## Group 9 Project Report

### Predicting Crime Type in the Greater LA Region 

##### A deep-dive into the types of crime occurring in LA neighborhoods, and using victim characteristics to predict crime type

#### **Introduction**

If you have ever been to Los Angeles, you may know that the region is grappling with increasing crime. The metropolitan area has seen a steady rise in property crime, for individuals of all ages. Crimes are no longer occurring solely at night, daytime crimes are also on the rise. Investigating recent crime data is crucial in determining potential trends and  to help create safer neighborhoods in the future. We have retrieved this data straight from the LAPD to use for analysis. 

The dataset that we will utilize is "Crime Data” from 2020 to present, taken from a public dataset released by the Los Angeles Police Department (LAPD), on Kaggle. Refer to reference section for link to dataset. From the dataset, we will solely focus on the following columns: DR_NO (unique identifier of crime report), DATE OCC (date of crime occurred), TIME OCC (time of crime occurred), AREA NAME (area of crime), Crm Cd (crime code), Crm Cd Desc (crime code description), Vict Age (victim age), Vict Sex (victim sex), Vict Descent (victim descent), and Weapon Desc (weapon description). Some columns were omitted because there are multiple columns displaying the same category. For example the weapons code and type were included (in two separate columns) but the general population would not recognize the weapon code without further information. 

> *make sure that we do not need to talk about every column since there are 28 of them*

In this analysis, we pose the question: **What crime type is predicted to occur to a victim of a certain age at a certain time?** These predictor variables were chosen because they are numeric and thus k-neighbor classification can be used to create a prediction model. By understanding the trends between victim traits and the crimes committed to them, we can better warn citizens and keep LA neighborhoods safe. 

#### **Preliminary exploratory data analysis**

> *to be completed after we create a tidy dataset

The chosen dataset is quite large since 

First it is necessary to tidy up our data and exclude unwanted columns and incomplete observations. There are some Nan values that should be removed using dropna function to discard them from the dataset as it is very large 

In [1]:
import pandas as pd 

large_crime_data = pd.read_csv("Crime_Data_from_2020_to_Present.csv")
large_crime_data.columns = large_crime_data.columns.str.replace(' ', '_')

large_crime_data['DATE_OCC'] = large_crime_data['DATE_OCC'].astype(str)

# Filter rows based on the year
crime_data_2021 = large_crime_data[large_crime_data['DATE_OCC'].str.contains("2021")]
crime_data_2021.dropna()
crime_data_2021.drop_duplicates()
clean_crime_data_2021 = crime_data_2021.drop(columns = ['DR_NO', 'Date_Rptd', 'Rpt_Dist_No', 'Part_1-2', 'Crm_Cd', 'Mocodes', 'Premis_Cd', 'Premis_Desc', 'Weapon_Used_Cd', 'Weapon_Desc', 'Status', 'Status_Desc', 'Crm_Cd_1', 'Crm_Cd_2', 'Crm_Cd_3', 'Crm_Cd_4', 'LOCATION', 'Cross_Street', 'LAT', 'LON'])

clean_crime_data_2021.to_csv('crime_data_2021.csv')
clean_crime_data_2021

Unnamed: 0,DATE_OCC,TIME_OCC,AREA,AREA_NAME,Crm_Cd_Desc,Vict_Age,Vict_Sex,Vict_Descent
250,02/01/2021 12:00:00 AM,650,7,Wilshire,BURGLARY,31,F,H
1531,04/25/2021 12:00:00 AM,300,16,Foothill,THEFT PLAIN - PETTY ($950 & UNDER),32,M,H
1638,03/18/2021 12:00:00 AM,1100,2,Rampart,VEHICLE - STOLEN,0,,
3296,02/07/2021 12:00:00 AM,1800,2,Rampart,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0,X,X
4369,01/25/2021 12:00:00 AM,338,11,Northeast,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",36,F,H
...,...,...,...,...,...,...,...,...
276579,01/19/2021 12:00:00 AM,1600,4,Hollenbeck,VEHICLE - STOLEN,0,,
276580,05/28/2021 12:00:00 AM,1930,20,Olympic,TRESPASSING,29,M,H
276581,03/19/2021 12:00:00 AM,1105,12,77th Street,VEHICLE - STOLEN,0,,
276582,03/04/2021 12:00:00 AM,2210,5,Harbor,FALSE IMPRISONMENT,41,F,B


In [2]:
import pandas as pd
url = "https://raw.githubusercontent.com/benjaminnguyen04/groupproject/main/crime_data_2021.csv"

crime_data_2021 = pd.read_csv(url)
crime_data_2021 = crime_data_2021[crime_data_2021["Vict_Age"] != 0]
crime_data_2021 = crime_data_2021[crime_data_2021["Vict_Sex"] != "X"]
crime_data_2021 = crime_data_2021[crime_data_2021["Vict_Sex"] != "H"]

crime_data_2021

Unnamed: 0.1,Unnamed: 0,DATE_OCC,TIME_OCC,AREA,AREA_NAME,Crm_Cd_Desc,Vict_Age,Vict_Sex,Vict_Descent
0,250,02/01/2021 12:00:00 AM,650,7,Wilshire,BURGLARY,31,F,H
1,1531,04/25/2021 12:00:00 AM,300,16,Foothill,THEFT PLAIN - PETTY ($950 & UNDER),32,M,H
4,4369,01/25/2021 12:00:00 AM,338,11,Northeast,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",36,F,H
8,5374,02/09/2021 12:00:00 AM,2130,1,Central,VANDALISM - MISDEAMEANOR ($399 OR UNDER),54,F,W
9,5549,03/10/2021 12:00:00 AM,1800,19,Mission,BURGLARY,50,F,H
...,...,...,...,...,...,...,...,...,...
79802,276574,05/25/2021 12:00:00 AM,813,6,Hollywood,BATTERY - SIMPLE ASSAULT,33,M,B
79803,276575,04/02/2021 12:00:00 AM,1730,2,Rampart,"BUNCO, GRAND THEFT",39,M,W
79804,276576,05/02/2021 12:00:00 AM,900,10,West Valley,THEFT OF IDENTITY,38,M,H
79808,276580,05/28/2021 12:00:00 AM,1930,20,Olympic,TRESPASSING,29,M,H


Next splitting up the data is necessary to be able to perform K-neighbor classification so training and testing data is obtained. 

In [3]:
#creating training and testing data 
from sklearn.model_selection import train_test_split

#dividing the data into training and test set 
crime_train, crime_test = train_test_split(
    crime_data_2021, train_size=0.75
)

#creating the X and y variables for each data set 
X_train = crime_train[["TIME_OCC"]]
y_train = crime_train["Vict_Age"]

X_test = crime_test[["TIME_OCC"]]
y_test = crime_test["Vict_Age"]