# Crime Prediction
## - Author: Frank Serafine
---

**GOAL:** Train various time series machine learning models to predict crime activity over a year, given ten years of history.

**SOURCE:** FBI's [Crime Data Explorer](https://crime-data-explorer.fr.cloud.gov/pages/downloads)

**TRIMMED INCIDENT-LEVEL DATA SET:** [Crime_Pred.csv](https://drive.google.com/file/d/1nM8-DCRj2GA8FxOi04btjwks0gxKAQom/view?usp=sharing)

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [44]:
df = pd.read_csv('Crime_Pred.csv')
df_altdrop = pd.read_csv('crimepred_altdrop.csv')
df.head()

Unnamed: 0,Incident ID,Offense ID,Date,Hour,Offense,Attempted or Completed,Offender Age,Offender Gender,Offender Race,Offender Bias,Weapon,Victim Type,Relation to Offender,Victim Age,Victim Gender,Victim Race
0,52643345,60268377,2010-07-15,10.0,Theft of Motor Vehicle Parts or Accessories,A,,U,Unknown,,,Individual,Relationship Unknown,40.0,M,Black or African American
1,52643346,60268378,2010-07-14,16.0,Theft of Motor Vehicle Parts or Accessories,C,,U,Unknown,,,Individual,Relationship Unknown,39.0,F,White
2,52643347,60274057,2010-07-15,11.0,Destruction/Damage/Vandalism of Property,C,,U,Unknown,,,Business,Not Applicable,,U,Unknown
3,52643348,60276871,2010-07-15,10.0,Theft From Motor Vehicle,C,,U,Unknown,,,Individual,Relationship Unknown,38.0,F,White
4,52643348,60276871,2010-07-15,10.0,Theft From Motor Vehicle,C,,U,Unknown,,,Business,Not Applicable,,U,Unknown


In [24]:
# Percentage of total observations that are null:

df.isna().sum()/df.shape[0]*100

# 4.6% of crime incidents did not have an hour logged.
# 34.6% of crime incidents did not log the age of the offender.
# 27.4% of crime incidents did not have an age listed for the victim, but this does not always indicate an omission:
# Crimes against an institution or against society do not have a specific victim and thus no victim age.

Incident ID                0.000000
Offense ID                 0.000000
Date                       0.000000
Hour                       4.562732
Offense                    0.000000
Attempted or Completed     0.000000
Offender Age              34.568539
Offender Gender            0.000000
Offender Race              0.000000
Offender Bias              0.000000
Weapon                     0.000000
Victim Type                0.000000
Relation to Offender       0.000000
Victim Age                27.404400
Victim Gender              0.000000
Victim Race                0.000000
dtype: float64

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6319810 entries, 0 to 6319809
Data columns (total 16 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   Incident ID             int64  
 1   Offense ID              int64  
 2   Date                    object 
 3   Hour                    float64
 4   Offense                 object 
 5   Attempted or Completed  object 
 6   Offender Age            float64
 7   Offender Gender         object 
 8   Offender Race           object 
 9   Offender Bias           object 
 10  Weapon                  object 
 11  Victim Type             object 
 12  Relation to Offender    object 
 13  Victim Age              float64
 14  Victim Gender           object 
 15  Victim Race             object 
dtypes: float64(3), int64(2), object(11)
memory usage: 771.5+ MB


In [26]:
df['Date'] = pd.to_datetime(df['Date']) 

In [41]:
dupe_data = df.duplicated(keep=False)
df.loc[dupe_data, :]

Unnamed: 0,Incident ID,Offense ID,Date,Hour,Offense,Attempted or Completed,Offender Age,Offender Gender,Offender Race,Offender Bias,Weapon,Victim Type,Relation to Offender,Victim Age,Victim Gender,Victim Race
30,52643365,60280691,2010-07-14,17.0,Simple Assault,C,17.0,F,Black or African American,,Personal Weapons,Individual,Victim Was Acquaintance,19.0,F,Black or African American
31,52643365,60280691,2010-07-14,17.0,Simple Assault,C,17.0,F,Black or African American,,Personal Weapons,Individual,Victim Was Acquaintance,19.0,F,Black or African American
32,52643365,60280691,2010-07-14,17.0,Simple Assault,C,22.0,M,Black or African American,,Personal Weapons,Individual,Victim Was Acquaintance,19.0,F,Black or African American
33,52643365,60280691,2010-07-14,17.0,Simple Assault,C,22.0,M,Black or African American,,Personal Weapons,Individual,Victim Was Acquaintance,19.0,F,Black or African American
34,52643365,60280692,2010-07-14,17.0,Destruction/Damage/Vandalism of Property,C,17.0,F,Black or African American,,,Individual,Victim Was Acquaintance,19.0,F,Black or African American
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6319543,118447058,143631782,2019-10-15,17.0,Shoplifting,C,19.0,F,White,,,Business,Not Applicable,,Not Applicable,Not Applicable
6319545,118447058,143631783,2019-10-15,17.0,False Pretenses/Swindle/Confidence Game,C,19.0,F,White,,,Business,Not Applicable,,Not Applicable,Not Applicable
6319546,118447058,143631783,2019-10-15,17.0,False Pretenses/Swindle/Confidence Game,C,19.0,F,White,,,Business,Not Applicable,,Not Applicable,Not Applicable
6319703,122904943,148700888,2019-11-10,12.0,Impersonation,C,,M,White,,,Individual,Relationship Unknown,20.0,F,White


In [31]:
df.nunique()

Incident ID               3182165
Offense ID                3400357
Date                         3652
Hour                           24
Offense                        51
Attempted or Completed          2
Offender Age                   98
Offender Gender                 3
Offender Race                   7
Offender Bias                  33
Weapon                         22
Victim Type                     9
Relation to Offender           27
Victim Age                     98
Victim Gender                   4
Victim Race                     8
dtype: int64

In [47]:
df_altdrop = df_altdrop.drop_duplicates(keep='first')

In [49]:
df_altdrop.isna().sum()/df.shape[0]*100

Incident ID                0.000000
Offense ID                 0.000000
Victim ID                  0.000000
Offender ID                0.000000
Date                       0.000000
Hour                       3.378013
Offense                    0.000000
Attempted or Completed     0.000000
Offender Age              26.928784
Offender Gender            0.000000
Offender Race              0.000000
Offender Bias              0.000000
Weapon                     0.000000
Victim Type                0.000000
Relation to Offender       0.000000
Victim Age                22.837079
Victim Gender              0.000000
Victim Race                0.000000
dtype: float64

In [None]:
# from sklearn.preprocessing import OneHotEncoder

# # Instantiate the OneHotEncoder
# ohe = OneHotEncoder()

# # Fit the OneHotEncoder to the  column and transform
# # It expects a 2D array, so we first convert the column into a DataFrame
# offense = pd.DataFrame(df['Offense'])
# encoded = ohe.fit_transform(subcategory)
# encoded