# LAPD Crime Stats 

## Dataset column descriptions:

source: https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8

<ul>
    <li>dr_no: Division of records number</li>
    <li>date_rptd: Date reported </li>
    <li>date_occ: Date Incident occurred</li>
    <li>time_occ: Time Incident occurred</li>
    <li>area: Police station (numbered 1 - 21)</li>
    <li>area_name: Name of area station</li>
    <li>rpt_dist_no: Reporting district number</li>
    <li>part_1_2: Part I or Part II level offense</li>
    <li>crm_cd: Crime committed</li>
    <li>Crm Cd Desc: Description of crime committed</li>
    <li>Mocodes: Modus Operandi code activity of associated suspect</li>
    <li>Vict Age: Age of victim</li>
    <li>Vict Sex: Sex of victim</li>
        <ul>
            <li>M: Male</li>
            <li>F: Female</li>
            <li>X: Unknown</li>
        </ul>
    <li>Vict Descent: Descent codes:</li>
        <ul>
            <li>A: Other Asian</li>
            <li>B: Black</li>
            <li>C: Chinese</li>
            <li>D: Cambodian</li>
            <li>F: Filipino</li>
            <li>G: Guamanian</li>
            <li>H: Hispanic/Latin/Mexican</li>
            <li>I: American Indian/Alaskan Native</li>
            <li>J: Japanese</li>
            <li>K: Korean</li>
            <li>L: Laotian</li>
            <li>O: Other</li>
            <li>P: Pacific Islander</li>
            <li>S: Samoan</li>
            <li>U: Hawaiian</li>
            <li>V: Vietnamese</li>
            <li>W: White</li>
            <li>X: Unknown</li>
            <li>Z: Asian Indian</li>
        </ul>
    <li>Premis Cd: Type of structure/vehicle/location where crime took place</li>
    <li>Premis Desc: Defines premis code provided</li>
    <li>Weapon Used Cd: Type of weapon used in crime</li>
    <li>Weapon Used Desc: Defines weapon code provided</li>
    <li>Status: status of case</li>
    <li>Status Desc: Description of status</li>
    <li>Crm Cd 1: Indicates code for primary crime committed </li>
    <li>Crm Cd 2: Additional code for crimes committed</li>
    <li>Crm Cd 3: Additional code for crimes committed</li>
    <li>Crm Cd 4: Additional code for crimes committed</li>
    <li>Location: Street address of crime</li>
    <li>Cross Street: Cross street of rounded address</li>
    <li>Lat: Latitude </li>
    <li>Lon: Longitude</li>
</ul>

In [None]:
import pandas as pd
import numpy as np 

file_path = "/home/francisco/Downloads/Crime_Data_from_2020_to_Present.csv"

df = pd.read_csv(file_path)
df.head()

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.isnull().sum()

## Data Cleaning 
<p>Columns are inconsistently formatted, using underscores, dashes, and literal spaces to denote blank space. Some columns are in all caps. 
Several of these columns contain redundant information by providing both the code and the code description. Crm Cd and Crm Cd 1 refer to the exact same data (With the first being of type int and the latter of type float????? My goodness). 
We can create a dictionary for these values and eliminate the description columns.</p>



In [None]:
np.sort(df['Crm Cd'].unique())

In [None]:
crime_codes = df[['Crm Cd', 'Crm Cd Desc']]
crime_codes.head()

In [None]:
crime_codes  = crime_codes.drop_duplicates()
crime_codes.head()

In [None]:
crime_code_list  = [code for code in crime_codes['Crm Cd']]
description_list = [desc for desc in crime_codes['Crm Cd Desc']]

crime_code_dictionary = dict(zip(crime_code_list, description_list))

    

In [None]:
crime_code_dictionary

In [None]:
# We can repeat these steps for the other columns that follow the same pattern, in fact...

def create_reference_dictionary(dataframe, column_1, column_2):
    
    temp_df = dataframe[[column_1, column_2]]
    temp_df = temp_df.drop_duplicates()
    temp_df = temp_df.dropna()
    
    column_1_list = [x for x in temp_df[column_1]]
    column_2_list = [x for x in temp_df[column_2]]
    
    return dict(zip(column_1_list, column_2_list))

In [None]:
premis_reference_dictionary = create_reference_dictionary(df, 'Premis Cd', 'Premis Desc')

premis_reference_dictionary

In [None]:
premis_reference_dictionary = {int(key):value for (key, value) in premis_reference_dictionary.items()}

In [None]:
premis_reference_dictionary

In [None]:
# Now for Weapons and status

weapon_reference_dictionary = create_reference_dictionary(df, 'Weapon Used Cd', 'Weapon Desc')
status_reference_dictionary = create_reference_dictionary(df, 'Status', 'Status Desc')

weapon_reference_dictionary

In [None]:
np.sort(df['Weapon Used Cd'].unique())

In [None]:
# Again, no reason for the dictionary to have floats as keys 

weapon_reference_dictionary = {int(key):value for (key, value) in weapon_reference_dictionary.items()}

In [None]:
weapon_reference_dictionary

In [None]:
status_reference_dictionary

Now that we have our reference dictionaries, we no longer need the additional description columns in our dataframe. We can also drop the DR_NO since this is just a record number for the organization.

In [None]:
df.columns

In [None]:
columns_to_drop = [
    'DR_NO','Crm Cd Desc', 'Weapon Desc', 
    'Premis Desc', 'Status Desc', 'Crm Cd 1'
]

cleaner_df = df.drop(columns_to_drop, axis = 1)
cleaner_df.shape


In [None]:
percent_null_values = (cleaner_df.isnull().sum()) / len(cleaner_df)

with pd.option_context('display.float_format','{:.2%}'.format):
    display(percent_null_values)

There's something wrong with the Date OCC and Time OCC columns...

In [None]:
df[['DATE OCC', 'TIME OCC']].head(15)


In [None]:
df['DATE OCC'][0][:10]

In [None]:
new_dates = [date[:10] for date in cleaner_df['DATE OCC'].values ]

In [None]:
cleaner_df['date_occurred'] = new_dates

In [None]:
cleaner_df[['DATE OCC','date_occurred']].head(10)

That's much better.
We also have 6 columns dedicated to the location, which is a bit much considering we can derive most of these data by just the Lat and Long coordinates.
We'll drop the original date occ column, along with the 4 columns for area that are NOT lat and long. 

In [None]:
columns_to_drop = [
    'DATE OCC', 'AREA', 'AREA NAME',
    'LOCATION', 'Cross Street'
]

more_cleaner_df = cleaner_df.drop(columns_to_drop, axis = 1)
more_cleaner_df.head(10)

Much better! The values in Premis Cd, Weapon Used Cd, and the Crm Cd 2/3/4 columns still need to be updated to ints. We can also fill any of the null values with 0, since all of the coded values have three digits. 

In [None]:
percent_null_values = (more_cleaner_df.isnull().sum()) / len(more_cleaner_df)

with pd.option_context('display.float_format','{:.2%}'.format):
    display(percent_null_values)

In [None]:
more_cleaner_df['Vict Sex'].value_counts()

In [None]:
# H??? That wasn't in the documentation 

mysterious_biology = more_cleaner_df[more_cleaner_df['Vict Sex'] == 'H']
mysterious_biology.head(10)

In [None]:
for crime in mysterious_biology['Crm Cd']:
    print(f'{crime}: {crime_code_dictionary[crime]}')

Considering an H value for victim sex is not listed in any documentation, and how unlikely it would be for an intersex victim to be identified by an H, we'll change these values to unknown (X)

In [None]:
# Taking care of all of the remaining null values 

# Both Vict Sex and Vict Descent have the same number of null values, while not stated explicitly, the documentation suggests these are crimes against property/things
more_cleaner_df['Vict Sex'] = more_cleaner_df['Vict Sex'].str.replace('H','X')
more_cleaner_df['Vict Sex'] = more_cleaner_df['Vict Sex'].str.replace('-','X')
more_cleaner_df['Vict Sex'] = more_cleaner_df['Vict Sex'].fillna('N')
more_cleaner_df['Vict Descent'] = more_cleaner_df['Vict Descent'].fillna('N')

more_cleaner_df['Mocodes'] = more_cleaner_df['Mocodes'].fillna(0)
more_cleaner_df['Weapon Used Cd'] = more_cleaner_df['Weapon Used Cd'].fillna(0)
more_cleaner_df['Crm Cd 2'] = more_cleaner_df['Crm Cd 2'].fillna(0)
more_cleaner_df['Crm Cd 3'] = more_cleaner_df['Crm Cd 3'].fillna(0)
more_cleaner_df['Crm Cd 4'] = more_cleaner_df['Crm Cd 4'].fillna(0)

In [None]:
percent_null_values = (more_cleaner_df.isnull().sum()) / len(more_cleaner_df)

with pd.option_context('display.float_format','{:.2%}'.format):
    display(percent_null_values)

In [None]:
# Convert the floats to ints 

more_cleaner_df['Premis Cd'] = more_cleaner_df['Premis Cd'].convert_dtypes()


In [None]:
more_cleaner_df['Premis Cd'].unique()

In [None]:
# To avoid unnecessary repeats

def convert_and_fill_nulls(series):
    series = series.convert_dtypes()
    series = series.fillna(0)
    return series

In [None]:
more_cleaner_df['Premis Cd'] = more_cleaner_df['Premis Cd'].fillna(0)

more_cleaner_df['Weapon Used Cd'] = convert_and_fill_nulls(more_cleaner_df['Weapon Used Cd'])

more_cleaner_df['Crm Cd 2'] = convert_and_fill_nulls(more_cleaner_df['Crm Cd 2'])

more_cleaner_df['Crm Cd 3'] = convert_and_fill_nulls(more_cleaner_df['Crm Cd 3'])

more_cleaner_df['Crm Cd 4'] = convert_and_fill_nulls(more_cleaner_df['Crm Cd 4'])

In [None]:
more_cleaner_df.dtypes

### We're almost ready for analysis! 
<ul>A few things left to clean up:
    <li>Change the date occured colum to a date format</li>
    <li>Reorder and rename the columns for consistency</li>

</ul>

In [None]:
more_cleaner_df['date_occurred'] = pd.to_datetime(more_cleaner_df['date_occurred'])
more_cleaner_df['date_reported'] = pd.to_datetime(more_cleaner_df['Date Rptd'])

In [None]:
rename_dictionary = {
    'TIME OCC':'time_occurred',
    'Part 1-2':'part_offense',
    'Rpt Dist No':'reporting_district',
    'Crm Cd':'crime_code',
    'Mocodes':'mo_codes',
    'Vict Age':'victim_age',
    'Vict Sex':'victim_sex',
    'Vict Descent':'victim_descent',
    'Premis Cd':'premises_code',
    'Weapon Used Cd':'weapon_used_code',
    'Status':'status_code',
    'Crm Cd 2':'crime_code_2',
    'Crm Cd 3':'crime_code_3',
    'Crm Cd 4':'crime_code_4',
    'LAT':'latitude',
    'LON':'longitude'
}

In [None]:
more_cleaner_df = more_cleaner_df.rename(columns = rename_dictionary)

more_cleaner_df.columns

In [None]:
more_cleaner_df = more_cleaner_df.drop('Date Rptd', axis = 1)

In [None]:
clean_df = more_cleaner_df[['date_occurred','time_occurred', 'part_offense', 'crime_code', 'crime_code_2', 
                            'crime_code_3', 'crime_code_4', 'victim_age', 'victim_sex', 'victim_descent', 
                            'weapon_used_code', 'premises_code', 'status_code', 'reporting_district', 'date_reported',
                            'mo_codes', 'latitude', 'longitude']]

clean_df.head()

In [None]:
import pdfquery

pdf = pdfquery.PDFQuery('/home/francisco/Downloads/MO_CODES_Numerical_20180627.pdf')
pdf.load()

pdf.tree.write('mo_codes.xml',pretty_print=True)
pdf

In [None]:
mo_codes = pdf.pq('LTTextLineHorizontal').text()

print(mo_codes)

In [None]:
# vs what we started with 
df.head()

We'll also pickle the reference dictionaries for future use 

In [None]:
import pickle 


with open('crime_code_reference.pickle', 'wb') as file:
    pickle.dump(crime_code_dictionary, file, protocol=pickle.HIGHEST_PROTOCOL)


with open('weapon_reference_dictionary.pkl', 'wb') as file:
    pickle.dump(weapon_reference_dictionary, file, protocol=pickle.HIGHEST_PROTOCOL)
    
    
with open('status_reference_dictionary.pkl', 'wb') as file:
    pickle.dump(status_reference_dictionary, file, protocol=pickle.HIGHEST_PROTOCOL)
    

with open('premis_reference_dictionary.pkl', 'wb') as file:
    pickle.dump(premis_reference_dictionary, file, protocol=pickle.HIGHEST_PROTOCOL)
    

In [None]:
clean_df.to_pickle('./clean_crime_data.pkl')

## Building the ML Model

In [1]:
# Starting with the original cleaned data set 

import pandas as pd
import numpy as np

file_path = './ml_crime_data.pkl' 

df = pd.read_pickle(file_path)
df.head()

Unnamed: 0,date_occurred,time_occurred,part_offense,crime_code,crime_code_2,crime_code_3,crime_code_4,victim_age,victim_sex,victim_descent,weapon_used_code,premises_code,status_code,reporting_district,date_reported,mo_codes,latitude,longitude
0,2020-01-08,2230,2,624,0,0,0,36,F,B,400,501,AO,377,2020-01-08,0444 0913,34.0141,-118.2978
1,2020-01-01,330,2,624,0,0,0,25,M,H,500,102,IC,163,2020-01-02,0416 1822 1414,34.0459,-118.2545
3,2020-01-01,1730,2,745,998,0,0,76,F,W,0,502,IC,1543,2020-01-01,0329 1402,34.1685,-118.4019
5,2020-01-01,30,1,121,998,0,0,25,F,H,500,735,IC,163,2020-01-02,0413 1822 1262 1415,34.0452,-118.2534
6,2020-01-02,1315,1,442,998,0,0,23,M,H,0,404,IC,161,2020-01-02,1402 2004 0344 0387,34.0483,-118.2631


In [2]:
df['crime_code'] = [code // 100 for code in df['crime_code']]
df['crime_code_2'] = [code // 100 for code in df['crime_code_2']]
df['crime_code_3'] = [code // 100 for code in df['crime_code_3']]
df['crime_code_4'] = [code // 100 for code in df['crime_code_4']]
df['weapon_used_code'] = [code // 100 for code in df['weapon_used_code']]
df['premises_code'] = [code // 100 for code in df['premises_code']]

In [3]:
reporting_lag = df['date_reported'] - df['date_occurred']

In [4]:
df['reporting_lag'] = reporting_lag

In [5]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df['victim_age'] = scaler.fit_transform(np.array(df['victim_age']).reshape(-1,1))
df['time_occurred'] = scaler.fit_transform(np.array(df['time_occurred']).reshape(-1,1))
df['latitude'] = scaler.fit_transform(np.array(df['latitude']).reshape(-1,1))   
df['longitude'] = scaler.fit_transform(np.array(df['longitude']).reshape(-1,1))
df['reporting_lag'] = scaler.fit_transform(np.array(df['reporting_lag']).reshape(-1,1))

df = df.drop(['date_occurred','date_reported'],axis=1)

In [6]:
df.head()

Unnamed: 0,time_occurred,part_offense,crime_code,crime_code_2,crime_code_3,crime_code_4,victim_age,victim_sex,victim_descent,weapon_used_code,premises_code,status_code,reporting_district,mo_codes,latitude,longitude,reporting_lag
0,0.945293,2,6,0,0,0,0.376238,F,B,4,5,AO,377,0444 0913,0.990674,0.003116,0.0
1,0.139525,2,6,0,0,0,0.267327,M,H,5,1,IC,163,0416 1822 1414,0.9916,0.003481,0.000705
3,0.733249,2,7,9,0,0,0.772277,F,W,0,5,IC,1543,0329 1402,0.995171,0.002239,0.0
5,0.012299,1,1,9,0,0,0.267327,F,H,5,7,IC,163,0413 1822 1262 1415,0.99158,0.00349,0.000705
6,0.557252,1,4,9,0,0,0.247525,M,H,0,4,IC,161,1402 2004 0344 0387,0.99167,0.003409,0.0


In [7]:
mcs = (df['mo_codes'])

str_mcs = [str(x) for x in mcs]

new_list = [list(x.split(" ")) for x in str_mcs]

lens = [len(x) for x in new_list]
print(max(lens))

10


In [8]:
for x in new_list:
    while len(x) < 10:
        x.append(0)

mo_code_arr = [np.array(x) for x in new_list]
mo_code_arr = np.asarray(mo_code_arr)
mo_code_arr.shape

(661006, 10)

In [9]:
mo_code_df = pd.DataFrame(mo_code_arr, columns=['mo_code_1','mo_code_2','mo_code_3','mo_code_4','mo_code_5','mo_code_6','mo_code_7','mo_code_8','mo_code_9','mo_code_10'],dtype='float')
df = df.join(mo_code_df)
df = df.drop(['mo_codes','status_code','reporting_district'],axis=1)

In [10]:
df['victim_sex'] = df['victim_sex'].astype('str')
df['victim_descent'] = df['victim_descent'].astype('str')

In [11]:
df['victim_sex'] = df['victim_sex'].apply(lambda x: x.replace(x,'0') if x=='M' else x.replace(x,'1')).astype('float')

df['victim_descent'] = df['victim_descent'].apply(lambda x: x.replace(x,'0') if x=='W' else x.replace(x, '1')).astype('float')

In [12]:
df = df.fillna(0)

In [14]:
from sklearn.model_selection import train_test_split

X = np.asarray(df.drop('victim_sex',axis=1))
y = np.asarray(df['victim_sex'])

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [None]:
from sklearn.svm import SVC

clf = SVC().fit(X_train,y_train)
clf.score(X_test,y_test)

In [None]:
from sklearn.model_selection import GridSearchCV


params = {'C':np.arange(1,10),
          'gamma':np.arange(1,10)}

grid = GridSearchCV(clf,params)
grid.fit(X_train,y_train)

print(grid.best_params_)
print(grid.best_score_)