## Decision Tree Analysis

With the Decision Tree algorithm we attempt to see if seasonal weather along with other factors such as area and victim type can predict type of crime.

In [39]:
import numpy as np 
import pandas as pd 
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Convert DATE OCC data into seasons
def date_to_float(x):
    d = x.split('/')
    d[2] = d[2].split(' ')[0]
    month = int(d[0])
    
    if(month >= 11 or month <= 2):
        return 'Winter'
    elif(month >=3 and month < 6):
        return 'Spring'
    elif(month >=6 and month < 8):
        return 'Summer'
    elif(month >=8 and month < 11):
        return 'Autumn'
    
# import the data required 
data = pd.read_csv('crime.csv')

#Drop Irrelevant Data
data.drop('Vict Descent', axis=1, inplace=True)
data.drop('TIME OCC', axis=1, inplace=True)

#Parse Date into Seasons
data['DATE OCC'] = data['DATE OCC'].apply(date_to_float)

#One Hot Encoding for differnt regions, seasons, and victim type
df1 = pd.get_dummies(data['AREA NAME'])
data = pd.concat([data, df1], axis=1).reindex(data.index)
data.drop('AREA NAME', axis=1, inplace=True)
df2 = pd.get_dummies(data['DATE OCC'])
data = pd.concat([data, df2], axis=1).reindex(data.index)
data.drop('DATE OCC', axis=1, inplace=True)
df3 = pd.get_dummies(data['Vict Sex'])
data = pd.concat([data, df3], axis=1).reindex(data.index)
data.drop('Vict Sex', axis=1, inplace=True)
features = data.columns.tolist()
features.remove('Crm Cd')
features.remove('Crm Cd Desc')
X = data[features]
y = data['Crm Cd Desc']
X[1:50]

Unnamed: 0,Vict Age,77th Street,Central,Devonshire,Foothill,Harbor,Hollenbeck,Hollywood,Mission,N Hollywood,...,Wilshire,Autumn,Spring,Summer,Winter,-,F,H,M,X
1,25,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
3,76,0,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,1,0,0,0
4,31,0,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,1
5,25,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
6,23,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
7,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
8,23,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
9,0,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
10,29,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0


#### We split and train this data on a Decision Tree with entropy splits

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=5)

print(X_train.shape)
print(y_train.shape)
print(X_train.head())
print(y_train.head())
 
my_decisiontree = DecisionTreeClassifier(criterion='entropy', random_state=0)
my_decisiontree.fit(X_train, y_train)

(729009, 31)
(729009,)
        Vict Age  77th Street  Central  Devonshire  Foothill  Harbor  \
321303         0            0        0           0         0       0   
341978        97            0        0           0         0       0   
769187        25            0        0           0         0       0   
448958        34            0        0           0         0       0   
296745         0            0        0           0         0       0   

        Hollenbeck  Hollywood  Mission  N Hollywood  ...  Wilshire  Autumn  \
321303           0          0        1            0  ...         0       0   
341978           0          0        0            0  ...         0       0   
769187           0          0        0            1  ...         0       1   
448958           0          0        0            0  ...         0       0   
296745           0          0        0            1  ...         0       0   

        Spring  Summer  Winter  -  F  H  M  X  
321303       0       1     

In [43]:
from sklearn.metrics import accuracy_score
y_predict_dt = my_decisiontree.predict(X_test)
score_dt = accuracy_score(y_test, y_predict_dt)

print(score_dt)

0.22798467147043505


#### Here we see a 22.8 % accuracy. This can mean that the Decision Tree was not able to substantially identify the best attribute to use from Area, Vicitm Type and Season to predict type of crime.


#### The data itself could be sparse as lot of the features have been one-hot encoded, which may lead to underfitting. To explore relations we look at other algorithms better suited for this type of data