# Crime Prediction

## Initializations

Will be using Intel's release of python for improved performance

In [1]:
import sys
sys.version

'3.6.3 |Intel Corporation| (default, Oct 17 2017, 23:26:12) [MSC v.1900 64 bit (AMD64)]'

In [1]:
%%bash
pip freeze | grep scikit-learn #check scikit-learn version

scikit-learn==0.19.0


You are using pip version 9.0.1, however version 9.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [2]:
#import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Decision Trees

#### a.	Create a new field “highCrime” which is true if the crime rate per capita (ViolentCrimesPerPop) is greater than 0.1, and false otherwise. What are the percentage of positive and negative instances in the dataset?

In [4]:
#Load data
data = pd.read_csv('Data/communities-crime-clean.csv')

In [22]:
#Create highCrime field
data['highCrime'] = data['ViolentCrimesPerPop'].apply(lambda x: int(x>0.1))

In [53]:
sum(data['highCrime'])/len(data['highCrime']) #percentage of positive

0.62719518314099343

#### b. Use DecisionTreeClassifier to learn a decision tree to predict highCrime on the entire dataset. Remember to exclude the crime rate feature (ViolentCrimesPerPop) from the input feature set so you are not cheating.

We need to remove non predictive attributes (comunityname,fold), ViolentCrimesPerPop (avoid cheating) and the target (highCrime)

In [24]:
#Create X,y variables
X = pd.DataFrame.drop(data,['communityname','fold','ViolentCrimesPerPop','highCrime'],axis=1) #predictors
y = data.highCrime #target

In [25]:
#check dimensions
print(X.shape)
print(y.shape)

(1993, 101)
(1993,)


In [17]:
# import DecisionTreeClassifier 
from sklearn.tree import DecisionTreeClassifier 

In [26]:
# create and train decision tree
dTree = DecisionTreeClassifier()
dTree.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

**i.	What are the training accuracy, precision, and recall for this tree? **

In [31]:
# evaluate model on training data
from sklearn import metrics
yPred = dTree.predict(X)
print('Accuracy: ' + str(metrics.accuracy_score(y,yPred)))
print('Precission: ' + str(metrics.precision_score(y,yPred)))
print('Recall: ' + str(metrics.recall_score(y,yPred)))

Accuracy: 1.0
Precission: 1.0
Recall: 1.0


**ii.	What are the main features used for classification? Can you explain why they make sense (or not)?**

In [47]:
# asses feature importance
f_importance = pd.DataFrame(list(zip(X.columns,dTree.feature_importances_)))
f_importance.columns = ['feature','importance']
f_importance.sort_values(by=['importance'],ascending=False).head()

Unnamed: 0,feature,importance
45,PctKids2Par,0.358652
4,racePctWhite,0.088817
6,racePctHisp,0.04512
35,PctEmplManu,0.021453
34,PctEmploy,0.018172


It is nearly impossible to stablish any kind of causality since so many of the features are highly correlated.
The most relevant parameter is percentage of kids in family housing with two parents. The second and third most relevant parameters are race related, and the 4th and 5th are employment related. Im makes sense that this parameters are good predictor, but as duscussed before it would be difficult to stablish any kind of causal relationship since all of them are going to be highly correlated with for example income and level of education.

#### c.	Now apply cross-validation (cross_val_score) to do 10-fold cross-validation to estimate the out-of-training accuracy of decision tree learning for this task
**i.	What are the 10-fold cross-validation accuracy, precision, and recall?**

In [56]:
#helper function
def crossValidationScores(model,cv = 10, printResults=True):
    '''Returns cross validation accuracy,precission and recall for model'''
    accuracy = cross_val_score(model,X,y,cv=10,scoring='accuracy').mean()
    precision = cross_val_score(model,X,y,cv=10,scoring='precision').mean()
    recall  = cross_val_score(model,X,y,cv=10,scoring='recall').mean()
    if printResults:
        print('Accuracy: ' + str(accuracy))
        print('Precission: ' + str(precision))
        print('Recall: ' + str(recall))
    return accuracy,precision,recall

In [58]:
from sklearn.model_selection import cross_val_score #stratified by default
dTree = DecisionTreeClassifier()
accuracy,precision,recall = crossValidationScores(dTree)

Accuracy: 0.714984924623
Precission: 0.776228635597
Recall: 0.7816


**ii.	Why are they different from the results in the previous test?**

The huge difference between the results before and the results out of sample sugest overfitting. 

## Linear Classification

#### a.	Use GaussianNB to learn a Naive Bayes classifier to predict highCrime.

In [54]:
from sklearn.naive_bayes import GaussianNB

**i.	What is the 10-fold cross-validation accuracy, precision, and recall for this method?**

In [59]:
gNB = GaussianNB()
accuracy,precision,recall = crossValidationScores(gNB)

Accuracy: 0.765108040201
Precission: 0.912419309104
Recall: 0.6976


**ii.	What are the 10 most predictive features? **

In [81]:
# helper class
class normalizedAbsoluteMeansDifference(object):
    def __init__(self,X,y):
        self.X = X
        self.y = y
        self.nu_y = self.y.mean()
        self.sd_y = np.sqrt(self.y.var())
        self.df = self.dfNAMD()
    def featureNAMD(self,f):
        '''computes normalized absolute difference for one feature f'''
        x = X[f]
        nu_x = x.mean()
        sd_x = np.sqrt(x.var())       
        return (abs(self.nu_y-nu_x)/(self.sd_y+sd_x))
    def dfNAMD(self):
        '''computes normalized absolute difference for a whole dataFrame'''
        nu_y = y.mean()
        sd_y = np.sqrt(y.var())
        dfNAMD = pd.DataFrame(X.columns, columns = ['feature'])
        dfNAMD['value'] = dfNAMD['feature'].apply(self.featureNAMD)
        return dfNAMD

In [83]:
# compute normalized Absolute Means Difference
NAMD = normalizedAbsoluteMeansDifference(X,y)
NAMD.df.sort_values(by=['value'],ascending=False).head()

Unnamed: 0,feature,value
0,state,1.661602
52,NumImmig,1.045962
91,NumStreet,1.03477
90,NumInShelters,1.019511
50,NumIlleg,0.997523


The most predictive variables are:
1. Stare. 
2. Number of people known to be foreign born.
3. Number of homeless people counted in the street.
4. Number of people in homeless shelters.
5. Number of kids born to never married

Most of the seem to be linked with poverty.

**iii.	How do these results compare with your results from decision trees, above?**

Precission has been improved. Accuracy slightly reduced. Recall greatly reduced.

#### b.	Use LinearSVC to learn a linear Support Vector Machine model to predict highCrime.

In [85]:
from sklearn.svm import LinearSVC

**What is the 10-fold cross-validation accuracy, precision, and recall for this method?**

In [87]:
lSVC = LinearSVC()
accuracy,precision,recall = crossValidationScores(lSVC)

Accuracy: 0.768138190955
Precission: 0.831503504277
Recall: 0.7472


**ii.	What are the 10 most predictive features? **

In [91]:
lSVC.fit(X,y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [101]:
# asses feature importance
f_importance = pd.DataFrame(list(zip(X.columns,lSVC.coef_[0]))) #lSVC.coef_ retunts a list o a list of the coefs
f_importance.columns = ['feature','coefficient']
f_importance.sort_values(by=['coefficient'],ascending=False).head()

Unnamed: 0,feature,coefficient
65,PersPerOccupHous,0.938564
85,RentHighQ,0.911587
39,MalePctDivorce,0.865824
6,racePctHisp,0.853696
72,HousVacant,0.813112


The most predictive variables are:
1. Mean persons per household. 
2. Rental housing - upper quartile rent.
3. percentage of males who are divorced.
4. percentage of population that is of hispanic heritage.
5. Number of vacant households

This look way more dispair than the previous ones.

**iii.	How do these results compare with your results from decision trees, above?**

Better accuracy and precision. Worst recall