# Crime Prediction

## Initializations

Will be using Intel's release of python for improved performance

In [1]:
import sys
sys.version

'3.6.3 |Intel Corporation| (default, Oct 17 2017, 23:26:12) [MSC v.1900 64 bit (AMD64)]'

In [None]:
%%bash
pip freeze | grep scikit-learn #check scikit-learn version

In [2]:
#import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Decision Trees

#### a.	Create a new field “highCrime” which is true if the crime rate per capita (ViolentCrimesPerPop) is greater than 0.1, and false otherwise. What are the percentage of positive and negative instances in the dataset?

In [3]:
#Load data
data = pd.read_csv('Data/communities-crime-clean.csv')

In [4]:
#Create highCrime field
data['highCrime'] = data['ViolentCrimesPerPop'].apply(lambda x: int(x>0.1))

In [5]:
sum(data['highCrime'])/len(data['highCrime']) #percentage of positive

0.62719518314099343

#### b. Use DecisionTreeClassifier to learn a decision tree to predict highCrime on the entire dataset. Remember to exclude the crime rate feature (ViolentCrimesPerPop) from the input feature set so you are not cheating.

We need to remove non predictive attributes (comunityname,fold), ViolentCrimesPerPop (avoid cheating) and the target (highCrime)

In [6]:
#Create X,y variables
X = pd.DataFrame.drop(data,['communityname','fold','ViolentCrimesPerPop','highCrime'],axis=1) #predictors
y = data.highCrime #target

In [7]:
#check dimensions
print(X.shape)
print(y.shape)

(1993, 101)
(1993,)


In [8]:
# import DecisionTreeClassifier 
from sklearn.tree import DecisionTreeClassifier 

In [9]:
# create and train decision tree
dTree = DecisionTreeClassifier()
dTree.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

**i.	What are the training accuracy, precision, and recall for this tree? **

In [10]:
from sklearn import metrics

In [11]:
#helper function
def trainScores(yTrue,yPred, printResults=True):
    '''Returns cross validation accuracy,precission and recall for model'''
    accuracy = metrics.accuracy_score(yTrue,yPred)
    precision = metrics.precision_score(yTrue,yPred)
    recall  = metrics.recall_score(yTrue,yPred)
    if printResults:
        print('Accuracy: ' + str(accuracy))
        print('Precission: ' + str(precision))
        print('Recall: ' + str(recall))
    return accuracy,precision,recall

In [13]:
# evaluate model on training data
yPred = dTree.predict(X)
accuracy,precision,recall = trainScores(y,yPred)

Accuracy: 1.0
Precission: 1.0
Recall: 1.0


**ii.	What are the main features used for classification? Can you explain why they make sense (or not)?**

In [14]:
# asses feature importance
f_importance = pd.DataFrame(list(zip(X.columns,dTree.feature_importances_)))
f_importance.columns = ['feature','importance']
f_importance.sort_values(by=['importance'],ascending=False).head()

Unnamed: 0,feature,importance
45,PctKids2Par,0.359725
4,racePctWhite,0.088817
6,racePctHisp,0.048585
35,PctEmplManu,0.018145
72,HousVacant,0.017388


It is nearly impossible to stablish any kind of causality since so many of the features are highly correlated.
The most relevant parameter is percentage of kids in family housing with two parents. The second and third most relevant parameters are race related, and the 4th and 5th are employment related. Im makes sense that this parameters are good predictor, but as duscussed before it would be difficult to stablish any kind of causal relationship since all of them are going to be highly correlated with for example income and level of education.

#### c.	Now apply cross-validation (cross_val_score) to do 10-fold cross-validation to estimate the out-of-training accuracy of decision tree learning for this task
**i.	What are the 10-fold cross-validation accuracy, precision, and recall?**

In [15]:
#helper function
def crossValidationScores(model,X,y,cv = 10, printResults=True):
    '''Returns cross validation accuracy,precission and recall for model'''
    accuracy = cross_val_score(model,X,y,cv=10,scoring='accuracy').mean()
    precision = cross_val_score(model,X,y,cv=10,scoring='precision').mean()
    recall  = cross_val_score(model,X,y,cv=10,scoring='recall').mean()
    if printResults:
        print('Accuracy: ' + str(accuracy))
        print('Precission: ' + str(precision))
        print('Recall: ' + str(recall))
    return accuracy,precision,recall

In [16]:
from sklearn.model_selection import cross_val_score #stratified by default
dTree = DecisionTreeClassifier()
accuracy,precision,recall = crossValidationScores(dTree,X,y)

Accuracy: 0.721520100503
Precission: 0.774827397792
Recall: 0.7776


**ii.	Why are they different from the results in the previous test?**

The huge difference between the results before and the results out of sample sugest overfitting. 

## Linear Classification

#### a.	Use GaussianNB to learn a Naive Bayes classifier to predict highCrime.

In [None]:
from sklearn.naive_bayes import GaussianNB

**i.	What is the 10-fold cross-validation accuracy, precision, and recall for this method?**

In [None]:
gNB = GaussianNB()
accuracy,precision,recall = crossValidationScores(gNB,X,y)

**ii.	What are the 10 most predictive features? **

In [None]:
# helper class
class normalizedAbsoluteMeansDifference(object):
    def __init__(self,X,y):
        self.X = X
        self.y = y
        self.nu_y = self.y.mean()
        self.sd_y = np.sqrt(self.y.var())
        self.df = self.dfNAMD()
    def featureNAMD(self,f):
        '''computes normalized absolute difference for one feature f'''
        x = X[f]
        nu_x = x.mean()
        sd_x = np.sqrt(x.var())       
        return (abs(self.nu_y-nu_x)/(self.sd_y+sd_x))
    def dfNAMD(self):
        '''computes normalized absolute difference for a whole dataFrame'''
        nu_y = y.mean()
        sd_y = np.sqrt(y.var())
        dfNAMD = pd.DataFrame(X.columns, columns = ['feature'])
        dfNAMD['value'] = dfNAMD['feature'].apply(self.featureNAMD)
        return dfNAMD

In [None]:
# compute normalized Absolute Means Difference
NAMD = normalizedAbsoluteMeansDifference(X,y)
NAMD.df.sort_values(by=['value'],ascending=False).head()

The most predictive variables are:
1. Stare. 
2. Number of people known to be foreign born.
3. Number of homeless people counted in the street.
4. Number of people in homeless shelters.
5. Number of kids born to never married

Most of the seem to be linked with poverty.

**iii.	How do these results compare with your results from decision trees, above?**

Precission has been improved. Accuracy slightly reduced. Recall greatly reduced.

#### b.	Use LinearSVC to learn a linear Support Vector Machine model to predict highCrime.

In [None]:
from sklearn.svm import LinearSVC

**What is the 10-fold cross-validation accuracy, precision, and recall for this method?**

In [None]:
lSVC = LinearSVC()
accuracy,precision,recall = crossValidationScores(lSVC,X,y)

**ii.	What are the 10 most predictive features? **

In [None]:
lSVC.fit(X,y)

In [None]:
# asses feature importance
f_importance = pd.DataFrame(list(zip(X.columns,lSVC.coef_[0]))) #lSVC.coef_ retunts a list o a list of the coefs
f_importance.columns = ['feature','coefficient']
f_importance.sort_values(by=['coefficient'],ascending=False).head()

The most predictive variables are:
1. Mean persons per household. 
2. Rental housing - upper quartile rent.
3. percentage of males who are divorced.
4. percentage of population that is of hispanic heritage.
5. Number of vacant households

This look way more dispair than the previous ones.

**iii.	How do these results compare with your results from decision trees, above?**

Better accuracy and precision. Worst recall

## Regression 

As before, we need to remove nonpredictive attributes (comunitynamefold),  and the target (ViolentCrimesPerPop). We continue to remove highCrime since we introduced it from ViolentCrimesPerProp

In [19]:
#Create X,y variables
X = pd.DataFrame.drop(data,['communityname','fold','ViolentCrimesPerPop','highCrime'],axis=1) #predictors
y = data.ViolentCrimesPerPop #target

#### a.	Use LinearRegression to learn a linear model directly predicting the crime rate per capita (ViolentCrimesPerPop).

In [20]:
from sklearn.linear_model import LinearRegression

**i.	Using 10-fold cross-validation, what is the estimated mean-squared-error (MSE) of the model**

In [21]:
#helper function
def crossValidationMSE(model, X, y, cv = 10, printResults=True):
    '''Returns cross validation MSE for model'''
    MSE = -cross_val_score(model,X,y,cv=10,scoring='neg_mean_squared_error').mean()
    if printResults:
        print('cross-validation MSE: ' + str(MSE))
    return MSE

In [22]:
linReg = LinearRegression()
MSE = crossValidationMSE(linReg,X,y)

cross-validation MSE: 0.0199142186962


**ii.	What is the MSE on the training set (train on all the data then test on it all)?**

In [23]:
#helper function
def trainMSE(yTrue,yPred, printResults=True):
    '''Returns model's MSE on train data'''
    MSE = metrics.mean_squared_error(yTrue,yPred)
    if printResults:
        print('train MSE: '  + str(MSE))
    return MSE

In [24]:
linReg.fit(X,y)
yPred = linReg.predict(X)
MSE = trainMSE(y,yPred)

train MSE: 0.0164410183897


**iii.	What features are most predictive of a high crime rate? A low crime rate?**

In [25]:
coefficients = pd.DataFrame(list(zip(X.columns,linReg.coef_))) #lSVC.coef_ retunts a list o a list of the coefs
coefficients.columns = ['feature','coefficient']
sorted_coefficients = coefficients.sort_values(by=['coefficient'],ascending=False)
sorted_coefficients.head()

Unnamed: 0,feature,coefficient
74,PctHousOwnOcc,0.573658
65,PersPerOccupHous,0.57086
39,MalePctDivorce,0.461738
59,PctRecImmig8,0.415278
86,MedRent,0.34763


In [26]:
sorted_coefficients.tail()

Unnamed: 0,feature,coefficient
45,PctKids2Par,-0.282629
22,whitePerCap,-0.352592
80,OwnOccLowQuart,-0.381967
42,TotalPctDiv,-0.552048
68,PctPersOwnOccup,-0.702292


Most Positive Coefficients:
1. Percent of households owner occupied.
2. Mean persons per household.
3. Percentage of males who are divorced.
4. Percent of _population_ who have immigrated within the last 8 years.
5. Median gross rent.

Most Negative Coefficients: (most negative to less negative)
1. Percent of people in owner occupied households.
2. Percentage of population who are divorced.
3. Owner occupied housing - lower quartile value.
4. Per capita income for caucasians.
5. Percentage of kids in family housing with two parents.

#### b.	Now use Ridge regression to reduce the amount of overfitting, pick the best alpha from among (10, 1, 0.1, 0.01, and 0.001).

In [27]:
from sklearn.linear_model import RidgeCV
rReg = RidgeCV(alphas=(10, 1, 0.1, 0.01, 0.001))

**i.	What is the estimated MSE of the model under 10-fold CV?**

In [33]:
MSE = crossValidationMSE(rReg,X,y)

cross-validation MSE: 0.0195627392747


**ii.	What is the MSE on the training set (train on all the data then test on it all)?**

In [29]:
rReg.fit(X,y)
yPred = rReg.predict(X)
MSE = trainMSE(y,yPred)

train MSE: 0.0166785109027


**iii.	What is the best alpha?**

In [30]:
rReg.alpha_

1.0

**iv.	What does this say about the amount of overfitting in linear regression for this problem?**

There was not too much overfitting on the original linear regression model

#### c.	Now use polynomial features to do quadratic (second-order) polynomial regression

In [31]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)

**i.	What is the estimated MSE of the model under 10-fold CV?**

In [34]:
#TODO
MSE = crossValidationMSE(poly,X,y)

AttributeError: 'PolynomialFeatures' object has no attribute 'predict'

**ii.	What is the MSE on the training set (train on all the data then test on it all)?**

In [None]:
poly.fit(X,y)
poly.
MSE = trainMSE(poly,X,y)

**iii.	Does this mean the quadratic model is better than the linear model for this problem?**

The results are the same. This means the coefficients for the 2nd degree order are 0 and therefore the final model is the same as the linear one. Therefore the linear one is better because we require less computation to arrive at the same model.

## Dirty Data

In [36]:
#Load data
data = pd.read_csv('Data/communities-crime-full.csv',na_values='?')
#Create highCrime field
data['highCrime'] = data['ViolentCrimesPerPop'].apply(lambda x: int(x>0.1))
#Create X,y variables
X = pd.DataFrame.drop(data,['communityname','fold','ViolentCrimesPerPop','highCrime'],axis=1) #predictors
y = data.highCrime #target

In [37]:
#TODO
dTree = DecisionTreeClassifier()
MSEcv = crossValidationScores(dTree,X,y)
dTree.fit(X,y)
yPred = dTree.predict(X)
MSEt = trainScores(y,ypred)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').