# Crime Prediction

## Initializations

Will be using Intel's distribution of python for improved performance

In [1]:
import sys
sys.version

'3.6.3 |Intel Corporation| (default, Oct 17 2017, 23:26:12) [MSC v.1900 64 bit (AMD64)]'

In [2]:
%%bash
#check scikit-learn version
pip freeze | grep scikit-learn 

scikit-learn==0.19.0




In [3]:
#import modules
import numpy as np
import pandas as pd

## Load and check data

In [4]:
#Load data
data = pd.read_csv('Data/communities-crime-clean.csv')
data.shape

(1993, 104)

Sklearn need all attributes to be numeric. Let's check the data types to be sure there are no non-numeric columns.

In [5]:
#Check column types
data.dtypes.value_counts()

float64    101
int64        2
object       1
dtype: int64

In [6]:
data.dtypes[data.dtypes != float]

state             int64
communityname    object
fold              int64
dtype: object

The only columns with non - numeric data type is "communityname". This column won't be used in the model so this is not a problem.

## Decision Trees

#### a.	Create a new field “highCrime” which is true if the crime rate per capita (ViolentCrimesPerPop) is greater than 0.1, and false otherwise. What are the percentage of positive and negative instances in the dataset?

In [7]:
#Create highCrime field
data['highCrime'] = data['ViolentCrimesPerPop'].apply(lambda x: int(x>0.1))

In [8]:
sum(data['highCrime'])/len(data['highCrime']) #percentage of positive

0.62719518314099343

#### b. Use DecisionTreeClassifier to learn a decision tree to predict highCrime on the entire dataset. Remember to exclude the crime rate feature (ViolentCrimesPerPop) from the input feature set so you are not cheating.

We need to remove non predictive attributes (comunityname,fold), ViolentCrimesPerPop (avoid cheating) and the target (highCrime)

In [9]:
#Create X,y variables
X = pd.DataFrame.drop(data,['communityname','fold','ViolentCrimesPerPop','highCrime'],axis=1) #predictors
y = data.highCrime #target

In [10]:
#check dimensions
print(X.shape)
print(y.shape)

(1993, 101)
(1993,)


In [11]:
# import DecisionTreeClassifier 
from sklearn.tree import DecisionTreeClassifier 

In [12]:
# create and train decision tree
dTree = DecisionTreeClassifier()
dTree.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

**i.	What are the training accuracy, precision, and recall for this tree? **

In [13]:
from sklearn import metrics

In [14]:
#helper function
def trainScores(yTrue,yPred, printResults=True):
    '''Returns cross validation accuracy,precision and recall for model'''
    accuracy = metrics.accuracy_score(yTrue,yPred)
    precision = metrics.precision_score(yTrue,yPred)
    recall  = metrics.recall_score(yTrue,yPred)
    if printResults:
        print('Accuracy: ' + str(accuracy))
        print('Precision: ' + str(precision))
        print('Recall: ' + str(recall))
    return accuracy,precision,recall

In [15]:
# evaluate model on training data
yPred = dTree.predict(X)
accuracy,precision,recall = trainScores(y,yPred)

Accuracy: 1.0
Precision: 1.0
Recall: 1.0


**ii.	What are the main features used for classification? Can you explain why they make sense (or not)?**

In [16]:
# asses feature importance
f_importance = pd.DataFrame(list(zip(X.columns,dTree.feature_importances_)))
f_importance.columns = ['feature','importance']
f_importance.sort_values(by=['importance'],ascending=False).head()

Unnamed: 0,feature,importance
45,PctKids2Par,0.359725
4,racePctWhite,0.089867
6,racePctHisp,0.048585
35,PctEmplManu,0.023214
30,PctLess9thGrade,0.018903


It is nearly impossible to establish any kind of causality since so many of the features are highly correlated. The most relevant parameter is the percentage of kids in family housing with two parents. The second and third most relevant parameters are race related, and the 4th and 5th are employment related. It makes sense that these parameters are good predictor, but as discussed before it would be difficult to establish any kind of causal relationship since all of them are going to be highly correlated with for example income and level of education

#### c.	Now apply cross-validation (cross_val_score) to do 10-fold cross-validation to estimate the out-of-training accuracy of decision tree learning for this task
**i.	What are the 10-fold cross-validation accuracy, precision, and recall?**

In [17]:
from sklearn.model_selection import KFold

In [18]:
#helper function
def crossValidationScores(model,X,y,cv_folds = 10, printResults=True):
    '''Returns cross validation accuracy,precision and recall for model'''
    kf = KFold(n_splits=cv_folds,random_state=1914) #set random_state to obtain repeatable results
    
    accuracy = cross_val_score(model,X,y,cv=kf,scoring='accuracy').mean()
    precision = cross_val_score(model,X,y,cv=kf,scoring='precision').mean()
    recall  = cross_val_score(model,X,y,cv=kf,scoring='recall').mean()
    
    if printResults:
        print('Accuracy: ' + str(accuracy))
        print('Precision: ' + str(precision))
        print('Recall: ' + str(recall))
    
    return accuracy,precision,recall

In [19]:
from sklearn.model_selection import cross_val_score #stratified by default
dTree = DecisionTreeClassifier()
accuracy,precision,recall = crossValidationScores(dTree,X,y)
allScores = pd.DataFrame(data = {'Accuracy': accuracy,'Precision': precision,'Recall': recall},
                         index=['Decision Tree']) #to compate with othe models

Accuracy: 0.760595477387
Precision: 0.790857828716
Recall: 0.788514107082


**ii.	Why are they different from the results in the previous test?**

The huge difference between the results before and the results out of sample suggest overfitting.

## Linear Classification

#### a.	Use GaussianNB to learn a Naive Bayes classifier to predict highCrime.

In [20]:
from sklearn.naive_bayes import GaussianNB

**i.	What is the 10-fold cross-validation accuracy, precision, and recall for this method?**

In [21]:
gNB = GaussianNB()
accuracy,precision,recall = crossValidationScores(gNB,X,y)
gNBscores = pd.DataFrame(data = {'Accuracy': accuracy,'Precision': precision,'Recall': recall},
                         index=['Gaussian NB']) #to compate with othe models
allScores = allScores.append(gNBscores)

Accuracy: 0.768153266332
Precision: 0.901253562112
Recall: 0.684065991577


**ii.	What are the 10 most predictive features? Why do these make sense (or not)? **

In [22]:
gNB.fit(X,y)

GaussianNB(priors=None)

We are using the normalized absolute difference of means

In [23]:
theta_T, theta_F = gNB.theta_
sigma_T, sigma_F = gNB.sigma_
NADMs = abs(theta_T-theta_F)/(sigma_T+sigma_F) #normalized absolute difference of means 

In [24]:
f_importance = pd.DataFrame(list(zip(X.columns,NADMs)))
f_importance.columns = ['feature','NADM']
f_importance.sort_values(by=['NADM'],ascending=False).head(10)

Unnamed: 0,feature,NADM
45,PctKids2Par,5.001257
41,FemalePctDiv,4.756826
44,PctFam2Par,4.59394
16,pctWInvInc,4.394939
42,TotalPctDiv,4.385936
47,PctTeen2Par,3.999311
39,MalePctDivorce,3.950688
46,PctYoungKids2Par,3.64827
51,PctIlleg,3.469861
4,racePctWhite,3.443203


The most predictive variables are:
1. Percentage of kids in family housing with two parents. 
2. Percentage of females who are divorced.
3. Percentage of families (with kids) that are headed by two parents.
4. Percentage of households with investment / rent income in 1989.
5. Percentage of population who are divorced.
6. Percent of kids age 12-17 in two parent households.
7. Percentage of males who are divorced.
8. Percent of kids 4 and under in two parent households
9. Percentage of kids born to never married 
10. Percentage of population that is caucasian.

Most of the predictive variables seem to be related to kids and their parents marital status. Two possible hypothesis would be that kids with divorced parents tend to incur in criminal activity more often or that families with kids are more likely to report crimes near their household.

**iii.	How do these results compare with your results from decision trees, above?**

In [25]:
allScores

Unnamed: 0,Accuracy,Precision,Recall
Decision Tree,0.760595,0.790858,0.788514
Gaussian NB,0.768153,0.901254,0.684066


Accuracy and Precision have been improved. Recall reduced.
Regarding the importance of the features, in both models "Percentage of kids in family housing with two parents" turns out to be the most important variable. "racePctWhite" appears in the two models as well.

#### b.	Use LinearSVC to learn a linear Support Vector Machine model to predict highCrime.

In [26]:
from sklearn.svm import LinearSVC

**What is the 10-fold cross-validation accuracy, precision, and recall for this method?**

In [27]:
lSVC = LinearSVC()
accuracy,precision,recall = crossValidationScores(lSVC,X,y)
lSVCscores = pd.DataFrame(data = {'Accuracy': accuracy,'Precision': precision,'Recall': recall},
                         index=['Linear SVC']) #to compate with othe models
allScores = allScores.append(lSVCscores)

Accuracy: 0.77156281407
Precision: 0.847807807295
Recall: 0.859086278618


**ii.	What are the 10 most predictive features? **

In [28]:
lSVC.fit(X,y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [29]:
# asses feature importance
f_importance = pd.DataFrame(list(zip(X.columns,lSVC.coef_[0]))) #lSVC.coef_ retunts a list o a list of the coefs
f_importance.columns = ['feature','coefficient']
f_importance.sort_values(by=['coefficient'],ascending=False).head(10)

Unnamed: 0,feature,coefficient
65,PersPerOccupHous,0.93984
85,RentHighQ,0.90177
39,MalePctDivorce,0.873811
6,racePctHisp,0.842854
72,HousVacant,0.82644
7,agePct12t21,0.820013
69,PctPersDenseHous,0.789112
38,PctOccupMgmtProf,0.74534
28,NumUnderPov,0.738221
3,racepctblack,0.654462


The most predictive variables are:
1. Mean persons per household. 
2. Rental housing - upper quartile rent.
3. Percentage of males who are divorced.
4. Percentage of population that is of hispanic heritage.
5. Number of vacant households.
6. Percentage of population that is 12-21 in age.
7. Percent of persons in dense housing.
8. Percentage of people 16 and over who are employed in management or professional occupations.
9. Number of people under the poverty level.
10. Percentage of population that is african american.

As seen also in other results, parameters related to income-level, sex and marital status seem to be the best predictors.

**iii.	How do these results compare with your results from decision trees, above?**

In [30]:
allScores

Unnamed: 0,Accuracy,Precision,Recall
Decision Tree,0.760595,0.790858,0.788514
Gaussian NB,0.768153,0.901254,0.684066
Linear SVC,0.771563,0.847808,0.859086


As with the Naive Bayes classifier, Accuracy and Precision have been improved and Recall reduced. The difference is Precision has been improved less but reclass has been reduced less as well.

## Regression 

As before, we need to remove nonpredictive attributes (comunitynamefold),  and the target (ViolentCrimesPerPop). We continue to remove highCrime since we introduced it from ViolentCrimesPerProp

In [31]:
#Create X,y variables
X = pd.DataFrame.drop(data,['communityname','fold','ViolentCrimesPerPop','highCrime'],axis=1) #predictors
y = data.ViolentCrimesPerPop #target

#### a.	Use LinearRegression to learn a linear model directly predicting the crime rate per capita (ViolentCrimesPerPop).

In [32]:
from sklearn.linear_model import LinearRegression

**i.	Using 10-fold cross-validation, what is the estimated mean-squared-error (MSE) of the model**

In [34]:
#helper function
def crossValidationMSE(model, X, y, cv = 10, printResults=True):
    '''Returns cross validation MSE for model'''
    kf = KFold(n_splits=cv,random_state=1914) #set random_state to obtain repeatable results
    
    MSE = -cross_val_score(model,X,y,cv=kf,scoring='neg_mean_squared_error').mean()
    if printResults:
        print('cross-validation MSE: ' + str(MSE))
    return MSE

In [35]:
linReg = LinearRegression()
cvMSE = crossValidationMSE(linReg,X,y)

cross-validation MSE: 0.0199142186962


**ii.	What is the MSE on the training set (train on all the data then test on it all)?**

In [36]:
#helper function
def trainMSE(yTrue,yPred, printResults=True):
    '''Returns model's MSE on train data'''
    MSE = metrics.mean_squared_error(yTrue,yPred)
    if printResults:
        print('train MSE: '  + str(MSE))
    return MSE

In [37]:
linReg.fit(X,y)
yPred = linReg.predict(X)
tMSE = trainMSE(y,yPred)

train MSE: 0.0164410183897


**iii.	What features are most predictive of a high crime rate? A low crime rate?**

In [38]:
coefficients = pd.DataFrame(list(zip(X.columns,linReg.coef_))) #lSVC.coef_ retunts a list o a list of the coefs
coefficients.columns = ['feature','coefficient']
sorted_coefficients = coefficients.sort_values(by=['coefficient'],ascending=False)
sorted_coefficients.head()

Unnamed: 0,feature,coefficient
74,PctHousOwnOcc,0.573658
65,PersPerOccupHous,0.57086
39,MalePctDivorce,0.461738
59,PctRecImmig8,0.415278
86,MedRent,0.34763


In [39]:
sorted_coefficients.tail()

Unnamed: 0,feature,coefficient
45,PctKids2Par,-0.282629
22,whitePerCap,-0.352592
80,OwnOccLowQuart,-0.381967
42,TotalPctDiv,-0.552048
68,PctPersOwnOccup,-0.702292


Most Positive Coefficients:
1. Percent of households owner occupied.
2. Mean persons per household.
3. Percentage of males who are divorced.
4. Percent of _population_ who have immigrated within the last 8 years.
5. Median gross rent.

Most Negative Coefficients: (most negative to less negative)
1. Percent of people in owner occupied households.
2. Percentage of population who are divorced.
3. Owner occupied housing - lower quartile value.
4. Per capita income for caucasians.
5. Percentage of kids in family housing with two parents.

#### b.	Now use Ridge regression to reduce the amount of overfitting, pick the best alpha from among (10, 1, 0.1, 0.01, and 0.001).

In [40]:
from sklearn.linear_model import RidgeCV
rReg = RidgeCV(alphas=(10, 1, 0.1, 0.01, 0.001))

**i.	What is the estimated MSE of the model under 10-fold CV?**

In [41]:
MSE = crossValidationMSE(rReg,X,y)

cross-validation MSE: 0.0195627392747


**ii.	What is the MSE on the training set (train on all the data then test on it all)?**

In [42]:
rReg.fit(X,y)
yPred = rReg.predict(X)
MSE = trainMSE(y,yPred)

train MSE: 0.0166785109027


**iii.	What is the best alpha?**

In [43]:
rReg.alpha_

1.0

**iv.	What does this say about the amount of overfitting in linear regression for this problem?**

There was not too much overfitting on the original linear regression model

#### c.	Now use polynomial features to do quadratic (second-order) polynomial regression

In [44]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X)
linReg = LinearRegression()

**i.	What is the estimated MSE of the model under 10-fold CV?**

In [45]:
cvMSE = crossValidationMSE(linReg,X_,y)

cross-validation MSE: 0.132821401449


**ii.	What is the MSE on the training set (train on all the data then test on it all)?**

In [46]:
linReg.fit(X_,y)
yPred = linReg.predict(X_)
tMSE = trainMSE(y,yPred)

train MSE: 1.07706567726e-25


**iii.	Does this mean the quadratic model is better than the linear model for this problem?**

Yes. The train MSE is almost 0, so it is possible we are overfitting, but either way the cross-validation MSE is better so we expect this model to perfor better out of sample.

## Dirty Data
Load and check dirty data

In [47]:
#Load data
data = pd.read_csv('Data/communities-crime-full.csv',na_values='?')

In [48]:
data.shape

(1994, 128)

The data now has 1994 rows and 128 columns (the clean one had 1993 rows and 104 columns)

In [49]:
n_NaNs = data.apply(lambda x: sum(x.isnull()),axis=0) 
n_NaNs [n_NaNs != 0]

county                  1174
community               1177
OtherPerCap                1
LemasSwornFT            1675
LemasSwFTPerPop         1675
LemasSwFTFieldOps       1675
LemasSwFTFieldPerPop    1675
LemasTotalReq           1675
LemasTotReqPerPop       1675
PolicReqPerOffic        1675
PolicPerPop             1675
RacialMatchCommPol      1675
PctPolicWhite           1675
PctPolicBlack           1675
PctPolicHisp            1675
PctPolicAsian           1675
PctPolicMinor           1675
OfficAssgnDrugUnits     1675
NumKindsDrugsSeiz       1675
PolicAveOTWorked        1675
PolicCars               1675
PolicOperBudg           1675
LemasPctPolicOnPatr     1675
LemasGangUnitDeploy     1675
PolicBudgPerPop         1675
dtype: int64

It looks like in the clean data all these columns (except OtherPerCap) where deleted, as well as the row with the missing OtherPerCap value. Let's try mean value imputation.

In [50]:
missingFeatures_series = n_NaNs [n_NaNs != 0]
missingFeatures = list(missingFeatures_series.index)
for f in missingFeatures:
    data[f].fillna(data[f].mean(), inplace=True)

In [51]:
#Check no NaNs remaining
data.isnull().any().sum() 

0

**Repeat the decision tree learning question for the full (non-clean) data set and present the results**

In [52]:
#Create highCrime field
data['highCrime'] = data['ViolentCrimesPerPop'].apply(lambda x: int(x>0.1))

#Create X,y variables
X = pd.DataFrame.drop(data,['communityname','fold','ViolentCrimesPerPop','highCrime'],axis=1) #predictors
y = data.highCrime #target

In [53]:
#Fit model and measure performance
dTree = DecisionTreeClassifier()

print('cross validation')
accuracy,precision,recall = crossValidationScores(dTree,X,y)

dTree.fit(X,y)
yPred = dTree.predict(X)

print('\nTrain')
tMSE = trainScores(y,yPred)

dirtyScores = pd.DataFrame(data = {'Accuracy': accuracy,'Precision': precision,'Recall': recall},
                         index=['Dirty Decision Tree']) #to compate with othe models
allScores = allScores.append(dirtyScores)

cross validation
Accuracy: 0.757718592965
Precision: 0.820509609574
Recall: 0.796863978537

Train
Accuracy: 1.0
Precision: 1.0
Recall: 1.0


**a. Are the CV results better or worse? What does this say about the effect of missing values?**

In [54]:
allScores

Unnamed: 0,Accuracy,Precision,Recall
Decision Tree,0.760595,0.790858,0.788514
Gaussian NB,0.768153,0.901254,0.684066
Linear SVC,0.771563,0.847808,0.859086
Dirty Decision Tree,0.757719,0.82051,0.796864


Even though mean value imputation is not the best possible method, the results improve slightly, which means the some eliminated columns contained valuable information