<h2> Building a maultivariable regressor using crime data in New York State in 2013(Source: FBI, UCR.)
We try to predict occurrence of murder in any city using other crime related data</h2>

To reduce the risk of overfitting we will apply following regularization procedures for the linear regression:

- Ridge Regression: where Ordinary Least Squares is modified to also minimize the squared absolute sum of the coefficients (called L2 regularization).
- Lasso Regression: where Ordinary Least Squares is modified to also minimize the absolute sum of the coefficients (called L1 regularization).


In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import math
import seaborn as sns
import sklearn
from sklearn import linear_model
from sklearn import preprocessing
%matplotlib inline
sns.set_style('white')

In [2]:
# Grab and process the cleaned data
path1 = ("C:/Users/aath/Dropbox/MAEN/Thankful/Data/NY Crime/NY_Crime_2013_cleaned2.csv")

df = pd.read_csv(path1)

In [3]:
df.head()

Unnamed: 0,City,prop,pop,murd,rob,violent,assul,burg,theft,motor,murd_binary
0,Adams Village,11,1851,0,0,0,0,1,10,0,0
1,Addison Town and Village,49,2568,0,1,2,1,1,47,1,0
2,Afton Village4,1,820,0,0,0,0,0,1,0,0
3,Akron Village,17,2842,0,0,1,1,0,17,0,0
4,Albany4,3888,98595,8,237,802,503,683,3083,122,1


In [4]:
# Prepare the data
# Remove New York City as it is the largest city in NY State and does distort visualization
df = df[df.City != 'New York']

# Murder = murd is our dependent binary variable already represented by murd_binayr. 
# City is not 
df = df.drop('murd',1)
df = df.drop('City',1)

In [5]:
df.isnull().sum()

prop           0
pop            0
rob            0
violent        0
assul          0
burg           0
theft          0
motor          0
murd_binary    0
dtype: int64

In [6]:
import numpy as np
trainsize = int(df.shape[0] / 2)
chosen_idx1 = np.random.choice(df.shape[0], replace=False, size=trainsize)
chosen_idx2 = np.random.choice(df.shape[0], replace=False, size=trainsize)

In [7]:
chosen_idx2 = chosen_idx1[~chosen_idx1.index.isin(chosen_idx1.index)]

AttributeError: 'numpy.ndarray' object has no attribute 'index'

In [None]:
# Define the training and test sizes.
df.sample(frac=1) # Shuffle the data
trainsize = int(df.shape[0] / 2)
df_test = df.iloc[trainsize:, :].copy()
df_train = df.iloc[:trainsize, :].copy()

# Set up the regression model to predict defaults using all other
# variables as features.
regr1 = linear_model.LinearRegression()
Y_train = df_train['murd_binary'].values.reshape(-1, 1)
X_train = df_train.loc[:, ~(df_train.columns).isin(['murd_binary'])]
regr1.fit(X_train, Y_train)
print('\nR-squared simple model:')
print(regr1.score(X_train, Y_train))

- This model explains 37% of variability of the response data around the mean.
- R-squared = Explained variation / Total variation

In [None]:
#Store the parameter estimates.
origparams = np.append(regr1.coef_, regr1.intercept_)
origparams

In [None]:
# Make new features to capture potential quadratic
# between the features.

df_train['prop2'] = (df_train['prop'] + 100) ** 2
df_train['pop2'] = (df_train['pop'] + 100) ** 2
df_train['rob2'] = (df_train['rob'] + 100) ** 2
df_train['violent2'] = (df_train['violent'] + 100) ** 2
df_train['assul2'] = (df_train['assul'] + 100) ** 2
df_train['burg2'] = (df_train['burg'] + 100) ** 2
df_train['theft2'] = (df_train['theft'] + 100) ** 2
df_train['mortor2'] = (df_train['motor'] + 100) ** 2

In [None]:
# Re-run the model with the new features.
regrBig = linear_model.LinearRegression()
X_train2 = df_train.loc[:, ~(df_train.columns).isin(['murd_binary'])]
regrBig.fit(X_train2, Y_train)
print('\nR-squared complex model:')
print(regrBig.score(X_train2, Y_train))

- R-squared increases by added features

In [None]:
# Store the new parameter estimates for the same features.
newparams = np.append(
    regrBig.coef_[0,0:(len(origparams)-1)],
    regrBig.intercept_)

print('\nParameter Estimates for the same predictors for the small model '
      'and large model:')
compare = np.column_stack((origparams, newparams))
prettycompare = np.array2string(
    compare,
    formatter={'float_kind':'{0:.3f}'.format})
print(prettycompare)

- After adding extra features, the magnitute and sign of some of coefficients changed. This can be a sign of overfitting and of having too many correlated dimensions.

In [None]:
# Test the simpler model with smaller coefficients.
Y_test = df_test['murd_binary'].values.reshape(-1, 1)
X_test = df_test.loc[:, ~(df_test.columns).isin(['murd_binary'])]
print('\nR-squared simple model:')
print(regr1.score(X_test, Y_test))

# Test the more complex model with larger coefficients.

df_test['prop2'] = (df_test['prop'] + 100) ** 2
df_test['pop2'] = (df_test['pop'] + 100) ** 2
df_test['rob2'] = (df_test['rob'] + 100) ** 2
df_test['violent2'] = (df_test['violent'] + 100) ** 2
df_test['assul2'] = (df_test['assul'] + 100) ** 2
df_test['burg2'] = (df_test['burg'] + 100) ** 2
df_test['theft2'] = (df_test['theft'] + 100) ** 2
df_test['mortor2'] = (df_test['motor'] + 100) ** 2

# Re-run the model with the new features.
X_test2 = df_test.loc[:, ~(df_test.columns).isin(['murd_binary'])]
print('\nR-squared complex model:')
print(regrBig.score(X_test2, Y_test))

- Testing the model on remaining data set result in a very poor outcome. A negative R-squared indicates that the fit of the model is even worse than a horizontal line. 

# Next, we try to improve the model by using Ridge procedure. 

In [None]:
# To address the overfitting problem we can neglect the requirement of an unbiased parameter estimator 
# and instead use a biased estimator, which may have smaller variance.
# One of such biased estimators is the Ridge method. 
# Below we will fit a ridge regression model where alpha is the regularization parameter (usually called lambda). 
# As alpha gets larger, parameter shrinkage grows more pronounced. Note that by convention, the intercept is not regularized. 

ridgeregr = linear_model.Ridge(alpha=20, fit_intercept=False) 
ridgeregr.fit(X_train, Y_train)
print(ridgeregr.score(X_train, Y_train))
origparams = ridgeregr.coef_[0]
print(origparams)

ridgeregrBig = linear_model.Ridge(alpha=20, fit_intercept=False)
ridgeregrBig.fit(X_train2, Y_train)
print(ridgeregrBig.score(X_train2, Y_train))
newparams = ridgeregrBig.coef_[0, 0:len(origparams)]

print('\nParameter Estimates for the same predictors for the small model'
      ' and large model:')
compare = np.column_stack((origparams, newparams))
prettycompare = np.array2string(
    compare,
    formatter={'float_kind':'{0:.3f}'.format})
print(prettycompare)

In [None]:
print('\nR-squared simple model:')
print(ridgeregr.score(X_test, Y_test))
print('\nR-squared complex model:')
print(ridgeregrBig.score(X_test2, Y_test))

- Risdge regression using regularization parameter λ = 20 make the model worse.

# Next, we try to improve the model by using Lasso procedure. 

Following we will use Lasso Regression which is a close cousin of Ridge Regression, in which absolute values of coefficients are minimized rather than the square of values. This method will help to eliminate some insignificant variables. Lasso regression is generally used when there is a very large number of variables, since Lasso automatically does the variables selection.



In [None]:
# Small number of parameters.
lass = linear_model.Lasso(alpha=.35)
lassfit = lass.fit(X_train, Y_train)
print('R² for the model with few features:')
print(lass.score(X_train, Y_train))
origparams = np.append(lassfit.coef_, lassfit.intercept_)
print('\nParameter estimates for the model with few features:')
print(origparams)

# Large number of parameters.
lassBig = linear_model.Lasso(alpha=.35)
lassBig.fit(X_train2, Y_train)
print('\nR² for the model with many features:')
print(lassBig.score(X_train2, Y_train))
origparams = np.append(lassBig.coef_, lassBig.intercept_)
print('\nParameter estimates for the model with many features:')
print(origparams)

In [None]:
# Checking predictive power using the test set:
print('\nR-squared simple model:')
print(lass.score(X_test, Y_test))
print('\nR-squared complex model:')
print(lassBig.score(X_test2, Y_test))

- Using the Lasso method the model improves considerably compared to previous models. However, because of negative R-squared we can conclude that this model, with its constraints, fits the data really poorly.
- We can also see that the Lasso method is able to make a selection of variables. Overall we can say that both Lasso and Ridge balance the trade-off bias-variance with the choice of lambda. On this dataset the Lasso performed better and apparently it was because of many of predators were not actually tied to the response variable. And to test this one should use cross-validation and also try to do an stepwise regression. Here we started with all variables and tried to reduce or shrink them. Another approach is to go forward stepwise by starting with an empty model and choose the variables with most significant association. In each step we add variable that can provide statistically significant association with the dependent variable. By using cross-validation and pipeline module we should be able to find and select the best variables in a backward and forward stepwise fashon.