__This is an example I developed that spans a few projects involving regression models. The data involved is a record of  criminal offenses known by law enforcment. The data was retreived from the FBI.__

-  We will start by importing and cleaning the initial data that we will train our linear regression model on.
-  We will then create our features and our outcome in order to predict property crime.
-  After our dataframes are created we will then fit the data to the model and take a look at the coefficients, intercept and r score on the training data, and then we will look at the r score on the test data to see if it is there is any overfitting.
-  If the initial model passes with a score of atleast 85% on the test set then we will move on to testing our model on a new data set.
-  If the model passes with an r score of atleast 85% on the test set then we will consider this experiment a success.

In [1]:
%matplotlib inline
import math

import numpy as np
import pandas as pd
import scipy
import sklearn
from sklearn import linear_model
from sklearn.model_selection import train_test_split

import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std

import matplotlib.pyplot as plt
import seaborn as sns

We will start by importing the data, renaming the columns, dropping uneccessary data and cleaning the data as need. we will also manipulate the variables by using these variables to create categorical features where values greater than 0 are coded 1, and values equal to 0 are coded 0.

In [2]:
data = pd.read_csv('https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/master/New_York_offenses/NEW_YORK-Offenses_Known_to_Law_Enforcement_by_City_2013%20-%2013tbl8ny.csv')
# The header information is located at row 3.
data.columns = data.iloc[3]
# Dropping blank rows
data = data.reindex(data.index.drop(range(0,4)))
data = data.rename(columns={ data.columns[2]: "Violent_crime" })
data = data.rename(columns={ data.columns[3]: "Murder_total" })
data = data.rename(columns={ data.columns[7]: "Aggravated_assault" })

In [3]:
# Cleaning and converting data
data['Violent_crime'] = data['Violent_crime'].str.replace(',','').astype(float)
data['Robbery'] = data['Robbery'].str.replace(',','').astype(float)
data['Robbery_convert'] = np.where(data['Robbery'] >= 1, int(1), data['Robbery'])
data['Population'] = data['Population'].str.replace(',','').astype(float)
data['Murder_total'] = data['Murder_total'].str.replace(',','').astype(float)
data['Property\ncrime'] = data['Property\ncrime'].str.replace(',','').astype(float)
data['Aggravated_assault'] = data['Aggravated_assault'].str.replace(',','').astype(float)

# Creating the columns murder and population squared.
data['Murder'] = np.where(data['Murder_total'] >= 1, int(1), data['Murder_total'])
data['Population_sq'] = np.square(data['Population'])


In [4]:
data.head(5)

3,City,Population,Violent_crime,Murder_total,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated_assault,Property crime,Burglary,Larceny- theft,Motor vehicle theft,Arson3,Robbery_convert,Murder,Population_sq
4,Adams Village,1861.0,0.0,0.0,,0,0.0,0.0,12.0,2,10,0,0.0,0.0,0.0,3463321.0
5,Addison Town and Village,2577.0,3.0,0.0,,0,0.0,3.0,24.0,3,20,1,0.0,0.0,0.0,6640929.0
6,Akron Village,2846.0,3.0,0.0,,0,0.0,3.0,16.0,1,15,0,0.0,0.0,0.0,8099716.0
7,Albany,97956.0,791.0,8.0,,30,227.0,526.0,4090.0,705,3243,142,,1.0,1.0,9595378000.0
8,Albion Village,6388.0,23.0,0.0,,3,4.0,16.0,223.0,53,165,5,,1.0,0.0,40806540.0


We will create a data frame of our independent variables and a target with our dependent variable. We will fit the data to a model and look at various statistical information. 

In [5]:
# Creating our independent variables dataframe and our target or dependent variable dataframe.
ind_var = data[['Population', 'Population_sq', 'Murder', 'Robbery_convert']].dropna()
target = data['Property\ncrime'].dropna()
x = ind_var
y = target.values.reshape(-1, 1)

# Fitting the data to the model and displaying various statistical information.
lm = linear_model.LinearRegression()
lm_split = linear_model.LinearRegression()
model = lm.fit(x,y)
print('\nCoefficients: \n', lm.coef_)
print('\nIntercept: \n', lm.intercept_)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=.20, random_state=15)
print('Testing on Sample: ' + str(lm.fit(x, y).score(x, y)))
print('With 20% Holdout: ' + str(lm_split.fit(X_train, y_train).score(X_test, y_test)))




Coefficients: 
 [[ 3.46570268e-02 -2.11108019e-09  1.51866535e+01 -9.62774363e+01]]

Intercept: 
 [-109.57533562]
Testing on Sample: 0.9961247104988709
With 20% Holdout: 0.7530732214289351


The model isn't doing too well only predicting 75% on the test data. Lets take a look at the correlation matrix and play with the features.

In [6]:
correlation_matrix = x.corr()
correlation_matrix

3,Population,Population_sq,Murder,Robbery_convert
3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Population,1.0,0.998264,0.162309,0.064371
Population_sq,0.998264,1.0,0.133067,0.043983
Murder,0.162309,0.133067,1.0,0.313271
Robbery_convert,0.064371,0.043983,0.313271,1.0


In [7]:
# Creating our independent variables dataframe and our target or dependent variable dataframe.
ind_var = data[['Population', 'Population_sq', 'Murder', 'Robbery_convert', 'Aggravated_assault', 'Violent_crime']].dropna()
target = data['Property\ncrime'].dropna()
x = ind_var
y = target.values.reshape(-1, 1)

# Fitting the data to the model and displaying various statistical information.
lm = linear_model.LinearRegression()
lm_split = linear_model.LinearRegression()
model = lm.fit(x,y)
print('\nCoefficients: \n', lm.coef_)
print('\nIntercept: \n', lm.intercept_)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=.20, random_state=15)
print('Testing on Sample: ' + str(lm.fit(x, y).score(x, y)))
print('With 20% Holdout: ' + str(lm_split.fit(X_train, y_train).score(X_test, y_test)))



Coefficients: 
 [[ 1.19428225e-02 -1.49788925e-09  8.48739728e+00  9.98337267e+01
  -2.74758719e+00  4.47609483e+00]]

Intercept: 
 [-10.98740258]
Testing on Sample: 0.998716677556776
With 20% Holdout: 0.9010740839156719


I originally ran this model excluding the "Aggravated_assault", and "Violent crime" columns, but the model was overfitting and not giving good results, but after adding the "Aggravated_assault" and "Violent crime" columns, the model seems to be working well and is classifying 90% of the variance in the data. I also tried removing some of the features to see the effects that each of the features had on the model, and it appears that population and population sq have the biggest influence.

We will now import a new year of data, and create the same columns to see if the model is working well with other data sets.

In [8]:
data2 = pd.read_excel('nycrime2014.xls')
data2.head()

Unnamed: 0,Table 8,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12
0,NEW YORK,,,,,,,,,,,,
1,Offenses Known to Law Enforcement,,,,,,,,,,,,
2,"by City, 2014",,,,,,,,,,,,
3,City,Population,Violent\ncrime,Murder and\nnonnegligent\nmanslaughter,Rape\n(revised\ndefinition)1,Rape\n(legacy\ndefinition)2,Robbery,Aggravated\nassault,Property\ncrime,Burglary,Larceny-\ntheft,Motor\nvehicle\ntheft,Arson3
4,Adams Village,1851,0,0,,0,0,0,11,1,10,0,0


In [9]:
data2.columns = data2.iloc[3]
#Dropping blank rows
data2 = data2.reindex(data2.index.drop(range(0,4)))
data2 = data2.rename(columns={ data2.columns[2]: "Violent_crime" })
data2 = data2.rename(columns={ data2.columns[3]: "Murder_total" })
data2 = data2.rename(columns={ data2.columns[7]: "Aggravated_assault" })
data2 = data2.rename(columns={ data2.columns[8]: "Property_crime" })

In [10]:
# # Creating the columns murder, robbery convert and population squared.
data2['Murder'] = np.where(data2['Murder_total'] >= 1, int(1), data2['Murder_total'])
data2['Population_sq'] = np.square(data2['Population'])
data2['Robbery_convert'] = np.where(data2['Robbery'] >= 1, int(1), data2['Robbery'])

In [11]:
data2.head()

3,City,Population,Violent_crime,Murder_total,Rape (revised definition)1,Rape (legacy definition)2,Robbery,Aggravated_assault,Property_crime,Burglary,Larceny- theft,Motor vehicle theft,Arson3,Murder,Population_sq,Robbery_convert
4,Adams Village,1851,0,0,,0.0,0,0,11,1,10,0,0,0,3426201,0
5,Addison Town and Village,2568,2,0,,0.0,1,1,49,1,47,1,0,0,6594624,1
6,Afton Village4,820,0,0,0.0,,0,0,1,0,1,0,0,0,672400,0
7,Akron Village,2842,1,0,,0.0,0,1,17,0,17,0,0,0,8076964,0
8,Albany4,98595,802,8,54.0,,237,503,3888,683,3083,122,12,1,9720974025,1


Below we will fit the data to the model and run the same statistical tests to observe the outcome.

In [12]:
dropna = data2[['Population', 'Population_sq', 'Murder', 'Robbery_convert', 'Aggravated_assault', 'Violent_crime', 'Property_crime']].dropna()
ind_var2 = dropna[['Population', 'Population_sq', 'Murder', 'Robbery_convert', 'Aggravated_assault',  'Violent_crime' ]]
target2 = dropna['Property_crime']
x2 = ind_var2
y2 = target2.values.reshape(-1, 1)

# Fitting the data to the model and displaying various statistical information.
lm = linear_model.LinearRegression()
lm_split = linear_model.LinearRegression()
model = lm.fit(x2,y2)
print('\nCoefficients: \n', lm.coef_)
print('\nIntercept: \n', lm.intercept_)
X_train2, X_test2, y_train2, y_test2 = train_test_split(x2, y2, test_size=.20, random_state=15)
print('Testing on Sample: ' + str(lm.fit(x2, y2).score(x2, y2)))
print('With 20% Holdout: ' + str(lm_split.fit(X_train2, y_train2).score(X_test2, y_test2)))


Coefficients: 
 [[ 9.74328079e-03 -1.05343995e-09  1.15239818e+02  8.34817699e+01
  -5.83262140e+00  6.17295827e+00]]

Intercept: 
 [-6.38942681]
Testing on Sample: 0.9986756341422635
With 20% Holdout: 0.8748441215595081


The model seems to be working very well. The results suggest that our model is not overfitting and is successfully predicting our outcome. It has remained consistent across data sets, and has met our 85% metric. This experiment therefore has been successful.