# Model Comparison-KNN vs. Linear Regression

Here let's work on regression. Find a data set and build a KNN Regression and an OLS regression. Compare the two. How similar are they? Do they miss in different ways?

Create a Jupyter notebook with your models. At the end in a markdown cell write a few paragraphs to describe the models' behaviors and why you favor one model or the other. Try to determine whether there is a situation where you would change your mind, or whether one is unambiguously better than the other. Lastly, try to note what it is about the data that causes the better model to outperform the weaker model. Submit a link to your notebook below.

## Linear regression

### Build the model

In [1]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
import sklearn
from sklearn import linear_model
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format
%matplotlib inline
sns.set_style('white')
# Suppress annoying harmless error.
import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

In [2]:
# import data and show format
df_ny = pd.read_csv('table_8_offenses_known_to_law_enforcement_new_york_by_city_2013.csv')
# rename the columns
df_ny.rename(columns={'Table 8': 'City', 'Unnamed: 1': 'Population', 'Unnamed: 2': 'Violent_crime', 
                   'Unnamed: 3': 'Murder', 'Unnamed: 4': 'Rape (revised definition)', 
                   'Unnamed: 5': 'Rape', 'Unnamed: 6': 'Robbery', 
                   'Unnamed: 7': 'Aggravated_assault', 'Unnamed: 8': 'Property_crime', 
                   'Unnamed: 9': 'Burglary', 'Unnamed: 10': 'Larceny_theft', 'Unnamed: 11': 'Motor_vehicle_theft',
                   'Unnamed: 12': 'Arson'}, inplace=True)
# drop title rows from the csv file
df_ny=df_ny.drop(['Unnamed: 13', 'Rape (revised definition)'], axis=1)
df_ny=df_ny.drop([0, 1, 2, 3, 352, 353, 354], axis=0)
df_ny.head()

Unnamed: 0,City,Population,Violent_crime,Murder,Rape,Robbery,Aggravated_assault,Property_crime,Burglary,Larceny_theft,Motor_vehicle_theft,Arson
4,Adams Village,1861,0,0,0,0,0,12,2,10,0,0.0
5,Addison Town and Village,2577,3,0,0,0,3,24,3,20,1,0.0
6,Akron Village,2846,3,0,0,0,3,16,1,15,0,0.0
7,Albany,97956,791,8,30,227,526,4090,705,3243,142,
8,Albion Village,6388,23,0,3,4,16,223,53,165,5,


In [3]:
# add state column
df_ny['State']='NY'
# convert all numerical columns to integers
for x in df_ny.index:
    df_ny['Population'].loc[x]=int(df_ny['Population'].loc[x].replace(',', ''))
    df_ny['Violent_crime'].loc[x]=int(df_ny['Violent_crime'].loc[x].replace(',', ''))
    df_ny['Murder'].loc[x]=int(df_ny['Murder'].loc[x].replace(',', ''))
    df_ny['Rape'].loc[x]=int(df_ny['Rape'].loc[x].replace(',', ''))
    df_ny['Robbery'].loc[x]=int(df_ny['Robbery'].loc[x].replace(',', ''))
    df_ny['Aggravated_assault'].loc[x]=int(df_ny['Aggravated_assault'].loc[x].replace(',', ''))
    df_ny['Property_crime'].loc[x]=int(df_ny['Property_crime'].loc[x].replace(',', ''))
    df_ny['Burglary'].loc[x]=int(df_ny['Burglary'].loc[x].replace(',', ''))
    df_ny['Larceny_theft'].loc[x]=int(df_ny['Larceny_theft'].loc[x].replace(',', ''))
    df_ny['Motor_vehicle_theft'].loc[x]=int(df_ny['Motor_vehicle_theft'].loc[x].replace(',', ''))
# values for Arson contain string, integers, and floats. Convert all to floats
df_ny['Arson']=df_ny['Arson'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [4]:
# add a population^2 column and add inputs
df_ny['Population^2']=df_ny['Population']**2

# add murder cat column and add inputs
df_ny['Murder Cat']=''
for x in df_ny.index:
    if df_ny['Murder'].loc[x] == 0:
          df_ny['Murder Cat'].loc[x]=df_ny['Murder Cat'].loc[x].replace('', '0')
    else:
          df_ny['Murder Cat'].loc[x]=df_ny['Murder Cat'].loc[x].replace('', '1')
# add robbery cat column and add inputs
df_ny['Robbery Cat']=''
for x in df_ny.index:
    if df_ny['Robbery'].loc[x] == 0:
          df_ny['Robbery Cat'].loc[x]=df_ny['Robbery Cat'].loc[x].replace('', '0')
    else:
          df_ny['Robbery Cat'].loc[x]=df_ny['Robbery Cat'].loc[x].replace('', '1')
# convert Robbery Cat and Murder Cat to integers
for x in df_ny.index:
    df_ny['Murder Cat'].loc[x]=int(df_ny['Murder Cat'].loc[x])
    df_ny['Robbery Cat'].loc[x]=int(df_ny['Robbery Cat'].loc[x])         
df_ny.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,City,Population,Violent_crime,Murder,Rape,Robbery,Aggravated_assault,Property_crime,Burglary,Larceny_theft,Motor_vehicle_theft,Arson,State,Population^2,Murder Cat,Robbery Cat
4,Adams Village,1861,0,0,0,0,0,12,2,10,0,0.0,NY,3463321,0,0
5,Addison Town and Village,2577,3,0,0,0,3,24,3,20,1,0.0,NY,6640929,0,0
6,Akron Village,2846,3,0,0,0,3,16,1,15,0,0.0,NY,8099716,0,0
7,Albany,97956,791,8,30,227,526,4090,705,3243,142,,NY,9595377936,1,1
8,Albion Village,6388,23,0,3,4,16,223,53,165,5,,NY,40806544,0,1


New York City skews the data (from [previous notebook](http://localhost:8888/notebooks/Desktop/Piles%20of%20Files/Challenge-%20Validating%20a%20linear%20regression(Stats%20model).ipynb)) so it will be removed from the dataset.

In [5]:
# remove NYC from dataframe
df_wony=df_ny.drop(220)

In [6]:
from sklearn import linear_model
import statsmodels.api as sm# Instantiate and fit our model.
Y = df_wony['Property_crime'].values.reshape(-1, 1)
X = df_wony[['Population', 'Population^2', 'Murder', 'Robbery']]
X = sm.add_constant(X)
# Note the difference in argument order
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.939
Model:,OLS,Adj. R-squared:,0.939
Method:,Least Squares,F-statistic:,1323.0
Date:,"Tue, 11 Dec 2018",Prob (F-statistic):,1.43e-206
Time:,20:35:28,Log-Likelihood:,-2414.5
No. Observations:,347,AIC:,4839.0
Df Residuals:,342,BIC:,4858.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-25.0399,18.877,-1.326,0.186,-62.169,12.089
Population,0.0206,0.001,18.350,0.000,0.018,0.023
Population^2,-7.195e-08,1.03e-08,-7.010,0.000,-9.21e-08,-5.18e-08
Murder,102.6434,14.278,7.189,0.000,74.559,130.728
Robbery,5.1300,0.764,6.718,0.000,3.628,6.632

0,1,2,3
Omnibus:,173.424,Durbin-Watson:,2.013
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4534.307
Skew:,-1.499,Prob(JB):,0.0
Kurtosis:,20.453,Cond. No.,7070000000.0


### Test the model

In [7]:
# import CA data and show format
df_ca = pd.read_csv('table_8_offenses_known_to_law_enforcement_california_by_city_2013.csv')
# rename the columns
df_ca.rename(columns={'Table 8': 'City', 'Unnamed: 1': 'Population', 'Unnamed: 2': 'Violent_crime', 
                   'Unnamed: 3': 'Murder', 'Unnamed: 4': 'Rape (revised definition)', 
                   'Unnamed: 5': 'Rape', 'Unnamed: 6': 'Robbery', 
                   'Unnamed: 7': 'Aggravated_assault', 'Unnamed: 8': 'Property_crime', 
                   'Unnamed: 9': 'Burglary', 'Unnamed: 10': 'Larceny_theft', 'Unnamed: 11': 'Motor_vehicle_theft',
                   'Unnamed: 12': 'Arson'}, inplace=True)
# drop title rows from the csv file
df_ca=df_ca.drop(['Unnamed: 13', 'Rape (revised definition)'], axis=1)
df_ca=df_ca.drop([0, 1, 2, 3, 466, 467], axis=0)
df_ca.head()

Unnamed: 0,City,Population,Violent_crime,Murder,Rape,Robbery,Aggravated_assault,Property_crime,Burglary,Larceny_theft,Motor_vehicle_theft,Arson
4,Adelanto,31165,198,2,15,52,129,886,381,372,133,17
5,Agoura Hills,20762,19,0,2,10,7,306,109,185,12,7
6,Alameda,76206,158,0,10,85,63,1902,287,1285,330,17
7,Albany,19104,29,0,1,24,4,557,94,388,75,7
8,Alhambra,84710,163,1,9,81,72,1774,344,1196,234,7


In [8]:
# add state column
df_ca['State']='CA'
# convert all numerical columns to integers
for x in df_ca.index:
    df_ca['Population'].loc[x]=int(df_ca['Population'].loc[x].replace(',', ''))
    df_ca['Violent_crime'].loc[x]=int(df_ca['Violent_crime'].loc[x].replace(',', ''))
    df_ca['Murder'].loc[x]=int(df_ca['Murder'].loc[x].replace(',', ''))
    df_ca['Rape'].loc[x]=int(df_ca['Rape'].loc[x].replace(',', ''))
    df_ca['Robbery'].loc[x]=int(df_ca['Robbery'].loc[x].replace(',', ''))
    df_ca['Aggravated_assault'].loc[x]=int(df_ca['Aggravated_assault'].loc[x].replace(',', ''))
    df_ca['Property_crime'].loc[x]=int(df_ca['Property_crime'].loc[x].replace(',', ''))
    df_ca['Burglary'].loc[x]=int(df_ca['Burglary'].loc[x].replace(',', ''))
    df_ca['Larceny_theft'].loc[x]=int(df_ca['Larceny_theft'].loc[x].replace(',', ''))
    df_ca['Motor_vehicle_theft'].loc[x]=int(df_ca['Motor_vehicle_theft'].loc[x].replace(',', ''))
    df_ca['Arson'].loc[x]=int(df_ca['Arson'].loc[x].replace(',', ''))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [9]:
# add a population^2 column and add inputs
df_ca['Population^2']=df_ca['Population']**2

# add murder cat column and add inputs
df_ca['Murder Cat']=''
for x in df_ca.index:
    if df_ca['Murder'].loc[x] == 0:
          df_ca['Murder Cat'].loc[x]=df_ca['Murder Cat'].loc[x].replace('', '0')
    else:
          df_ca['Murder Cat'].loc[x]=df_ca['Murder Cat'].loc[x].replace('', '1')
# add robbery cat column and add inputs
df_ca['Robbery Cat']=''
for x in df_ca.index:
    if df_ca['Robbery'].loc[x] == 0:
          df_ca['Robbery Cat'].loc[x]=df_ca['Robbery Cat'].loc[x].replace('', '0')
    else:
          df_ca['Robbery Cat'].loc[x]=df_ca['Robbery Cat'].loc[x].replace('', '1')
# convert Robbery Cat and Murder Cat to integers
for x in df_ca.index:
    df_ca['Murder Cat'].loc[x]=int(df_ca['Murder Cat'].loc[x])
    df_ca['Robbery Cat'].loc[x]=int(df_ca['Robbery Cat'].loc[x])         
df_ca.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,City,Population,Violent_crime,Murder,Rape,Robbery,Aggravated_assault,Property_crime,Burglary,Larceny_theft,Motor_vehicle_theft,Arson,State,Population^2,Murder Cat,Robbery Cat
4,Adelanto,31165,198,2,15,52,129,886,381,372,133,17,CA,971257225,1,1
5,Agoura Hills,20762,19,0,2,10,7,306,109,185,12,7,CA,431060644,0,1
6,Alameda,76206,158,0,10,85,63,1902,287,1285,330,17,CA,5807354436,0,1
7,Albany,19104,29,0,1,24,4,557,94,388,75,7,CA,364962816,0,1
8,Alhambra,84710,163,1,9,81,72,1774,344,1196,234,7,CA,7175784100,1,1


In [10]:
# run the model on new set of data
X = df_ca[['Population', 'Population^2', 'Murder', 'Robbery']]
X = sm.add_constant(X)
# Note the difference in argument order
model.predict(X)
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.939
Model:,OLS,Adj. R-squared:,0.939
Method:,Least Squares,F-statistic:,1323.0
Date:,"Tue, 11 Dec 2018",Prob (F-statistic):,1.43e-206
Time:,20:36:31,Log-Likelihood:,-2414.5
No. Observations:,347,AIC:,4839.0
Df Residuals:,342,BIC:,4858.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-25.0399,18.877,-1.326,0.186,-62.169,12.089
Population,0.0206,0.001,18.350,0.000,0.018,0.023
Population^2,-7.195e-08,1.03e-08,-7.010,0.000,-9.21e-08,-5.18e-08
Murder,102.6434,14.278,7.189,0.000,74.559,130.728
Robbery,5.1300,0.764,6.718,0.000,3.628,6.632

0,1,2,3
Omnibus:,173.424,Durbin-Watson:,2.013
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4534.307
Skew:,-1.499,Prob(JB):,0.0
Kurtosis:,20.453,Cond. No.,7070000000.0


## KNN

### Build the model

In [11]:
df_wony.head()

Unnamed: 0,City,Population,Violent_crime,Murder,Rape,Robbery,Aggravated_assault,Property_crime,Burglary,Larceny_theft,Motor_vehicle_theft,Arson,State,Population^2,Murder Cat,Robbery Cat
4,Adams Village,1861,0,0,0,0,0,12,2,10,0,0.0,NY,3463321,0,0
5,Addison Town and Village,2577,3,0,0,0,3,24,3,20,1,0.0,NY,6640929,0,0
6,Akron Village,2846,3,0,0,0,3,16,1,15,0,0.0,NY,8099716,0,0
7,Albany,97956,791,8,30,227,526,4090,705,3243,142,,NY,9595377936,1,1
8,Albion Village,6388,23,0,3,4,16,223,53,165,5,,NY,40806544,0,1


In [12]:
df_wony.describe()

Unnamed: 0,Population,Violent_crime,Murder,Rape,Robbery,Aggravated_assault,Property_crime,Burglary,Larceny_theft,Motor_vehicle_theft,Arson,Population^2,Murder Cat,Robbery Cat
count,347.0,347.0,347.0,347.0,347.0,347.0,347.0,347.0,347.0,347.0,187.0,347.0,347.0,347.0
mean,15956.686,51.213,0.605,2.677,17.867,30.063,385.752,72.173,298.994,14.585,1.872,985840709.758,0.138,0.599
std,27080.219,236.667,3.707,10.741,94.972,128.783,1034.369,264.941,715.232,67.682,10.693,5067232380.434,0.346,0.491
min,526.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,276676.0,0.0,0.0
25%,2997.0,2.0,0.0,0.0,0.0,1.0,40.0,6.0,31.0,0.0,0.0,8982153.0,0.0,0.0
50%,7187.0,6.0,0.0,0.0,1.0,4.0,112.0,17.0,94.0,2.0,0.0,51652969.0,0.0,1.0
75%,18160.5,21.5,0.0,2.0,5.0,14.0,340.5,51.0,284.5,7.0,1.0,329804222.5,0.0,1.0
max,258789.0,3249.0,47.0,145.0,1322.0,1735.0,12491.0,3458.0,8076.0,957.0,132.0,66971746521.0,1.0,1.0


In [52]:
import scipy
%matplotlib inline
from sklearn.neighbors import KNeighborsClassifier
neighbors = KNeighborsClassifier(n_neighbors=10, weights='distance')
X = df_wony[['Population', 'Population^2', 'Murder', 'Robbery']]
Y = df_wony[['Property_crime']]
neighbors.fit(X,Y)

# Set up our prediction line.
G = np.arange(0, 300000, 30000)[:, np.newaxis]
H = np.arange(0, 100000000000, 10000000000)[:, np.newaxis]
I = np.arange(0, 50, 5)[:, np.newaxis]
J = np.arange(0, 2000, 200)[:, np.newaxis]

Z = np.c_[G, H, I, J]

# Trailing underscores are a common convention for a prediction.
Y_ = neighbors.predict(Z)

  import sys


In [53]:
neighbors.score(X, Y)

1.0

The regression score comes out as 1, which is most likely due to overfitting cause by the KNN model. 