# Challenge: Model Comparison
   ---
You now know two kinds of regression and two kinds of classifier. So let's use that to compare models!

Comparing models is something data scientists do all the time. There's very rarely just one model that would be possible to run for a given situation, so learning to choose the best one is very important.

Here let's work on regression. Find a data set and build a KNN Regression and an OLS regression. Compare the two. How similar are they? Do they miss in different ways?

Create a Jupyter notebook with your models. At the end in a markdown cell write a few paragraphs to describe the models' behaviors and why you favor one model or the other. Try to determine whether there is a situation where you would change your mind, or whether one is unambiguously better than the other. Lastly, try to note what it is about the data that causes the better model to outperform the weaker model. Submit a link to your notebook below.

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import neighbors
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
import statsmodels.formula.api as smf
%matplotlib inline

In [47]:
df = pd.read_csv('2 Dataset.csv')
list(df)

['Amount.Requested',
 'Amount.Funded.By.Investors',
 'Interest.Rate',
 'Loan.Length',
 'Loan.Purpose',
 'Debt.To.Income.Ratio',
 'State',
 'Home.Ownership',
 'Monthly.Income',
 'FICO.Range',
 'Open.CREDIT.Lines',
 'Revolving.CREDIT.Balance',
 'Inquiries.in.the.Last.6.Months',
 'Employment.Length']

In [48]:
df.dropna(inplace=True)
df['Debt.To.Income.Ratio'] = ([s.replace('%', '') for s in df['Debt.To.Income.Ratio']])
df['Interest.Rate'] = ([s.replace('%', '') for s in df['Interest.Rate']])

cols = ['Debt.To.Income.Ratio', 'Interest.Rate', 'Monthly.Income', 'Revolving.CREDIT.Balance']
for c in cols:
    df[c] = pd.to_numeric(df[c], errors='coerce')

In [63]:
X = pd.DataFrame(df['Interest.Rate'])
Y = df[['Debt.To.Income.Ratio', 'Monthly.Income', 'Revolving.CREDIT.Balance']]

knn = neighbors.KNeighborsRegressor(n_neighbors=10)
knn.fit(X, Y)
score = cross_val_score(knn, X, Y, cv=5)

knn_w = neighbors.KNeighborsRegressor(n_neighbors=10, weights='distance')
knn_w.fit(X, Y)
score_w = cross_val_score(knn_w, X, Y, cv=5)

Y = df['Interest.Rate'].values.reshape(-1,1)
X = df[['Debt.To.Income.Ratio', 'Monthly.Income', 'Revolving.CREDIT.Balance']]
# Note: compare this X and Y to the X and Y for KNN - is there a reason for the differences?
ols = linear_model.LinearRegression()
ols.fit(X, Y)

print('KNN unweighted accuracy: %0.2f (+/- %0.2f)' % (score.mean(), score.std() * 2))
print('KNN weighted accuracy: %0.2f (+/- %0.2f)' % (score_w.mean(), score_w.std() * 2))
print('OLS weighted accuracy: %0.2f' % ols.score(X, Y))

KNN unweighted accuracy: -0.14 (+/- 0.24)
KNN weighted accuracy: -0.20 (+/- 0.25)
OLS weighted accuracy: 0.03
