Take the loan data and process it as you did previously to build your linear regression model.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np

In [2]:
#Import Data
loansData = pd.read_csv('https://github.com/Thinkful-Ed/curric-data-001-data-sets/raw/master/loans/loansData.csv')

In [3]:
#Clean Data
loansData['Interest.Rate'] = map(lambda x: x.rstrip('%'), loansData['Interest.Rate'])
loansData['Loan.Length'] = map(lambda x: x.rstrip(' months'), loansData['Loan.Length'])

loansData['FICO.Score'] = loansData['FICO.Range'].astype(str)
loansData['FICO.Score'] = map(lambda x: x.split('-'), loansData['FICO.Score'])
loansData['FICO.Score'] = map(lambda x : x[0], loansData['FICO.Score'])
loansData['FICO.Score'] = loansData['FICO.Score'].astype(float)

In [5]:
loansData.head()

Unnamed: 0,Amount.Requested,Amount.Funded.By.Investors,Interest.Rate,Loan.Length,Loan.Purpose,Debt.To.Income.Ratio,State,Home.Ownership,Monthly.Income,FICO.Range,Open.CREDIT.Lines,Revolving.CREDIT.Balance,Inquiries.in.the.Last.6.Months,Employment.Length,FICO.Score
81174,20000,20000.0,8.9,36,debt_consolidation,14.90%,SC,MORTGAGE,6541.67,735-739,14.0,14272.0,2.0,< 1 year,735.0
99592,19200,19200.0,12.12,36,debt_consolidation,28.36%,TX,MORTGAGE,4583.33,715-719,12.0,11140.0,1.0,2 years,715.0
80059,35000,35000.0,21.98,60,debt_consolidation,23.81%,CA,MORTGAGE,11500.0,690-694,14.0,21977.0,1.0,2 years,690.0
15825,10000,9975.0,9.99,36,debt_consolidation,14.30%,KS,MORTGAGE,3833.33,695-699,10.0,9346.0,0.0,5 years,695.0
33182,12000,12000.0,11.71,36,credit_card,18.78%,NJ,RENT,3195.0,695-699,11.0,14469.0,0.0,9 years,695.0


Break the data-set into 10 segments following the example provided here in KFold.

In [6]:
#Extract Columns
intrate = loansData['Interest.Rate'].astype(float)
loanamt = loansData['Amount.Requested']
fico = loansData['FICO.Score'].astype(int)

#Reshape the Data
# The dependent variable
y = np.matrix(intrate).transpose()
# The independent variables shaped as columns
x1 = np.matrix(fico).transpose()
x2 = np.matrix(loanamt).transpose()
x = np.column_stack([x1,x2])

X = sm.add_constant(x)

In [15]:
from sklearn.cross_validation import KFold
import sklearn.metrics as metrics

kf = KFold(len(X), n_folds=10)

Compute each of the performance metric (MAE, MSE or R2) for all the folds. The average would be the performance of your model.

In [16]:
mae_train_list = []
mae_test_list = []
mse_train_list = []
mse_test_list = []
r2_train_list = []
r2_test_list = []

for train, test in kf:
    f = sm.OLS(y[train], X[train]).fit()
    mae_train = metrics.mean_absolute_error(y[train],
                                            f.predict(X[train]))
    mae_test = metrics.mean_absolute_error(y[test],
                                           f.predict(X[test]))
    mse_train = metrics.mean_squared_error(y[train],
                                           f.predict(X[train]))
    mse_test = metrics.mean_squared_error(y[test],
                                          f.predict(X[test]))
    r2_train = metrics.r2_score(y[train],f.predict(X[train]))
    r2_test = metrics.r2_score(y[test],f.predict(X[test]))
    mae_train_list.append(mae_train)
    mae_test_list.append(mae_test)
    mse_train_list.append(mse_train)
    mse_test_list.append(mse_test)
    r2_train_list.append(r2_train)
    r2_test_list.append(r2_test)
    print 'Mean Absolute Error: train {}, test {}'.format(mae_train, 
                                                          mae_test)
    print 'Mean Squared Error: train {}, test {}'.format(mse_train, 
                                                         mse_test)
    print 'R-Squared: train {}, test {}'.format(r2_train,
                                                r2_test)
    print ''

Mean Absolute Error: train 1.92574828457, test 2.03526982218
Mean Squared Error: train 5.91856408622, test 6.67721546898
R-Squared: train 0.658869304534, test 0.634405537856

Mean Absolute Error: train 1.93346982734, test 1.96086476069
Mean Squared Error: train 5.97390630798, test 6.15688091076
R-Squared: train 0.658648931299, test 0.63722077469

Mean Absolute Error: train 1.93458991973, test 1.98855434507
Mean Squared Error: train 5.95952150769, test 6.29592702927
R-Squared: train 0.662470448396, test 0.596348083509

Mean Absolute Error: train 1.94004506613, test 1.90918225184
Mean Squared Error: train 6.04011482904, test 5.56199441393
R-Squared: train 0.657893916623, test 0.643579708291

Mean Absolute Error: train 1.94151286181, test 1.90098016795
Mean Squared Error: train 5.9946167782, test 5.97575835329
R-Squared: train 0.655387003209, test 0.665164764526

Mean Absolute Error: train 1.93938230057, test 1.93012663743
Mean Squared Error: train 5.98824212345, test 6.03414494336
R-Squa

In [17]:
avg_mse_train = sum(mse_train_list)/kf.n_folds
avg_mse_test = sum(mse_test_list)/kf.n_folds
avg_mae_train = sum(mae_train_list)/kf.n_folds
avg_mae_test = sum(mae_test_list)/kf.n_folds
avg_r2_train = sum(r2_train_list)/kf.n_folds
avg_r2_test = sum(r2_test_list)/kf.n_folds

print 'Average MAE: {}'.format(avg_mae_test)
print '\nAverage MSE: {}'.format(avg_mse_test)
print '\nAverage R-Squared: {}'.format(avg_r2_test)

Average MAE: 1.94119197599

Average MSE: 6.01329025629

Average R-Squared: 0.653129530565


Comment on each of the performance metric you obtained.