# CMSE 802 In-class assignment: Credit Scoring
## Date: 11/27/2018
## Due: 11/27/2018; 10:30 PM

### The goal of this assignment is to learn Logistic Regression and Random Forest.

<img src="http://mechanicalforex.com/wp-content/uploads/2016/08/Selection_999256.jpg" width="60%"/>

source: http://mechanicalforex.com/2016/08/making-random-forests-cheap-enough-for-machine-learning-mining.html

---
### Your name: Boyao Zhu

---
### Review pre-class activity 

In [26]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('https://stats.idre.ucla.edu/stat/data/binary.csv') 
df.columns = ["admit", "gre", "gpa", "prestige"]

X = df.copy()
y = X['admit']
X = X.drop(['admit'], axis=1)


scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)


logisModel = LogisticRegression()
logisModel.fit(X_train, y_train)

forestModel = RandomForestClassifier(random_state=42)
forestModel.fit(X_train, y_train)

print('Logistic Model\n**************')
print('Training accuracy:', accuracy_score(y_train, logisModel.predict(X_train)))
print('Validation accuracy:', accuracy_score(y_test, logisModel.predict(X_test)))

print('\nRandom Forest Model\n*******************')
print('Training accuracy:', accuracy_score(y_train, forestModel.predict(X_train)))
print('Validation accuracy:', accuracy_score(y_test, forestModel.predict(X_test)))

Logistic Model
**************
Training accuracy: 0.725
Validation accuracy: 0.6625

Random Forest Model
*******************
Training accuracy: 0.946875
Validation accuracy: 0.6875


** Thoughts: ** 
</p>
1. The logistic model is underfitting the data. We can try a more complicated model: for instance, enigneering new features of the type $x_1^2, x_2^2, x_3^2 , \ldots, x_1x_2, x_1x_3,x_2x_3,\ldots, x_1^3,x_2^3,x_3^3,\ldots$ The new features should be nonlinear functions of the old variables.
1. Overfitting in the logistic model does not seem to be a problem.
1. Random forest suffers from massive overfitting. This method has tunning parameters which can be used to enforce simplicity of the model (e.g. number of trees,  their depth, etc. See http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). 
1. Using less variables is another strategy for improving a Random Forest model: pick the top $n$ according to their importance.
---

In [27]:
# Let's see if we can bump up training accuracy in the logistic model

X_train_aug = np.concatenate((X_train, X_train*X_train), axis = 1)
X_test_aug = np.concatenate((X_test, X_test*X_test), axis = 1)


aug_logisModel = LogisticRegression()
aug_logisModel.fit(X_train_aug, y_train)


print(X_train_aug.shape,X_train.shape)
print('Logistic Model\n**************')
print('Training accuracy:', accuracy_score(y_train, logisModel.predict(X_train)))
print('Validation accuracy:', accuracy_score(y_test, logisModel.predict(X_test)))

print('\nAugmented Logistic Model\n**************')
print('Training accuracy:', accuracy_score(y_train, aug_logisModel.predict(X_train_aug)))
print('Validation accuracy:', accuracy_score(y_test, aug_logisModel.predict(X_test_aug)))

(320, 6) (320, 3)
Logistic Model
**************
Training accuracy: 0.725
Validation accuracy: 0.6625

Augmented Logistic Model
**************
Training accuracy: 0.728125
Validation accuracy: 0.65


In [28]:
# Constrain the random forest model using parameters

red_forestModel = RandomForestClassifier(n_estimators=30, min_samples_split=22, max_depth = 7, random_state=42)
red_forestModel.fit(X_train, y_train)


print('\nRandom Forest Model\n*******************')
print('Training accuracy:', accuracy_score(y_train, forestModel.predict(X_train)))
print('Validation accuracy:', accuracy_score(y_test, forestModel.predict(X_test)))

print('\nReduced Random Forest Model\n***************************')
print('Training accuracy:', accuracy_score(y_train, red_forestModel.predict(X_train)))
print('Validation accuracy:', accuracy_score(y_test, red_forestModel.predict(X_test)))


Random Forest Model
*******************
Training accuracy: 0.946875
Validation accuracy: 0.6875

Reduced Random Forest Model
***************************
Training accuracy: 0.775
Validation accuracy: 0.6875


In [30]:
# How are the different models using the features?

featImportanceForest = pd.Series(forestModel.feature_importances_, index = ['gre','gpa','prestige'])

featImportance_redForest = pd.Series(red_forestModel.feature_importances_, index = ['gre','gpa','prestige'])

print('Default Random Forest')
print(featImportanceForest)
print('\nReduced Random Forest')
print(featImportance_redForest)

Default Random Forest
gre         0.334353
gpa         0.528977
prestige    0.136670
dtype: float64

Reduced Random Forest
gre         0.285422
gpa         0.458797
prestige    0.255782
dtype: float64


**Conclusion:** With these classification algorithms, the data does not have enough expressiveness to capture the phenomenon. 

----

### Activity
</p>

1. Load the transformed loan data from last class.

1. Use the get_dummies function in the pandas library to convert categorical data to numerical data. Please remember that having highly correlated variables = bad.

1. Prepare the data: y ~ variable to predict, X ~ predictor variables; re-scale X and divide the training data into training and validation.

1. Build the best regression model you can. At your disposal: logistic regression, random forest; you also have control over all the parameters in the models. Please remember to use random_state = 42 so that we can all compare results.

In [44]:
# Your code
df = pd.read_csv('loan_data_Train_clean.csv')



In [45]:
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
df['LoanAmount'].replace(0. ,df['LoanAmount'].mean())
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mean(), inplace=True)
df['Loan_Amount_Term'].replace(0., df['Loan_Amount_Term'].mean())
df['Credit_History'].fillna(df['Credit_History'].mean(), inplace=True)


In [35]:
df_drop = df.drop(columns=['Loan_ID'])
df_dummies = pd.get_dummies(df_drop)
df_dummies

Unnamed: 0.1,Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,...,Dependents_3+,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Loan_Status_N,Loan_Status_Y
0,0,5849,0.0,146.412162,360.0,1.000000,0,1,1,0,...,0,1,0,1,0,0,0,1,0,1
1,1,4583,1508.0,128.000000,360.0,1.000000,0,1,0,1,...,0,1,0,1,0,1,0,0,1,0
2,2,3000,0.0,66.000000,360.0,1.000000,0,1,0,1,...,0,1,0,0,1,0,0,1,0,1
3,3,2583,2358.0,120.000000,360.0,1.000000,0,1,0,1,...,0,0,1,1,0,0,0,1,0,1
4,4,6000,0.0,141.000000,360.0,1.000000,0,1,1,0,...,0,1,0,1,0,0,0,1,0,1
5,5,5417,4196.0,267.000000,360.0,1.000000,0,1,0,1,...,0,1,0,0,1,0,0,1,0,1
6,6,2333,1516.0,95.000000,360.0,1.000000,0,1,0,1,...,0,0,1,1,0,0,0,1,0,1
7,7,3036,2504.0,158.000000,360.0,0.000000,0,1,0,1,...,1,1,0,1,0,0,1,0,1,0
8,8,4006,1526.0,168.000000,360.0,1.000000,0,1,0,1,...,0,1,0,1,0,0,0,1,0,1
9,9,12841,10968.0,349.000000,360.0,1.000000,0,1,0,1,...,0,1,0,1,0,0,1,0,1,0


In [36]:
X = df_dummies.copy()
y = X['Loan_Status_Y']
X = X.drop(['Loan_Status_N', 'Loan_Status_Y'], axis=1)

In [37]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)


logisModel = LogisticRegression()
logisModel.fit(X_train, y_train)

forestModel = RandomForestClassifier(random_state=42)
forestModel.fit(X_train, y_train)

print('Logistic Model\n**************')
print('Training accuracy:', accuracy_score(y_train, logisModel.predict(X_train)))
print('Validation accuracy:', accuracy_score(y_test, logisModel.predict(X_test)))

print('\nRandom Forest Model\n*******************')
print('Training accuracy:', accuracy_score(y_train, forestModel.predict(X_train)))
print('Validation accuracy:', accuracy_score(y_test, forestModel.predict(X_test)))

Logistic Model
**************
Training accuracy: 0.8167006109979633
Validation accuracy: 0.7967479674796748

Random Forest Model
*******************
Training accuracy: 0.9938900203665988
Validation accuracy: 0.7154471544715447


In [38]:
X_train_aug = np.concatenate((X_train, X_train*X_train), axis = 1)
X_test_aug = np.concatenate((X_test, X_test*X_test), axis = 1)


aug_logisModel = LogisticRegression()
aug_logisModel.fit(X_train_aug, y_train)


print(X_train_aug.shape,X_train.shape)
print('Logistic Model\n**************')
print('Training accuracy:', accuracy_score(y_train, logisModel.predict(X_train)))
print('Validation accuracy:', accuracy_score(y_test, logisModel.predict(X_test)))

print('\nAugmented Logistic Model\n**************')
print('Training accuracy:', accuracy_score(y_train, aug_logisModel.predict(X_train_aug)))
print('Validation accuracy:', accuracy_score(y_test, aug_logisModel.predict(X_test_aug)))

(491, 42) (491, 21)
Logistic Model
**************
Training accuracy: 0.8167006109979633
Validation accuracy: 0.7967479674796748

Augmented Logistic Model
**************
Training accuracy: 0.8309572301425662
Validation accuracy: 0.7804878048780488


In [40]:
X = df_dummies.copy()
y = X['Loan_Status_Y']
X = X.loc[:, ['ApplicantIncome', 'LoanAmount', 'Credit_History']]


In [41]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)


logisModel = LogisticRegression()
logisModel.fit(X_train, y_train)

forestModel = RandomForestClassifier(random_state=42)
forestModel.fit(X_train, y_train)

print('Logistic Model\n**************')
print('Training accuracy:', accuracy_score(y_train, logisModel.predict(X_train)))
print('Validation accuracy:', accuracy_score(y_test, logisModel.predict(X_test)))

print('\nRandom Forest Model\n*******************')
print('Training accuracy:', accuracy_score(y_train, forestModel.predict(X_train)))
print('Validation accuracy:', accuracy_score(y_test, forestModel.predict(X_test)))

Logistic Model
**************
Training accuracy: 0.814663951120163
Validation accuracy: 0.7886178861788617

Random Forest Model
*******************
Training accuracy: 0.9877800407331976
Validation accuracy: 0.6910569105691057


In [42]:
red_forestModel = RandomForestClassifier(n_estimators=30, min_samples_split=22, max_depth = 7, random_state=42)
red_forestModel.fit(X_train, y_train)


print('\nRandom Forest Model\n*******************')
print('Training accuracy:', accuracy_score(y_train, forestModel.predict(X_train)))
print('Validation accuracy:', accuracy_score(y_test, forestModel.predict(X_test)))

print('\nReduced Random Forest Model\n***************************')
print('Training accuracy:', accuracy_score(y_train, red_forestModel.predict(X_train)))
print('Validation accuracy:', accuracy_score(y_test, red_forestModel.predict(X_test)))


Random Forest Model
*******************
Training accuracy: 0.9877800407331976
Validation accuracy: 0.6910569105691057

Reduced Random Forest Model
***************************
Training accuracy: 0.8187372708757638
Validation accuracy: 0.7886178861788617


In [43]:
featImportanceForest = pd.Series(forestModel.feature_importances_, index = ['ApplicantIncome', 'LoanAmount', 'Credit_History'])

featImportance_redForest = pd.Series(red_forestModel.feature_importances_, index = ['ApplicantIncome', 'LoanAmount', 'Credit_History'])

print('Default Random Forest')
print(featImportanceForest)
print('\nReduced Random Forest')
print(featImportance_redForest)

Default Random Forest
ApplicantIncome    0.384651
LoanAmount         0.340088
Credit_History     0.275261
dtype: float64

Reduced Random Forest
ApplicantIncome    0.165879
LoanAmount         0.153164
Credit_History     0.680956
dtype: float64


---
### Congratulations, we're done!

** Don't forget to add your names to the top!!**

Log into the course D2L website (d2l.msu.edu) and go to "Assessments > Assignments > In-class Assignment 20181127".