## Model Evaluation for college scorecard - earnings prediction

This notebook will use several different supervised learning regression algorithms to model earnings after college using College Scorecard data. The models included for evaluation will be:

1. Linear Regression
1. Decision Tree
1. Random Forest

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
seed = 12345

#Load the data into a data frame
data file = ''
data = pd.read_csv(data_file)

### Data preprocessing
We need to split the data into training and test sets. Also, we need to think about possibly centering/normalizing the data.

Normalizing may make sense at least for linear regression, so we can understand the features a little bit better

Also, we can think about using PCA (if only for visualization purposes)

In [None]:
# Split the data into training and test
# Need to figure out best way to split time series data
# Do we split based only on colleges (i.e. each college is either train or test)
# Do we split based on college and year (i.e. each data entry is either train or test)
# Another way to split?


### Linear Regression
First try: no feature scaling

In [1]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
print('R^2 of training set: ' + lin_reg.score())
print('R^2 of test set: ' + lin_reg.score(X_test, y_test))

Second try: feature scaling

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
X_train_s, y_train_s, X_test_s, y_test_s = ???

lin_reg_scaled = LinearRegression()
lin_reg_scaled.fit(X_train_s, y_train_s)
print('R^2 of training set: ' + lin_reg_scaled.score())
print('R^2 of test set: ' + lin_reg_scaled.score(X_test_s, y_test_s))

### Decision Tree
First using default params, then tuning params
We can think also use different criterion for splitting (MSE vs MAE)

In [2]:
from sklearn.tree import DecisionTreeRegressor
#Do decision tree using out of box parameters
dec_tree = DecisionTreeRegressor(criterion = 'mse', random_state = seed)
dec_tree.fit(X_train, y_train)

# Train and test accuracy
train_accuracy = model.score(X_train, y_train)
test_accuracy = model.score(X_test, y_test)
print('mse')
print('training accuracy:\t', train_accuracy)
print('test accuracy:\t\t', test_accuracy)

#print top 20 feature importance
x = np.arange(20)
sorted_inds = np.argsort(-(model.feature_importances_))[:20]
sorted_colnames = df.columns[sorted_inds]
fig = plt.figure(figsize = (10,5))
plt.bar(x,model.feature_importances_[sorted_inds], width = 0.8)
plt.xticks(x,(sorted_colnames))
plt.show()

#Repeat decision tree using out of box parameters with 'mae' as criterion
dec_tree = DecisionTreeRegressor(criterion = 'mae', random_state = seed)
dec_tree.fit(X_train, y_train)

# Train and test accuracy
train_accuracy = model.score(X_train, y_train)
test_accuracy = model.score(X_test, y_test)
print('mae')
print('training accuracy:\t', train_accuracy)
print('test accuracy:\t\t', test_accuracy)

#print top 20 feature importance
x = np.arange(20)
sorted_inds = np.argsort(-(model.feature_importances_))[:20]
sorted_colnames = df.columns[sorted_inds]
fig = plt.figure(figsize = (10,5))
plt.bar(x,model.feature_importances_[sorted_inds], width = 0.8)
plt.xticks(x,(sorted_colnames))
plt.show()

Tune hyper parameters

In [None]:
min_samples_split_range = np.arange(??,??,??)
min_samples_leaf_range = np.arange(??,??,??)

# Function that takes model and outputs R^2
def get_r2(criterion, min_split,min_leaf):
    model = DecisionTreeRegressor(criterion = criterion, random_state = seed, 
                                  min_samples_split = min_split,
                                  min_samples_leaf = min_leaf)
    model.fit(X_train,y_train)
    return(model.score(X_test, y_test))

r2_array = dict{
    'mse' : np.zeros([10,10]),
    'mae' : np.zeros([10,10])}
#This for loop runs the model
for crit in ['mse', 'mae']:
    for i in range(0,10):
        min_samples_split = min_samples_split_values[i]
        for j in range(0,10):
            min_samples_leaf = min_samples_leaf_values[j]
            r2_array[crit][i,j]=get_r2(crit, min_samples_split,min_samples_leaf)

#plot the models
fig = plt.figure(figsize = (10,4))
for i in range(0,10):
    plt.plot(min_samples_split_values, r2_array['mse'][:,i], 
             label = 'min_leaf: ' + str(min_samples_leaf_values[i]))
plt.legend(loc=(1.01,0.25))
plt.xticks(min_samples_split_values)
plt.xlabel('min_samples_split')
plt.title('R^2 for MSE decision tree')
plt.show()

fig = plt.figure(figsize = (10,4))
for i in range(0,10):
    plt.plot(min_samples_split_values, r2_array['mae'][:,i], 
             label = 'min_leaf: ' + str(min_samples_leaf_values[i]))
plt.legend(loc=(1.01,0.25))
plt.xticks(min_samples_split_values)
plt.xlabel('min_samples_split')
plt.title('R^2 for MASE decision tree')
plt.show()
print(np.max(r2_array))

### Random forest


In [4]:
#First using out-of-box parameters (which will probably lead to overfitting)
#use mse and then use mae
from sklearn.ensemble import RandomForestRegressor
rand_forest = RandomForestRegressor(criterion = 'mse',oob_score = True, random_state = seed)
rand_forest.fit(X_train, y_train)
print('mse')
print('OOB score : ' + rand_forest.oob_score)
print('Train accuracy: ' + rand_forest.score(X_train, y_train))
print('Test accuracy: ' + rand_forest.score(X_test, y_test))
print()
rand_forest = RandomForestRegressor(criterion = 'mae',oob_score = True, random_state = seed)
rand_forest.fit(X_train, y_train)
print('mae')
print('OOB score : ' + rand_forest.oob_score)
print('Train accuracy: ' + rand_forest.score(X_train, y_train))
print('Test accuracy: ' + rand_forest.score(X_test, y_test))


In [6]:
#change some parameters and look for improvement (use better of mae and mse)

#first change n_estimators (to 100 because that is sklearn's version 0.22 default)
# and change max_features to 0.66 (since that is what wikipedia claims is a good starting point
# for a random forest regression)
n_estimators = 100
max_features = 0.66

rand_forest = RandomForestRegressor(n_estimators = n_estimators, max_features = max_features,
                                    criterion = 'mae', oob_score = True, random_state = seed)
rand_forest.fit(X_train, y_train)
print(rand_forest.get_params())
print('OOB score : ' + rand_forest.oob_score)
print('Train accuracy: ' + rand_forest.score(X_train, y_train))
print('Test accuracy: ' + rand_forest.score(X_test, y_test))

#change other parameters as we see fit (probably max_depth, min_size_leaf, min_size_split)