# AdSel Modeling
## Step 3: Modeling
### Step 3a: Cross validate model hyperparameters

The purpose of this script is to optimal hyperparameters for the AA predictive models. This is an optional script in the pipeline. Note - this will take a (very) long time to run.

### Goals

* Use cross validation to find optimal hyperparameters for AA predictive models

### Process

* A. Load data and modules
* B. Set configurations
* C. Preprocessing
* D. Cross-validation

## Part A - Load data and modules

In [1]:
import pandas as pd
import pickle
import numpy as np
import multiprocessing

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.metrics import mean_squared_error

In [4]:
data = pd.read_csv('../../data/processed_data/dataForAaPredictions.csv')

In [5]:
data.head()

Unnamed: 0,folderId,applicationType,aaScore,gpa,ibDiplomaProjected,hsMathLevel,deficiencyArts,yrsArts,deficiencyEnglish,yrsEnglish,...,hasCollegeGpa,hasCollegeCredits,avoidedIb,avoidedAp,highTOEFL,highIELTS,predictedSAT,calcPredictedSAT,predictedTOEFL,calcPredictedTOEFL
0,3890DAA4-A2AC-E911-90FA-00505692D664,Freshman,6.0,3.18,False,4.0,0.0,2.386,0.0,4.016239,...,False,False,False,False,False,False,1091.027139,False,87.240594,False
1,E83E41B7-7FB3-E911-90FA-00505692D664,Freshman,9.0,3.16,False,4.0,0.0,2.386324,0.0,4.01615,...,True,False,False,False,False,False,1153.546231,False,105.764698,False
2,EB6B2BB9-FFB8-E911-90FA-00505692D664,Freshman,17.0,3.95,False,5.0,0.0,2.24716,0.0,4.006289,...,False,False,False,False,False,False,1285.029809,False,92.499766,False
3,BEFDE9D7-C7C4-E911-90FB-00505692D664,Freshman,12.0,3.9,False,5.0,0.0,2.304902,0.0,3.996059,...,False,False,False,False,False,False,1269.673844,False,100.538146,False
4,35D869E3-EDC7-E911-90FB-00505692D664,Freshman,8.0,3.46,False,3.0,0.0,2.386324,0.0,4.01615,...,True,False,False,False,False,False,826.700887,False,96.279605,False


In [6]:
data.columns

Index(['folderId', 'applicationType', 'aaScore', 'gpa', 'ibDiplomaProjected',
       'hsMathLevel', 'deficiencyArts', 'yrsArts', 'deficiencyEnglish',
       'yrsEnglish',
       ...
       'hasCollegeGpa', 'hasCollegeCredits', 'avoidedIb', 'avoidedAp',
       'highTOEFL', 'highIELTS', 'predictedSAT', 'calcPredictedSAT',
       'predictedTOEFL', 'calcPredictedTOEFL'],
      dtype='object', length=314)

## Part B - Set configurations

In [2]:
save_results = True

In [3]:
cores = multiprocessing.cpu_count()

In [19]:
parameters = { #these can be altered/fine-tuned
    'max_depth': [5, 7, 9],
    'learning_rate': [1, 0.1, 0.01],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 5],
    'max_features': [0.5, 0.75, 0.9, 1.0]
}

## Part C - Preprocessing

In [7]:
allData = data.copy()
allData = allData[allData.aaScore.notna()]

In [8]:
len(allData)

179170

In [9]:
cols = []
for col in allData.columns:
    current = allData[col]
    for each in current:
        try:
            float(each)
        except ValueError:
            cols.append(col)

In [10]:
cols = list(set(cols))

In [11]:
cols

['applicationType', 'folderId']

In [12]:
allData = allData.drop(cols, axis = 1).astype(float)

In [13]:
allData.entryYear.value_counts()

2018.0    44659
2019.0    44564
2017.0    43926
2016.0    42650
2020.0     3371
Name: entryYear, dtype: int64

## Part D - Cross-validation

### First, resident students

In [14]:
currentData = allData[allData['resident'] == 1].copy()

In [15]:
currentX = currentData.drop('aaScore', axis = 1)
currentY = currentData[['aaScore']]

In [16]:
scalerResX = MinMaxScaler()
scaledX = scalerResX.fit_transform(currentX)
scalerResY = MinMaxScaler()
scaledY = scalerResY.fit_transform(currentY)

In [22]:
xgbRes = GradientBoostingRegressor()

In [23]:
clf = GridSearchCV(xgbRes, parameters, n_jobs = cores - 1, verbose = 1)

In [24]:
clf.fit(scaledX, scaledY)

Fitting 3 folds for each of 432 candidates, totalling 1296 fits


[Parallel(n_jobs=7)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=7)]: Done  36 tasks      | elapsed: 18.1min
[Parallel(n_jobs=7)]: Done 186 tasks      | elapsed: 128.7min
[Parallel(n_jobs=7)]: Done 436 tasks      | elapsed: 378.6min
[Parallel(n_jobs=7)]: Done 786 tasks      | elapsed: 623.7min
[Parallel(n_jobs=7)]: Done 1236 tasks      | elapsed: 1081.7min
[Parallel(n_jobs=7)]: Done 1296 out of 1296 | elapsed: 1207.1min finished
  y = column_or_1d(y, warn=True)


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=GradientBoostingRegressor(alpha=0.9,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.1,
                                                 loss='ls', max_depth=3,
                                                 max_features=None,
                                                 max_leaf_nodes=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=100,
                                                 n...
                             

In [25]:
paramsRes = clf.best_params_
paramsRes

{'learning_rate': 0.1,
 'max_depth': 7,
 'max_features': 0.5,
 'min_samples_leaf': 2,
 'min_samples_split': 10}

In [26]:
if save_results: #save prediction model
    output_name = '../../outputs/resAaParams.pkl'
    with open(output_name, 'wb') as output:
        pickle.dump(paramsRes, output, pickle.HIGHEST_PROTOCOL)