# Essay Keystrokes Project

This project is based on the Kaggle competition [Linking Writing Processes to Writing Quality](https://www.kaggle.com/competitions/linking-writing-processes-to-writing-quality). The train data set for the competition includes scores for about 2.7k essays as well as a set of logs with thousands of records for each essay with over 8 million records total.  
  
The goal of the project is to use the data about each student's writing process to predict the score that they receive on their essay while minimizing rmse.  My project includes two notebooks: one for preprocessing and one for modeling.

## Project Summary

<img src="Essay_DFD.png" alt=" " width="900" height="900" style="float:left; margin-right:10px;">

## Modeling Notebook

In this notebook, I apply the preprocessing function from my previous notebook to prepare the data for modeling. Then, I build regression models to predict the score for each essay and use an iterative approach to build a model to minimize rmse.

Because the test data is hosted privately on Kaggle, I use cross validation scores on the train data to evaluate each model's performance in this notebook. My final model performs with a .626 rmse on the unseen test data on Kaggle.

### 1. Imports and Applying Preprocessing Function

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np

import pandas as pd

import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import StackingRegressor

from xgboost import XGBRegressor

In [2]:
train_df = pd.read_csv('Data/train_logs.csv')
train_scores_df = pd.read_csv('Data/train_scores.csv')

In [3]:
def preprocess(log_df):
    
    # Action Time
    action_time = log_df.groupby(by='id')['action_time'].sum()
    action_time_df = pd.DataFrame({'id':action_time.index, 'action_time':action_time.values})
    action_time_df['action_time'] = action_time_df['action_time']
    action_time_df['action_time_%_total'] = action_time_df['action_time'] / 1800000
    
    # Final Word Count
    final_word_count = log_df.groupby('id')['word_count'].last()
    word_count_df = pd.DataFrame({'id':final_word_count.index,'word_count':final_word_count.values})
    final_record_id_series = log_df.groupby('id')['event_id'].last()
    final_record_ids = final_record_id_series.to_dict()
    
    def total_words(group):
        final_record_id = final_record_ids[group.name]
        word_counts = group[group['event_id'] <= final_record_id]['word_count']
        word_diffs = word_counts.diff().where(lambda x: x > 0).sum()  
    
        return word_diffs
    
    total_words = log_df.groupby('id').apply(total_words)
    total_words_df = pd.DataFrame({'id':total_words.index,'total_words':total_words.values})
    word_count_df = pd.merge(word_count_df, total_words_df, on='id')
    
    # Final Event Count
    event_count = log_df.groupby('id')['event_id'].last()
    event_count_df = pd.DataFrame({'id':event_count.index,'event_count':event_count.values})
    
    # Activity
    activities = ['Input', 'Move', 'Nonproduction', 'Paste', 'Remove/Cut', 'Replace']
    
    log_df.loc[log_df['activity'].str.contains('From',case=False), 'activity'] = 'Move'
    log_df.loc[log_df['activity'].str.contains('From',case=False), 'activity'] = 'Move'
    activity_data = log_df.groupby(by='id')['activity'].value_counts().unstack(fill_value=0)
    
    for activity in activities:
        if activity in activity_data.columns:
            activity_data[activity] = activity_data[activity]
        else:
            activity_data[activity] = 0
    
    activity_df = activity_data
    activity_df['id'] = activity_df.index
    activity_df.reset_index(drop=True,inplace=True)
    
    pivot = log_df.pivot_table(values='action_time', index='id', columns='activity', aggfunc=['sum']).fillna(0)
    pivot.columns = pivot.columns.droplevel(level=0)
    
    for activity in activities:
        if activity in pivot.columns:
            pivot[activity] = pivot[activity]
        else:
            pivot[activity] = 0
    
    new_columns = [col + '_time' for col in pivot.columns]
    pivot.columns = new_columns
    pivot['id'] = pivot.index
    pivot.reset_index(drop=True,inplace=True)
    
    # Time of First and Last Input
    first_input = log_df[log_df['activity'] == 'Input'].groupby('id').first()
    first_input_df = pd.DataFrame({'id':first_input.index,'first_input':first_input['down_time']})
    first_input_df.reset_index(drop=True,inplace=True)
    last_input = log_df[log_df['activity'] == 'Input'].groupby('id').last()
    last_input_df = pd.DataFrame({'id':last_input.index,'last_input':last_input['up_time']})
    last_input_df.reset_index(drop=True,inplace=True)
    first_last_input_df = pd.merge(first_input_df,last_input_df,on='id')
    
    # Inputs
    columns = ['q', 'Space', 'Backspace', 'Shift', 'ArrowRight', 'Leftclick', 'ArrowLeft', '.', ',',
               'ArrowDown', 'ArrowUp', 'Enter', 'CapsLock', "'", 'Delete', 'Unidentified', 'Control',
               '"', '-', '?', ';', '=', 'Tab', '/', 'Rightclick', ':', '(', ')', '\\', 'ContextMenu',
               'End', '!', 'Meta', 'Alt', '[', 'c', 'v', 'NumLock', 'Insert', 'Home', 'z', 'AudioVolumeDown',
               'F2', 'a', 'x', 'AudioVolumeUp', '$', '>', ']', '*', '%', '&', 'Dead', 's', 'Escape',
               'ModeChange', 'F3', '<', 'AudioVolumeMute', 'F15', '+', 'ScrollLock', 'Process', 'PageDown',
               't', 'i', '_', '`', 'PageUp', '{', '0', '#', 'Middleclick', '1', '5', 'F12', '\x97',
               'OS', 'e', '@', 'F11', 'r', 'MediaTrackNext', 'y', 'm', 'n', 'b', 'Clear', 'MediaPlayPause', 'o']
    
    down_event_data = log_df.groupby(by='id')['down_event'].value_counts().unstack(fill_value=0)
    
    for col in columns:
        if col in down_event_data.columns:
            down_event_data[col] = down_event_data[col]
        else:
            down_event_data[col] = 0
    
    down_event_df = down_event_data[columns]
    down_event_df.reset_index(inplace=True)
    
            
    # Pauses
    def add_previous_down_time(group):
        group['previous_down_time'] = group['down_time'].shift(1).fillna(group['down_time'])
        return group[['id','down_time','previous_down_time']]

    pause_df = log_df.groupby('id').apply(add_previous_down_time)
    
    pause_df['down_time_difference'] = pause_df['down_time'] - pause_df['previous_down_time']
    pause_df['pauses'] = 0
    pause_df.loc[pause_df['down_time_difference']>=5000,'pauses'] = 1
    
    pause_df['pause_time'] = 0
    pause_df.loc[pause_df['down_time_difference']>=5000,'pause_time'] = pause_df['down_time_difference']
    
    pause_df = pause_df.reset_index(drop=True)
    pause_df = pause_df[['id','pauses','pause_time']]
    pause_df = pause_df.groupby('id').sum()
    
    # Merge DataFrames
    full_df = pd.merge(action_time_df, word_count_df, on = 'id')
    full_df = pd.merge(full_df, pivot, on = 'id')
    full_df = pd.merge(full_df, event_count_df, on = 'id')
    full_df = pd.merge(full_df, activity_df, on = 'id')
    full_df = pd.merge(full_df, first_last_input_df, on = 'id')
    full_df = pd.merge(full_df, down_event_df, on = 'id')
    full_df = pd.merge(full_df, pause_df, on = 'id')
    
    # Sentence Length
    full_df['end_punctuation'] = full_df['.'] + full_df['!'] + full_df['?']
    full_df['sentence_length'] = full_df['word_count'] / full_df['end_punctuation']
    full_df.loc[full_df['end_punctuation']==0,'sentence_length'] = full_df['word_count']
    
    # Ratios
    full_df['Move_Ratio'] = full_df['Move'] / full_df['event_count']
    full_df['Paste_Ratio'] = full_df['Paste'] / full_df['event_count']
    full_df['Replace_Ratio'] = full_df['Replace'] / full_df['event_count']
    full_df['Nonproduction_Ratio'] = full_df['Nonproduction'] / full_df['event_count']
    full_df['Remove_Ratio'] = full_df['Remove/Cut'] / full_df['event_count']
    full_df['Input_Ratio'] = full_df['Input'] / full_df['event_count']
    full_df['revision_events'] = full_df['Move'] + full_df['Paste'] + full_df['Replace'] + full_df['Remove/Cut']
    full_df['Revision_Ratio'] =  full_df['revision_events'] / full_df['event_count']
    full_df['characters_per_pause'] = full_df['q'] / full_df['pauses']
    full_df.loc[full_df['pauses']==0,'characters_per_pause'] = full_df['q']
    full_df['Move_Time_Ratio'] = full_df['Move_time'] / full_df['action_time']
    full_df['Paste_Time_Ratio'] = full_df['Paste_time'] / full_df['action_time']
    full_df['Replace_Time_Ratio'] = full_df['Replace_time'] / full_df['action_time']
    full_df['Nonproduction_Time_Ratio'] = full_df['Nonproduction_time'] / full_df['action_time']
    full_df['Remove_Time_Ratio'] = full_df['Remove/Cut_time'] / full_df['action_time']
    full_df['Input_Time_Ratio'] = full_df['Input_time'] / full_df['action_time']
    full_df['revision_time'] = full_df['Move_time'] + full_df['Paste_time'] + full_df['Replace_time'] + full_df['Remove/Cut_time']
    full_df['Revision_Time_Ratio'] =  full_df['revision_time'] / full_df['action_time']
    full_df['action_time_per_pause'] = full_df['action_time'] / full_df['pauses']
    full_df.loc[full_df['pauses']==0,'action_time_per_pause'] = full_df['action_time']
    full_df['action_time_per_pause_time'] = full_df['action_time'] / full_df['pause_time']
    full_df.loc[full_df['pause_time']==0,'action_time_per_pause_time'] = full_df['action_time']
    
    return full_df

In [4]:
## Applying preprocess function

full_df = preprocess(train_df)

In [5]:
## Merging with scores

full_df = pd.merge(full_df,train_scores_df,on='id')
full_df

Unnamed: 0,id,action_time,action_time_%_total,word_count,total_words,Input_time,Move_time,Nonproduction_time,Paste_time,Remove/Cut_time,...,Paste_Time_Ratio,Replace_Time_Ratio,Nonproduction_Time_Ratio,Remove_Time_Ratio,Input_Time_Ratio,revision_time,Revision_Time_Ratio,action_time_per_pause,action_time_per_pause_time,score
0,001519c8,297243,0.165135,255,348.0,243731.0,0.0,18506.0,0.0,34130.0,...,0.000000,0.002947,0.062259,0.114822,0.819972,35006.0,0.117769,4644.421875,0.300140,3.5
1,0022f953,275391,0.152995,320,369.0,237891.0,0.0,13781.0,71.0,23550.0,...,0.000258,0.000356,0.050042,0.085515,0.863830,23719.0,0.086128,5859.382979,0.255246,3.5
2,0042269b,421201,0.234001,404,549.0,353718.0,0.0,33951.0,0.0,32905.0,...,0.000000,0.001489,0.080605,0.078122,0.839784,33532.0,0.079610,12388.264706,0.427964,6.0
3,0059420b,189596,0.105331,206,244.0,167790.0,0.0,3062.0,160.0,18410.0,...,0.000844,0.000918,0.016150,0.097101,0.884987,18744.0,0.098863,6319.866667,0.315773,2.0
4,0075873a,313702,0.174279,252,339.0,266515.0,0.0,6988.0,0.0,40199.0,...,0.000000,0.000000,0.022276,0.128144,0.849580,40199.0,0.128144,6402.081633,0.340198,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2466,ffb8c745,499670,0.277594,273,620.0,426990.0,0.0,5203.0,0.0,67253.0,...,0.000000,0.000448,0.010413,0.134595,0.854544,67477.0,0.135043,20819.583333,0.542786,3.5
2467,ffbef7e5,214221,0.119012,438,450.0,203403.0,0.0,6583.0,0.0,4118.0,...,0.000000,0.000546,0.030730,0.019223,0.949501,4235.0,0.019769,5100.500000,0.260813,4.0
2468,ffccd6fd,231580,0.128656,201,213.0,214677.0,0.0,10232.0,0.0,6671.0,...,0.000000,0.000000,0.044183,0.028806,0.927010,6671.0,0.028806,6811.176471,0.236213,1.5
2469,ffec5b38,289439,0.160799,413,472.0,263216.0,0.0,5624.0,0.0,20599.0,...,0.000000,0.000000,0.019431,0.071169,0.909401,20599.0,0.071169,9647.966667,0.453550,5.0


In [6]:
## Checking for nulls

full_df.isnull().sum().sum()

0

## 2. Modeling

### 2a. Train Test Split and Pipeline Setup

In [7]:
# Defining independent and dependent variables

X = full_df.drop(['id','score'],axis=1)
y = full_df['score']

In [8]:
# Performing train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state=42)

In [9]:
# Creating subpipeline to scale the data; all features are numerical

sub_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [10]:
# Creating column transformer

ct = ColumnTransformer(transformers = [
    ('sub', sub_pipe, X.columns)
])

### 2b. Dummy: Baseline Model

I start by building a dummy model to create baseline rates to evaluate the performance of the other models I create. I evaluate all models in terms of rmse and r2 even though rmse is the only metric used for the competition. Using r2 as well provides a more complete description of how each model performs.

In [11]:
# Creating dummy pipeline

dum_pipe = Pipeline([
    ('ct', ct),
    ('dummy', DummyRegressor(strategy='mean'))
])

In [12]:
dum_pipe.fit(X_train, y_train)

In [13]:
# Calculating dummy rmse

dummy_rmse = np.sqrt(mean_squared_error(y_train,dum_pipe.predict(X_train)))
dummy_rmse

1.0338843406754694

In [14]:
# Calculating dummy r2

dummy_r2 = r2_score(y_test,dum_pipe.predict(X_test))
dummy_r2

-0.0001285506665325009

In [15]:
# Creating results dataframe and appending dummy scores

results_df = pd.DataFrame({'Model':'Dummy','RMSE':dummy_rmse,'R2':dummy_r2}, index=[0])
results_df

Unnamed: 0,Model,RMSE,R2
0,Dummy,1.033884,-0.000129


The dummy regressor, which makes its predictions by predicting the mean score from the train data every single time, tends to be off by a little over 1 on an average prediction, and it accounts for none of the variation in the data. The R2 makes sense because there is no variation in the predictions; using this rmse as a baseline will show the value the models provide above simply predicting every writer achieves an average score.

### 2c. MLR: First Simple Model

Next, I use a multiple linear regression as a first simple model to see how it performs compared to the dummy. I also create a function to append the results of each model to a rolling dataframe to make it easier to evaluate the models against one another. I append all results to the results dataframe created during the dummy model section.

In [16]:
# Creating the MLR pipeline

mlr_pipe = Pipeline(steps=[
    ('ct', ct),
    ('MLR', LinearRegression())
])

In [17]:
mlr_pipe.fit(X_train,y_train)

In [18]:
# Creating a function to append the results for each model to a rolling dataframe

def results(pipe, results_df, X, y, split):
    
    # Defining predictions
    y_pred = pipe.predict(X)
    
    # Defining variables for rmse and r2
    mse = mean_squared_error(y, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y, y_pred)
    
    # Extracting model name from the last step in the pipeline
    model_name = pipe.steps[-1][0]
    
    # Appending results to the DataFrame
    results_df = results_df.append({'Model': model_name, 'RMSE': rmse, 'R2': r2, 'Split': split}, ignore_index=True)
    
    return results_df

In [19]:
# Finding results for MLR train and test

results_df = results(mlr_pipe, results_df, X_train, y_train, 'train')
results_df = results(mlr_pipe, results_df, X_test, y_test, 'test')
results_df

Unnamed: 0,Model,RMSE,R2,Split
0,Dummy,1.033884,-0.000129,
1,MLR,0.701514,0.539606,train
2,MLR,0.710663,0.481872,test


The MLR model explains a little less than half of the variance in the data and is typically off by about .7 per prediction. This is a substantial improvement from the dummy and sets a new score for the subsequent models to beat.

### 2d. Ridge

Next, I try a ridge regressor as a quick way to see if regularization could help improve the first simple model.

In [20]:
# Creating pipeline for ridge regressor

ridge_pipe = Pipeline([
    ('ct', ct),
    ('Ridge', Ridge())
])

In [21]:
ridge_pipe.fit(X_train,y_train)

In [22]:
# Appending results to the rolling dataframe

results_df = results(ridge_pipe, results_df, X_train, y_train, 'train')
results_df = results(ridge_pipe, results_df, X_test, y_test, 'test')
results_df

Unnamed: 0,Model,RMSE,R2,Split
0,Dummy,1.033884,-0.000129,
1,MLR,0.701514,0.539606,train
2,MLR,0.710663,0.481872,test
3,Ridge,0.701801,0.53923,train
4,Ridge,0.705218,0.489781,test


The ridge regressor performs slightly better than the MLR model, but not by much. The models could be improved by trying to reduce factors like multicollinearity; first, I check more models to see whether doing more work with the linear regression models would be worth it.

### 2e. Random Forest

The next model I try is random forest. As an ensemble model based on decision trees, it is less sensitive to factors such as multicollinearity and therefore may perform better than the linear models.

In [23]:
# Creating random forest pipeline

rf_pipe = Pipeline([
    ('ct', ct),
    ('RF', RandomForestRegressor(random_state=42))
])

In [24]:
rf_pipe.fit(X_train,y_train)

In [25]:
# Appending results to the rolling dataframe

results_df = results(rf_pipe, results_df, X_train, y_train, 'train')
results_df = results(rf_pipe, results_df, X_test, y_test, 'test')
results_df

Unnamed: 0,Model,RMSE,R2,Split
0,Dummy,1.033884,-0.000129,
1,MLR,0.701514,0.539606,train
2,MLR,0.710663,0.481872,test
3,Ridge,0.701801,0.53923,train
4,Ridge,0.705218,0.489781,test
5,RF,0.255033,0.939152,train
6,RF,0.585229,0.648632,test


The random forest model performance substantially better than the ridge or MLR models, explaining nearly 65% of the variance in the data and achieving a lower rmse. With no hyperparameter tuning, it is also significantly overfit to the train data.

Because the model performs this well, I use a gridsearch to tune hyperparameters and try to maximize model performance. To minimizethe computing power required at any one time, I split the grid search in half and perform two separate searches.

In [26]:
# Selecting first parameters for gridsearch

params_rf = {
    'RF__n_estimators' : [500, 600, 700],
    'RF__max_depth' : [7, 9, 11] 
}

In [27]:
# Instantiating gridsearch variable

gs_rf = GridSearchCV(
    estimator = rf_pipe,
    param_grid = params_rf,
    cv = 5,
    verbose = 1)

In [28]:
# For time purposes, I comment these lines out and use the parameters in the code that follows.

#gs_rf.fit(X_train,y_train)

In [29]:
#gs_rf.best_params_

In [30]:
# Creating a new pipeline with the first gridsearch parameters

gs_rf_pipe = Pipeline([
    ('ct', ct),
    ('GS_RF',RandomForestRegressor(max_depth=7,n_estimators=500,random_state=42))
])

In [31]:
gs_rf_pipe.fit(X_train,y_train)

In [32]:
# Defining parameters for the second gridsearch

params_rf1 = {
    'GS_RF__min_samples_split': [2, 5, 10],
    'GS_RF__min_samples_leaf': [1, 2, 4]
}

In [33]:
# Instantiating variable for second gridsearch

gs_rf1 = GridSearchCV(
    estimator = gs_rf_pipe,
    param_grid = params_rf1,
    cv = 5,
    verbose = 1)

In [34]:
# gs_rf1.fit(X_train,y_train)

In [35]:
# gs_rf1.best_params_

In [37]:
# Creating a new pipeline with the first gridsearch parameters

gs_rf_pipe1 = Pipeline([
    ('ct', ct),
    ('GS_RF',RandomForestRegressor(max_depth=7,
                                   n_estimators=500,
                                   min_samples_leaf=2,
                                   min_samples_split=2,
                                   random_state=42))
])

In [39]:
gs_rf_pipe1.fit(X_train, y_train)

In [40]:
# Appending results to the rolling dataframe

results_df = results(gs_rf_pipe1, results_df, X_train, y_train, 'train')
results_df = results(gs_rf_pipe1, results_df, X_test, y_test, 'test')
results_df

Unnamed: 0,Model,RMSE,R2,Split
0,Dummy,1.033884,-0.000129,
1,MLR,0.701514,0.539606,train
2,MLR,0.710663,0.481872,test
3,Ridge,0.701801,0.53923,train
4,Ridge,0.705218,0.489781,test
5,RF,0.255033,0.939152,train
6,RF,0.585229,0.648632,test
7,GS_RF,0.499705,0.766394,train
8,GS_RF,0.588884,0.64423,test


The grid search did not significantly alter the performance of the random forest model on the test data; in fact, after the grid search, it performed slightly worse. The grid search did reduce the overfitting, though.

Both with and with out hyperparameter tuning, the random forest model performed much better than the linear models did. I continue to try different models to see if they can outperform random forest.

### 2f. Gradient Boost

In [41]:
# Creating the pipeline

gb_pipe = Pipeline([
    ('ct', ct),
    ('GB', GradientBoostingRegressor(random_state=42))
])

In [42]:
gb_pipe.fit(X_train, y_train)

In [43]:
# Appending the results to the rolling dataframe

results_df = results(gb_pipe, results_df, X_train, y_train,'train')
results_df = results(gb_pipe, results_df, X_test, y_test,'test')
results_df

Unnamed: 0,Model,RMSE,R2,Split
0,Dummy,1.033884,-0.000129,
1,MLR,0.701514,0.539606,train
2,MLR,0.710663,0.481872,test
3,Ridge,0.701801,0.53923,train
4,Ridge,0.705218,0.489781,test
5,RF,0.255033,0.939152,train
6,RF,0.585229,0.648632,test
7,GS_RF,0.499705,0.766394,train
8,GS_RF,0.588884,0.64423,test
9,GB,0.531387,0.735833,train


Even without any hyperparameter tuning, the gradient boosting regressor performs slightly better on the unseen data than the random forest model did. It also produces a better r2 score.  Because it performs well, I use a grid search to tune hyper parameters. I use the same two-part approach that I did with the random forest grid search.

In [44]:
# Defining first parameters

params_gb = {
    'GB__learning_rate' : [.01, .05, .1, .2, .3],
}

In [45]:
# Instantiating first grid search

gs_gb = GridSearchCV(
    estimator = gb_pipe,
    param_grid = params_gb,
    cv = 5,
    verbose = 1
)

In [46]:
# Commented out to save time/computing power

# gs_gb.fit(X_train,y_train)

In [47]:
# gs_gb.best_params_

In [48]:
# Creating first gradient boost pipeline

gs_gb_pipe = Pipeline([
    ('ct', ct),
    ('GS_GB', GradientBoostingRegressor(learning_rate=.05, random_state=42))
])

In [49]:
gs_gb_pipe.fit(X_train, y_train)

In [50]:
# Defining second parameter set

params_gb1 = {
    'GS_GB__n_estimators': [50, 100, 200],
    'GS_GB__max_depth': [3, 5, 7]
}

In [51]:
# Instantiating second grid search

gs_gb1 = GridSearchCV(
    estimator = gs_gb_pipe,
    param_grid = params_gb1,
    cv = 5,
    verbose = 1
)

In [52]:
# gs_gb1.fit(X_train,y_train)

In [53]:
# gs_gb1.best_params_

In [54]:
# Creating final pipeline with grid search results

gs_gb_pipe1 = Pipeline([
    ('ct', ct),
    ('GS_GB', GradientBoostingRegressor(learning_rate=.05,
                                        max_depth=3,
                                        n_estimators=100,
                                        random_state=42))
])

In [56]:
gs_gb_pipe1.fit(X_train,y_train)

In [57]:
# Appending results to the rolling dataframe

results_df = results(gs_gb_pipe1, results_df, X_train, y_train, 'train')
results_df = results(gs_gb_pipe1, results_df, X_test, y_test, 'test')
results_df

Unnamed: 0,Model,RMSE,R2,Split
0,Dummy,1.033884,-0.000129,
1,MLR,0.701514,0.539606,train
2,MLR,0.710663,0.481872,test
3,Ridge,0.701801,0.53923,train
4,Ridge,0.705218,0.489781,test
5,RF,0.255033,0.939152,train
6,RF,0.585229,0.648632,test
7,GS_RF,0.499705,0.766394,train
8,GS_RF,0.588884,0.64423,test
9,GB,0.531387,0.735833,train


Once again, the grid search reduced the overfitting to the train data but performed worse on the unseen data than the untuned model in terms of both rmse and r2.

### 2g. ADA Boost

Next I try the same approach with an ADA Boost regressor.

In [58]:
# Creating the pipeline

ada_pipe = Pipeline([
    ('ct', ct),
    ('ADA', AdaBoostRegressor(random_state=42))
])

In [59]:
ada_pipe.fit(X_train, y_train)

In [60]:
# Appending results to the dataframe

results_df = results(ada_pipe, results_df, X_train, y_train, 'train')
results_df = results(ada_pipe, results_df, X_test, y_test, 'test')
results_df

Unnamed: 0,Model,RMSE,R2,Split
0,Dummy,1.033884,-0.000129,
1,MLR,0.701514,0.539606,train
2,MLR,0.710663,0.481872,test
3,Ridge,0.701801,0.53923,train
4,Ridge,0.705218,0.489781,test
5,RF,0.255033,0.939152,train
6,RF,0.585229,0.648632,test
7,GS_RF,0.499705,0.766394,train
8,GS_RF,0.588884,0.64423,test
9,GB,0.531387,0.735833,train


The model starts out performing worse than the previous two. I tune the learning rate using grid search to see whether it would be worth continuing to further tune the model.

In [None]:
params_ada = {
    'ADA__learning_rate' : [.01, .05, .1, .2, .3],
}

In [62]:
# Instantiating grid search

gs_ada = GridSearchCV(
    estimator = ada_pipe,
    param_grid = params_ada,
    cv = 5,
    verbose = 1
)

In [63]:
#gs_ada.fit(X_train,y_train)

In [64]:
#gs_ada.best_params_

In [65]:
# Creating the pipeline with tuning

gs_ada_pipe = Pipeline([
    ('ct',ct),
    ('GS_ADA',AdaBoostRegressor(learning_rate=.2,random_state=42))
])

In [66]:
gs_ada_pipe.fit(X_train,y_train)

In [67]:
# Appending results to the rolling data frame

results_df = results(gs_ada_pipe, results_df, X_train, y_train, 'train')
results_df = results(gs_ada_pipe, results_df, X_test, y_test, 'test')
results_df

Unnamed: 0,Model,RMSE,R2,Split
0,Dummy,1.033884,-0.000129,
1,MLR,0.701514,0.539606,train
2,MLR,0.710663,0.481872,test
3,Ridge,0.701801,0.53923,train
4,Ridge,0.705218,0.489781,test
5,RF,0.255033,0.939152,train
6,RF,0.585229,0.648632,test
7,GS_RF,0.499705,0.766394,train
8,GS_RF,0.588884,0.64423,test
9,GB,0.531387,0.735833,train


After one round of grid search, the model performs better, but it is still not as good as the gradient boosting regressor or the random forest regressor. I stop tuning it to try one final model.

### 2h. XGBoost

The last model I try before trying a stacking model with the best performers is XGB.

In [68]:
# Creating the pipeline

xgb_pipe = Pipeline([
    ('ct', ct),
    ('XGB', XGBRegressor(random_state=42))
])

In [69]:
xgb_pipe.fit(X_train, y_train)

In [70]:
# Appending results to the rolling dataframe

results_df = results(xgb_pipe, results_df, X_train, y_train, 'train')
results_df = results(xgb_pipe, results_df, X_test, y_test, 'test')
results_df

Unnamed: 0,Model,RMSE,R2,Split
0,Dummy,1.033884,-0.000129,
1,MLR,0.701514,0.539606,train
2,MLR,0.710663,0.481872,test
3,Ridge,0.701801,0.53923,train
4,Ridge,0.705218,0.489781,test
5,RF,0.255033,0.939152,train
6,RF,0.585229,0.648632,test
7,GS_RF,0.499705,0.766394,train
8,GS_RF,0.588884,0.64423,test
9,GB,0.531387,0.735833,train


XGB performs worse on the unseen data than random forest and gradient boost.  However, it is extremely overfit to the train data without tuning, with 99% of variance explained and an average rmse of .08 for train. I run two rounds of grid search to try to tune the model to generalize better on unseen data and reduce overfitting.

In [71]:
params_xgb = {
    'XGB__learning_rate' : [.01, .05, .1, .2, .3],
}

In [72]:
# Instantiating first grid search

gs_xgb = GridSearchCV(
    estimator = xgb_pipe,
    param_grid = params_xgb,
    cv = 5,
    verbose = 1
)

In [73]:
# gs_xgb.fit(X_train, y_train)

In [74]:
# gs_xgb.best_params_

In [75]:
# Creating pipeline

gs_xgb_pipe = Pipeline([
    ('ct',ct),
    ('GS_XGB',XGBRegressor(learning_rate = .05, random_state=42))
])

In [76]:
gs_xgb_pipe.fit(X_train, y_train)

In [77]:
# Defining hyperparameters for second grid search

params_xgb1 = {
    'GS_XGB__n_estimators': [50, 100, 200],
    'GS_XGB__max_depth': [3, 5, 7],
    'GS_XGB__min_child_weight': [1, 3, 5]
}

In [78]:
# Instantiating second grid search

gs_xgb1 = GridSearchCV(
    estimator = gs_xgb_pipe,
    param_grid = params_xgb1,
    cv = 5,
    verbose = 1
)

In [79]:
# gs_xgb1.fit(X_train,y_train)

In [80]:
# gs_xgb1.best_params_

In [81]:
# Creating the pipeline with hyperparameter tuning

gs_xgb_pipe1 = Pipeline([
    ('ct',ct),
    ('GS_XGB',XGBRegressor(learning_rate = .05, 
                           max_depth = 3,
                           min_child_weight = 3,
                           n_estimators = 100,
                           random_state=42))
])

In [82]:
gs_xgb_pipe1.fit(X_train,y_train)

In [83]:
results_df = results(gs_xgb_pipe1, results_df, X_train, y_train, 'train')
results_df = results(gs_xgb_pipe1, results_df, X_test, y_test, 'test')
results_df

Unnamed: 0,Model,RMSE,R2,Split
0,Dummy,1.033884,-0.000129,
1,MLR,0.701514,0.539606,train
2,MLR,0.710663,0.481872,test
3,Ridge,0.701801,0.53923,train
4,Ridge,0.705218,0.489781,test
5,RF,0.255033,0.939152,train
6,RF,0.585229,0.648632,test
7,GS_RF,0.499705,0.766394,train
8,GS_RF,0.588884,0.64423,test
9,GB,0.531387,0.735833,train


After tuning, the xgb model perfroms almost identically on the train and test data in terms of rmse. It is also one of the strongest models so far.  Given these results, I try one final stacking regressor with my three best models to see if it performs better than any of the individual models. I use the models with hyperparameter tuning even though they sometimes performed slightly worse on the test data because they were less overfit and would hopefully generalize better.

### 2i. Stacking Regressor

In [88]:
# Identifying the three best models

estimator_list = [
    ('rf', gs_rf_pipe1),
    ('gb', gs_gb_pipe1),
    ('xgb', gs_xgb_pipe1)
]

In [89]:
# Building the stacking model using a linear regression as the final estimator

stack_model = StackingRegressor(
                    estimators=estimator_list,
                    final_estimator=LinearRegression(),
)

In [90]:
stack_model.fit(X_train, y_train)

In [91]:
# Calculating rmse

stack_rmse = np.sqrt(mean_squared_error(y_test,stack_model.predict(X_test)))
stack_rmse

0.5762736201650145

In [92]:
Calculating r2

stack_r2 = r2_score(y_test,stack_model.predict(X_test))
stack_r2

0.6593038870517494

The stacking regressor performs slightly better than the previous best performing model, the xgb with hyper parameter tuning.  With an rmse of .576 and an r2 of .659, this is my best performing model.

## 3. Results and Conclusions