# Predictive Modelling
This notebook goes through the process of setting up predictive modelling in order to predict salaries for a job opening. This file contains the following steps:

1. Preprocessing Data
2. Training different models
3. Finding the best model with lowest MSE

In [1]:
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

## Data Preparation and Preprocessing

### Loading Files

Previously we cleaned the train csv file and saved as 'train_df_clean.csv'. This file is already consolidated over jobkey_id index. We will also load this file along with the 'test_df.csv'

In [2]:
clean_train_df = pd.read_csv('train_df_clean.csv')
clean_test_df = pd.read_csv('test_features.csv')

### Shuffling Dataframe
Shuffling Dataframe may help to improve cross-validation accuracy

In [3]:
clean_train_df = shuffle(clean_train_df).reset_index(drop=True)

### Define Variables

In [4]:
categorical_vars = ['companyId', 'jobType', 'degree', 'major', 'industry']
numeric_vars = ['yearsExperience', 'milesFromMetropolis']
target_var = 'salary'

### Get Target Variable

In [5]:
target_df = clean_train_df[target_var]

### One Hot Encoding

In [6]:
def one_hot_encode_feature_df(df, cat_vars=None, num_vars=None):
    '''performs one-hot encoding on all categorical variables and combines result with continous variables'''
    cat_df = pd.get_dummies(df[cat_vars])
    num_df = df[num_vars].apply(pd.to_numeric)
    return pd.concat([cat_df, num_df], axis=1)

In [7]:
feature_df = one_hot_encode_feature_df(clean_train_df, cat_vars=categorical_vars, num_vars=numeric_vars)
test_df = one_hot_encode_feature_df(clean_test_df, cat_vars=categorical_vars, num_vars=numeric_vars)

## Initializing and Training Predictive Models

Steps:
1. Initializing model lists and dict
2. create models + Hyperparameter tuning
3. Model summary

Models:
1. Linear Regression
2. RandomForestRegressor
3. GradientBoostRegressor

In [8]:
#initializing model lists and dict
models = []
mean_mse = {}
cv_std = {}
res = {}

In [9]:
#define number of processes to run in parallel
num_procs = 2

#shared model paramaters
verbose_lvl = 5

In [10]:
#Initiazlizing Models
lr = LinearRegression()
rf = RandomForestRegressor(n_estimators=150, n_jobs=num_procs, max_depth=25, min_samples_split=60, max_features=30, verbose=verbose_lvl)
gb = GradientBoostingRegressor(n_estimators=150, max_depth=5, loss='ls', verbose=verbose_lvl)

In [11]:
models.extend([lr, rf, gb])

In [12]:
def train_model(model, feature_df, target_df, num_procs, mean_mse, cv_std):
    neg_mse = cross_val_score(model, feature_df, target_df, cv=2, n_jobs=num_procs, scoring='neg_mean_squared_error')
    mean_mse[model] = -1.0*np.mean(neg_mse)
    cv_std[model] = np.std(neg_mse)

In [13]:
#cross validate models and print summaries to compare each model
for model in models:
    train_model(model, feature_df, target_df, num_procs, mean_mse, cv_std)
    print('\nModel:\n', model)
    print('Average MSE:\n', mean_mse[model])
    print('Standard deviation during CV:\n', cv_std[model])


Model:
 LinearRegression()
Average MSE:
 384.47499047251506
Standard deviation during CV:
 3.3891256464357866e-05
Standard deviation during CV:
 3.3891256464357866e-05

Model:
 RandomForestRegressor(max_depth=25, max_features=30, min_samples_split=60,
                      n_estimators=150, n_jobs=2, verbose=5)
Average MSE:
 367.5588277034447
Standard deviation during CV:
 0.04838140050742368
Standard deviation during CV:
 0.04838140050742368

Model:
 GradientBoostingRegressor(max_depth=5, n_estimators=150, verbose=5)
Average MSE:
 357.33422543210236
Standard deviation during CV:
 0.09232045687170398
Standard deviation during CV:
 0.09232045687170398


| Model | Average MSE | SD during CV |
| ---- | ---- | ---- |
| Linear Regression | 384.475 | 3.389 |
| Random Forest | 367.559 | 0.048 |
| Gradient Boost | 357.334 | 0.092 |

We saw the lowest Average Mean Squared Error with Gradient Boost Regressor Model making it the best model among the three used above. 