# Case: Model Operationalization
### Part 1a: Training a model in the development environment

Welcome to the first part of the case! In this step we will train the model which we will later deploy into our production environment. 
The goal of this step is to get a feeling for the data and the model that were dealing with. 
As in many data science departments, you will use Jupyter notebook to train the model. 
Using Jupyter Notebook enables us to make our script readable and easily explainable to others, because of the options to include visualizations and text blocks.

Run the code and have a quick look to understand what is going on. The whole notebook should run without problem already. Check out the code and try to figure out which pickle files you have to make. Don't spend too much time on  understanding the details of the code!

Good luck and have fun!

In [None]:
# Imports

import pandas as pd
import random
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
import sklearn
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np

# We set a random seed to get the same results with every run
random.seed(15) 

Our dataset contains HR data - for each employee we have collected data and we mark if he had left the company or not.
We would like to train a model to alert us on potential employees who might leave us.

In [None]:
# Put the csv file in the same directory as this notebook or update the path below
csv_path = "HR_source_data.csv"

In [None]:
inputdata = pd.read_csv(r"{}".format(csv_path), header=0, sep=",")
inputdata.sample(5)

In [None]:
# Let's review some statistics

inputdata.describe()

In [None]:
# We will use specific variables for the model

columns = ['average_monthly_hours', 'department', 'salary', 'number_project', 'last_evaluation', 'satisfaction_level', 'left']
inputdata = inputdata.loc[:, columns]
inputdata.sample(5)

In [None]:
# Does department affects churn? And what about other columns?
# Plot categories histogram and split with explained variable
g=sns.catplot(x="department", hue="left", kind="count",
            palette="pastel", edgecolor=".6",
            data=inputdata)
g.fig.set_size_inches(12, 12)

# And as a table - average churn rate per department
print(inputdata.groupby('department').mean()['left'])

# Churn is lower in RandD and managment

## In order to build a model, the following transformations on the data will have to be applied:
1. Dummify categorical variables
2. Create a new variable 
3. Scale and Impute certain variables

In [None]:
# Split to train and test
train_img, test_img = train_test_split(inputdata, test_size=0.3, random_state=15)

In [None]:
# Alter department variables based on exploration
train_img['department'] = train_img['department'].apply(lambda x: 'other' if x not in ['RandD','management'] else x)

# Dummify the department variable  
train_img = pd.concat([train_img, pd.get_dummies(train_img['department'])], axis=1)
train_img.drop('department', axis=1, inplace=True)
train_img.sample(5)

In [None]:
# Make salary an ordinal variable

replace_map_salary = {'salary': {'low': 1, 'medium': 2, 'high': 3}}
train_img.replace(replace_map_salary, inplace=True)
train_img.sample(5)

In [None]:
# Create new variable - number of hours per project

train_img['hours_per_project'] = train_img['average_monthly_hours']/train_img['number_project']
train_img.sample(5)

In [None]:
# Scale the average monthly hours variable using the min-max scaler

scaler = sklearn.preprocessing.MinMaxScaler()
train_img['average_monthly_hours'] = scaler.fit_transform(train_img['average_monthly_hours'].values.reshape(-1, 1))
train_img.sample(5)

In [None]:
# Dealing with missing values
# Impute satisfaction level using the mean of the variable

train_img.satisfaction_level.fillna(train_img.satisfaction_level.mean(), inplace=True)
train_img.describe()

In [None]:
# Let's use Random Forest for classficiation
# More hyperparameters can be tweaked, but we keep it simple for now --> not the focus of this module

forest = RandomForestClassifier(n_estimators=100, n_jobs=1)

In [None]:
X_train = train_img.drop(['left'], axis=1)
y_train = train_img.loc[:, 'left']

forest.fit(X_train, y_train)

In [None]:
# Using feature importance, we can better understand which variables are contributing to the model the most

features = X_train.columns
importances = forest.feature_importances_
indices = np.argsort(importances)

plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

In [None]:
# Create a pipeline to transform the test set to the version we can use for model training.
# In other words, apply the same transformations as we have done during our exploration phase, but now in one function.
def test_transformation(test_set):
    # Create a local copy of the incoming data in this function.
    test_set_copy = test_set.copy()
    
    # Transform department variable
    test_set_copy['department'] = test_set_copy['department'].apply(lambda x: 'other' if x not in ['RandD', 'management'] else x)
    
    # Dummify categorical variable 
    test_set_copy = pd.concat([test_set_copy, pd.get_dummies(test_set_copy['department'])], axis=1)
    
    # Remove old categorical variable 
    test_set_copy.drop('department', axis=1, inplace=True)
    
    # Make salary variable ordinal
    test_set_copy.replace(replace_map_salary, inplace=True)
    
    # Create new variable
    test_set_copy['hours_per_project'] = test_set_copy['average_monthly_hours'] / test_set_copy['number_project']
    
    # Scale  average monthly hours
    test_set_copy['average_monthly_hours'] = scaler.transform(test_set_copy['average_monthly_hours'].values.reshape(-1,1))
    
    # Impute missing values in satisfaction level
    test_set_copy.satisfaction_level.fillna(train_img.satisfaction_level.mean(), inplace=True)
    
    return test_set_copy.drop(['left'], axis=1),test_set_copy.loc[:, 'left']

X_test, y_test = test_transformation(test_img)

In [None]:
expected = y_test

# Make new predictions
predicted = forest.predict(X_test)

# Create confusion matrix
print(classification_report(expected, predicted))

## Exercise: save relevant model objects to be able to deploy the model to production

In [None]:
# Tip - Go through the theory to recall which model objects are relevant to save.
# Use the pickle package documentation to save the model objects