# SI 618 - Homework 9 - Classification of Employee Attrition

## Objectives
* Be able to extract, transform, and select multiple features to prepare for classification algorithm
* Be able to apply random forest regression
* Be able to evaluate classification results using appropriate metrics

## Submission Instructions:
Please submit your completed Databricks notebook file in .html format as well as the URL to the published version of your notebook via Canvas.

## Goal: 
1. Try to predict the IBM employee attrition using multiple factors, and evaluate how good your predictions are; 
2. Find out the leading drivers of Employee Attrition.

The dataset is downloaded from IBM's website:
https://www.ibm.com/communities/analytics/watson-analytics-blog/guide-to-sample-datasets/

#### NOTE: This homework assignment follows very closely the structure of this week's lab assignment.

You should be able to complete the core (i.e. everything other than "Above and Beyond") of this
assignment based on the code in the lab notebook.

#### Read the dataset from AWS S3 bucket.

In [4]:
ACCESS_KEY = 
SECRET_KEY = 
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "umsi-data-science-west"
MOUNT_NAME = "umsi-data-science"
try:
  dbutils.fs.unmount("/mnt/%s/" % MOUNT_NAME)
except:
  print("Could not unmount %s, but that's ok." % MOUNT_NAME)
dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
display(dbutils.fs.ls("/mnt/umsi-data-science/si618wn2017"))

In [5]:
# Read the data (in AWS as a csv) in a pyspark DataFrame
ibm = spark.read.csv("/mnt/umsi-data-science/si618wn2017/WA_Fn-UseC_-HR-Employee-Attrition.csv", inferSchema=True, header=True)

In [6]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [7]:
string_columns = ["Attrition","BusinessTravel","Department","EducationField","Gender","JobRole","MaritalStatus","Over18","OverTime"]
columns = ibm.columns

# Anna's trick to create new columns
for x, y in enumerate(columns):
  if y in string_columns:
    columns[x] = "indexed_"+y

# Step 1:
ibm_string_indexer = [StringIndexer(inputCol=c, outputCol="indexed_"+ c) for c in string_columns]

# remove target column (now called "indexed_attrition")
columns.remove("indexed_Attrition")

# Step 2:
ibm_assembler = VectorAssembler(
    inputCols=columns,
    outputCol="features")

# Step 3:
ibm_cat_indexer = VectorIndexer(inputCol="features", outputCol="indexed", maxCategories=3)

# Step 4:
ibm_rf = RandomForestClassifier(featuresCol="indexed", labelCol="indexed_Attrition", numTrees=10)

# Step 5: (Before this wasn't working because I was trying to fit a list)
ibm_labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", 
	labels=ibm_string_indexer[0].fit(ibm).labels)

# Step 6:
splits = ibm.randomSplit([0.8, 0.2], 1234)
ibm_train = splits[0]
ibm_test = splits[1]

# Step 7:
ibm_pipeline = Pipeline(stages=ibm_string_indexer + [ibm_assembler, ibm_cat_indexer, ibm_rf, ibm_labelConverter])

# Step 8:
ibm_model = ibm_pipeline.fit(ibm_train)
ibm_predictions = ibm_model.transform(ibm_test)

In [8]:
# Step 9 (what happened = "attrition"; our prediction = "prediction")
display(ibm_predictions.select("Attrition","predictedLabel"))

In [9]:
# Step 10: (we compare indexed variables (i.e., indexed_Attrition and prediction) not (Attrition and predictedLabel))
evaluator = MulticlassClassificationEvaluator(labelCol="indexed_Attrition", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(ibm_predictions)
print("Test set accuracy = " + str(accuracy))

In [10]:
# Step 11:
import pandas as pd
rf_features = pd.DataFrame({"index":ibm.columns[:-1],"featureImportances":ibm_model.stages[-2].featureImportances})\
  .sort_values("featureImportances", ascending=False)
rf_features['Rank'] = rf_features['featureImportances'].rank(ascending=0)
# nicer table if you make index the pandas DataFrame index
indexed_rf_features = rf_features.set_index(['index'])
indexed_rf_features.head(10)

In [11]:
# Step 12:
import seaborn as sns
import matplotlib.pyplot as plt

f, ax = plt.subplots(figsize=(12, 7))
sns.set(style="darkgrid")
sns.barplot(y="index",x="featureImportances",data=rf_features)

In [12]:
display(f.figure)

What do we see?
- MaritalStatus is the most important predictor of attrition. Of course, marital status can be a proxy for whether or not an employee becomes a parent, which could lead an employee to leave. This suggests IBM should re-consider (and improve) their policies for parents. It could also be a strong predictor because employees that are unmarried are also younger, and more likely to leave. So we'd want to know the direction of the feature, but Random Forrest Classifiers don't offer this info (they are multi-dimensional).
- Pay matters. Stock Options and Years Since Last Promotion (another proxy) are predictive.
- Job level and department also matter. This could suggest some departments are more likely to have people leave, because of the nature of the work (e.g., sales people may leave more often) and some roles may be more likely to have attrition (e.g., analysts, who spend two or three years at IBM and then go to business schools for an MBA).

## Above and Beyond

2. Repeat the analysis for Steps 4-12 using Gradient-Boosted Trees (https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier) and compare the results to Random Forest. Describe, in detail, how the classification results differ. The majority of your work should go into exploring the differences in the results.

In [15]:
# Gradient-Boosted Trees
from pyspark.ml.classification import GBTClassifier

# Re-do Step 4:
ibm_gbt = GBTClassifier(featuresCol="indexed", labelCol="indexed_Attrition", maxIter=10)

# Re-run Step 6:
splits = ibm.randomSplit([0.8, 0.2], 1234)
ibm_train = splits[0]
ibm_test = splits[1]

# Step 7:
ibm_pipeline_gbt = Pipeline(stages=ibm_string_indexer + [ibm_assembler, ibm_cat_indexer, ibm_gbt, ibm_labelConverter])

# Step 8:
ibm_model_gbt = ibm_pipeline_gbt.fit(ibm_train)
ibm_predictions_gbt = ibm_model.transform(ibm_test)

evaluator = MulticlassClassificationEvaluator(labelCol="indexed_Attrition", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(ibm_predictions)
print("Test set accuracy = " + str(accuracy))

In [16]:
# Now calculate feature importance for GBT model
gbt_features = pd.DataFrame({"index":ibm.columns[:-1],"featureImportances":ibm_model_gbt.stages[-2].featureImportances})\
  .sort_values("featureImportances", ascending=False)
gbt_features['Rank'] = gbt_features['featureImportances'].rank(ascending=0)
# nicer table if you make index the pandas DataFrame index
indexed_gbt_features = gbt_features.set_index(['index'])

In [17]:
# add column to each dataframe, to make getting it into Seaborn format a bit easier
gbt_features['type'] = "gbt"
rf_features['type'] = "rf"

all_features = pd.concat([rf_features, gbt_features])
# This code isn't displaying because DataBricks is a pain with graphing :(
#g = sns.factorplot(x="index", y="featureImportances", hue="type", data=all_features,
                   size=6, kind="bar", palette="muted")
#g.despine(left=True)
#g.set_ylabels("")
#display(g.FacetGrid)

In [18]:
all_features.head()

In [19]:
indexed_gbt_features.head(20)

Our RF model indicated that Marital status was the strongest predictor (in terms feature importance), and the second strongest was Over 18. Over 18 was strange because all of the employees are over 18 years old, so there is no variation in the data (going to Kaggle confirmed this). So why is it such a strong predictor?

This brings us to the question why do the features have different relative importances?
- A Random Forrest Classifier uses a bagging ensemble to derive predictors for each feature--it bootstraps samples, stores the feature importances, and then averages them. The key idea is that each observation in the bootstrap has the same probability of appearing in each boot.
- A Gradient Boosted Classifier uses a boosting ensemble technique, which in simple terms gives more weight to some observations when calculating feature importances. A blog post on Medium explains: "Therefore, the observations have an unequal probability of appearing in subsequent models and ones with the highest error appear most... new predictors are learning from mistakes committed by previous predictors, it takes less time/iterations to reach close to actual predictions"

In one sentence, the difference between RF and GBT is that the former is parallel, and the later is sequential. 

Why does GBT throw out Over18 feature, which has no variation? The same Medium article says "the intuition behind gradient boosting algorithm is to repetitively leverage the patterns in residuals and strengthen a model with weak predictions and make it better." The Over18 feature has no patterns in the residuals -- zero variation, zero residuals. So the feature isn't as important.


FYI, here are the links to the articles that I read to learn the difference between RF and GBT:
- http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
- https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d

## End of Homework 9