# Model 1 - Logistic Regression

Import all required libraries,<br/>
Create a spark session

In [1]:
import findspark
findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.sql.functions import corr

spark = SparkSession.builder.appName('hr').getOrCreate()

In [2]:
results = {}

### Read data from CSV file; loading it into DataFrame
**inferSchema** parameter makes spark DataFrame try infering schema based on CSV values.<br/>
**header** parameter will make DataFrame have the first CSV row as a header.

In [3]:
df = spark.read.csv('HR_comma_sep.csv', inferSchema=True, header=True)

### Inspect DataFrame schema and columns

In [4]:
df.printSchema()

root
 |-- satisfaction_level: double (nullable = true)
 |-- last_evaluation: double (nullable = true)
 |-- number_project: integer (nullable = true)
 |-- average_montly_hours: integer (nullable = true)
 |-- time_spend_company: integer (nullable = true)
 |-- Work_accident: integer (nullable = true)
 |-- left: integer (nullable = true)
 |-- promotion_last_5years: integer (nullable = true)
 |-- sales: string (nullable = true)
 |-- salary: string (nullable = true)



In [5]:
df.columns

['satisfaction_level',
 'last_evaluation',
 'number_project',
 'average_montly_hours',
 'time_spend_company',
 'Work_accident',
 'left',
 'promotion_last_5years',
 'sales',
 'salary']

### Index sales and salary columns to numerical indexes using StringIndexer

In [6]:
indexers = [StringIndexer(inputCol=column, outputCol=column+'_index').fit(df) for column in ['sales','salary']]

### Use pipeline to chainload two indexing jobs together

In [7]:
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)

#### See how other features correlate with 'left'

In [9]:
df_r.select(corr('satisfaction_level', 'left')).show()

+------------------------------+
|corr(satisfaction_level, left)|
+------------------------------+
|           -0.3883749834241161|
+------------------------------+



In [10]:
df_r.select(corr('sales_index', 'left')).show()

+-----------------------+
|corr(sales_index, left)|
+-----------------------+
|   -0.02839400311423293|
+-----------------------+



### Build a vector column using features, append the vector column to DataFrame
Since there is no strong correlation between employee leaving due to a single factor - <br/>
Trying various combinations of features and reducing the number of features in order to increase the accuracy.

In [12]:
assembler = VectorAssembler(inputCols=['satisfaction_level',
                                       'last_evaluation',
                                       'number_project',
                                       'time_spend_company',
                                       'Work_accident',
                                       'promotion_last_5years',
                                       'sales_index',
                                       'salary_index'], outputCol='features')

### Transform DataFrame
i.e. actually build feature vector and add it to DataFrame and collect the output in variable called output.

In [13]:
output = assembler.transform(df_r)

### Project only feature vector and left column and collect its output in final_data

In [14]:
final_data = output.select('features','left')

### Split data in two sets, randomly.

In [15]:
train_left, test_left = final_data.randomSplit([0.4,0.6])

### Build regression model for predicting 'left'
regularization parameter &lambda; can be set if model is overfitting.<br/>
However, it is not required in this case.

In [16]:
lr_left = LogisticRegression(labelCol='left',
                             family='multinomial') 

In [17]:
#lr_left.setRegParam(0.0)

In [18]:
#lr_left.setElasticNetParam(0.0)

### Train the model using training data

In [19]:
fitted_left_model = lr_left.fit(train_left)

### Test the fitted model using test dataset

In [20]:
logPreds = fitted_left_model.evaluate(test_left)

### Check the evaluation metric - Area under ROC curve

In [21]:
aucLR = logPreds.areaUnderROC
results['Logistic Regression\t'] = aucLR
print("areaUnderROC",": ", aucLR)

areaUnderROC :  0.8038551374947741


# Model 2 & 3 - Decision Trees and Random Forest Classification

### Import required classes/functions for Random Forest classification

In [22]:
from pyspark.ml.classification import DecisionTreeClassifier

In [23]:
from pyspark.ml.classification import RandomForestClassifier, RandomForestClassificationModel
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer

### Read data from CSV file; loading it into DataFrame
**inferSchema** parameter makes spark DataFrame try infering schema based on CSV values.<br/>
**header** parameter will make DataFrame have the first CSV row as a header.

In [24]:
df = spark.read.csv('HR_comma_sep.csv', inferSchema=True, header=True)

### Index sales and salary columns to numerical indexes using StringIndexer

In [25]:
indexers = [StringIndexer(inputCol=column, outputCol=column+'_index').fit(df) for column in ['sales','salary']]

### Use pipeline to chainload two indexing jobs together

In [26]:
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)

### Build a vector column using features, append the vector column to DataFrame
Since there is no strong correlation between employee leaving due to a single factor - <br/>
Trying various combinations of features and reducing the number of features in order to increase the accuracy.

In [27]:
data = VectorAssembler(inputCols=['satisfaction_level',
                                       'last_evaluation',
                                       'number_project',
                                       'average_montly_hours',
                                       'time_spend_company',
                                       'Work_accident',
                                       'promotion_last_5years',
                                       'sales_index',
                                       'salary_index'], outputCol='features').transform(df_r)

In [29]:
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=5).fit(data)

### Transform DataFrame
i.e. actually build feature vector and add it to DataFrame and collect the output in variable called output.

In [30]:
data = featureIndexer.transform(data)

### Randomly split data into training and test datasets

In [31]:
(trainingDataDT, testDataDT) = data.randomSplit([0.5, 0.5])

In [32]:
(trainingDataRF, testDataRF) = data.randomSplit([0.5, 0.5])

### Define decision trees  classifier for features 'left'

In [33]:
dt = DecisionTreeClassifier(labelCol="left", featuresCol="indexedFeatures")

In [34]:
dtmodel = dt.fit(trainingDataDT)

In [35]:
predictionsDT = dtmodel.transform(testDataDT)

In [36]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [37]:
evaluatorDT = BinaryClassificationEvaluator(labelCol="left", rawPredictionCol="rawPrediction")

In [38]:
aucDT = evaluatorDT.evaluate(predictionsDT)

In [39]:
results['Decision Tree Classifier'] = aucDT
print(evaluatorDT.getMetricName(),": ", aucDT)

areaUnderROC :  0.9428336966632171


### Define RandomForestClassifer for feature 'left' and use vector 'indexedFeatures' we built in previous steps

In [40]:
rf = RandomForestClassifier(labelCol="left", featuresCol="indexedFeatures")

### Train the model

In [41]:
model = rf.fit(trainingDataRF)

### Apply the model to the test dataset and collect predictions

In [42]:
predictionsRF = model.transform(testDataRF)

### Build BinaryClassificationEvaluator (since classification is binary)

In [43]:
evaluatorRF = BinaryClassificationEvaluator(labelCol="left", rawPredictionCol="rawPrediction")

### Check the evaluation metric - Area under ROC curve

In [44]:
aucRF = evaluatorRF.evaluate(predictionsRF)
results['Random Forest Classifier'] = aucRF
print(evaluatorRF.getMetricName(),": ", aucRF)

areaUnderROC :  0.9818101464647424


# Model 4 - Naive Bayes - Classification

### Import essential classes

In [45]:
from pyspark.ml.classification import NaiveBayes

### Read data from CSV file; loading it into DataFrame
**inferSchema** parameter makes spark DataFrame try infering schema based on CSV values.<br/>
**header** parameter will make DataFrame have the first CSV row as a header.

In [46]:
df_nb = spark.read.csv('HR_comma_sep.csv', inferSchema=True, header=True)

### Assemble a vector for Naive Bayes with selective features

In [48]:
data_nb = VectorAssembler(inputCols=['satisfaction_level', 'Work_accident', 'promotion_last_5years'], outputCol='features').transform(df_r)

### Split data for training and testing

In [49]:
(trainingData, testData) = data_nb.randomSplit([0.7, 0.3])

### Define Naive Bayes model for predicting label 'left'

In [50]:
nb = NaiveBayes(smoothing=0, modelType='multinomial', labelCol='left')

### Train Naive Bayes model using training dataset

In [51]:
model = nb.fit(trainingData)

### Test the model on test dataset

In [52]:
predictions = model.transform(testData)

### Evaluate the predictions using BinaryClassification evaluator

In [53]:
evaluator = BinaryClassificationEvaluator(labelCol="left", rawPredictionCol="rawPrediction")

### Measure area under ROC curve for Naive Bayes model

In [54]:
aucNB = evaluator.evaluate(predictions)
results['Naive Bayes Classifier'] = aucNB
print(evaluator.getMetricName(),": ", aucNB)

areaUnderROC :  0.7667689475712651


# Model 5 - Linear Support Vector Classifier

### Import LinearSVC class

In [55]:
from pyspark.ml.classification import LinearSVC

ImportError: cannot import name 'LinearSVC'

### Read data from CSV file; loading it into DataFrame
**inferSchema** parameter makes spark DataFrame try infering schema based on CSV values.<br/>
**header** parameter will make DataFrame have the first CSV row as a header.

In [56]:
df = spark.read.csv('HR_comma_sep.csv', inferSchema=True, header=True)

### Index sales and salary columns to numerical indexes using StringIndexer

In [57]:
indexers = [StringIndexer(inputCol=column, outputCol=column+'_index').fit(df) for column in ['sales','salary']]

### Use pipeline to chainload two indexing jobs together

In [58]:
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)

### Build a vector column using features, append the vector column to DataFrame
Since there is no strong correlation between employee leaving due to a single factor - <br/>
Trying various combinations of features and reducing the number of features in order to increase the accuracy.

In [59]:
data = VectorAssembler(inputCols=['satisfaction_level',
                                       'last_evaluation',
                                       'number_project',
                                       'average_montly_hours',
                                       'time_spend_company',
                                       'Work_accident',
                                       'promotion_last_5years',
                                       'sales_index',
                                       'salary_index'], outputCol='features').transform(df_r)

### Randomly split the dataset into trainingData and testData

In [60]:
(trainingData, testData) = data.randomSplit([0.20, 0.80])

### Define linear SVC model to predict label 'left'

In [61]:
# lsvc = LinearSVC(regParam=0.0, labelCol='left', maxIter=30)
lsvc = LinearSVC(regParam=0.0, labelCol='left')

NameError: name 'LinearSVC' is not defined

### Train the model with training dataset

In [62]:
lsvcModel = lsvc.fit(trainingData)

NameError: name 'lsvc' is not defined

### Apply the model to the test dataset and collect predictions

In [63]:
svcPred = lsvcModel.transform(testData)

NameError: name 'lsvcModel' is not defined

### Use BinaryClassificationEvaluator to evaluate predictions from linear SVC

In [64]:
evaluator = BinaryClassificationEvaluator(labelCol="left", rawPredictionCol="rawPrediction")

### Evaluate the predictions and collect area under ROC curve

In [65]:
aucSVC = evaluator.evaluate(svcPred)
results['Support Vector Classifier'] = aucSVC
print(evaluator.getMetricName(),": ", aucSVC)

NameError: name 'svcPred' is not defined

## Comparing Logistic Regression, Random Forest, Naive Bayes, and Linear SVC

In [66]:
for (k,v) in results.items():
    print(k, ':\t  ', v)

Decision Tree Classifier :	   0.9428336966632171
Random Forest Classifier :	   0.9818101464647424
Logistic Regression	 :	   0.8038551374947741
Naive Bayes Classifier :	   0.7667689475712651


#### Logistic Regression
Logistic regression works best with features that have clear decsion boundary, else it is prone to underfitting (high bias). One advantage logistic regression has that its raw prediction output is in the form of probability. Which can be utilized to categorize the labels in group, and is not limited to binary classification.

For this dataset, the accuracy of model will drop if I reduce training set side below 40%, which causes underfitting.

#### Decision Trees and Random Forest Classifier
Decision tree and random forest perfoms well with this data, without requiring least or any kind of tuning. It seems to handle categorical features like salary and department (sales) indexes better than other models. Random forest is less prone to overfitting than decision trees if very few features are being used with decision trees. As I tested both models with selective features, Random forest does not lose its accuracy as quickly as decision trees.

#### Naive Bayes Classifier
Does not work well with numeric data as it works well with categorical features. Makes a very strong asumption about distribution of importance between two features. (assumes strong independence between features). Therefore its 'naive'. Result from naive bayes classifier will improve by reducing features that corelate the least with the label.

Required the highest amount of data for training.

#### Support Vector Classifier
Computationally expensive, requires a lot of iterations for relatively good results. Works better with large number of features

Required the least amount of data for training. I was able to reduce training set to 20% without losing any significant accuracy.