# Company-Employee relationship predictions

Predicting if a person would leave the company based on his data using logistic regression

### Initialize and create a spark session

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Company").getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://Varun-CK:4040
SparkContext available as 'sc' (version = 2.3.0, master = local[*], app id = local-1577624611377)
SparkSession available as 'spark'


2019-12-29 18:33:27 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-12-29 18:33:45 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4c46b092


### Initialize Logger

In [2]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


### Import statements to setup ML for Logistic Regression

In [3]:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{StringIndexer,VectorAssembler,OneHotEncoder}
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler, OneHotEncoder}
import org.apache.spark.ml.linalg.Vectors


### Using Spark to read the employee data set

In [4]:
val data = spark.read.options(Map(("header","true"),("inferSchema","true"))).csv("HR_comma_sep.csv")

data: org.apache.spark.sql.DataFrame = [satisfaction_level: double, last_evaluation: double ... 8 more fields]


### Printing the first row of the dataframe

In [5]:
data.head(1)

res1: Array[org.apache.spark.sql.Row] = Array([0.38,0.53,2,157,3,0,1,0,sales,low])


### Printing the schema of the dataframe

In [6]:
data.printSchema

root
 |-- satisfaction_level: double (nullable = true)
 |-- last_evaluation: double (nullable = true)
 |-- number_project: integer (nullable = true)
 |-- average_montly_hours: integer (nullable = true)
 |-- time_spend_company: integer (nullable = true)
 |-- Work_accident: integer (nullable = true)
 |-- left: integer (nullable = true)
 |-- promotion_last_5years: integer (nullable = true)
 |-- Department: string (nullable = true)
 |-- salary: string (nullable = true)



### Show

In [7]:
data.show(5)

+------------------+---------------+--------------+--------------------+------------------+-------------+----+---------------------+----------+------+
|satisfaction_level|last_evaluation|number_project|average_montly_hours|time_spend_company|Work_accident|left|promotion_last_5years|Department|salary|
+------------------+---------------+--------------+--------------------+------------------+-------------+----+---------------------+----------+------+
|              0.38|           0.53|             2|                 157|                 3|            0|   1|                    0|     sales|   low|
|               0.8|           0.86|             5|                 262|                 6|            0|   1|                    0|     sales|medium|
|              0.11|           0.88|             7|                 272|                 4|            0|   1|                    0|     sales|medium|
|              0.72|           0.87|             5|                 223|                 5|   

### Averaging all the columns to compare the relation with label "left"

In [10]:
data.groupBy("left").mean().show(2)

+----+-----------------------+--------------------+-------------------+-------------------------+-----------------------+--------------------+---------+--------------------------+
|left|avg(satisfaction_level)|avg(last_evaluation)|avg(number_project)|avg(average_montly_hours)|avg(time_spend_company)|  avg(Work_accident)|avg(left)|avg(promotion_last_5years)|
+----+-----------------------+--------------------+-------------------+-------------------------+-----------------------+--------------------+---------+--------------------------+
|   1|    0.44009801176140917|  0.7181125735088183| 3.8555026603192384|       207.41921030523662|      3.876505180621675|0.047325679081489776|      1.0|      0.005320638476617194|
|   0|      0.666809590479516|  0.7154733986699274|  3.786664333216661|        199.0602030101505|     3.3800315015750786| 0.17500875043752187|      0.0|      0.026251312565628283|
+----+-----------------------+--------------------+-------------------+-------------------------+---

From above results we can draw following conclusions,

<br>**Satisfaction Level**: Satisfaction level seems to be relatively low (0.44) in employees leaving the firm vs the retained ones (0.66)
<br>**Average Monthly Hours**: Average monthly hours are higher in employees leaving the firm (199 vs 207)
<br>**Promotion Last 5 Years**: Employees who are given promotion are likely to be retained at firm

### Analyzing the dependancy of categorical columns "salary" and "Department"

In [11]:
data.count()

res7: Long = 14999


In [12]:
data.groupBy("Salary").count().count()

res8: Long = 3


In [13]:
data.groupBy("Department").count().count()

res9: Long = 10


In [14]:
data.printSchema

root
 |-- satisfaction_level: double (nullable = true)
 |-- last_evaluation: double (nullable = true)
 |-- number_project: integer (nullable = true)
 |-- average_montly_hours: integer (nullable = true)
 |-- time_spend_company: integer (nullable = true)
 |-- Work_accident: integer (nullable = true)
 |-- left: integer (nullable = true)
 |-- promotion_last_5years: integer (nullable = true)
 |-- Department: string (nullable = true)
 |-- salary: string (nullable = true)



In [40]:
val filtered_data = data.select("satisfaction_level","last_evaluation","number_project","average_montly_hours","time_spend_company"
                                ,"Work_accident","left","promotion_last_5years","Department","salary")

filtered_data: org.apache.spark.sql.DataFrame = [satisfaction_level: double, last_evaluation: double ... 8 more fields]


### Since the `salary` and `Department` columns are categorical type, we can convert it to numerical and to vector using StringIndexer and OneHotEncoder

#### Using String Indexer to convert categorical string columns to numerical type

In [41]:
val DepartmentIndexer = new StringIndexer().setInputCol("Department").setOutputCol("DepartmentInd")
val salaryIndexer = new StringIndexer().setInputCol("salary").setOutputCol("salaryInd")

DepartmentIndexer: org.apache.spark.ml.feature.StringIndexer = strIdx_4de41bbd4e1c
salaryIndexer: org.apache.spark.ml.feature.StringIndexer = strIdx_0fcb13efa78d


#### Using One Hot Encoder to convert categorical numeric type columns to Vector type

In [42]:
val DepartmentEncoder = new OneHotEncoder().setInputCol("DepartmentInd").setOutputCol("DepartmentVec")
val salaryEncoder = new OneHotEncoder().setInputCol("salaryInd").setOutputCol("salaryVec")

DepartmentEncoder: org.apache.spark.ml.feature.OneHotEncoder = oneHot_f3f71d4cf533
salaryEncoder: org.apache.spark.ml.feature.OneHotEncoder = oneHot_007c99d1613c


### Assembling all the features to a single vector column "features"

In [43]:
data.columns

res24: Array[String] = Array(satisfaction_level, last_evaluation, number_project, average_montly_hours, time_spend_company, Work_accident, left, promotion_last_5years, Department, salary)


In [44]:
val assembler = new VectorAssembler().setInputCols(Array("satisfaction_level","last_evaluation","number_project","average_montly_hours"
                                                         ,"time_spend_company","Work_accident","promotion_last_5years"
                                                         ,"DepartmentVec","salaryVec")).setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_7426c903d501


### Splitting the resultant data into training data and testing data,

<code>
<b>Training data is to train the model</b>
<b>Testing data is to test the builted model</b>
</code>

In [45]:
val Array(train_data,test_data) = filtered_data.randomSplit(Array(0.7,0.3))

train_data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [satisfaction_level: double, last_evaluation: double ... 8 more fields]
test_data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [satisfaction_level: double, last_evaluation: double ... 8 more fields]


In [46]:
filtered_data.count()

res25: Long = 14999


In [47]:
filtered_data.na.drop().count()

res26: Long = 14999


In [48]:
filtered_data.select("left").describe().show()

+-------+-------------------+
|summary|               left|
+-------+-------------------+
|  count|              14999|
|   mean| 0.2380825388359224|
| stddev|0.42592409938029885|
|    min|                  0|
|    max|                  1|
+-------+-------------------+



In [52]:
train_data.count

res31: Long = 10439


In [53]:
test_data.count

res32: Long = 4560


In [54]:
train_data.select("left").describe().show

+-------+-------------------+
|summary|               left|
+-------+-------------------+
|  count|              10439|
|   mean|0.23728326468052496|
| stddev|0.42543772228815196|
|    min|                  0|
|    max|                  1|
+-------+-------------------+



In [55]:
test_data.select("left").describe().show

+-------+-------------------+
|summary|               left|
+-------+-------------------+
|  count|               4560|
|   mean|0.23991228070175438|
| stddev| 0.4270765470464649|
|    min|                  0|
|    max|                  1|
+-------+-------------------+



### Creating a logistic regression model object

In [56]:
val lor = new LogisticRegression().setLabelCol("left").setFeaturesCol("features")

lor: org.apache.spark.ml.classification.LogisticRegression = logreg_add7b4de687f


### Setting Up the Pipeline

In [57]:
import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.Pipeline


In [59]:
val pipeline = new Pipeline().setStages(Array(DepartmentIndexer,salaryIndexer,DepartmentEncoder,salaryEncoder,assembler,lor))

pipeline: org.apache.spark.ml.Pipeline = pipeline_c42d7c0a46e4


### Fitting the pipeline to training set.

In [61]:
val model = pipeline.fit(train_data)

model: org.apache.spark.ml.PipelineModel = pipeline_c42d7c0a46e4


### Getting Results on Test Set

In [62]:
val results = model.transform(test_data)

results: org.apache.spark.sql.DataFrame = [satisfaction_level: double, last_evaluation: double ... 16 more fields]


In [63]:
results.printSchema()

root
 |-- satisfaction_level: double (nullable = true)
 |-- last_evaluation: double (nullable = true)
 |-- number_project: integer (nullable = true)
 |-- average_montly_hours: integer (nullable = true)
 |-- time_spend_company: integer (nullable = true)
 |-- Work_accident: integer (nullable = true)
 |-- left: integer (nullable = true)
 |-- promotion_last_5years: integer (nullable = true)
 |-- Department: string (nullable = true)
 |-- salary: string (nullable = true)
 |-- DepartmentInd: double (nullable = false)
 |-- salaryInd: double (nullable = false)
 |-- DepartmentVec: vector (nullable = true)
 |-- salaryVec: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [64]:
val output = results.select("left","rawPrediction","prediction","probability","features")

output: org.apache.spark.sql.DataFrame = [left: int, rawPrediction: vector ... 3 more fields]


In [65]:
output.show(5)

+----+--------------------+----------+--------------------+--------------------+
|left|       rawPrediction|prediction|         probability|            features|
+----+--------------------+----------+--------------------+--------------------+
|   1|[-0.7713329005047...|       1.0|[0.31619084367384...|(18,[0,1,2,3,4,11...|
|   1|[-0.7930519955582...|       1.0|[0.31151372275216...|(18,[0,1,2,3,4,7,...|
|   1|[-1.0232302868408...|       1.0|[0.26439865759682...|(18,[0,1,2,3,4,9,...|
|   1|[-1.0582202795894...|       1.0|[0.25764970845306...|(18,[0,1,2,3,4,8,...|
|   1|[-0.3954148831148...|       1.0|[0.40241446026218...|(18,[0,1,2,3,4,16...|
+----+--------------------+----------+--------------------+--------------------+
only showing top 5 rows



## MODEL EVALUATION

### 1) Converting the data to rdd and evaluating using MulticlassMetrics to print the confusion matrix

In [66]:
import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.mllib.evaluation.MulticlassMetrics


In [67]:
val clean_result = output.withColumn("left",output("left").cast("double"))

clean_result: org.apache.spark.sql.DataFrame = [left: double, rawPrediction: vector ... 3 more fields]


In [69]:
clean_result.select("left","prediction").show(5)

+----+----------+
|left|prediction|
+----+----------+
| 1.0|       1.0|
| 1.0|       1.0|
| 1.0|       1.0|
| 1.0|       1.0|
| 1.0|       1.0|
+----+----------+
only showing top 5 rows



In [71]:
val predictionAndLabel = clean_result.select("left","prediction").as[(Double,Double)].rdd

predictionAndLabel: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[382] at rdd at <console>:56


In [72]:
val metrics = new MulticlassMetrics(predictionAndLabel)

metrics: org.apache.spark.mllib.evaluation.MulticlassMetrics = org.apache.spark.mllib.evaluation.MulticlassMetrics@3de84f4b


#### Printing the confusion matrix

In [73]:
println(metrics.confusionMatrix)

3207.0  718.0  
259.0   376.0  


#### Printing the Accuracy

In [74]:
println(metrics.accuracy)

0.7857456140350877


#### Recall

In [75]:
println(metrics.recall)

0.7857456140350877


#### precision

In [76]:
println(metrics.precision)

0.7857456140350877


### 2) Evaluating using BinaryClassificationEvaluator

In [77]:
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator


In [78]:
val bin_eval = new BinaryClassificationEvaluator().setRawPredictionCol("rawPrediction").setLabelCol("left")

bin_eval: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_66d8ebdd9926


#### Calculating Area Under ROC

In [79]:
val AOC =bin_eval.evaluate(output)

AOC: Double = 0.8201352179595757


#### Printing Area Under ROC

In [80]:
println(AOC)

0.8201352179595757


### 3) Evaluating using MulticlassClassificationEvaluator

In [81]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator


In [82]:
val multi_eval = new MulticlassClassificationEvaluator().setPredictionCol("prediction").setLabelCol("left")

multi_eval: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_f895475bd3dd


#### Calculating Area Under ROC

In [83]:
val AOC_2 = multi_eval.evaluate(output)

AOC_2: Double = 0.7639592838971173


#### Printing Area Under ROC

In [84]:
println(AOC_2)

0.7639592838971173


### Stopping the created spark session

In [85]:
spark.stop()

## Thank You!