### Employee Promotion Evaluation

#### Problem Statement
Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is:

1.	They first identify a set of employees based on recommendations/ past performance
2.	Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical
3.	At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion

For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle.

They have provided multiple attributes around Employee's past and current performance along with demographics. Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.

#### Attribute description
employee_id -	Unique ID for employee <br/>
department - Department of employee <br/>
region	- Region of employment (unordered)<br/>
education -	Education Level<br/>
gender	- Gender of Employee<br/>
recruitment_channel	- Channel of recruitment for employee<br/>
no_of_trainings - no of other trainings completed in previous year on soft skills, technical skills etc.<br/>
age	- Age of Employee<br/>
previous_year_rating -	Employee Rating for the previous year<br/>
length_of_service -	Length of service in years<br/>
KPIs_met >80%	- if Percent of KPIs(Key performance Indicators) >80% then 1 else 0<br/>
awards_won?	- if awards won during previous year then 1 else 0<br/>
avg_training_score -	Average score in current training evaluations<br/>
is_promoted	(Target) - Recommended for promotion<br/>


#### Importing Packages

In [380]:
import findspark
findspark.init()
findspark.find()
from pyspark.sql import SparkSession

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.mllib.stat import Statistics

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import LinearSVC

from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [167]:
spark = SparkSession.builder.appName("Employee Promotion Evaluation").getOrCreate()

In [168]:
spark.conf.set("spark.sql.repl.eagerEval.enabled", True)

In [169]:
dataset = spark.read.csv("attachment_train_lyst4523.csv",header=True)

In [170]:
dataset.show(5)

+-----------+-----------------+---------+----------------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+
|employee_id|       department|   region|       education|gender|recruitment_channel|no_of_trainings|age|previous_year_rating|length_of_service|KPIs_met >80%|awards_won?|avg_training_score|is_promoted|
+-----------+-----------------+---------+----------------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+
|      65438|Sales & Marketing| region_7|Master's & above|     f|           sourcing|              1| 35|                   5|                8|            1|          0|                49|          0|
|      65141|       Operations|region_22|      Bachelor's|     m|              other|              1| 30|                   5|                4|            0|          0|                60|   

In [171]:
dataset.printSchema()

root
 |-- employee_id: string (nullable = true)
 |-- department: string (nullable = true)
 |-- region: string (nullable = true)
 |-- education: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- recruitment_channel: string (nullable = true)
 |-- no_of_trainings: string (nullable = true)
 |-- age: string (nullable = true)
 |-- previous_year_rating: string (nullable = true)
 |-- length_of_service: string (nullable = true)
 |-- KPIs_met >80%: string (nullable = true)
 |-- awards_won?: string (nullable = true)
 |-- avg_training_score: string (nullable = true)
 |-- is_promoted: string (nullable = true)



In [172]:
# TO show the null values count 
from pyspark.sql.functions import col, count, isnan, when
#checking for null ir nan type values in our columns
dataset.select([count(when(col(c).isNull(), c)).alias(c) for c in dataset.columns]).show()

+-----------+----------+------+---------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+
|employee_id|department|region|education|gender|recruitment_channel|no_of_trainings|age|previous_year_rating|length_of_service|KPIs_met >80%|awards_won?|avg_training_score|is_promoted|
+-----------+----------+------+---------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+
|          0|         0|     0|     2409|     0|                  0|              0|  0|                4124|                0|            0|          0|                 0|          0|
+-----------+----------+------+---------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+



### Question: Perform Descriptive Statistics on the dataset 

In [173]:
data_without_missing = dataset.dropna(how='all')

In [174]:
data_without_missing.show(5, truncate=False)

+-----------+-----------------+---------+----------------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+
|employee_id|department       |region   |education       |gender|recruitment_channel|no_of_trainings|age|previous_year_rating|length_of_service|KPIs_met >80%|awards_won?|avg_training_score|is_promoted|
+-----------+-----------------+---------+----------------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+
|65438      |Sales & Marketing|region_7 |Master's & above|f     |sourcing           |1              |35 |5                   |8                |1            |0          |49                |0          |
|65141      |Operations       |region_22|Bachelor's      |m     |other              |1              |30 |5                   |4                |0            |0          |60                |0  

In [175]:
data_without_missing.count()

54808

In [176]:
data_without_missing.describe().show(truncate=False, vertical=True)

-RECORD 0------------------------------------
 summary              | count                
 employee_id          | 54808                
 department           | 54808                
 region               | 54808                
 education            | 52399                
 gender               | 54808                
 recruitment_channel  | 54808                
 no_of_trainings      | 54808                
 age                  | 54808                
 previous_year_rating | 50684                
 length_of_service    | 54808                
 KPIs_met >80%        | 54808                
 awards_won?          | 54808                
 avg_training_score   | 54808                
 is_promoted          | 54808                
-RECORD 1------------------------------------
 summary              | mean                 
 employee_id          | 39195.83062691578    
 department           | null                 
 region               | null                 
 education            | null      

In [177]:
# aaa = ratings_df.dropDuplicates()

# Let us check the unique values and count for the Recruitment_channel, education, region,department



dataset.groupby("education").count().show()

+----------------+-----+
|       education|count|
+----------------+-----+
|            null| 2409|
| Below Secondary|  805|
|Master's & above|14925|
|      Bachelor's|36669|
+----------------+-----+



In [178]:
dataset.groupby("region").count().show()

+---------+-----+
|   region|count|
+---------+-----+
|region_16| 1465|
| region_2|12343|
|region_28| 1318|
|region_10|  648|
|region_27| 1659|
|region_18|   31|
| region_9|  420|
|region_24|  508|
| region_5|  766|
|region_26| 2260|
|region_32|  945|
|region_13| 2648|
|region_14|  827|
|region_19|  874|
| region_4| 1703|
|region_23| 1175|
|region_33|  269|
|region_12|  500|
|region_20|  850|
| region_8|  655|
+---------+-----+
only showing top 20 rows



In [179]:
# we can remove th region from the name. 

In [180]:
split_col = split(dataset['region'], '_')
dataset = dataset.withColumn('region_1', split_col.getItem(1))
dataset.groupby("region_1").count().show()

+--------+-----+
|region_1|count|
+--------+-----+
|       7| 4843|
|      15| 2808|
|      11| 1315|
|      29|  994|
|       3|  346|
|      30|  657|
|      34|  292|
|       8|  655|
|      22| 6428|
|      28| 1318|
|      16| 1465|
|       5|  766|
|      31| 1935|
|      18|   31|
|      27| 1659|
|      17|  796|
|      26| 2260|
|       6|  690|
|      19|  874|
|      23| 1175|
+--------+-----+
only showing top 20 rows



In [181]:
dataset.groupby("department").count().show()

+-----------------+-----+
|       department|count|
+-----------------+-----+
|               HR| 2418|
|          Finance| 2536|
|        Analytics| 5352|
|            Legal| 1039|
|Sales & Marketing|16840|
|       Technology| 7138|
|      Procurement| 7138|
|       Operations|11348|
|              R&D|  999|
+-----------------+-----+



In [182]:
dataset.groupby("recruitment_channel").count().show()

+-------------------+-----+
|recruitment_channel|count|
+-------------------+-----+
|              other|30446|
|           sourcing|23220|
|           referred| 1142|
+-------------------+-----+



In [183]:
# we can use the label encoding for this variable

In [184]:
dataset.select([count(when(
    col(c).isNull(), c)).alias(c) 
                for c in dataset.columns])

employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,region_1
0,0,0,2409,0,0,0,0,4124,0,0,0,0,0,0


### Question: Remove missing values from the data frame

In [185]:
dataset.columns

['employee_id',
 'department',
 'region',
 'education',
 'gender',
 'recruitment_channel',
 'no_of_trainings',
 'age',
 'previous_year_rating',
 'length_of_service',
 'KPIs_met >80%',
 'awards_won?',
 'avg_training_score',
 'is_promoted',
 'region_1']

In [186]:
dataset = dataset.drop('employee_id','region')

In [187]:
# Please drop employee_id

In [229]:
#columns that can be int 
int_columns = ['no_of_trainings','is_promoted','KPIs_met >80%','length_of_service','avg_training_score',
 'awards_won?',
 'age','region_1','gender','previous_year_rating']

In [230]:
len(int_columns)

10

In [190]:

dataset.groupby("gender").count().show()

+------+-----+
|gender|count|
+------+-----+
|     m|38496|
|     f|16312|
+------+-----+



In [191]:
from pyspark.sql.functions import when, regexp_replace
dataset = dataset.withColumn('gender', regexp_replace('gender','m','1')). \
withColumn('gender', regexp_replace('gender','f','2')) 

     

In [192]:

dataset.groupby("gender").count().show()

+------+-----+
|gender|count|
+------+-----+
|     1|38496|
|     2|16312|
+------+-----+



In [193]:
dataset.columns

['department',
 'education',
 'gender',
 'recruitment_channel',
 'no_of_trainings',
 'age',
 'previous_year_rating',
 'length_of_service',
 'KPIs_met >80%',
 'awards_won?',
 'avg_training_score',
 'is_promoted',
 'region_1']

In [195]:
dataset.printSchema()

root
 |-- department: string (nullable = true)
 |-- education: string (nullable = true)
 |-- gender: integer (nullable = true)
 |-- recruitment_channel: string (nullable = true)
 |-- no_of_trainings: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- previous_year_rating: string (nullable = true)
 |-- length_of_service: integer (nullable = true)
 |-- KPIs_met >80%: integer (nullable = true)
 |-- awards_won?: integer (nullable = true)
 |-- avg_training_score: integer (nullable = true)
 |-- is_promoted: integer (nullable = true)
 |-- region_1: integer (nullable = true)



In [247]:
dataset.groupby("previous_year_rating").count().show()
dataset.groupby("education").count().show()

dataset.groupby("recruitment_channel").count().show()
dataset.groupby("department").count().show()

+--------------------+-----+
|previous_year_rating|count|
+--------------------+-----+
|                   1| 6223|
|                   3|18618|
|                   5|11741|
|                   4| 9877|
|                   2| 4225|
|                   0| 4124|
+--------------------+-----+

+----------------+-----+
|       education|count|
+----------------+-----+
|         unknown| 2409|
| Below Secondary|  805|
|Master's & above|14925|
|      Bachelor's|36669|
+----------------+-----+

+-------------------+-----+
|recruitment_channel|count|
+-------------------+-----+
|              other|30446|
|           sourcing|23220|
|           referred| 1142|
+-------------------+-----+

+-----------------+-----+
|       department|count|
+-----------------+-----+
|               HR| 2418|
|          Finance| 2536|
|        Analytics| 5352|
|            Legal| 1039|
|Sales & Marketing|16840|
|       Technology| 7138|
|      Procurement| 7138|
|       Operations|11348|
|              R&D|  999|

In [200]:
dataset = dataset.na.fill({'previous_year_rating': 0, 'education': 'unknown'})

In [None]:
from pyspark.sql.types import IntegerType
for col_name in int_columns:
    dataset = dataset.withColumn(col_name, dataset[col_name].cast(IntegerType()))

In [201]:
dataset

department,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,region_1
Sales & Marketing,Master's & above,2,sourcing,1,35,5,8,1,0,49,0,7
Operations,Bachelor's,1,other,1,30,5,4,0,0,60,0,22
Sales & Marketing,Bachelor's,1,sourcing,1,34,3,7,0,0,50,0,19
Sales & Marketing,Bachelor's,1,other,2,39,1,10,0,0,50,0,23
Technology,Bachelor's,1,other,1,45,3,2,0,0,73,0,26
Analytics,Bachelor's,1,sourcing,2,31,3,7,0,0,85,0,2
Operations,Bachelor's,2,other,1,31,3,5,0,0,59,0,20
Operations,Master's & above,1,sourcing,1,33,3,6,0,0,63,0,34
Analytics,Bachelor's,1,other,1,28,4,5,0,0,83,0,20
Sales & Marketing,Master's & above,1,sourcing,1,32,5,5,1,0,54,0,1


In [228]:
dataset.printSchema()

root
 |-- department: string (nullable = true)
 |-- education: string (nullable = false)
 |-- gender: integer (nullable = true)
 |-- recruitment_channel: string (nullable = true)
 |-- no_of_trainings: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- previous_year_rating: string (nullable = false)
 |-- length_of_service: integer (nullable = true)
 |-- KPIs_met >80%: integer (nullable = true)
 |-- awards_won?: integer (nullable = true)
 |-- avg_training_score: integer (nullable = true)
 |-- is_promoted: integer (nullable = true)
 |-- region_1: integer (nullable = true)



In [227]:
# from pyspark.ml.feature import Imputer
# imputer = Imputer(inputCols=["previous_year_rating", "education"], 
#                   outputCols=["previous_year_rating", "education"])
# model = imputer.fit(dataset)

# imputed_data = model.transform(dataset)

In [203]:
dataset.select([count(when(
    col(c).isNull(), c)).alias(c) 
                for c in dataset.columns])

department,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,region_1
0,0,0,0,0,0,0,0,0,0,0,0,0


In [259]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = ["recruitment_channel", "education","department"]
stages = []
for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index')
    encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]

    
    


In [226]:
set(dataset.columns) - set(categoricalColumns)

{'KPIs_met >80%',
 'age',
 'avg_training_score',
 'awards_won?',
 'gender',
 'is_promoted',
 'length_of_service',
 'no_of_trainings',
 'previous_year_rating',
 'region_1'}

In [260]:
stages

[StringIndexer_69635acfda84,
 OneHotEncoder_e6f77dce1c3d,
 StringIndexer_7906cbdb1a1f,
 OneHotEncoder_95bb7b63f7ef,
 StringIndexer_8948e5d6e828,
 OneHotEncoder_ea9d5ccf0acc]

In [261]:
label_stringIdx = StringIndexer(inputCol = 'is_promoted', outputCol = 'label')
stages += [label_stringIdx]
ak = list(set(dataset.columns) - set(categoricalColumns))

ak.remove('is_promoted')
numericCols =  ak
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

In [262]:
assemblerInputs

['recruitment_channelclassVec',
 'educationclassVec',
 'departmentclassVec',
 'previous_year_rating',
 'KPIs_met >80%',
 'awards_won?',
 'gender',
 'length_of_service',
 'avg_training_score',
 'no_of_trainings',
 'age',
 'region_1']

In [263]:
stages

[StringIndexer_69635acfda84,
 OneHotEncoder_e6f77dce1c3d,
 StringIndexer_7906cbdb1a1f,
 OneHotEncoder_95bb7b63f7ef,
 StringIndexer_8948e5d6e828,
 OneHotEncoder_ea9d5ccf0acc,
 StringIndexer_30ea4fadeb1b,
 VectorAssembler_fef29926ef8e]

### Question: Perform classification using Support Vector Machines Algorithm and analyse the metrics

In [272]:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(dataset)
df = pipelineModel.transform(dataset)
selectedCols = ['label', 'features'] + dataset.columns
df = df.select(selectedCols)
df.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- department: string (nullable = true)
 |-- education: string (nullable = false)
 |-- gender: integer (nullable = true)
 |-- recruitment_channel: string (nullable = true)
 |-- no_of_trainings: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- previous_year_rating: integer (nullable = true)
 |-- length_of_service: integer (nullable = true)
 |-- KPIs_met >80%: integer (nullable = true)
 |-- awards_won?: integer (nullable = true)
 |-- avg_training_score: integer (nullable = true)
 |-- is_promoted: integer (nullable = true)
 |-- region_1: integer (nullable = true)



In [271]:
df.columns

['department',
 'education',
 'gender',
 'recruitment_channel',
 'no_of_trainings',
 'age',
 'previous_year_rating',
 'length_of_service',
 'KPIs_met >80%',
 'awards_won?',
 'avg_training_score',
 'is_promoted',
 'region_1',
 'recruitment_channelIndex',
 'recruitment_channelclassVec',
 'educationIndex',
 'educationclassVec',
 'departmentIndex',
 'departmentclassVec',
 'label',
 'features']

In [215]:
train, test = df.randomSplit([0.7, 0.3], seed = 2018)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

NameError: name 'df' is not defined

In [285]:
feature_columsn = [
 'department',
 'education',
 'gender',
 'recruitment_channel',
 'no_of_trainings',
 'age',
 'previous_year_rating',
 'length_of_service',
 'KPIs_met >80%',
 'awards_won?',
 'avg_training_score',
 'region_1']

In [287]:
df.select(feature_columsn).rdd.map(lambda row: row[0:]).collect()

[('Sales & Marketing',
  "Master's & above",
  2,
  'sourcing',
  1,
  35,
  5,
  8,
  1,
  0,
  49,
  7),
 ('Operations', "Bachelor's", 1, 'other', 1, 30, 5, 4, 0, 0, 60, 22),
 ('Sales & Marketing', "Bachelor's", 1, 'sourcing', 1, 34, 3, 7, 0, 0, 50, 19),
 ('Sales & Marketing', "Bachelor's", 1, 'other', 2, 39, 1, 10, 0, 0, 50, 23),
 ('Technology', "Bachelor's", 1, 'other', 1, 45, 3, 2, 0, 0, 73, 26),
 ('Analytics', "Bachelor's", 1, 'sourcing', 2, 31, 3, 7, 0, 0, 85, 2),
 ('Operations', "Bachelor's", 2, 'other', 1, 31, 3, 5, 0, 0, 59, 20),
 ('Operations', "Master's & above", 1, 'sourcing', 1, 33, 3, 6, 0, 0, 63, 34),
 ('Analytics', "Bachelor's", 1, 'other', 1, 28, 4, 5, 0, 0, 83, 20),
 ('Sales & Marketing',
  "Master's & above",
  1,
  'sourcing',
  1,
  32,
  5,
  5,
  1,
  0,
  54,
  1),
 ('Technology', 'unknown', 1, 'sourcing', 1, 30, 0, 1, 0, 0, 77, 23),
 ('Sales & Marketing', "Bachelor's", 2, 'sourcing', 1, 35, 5, 3, 1, 0, 50, 7),
 ('Sales & Marketing', "Bachelor's", 1, 'sourcing'

In [288]:
df.select("features")

features
"(22,[1,3,5,13,14,..."
"(22,[0,2,6,13,16,..."
"(22,[1,2,5,13,16,..."
"(22,[0,2,5,13,16,..."
"(22,[0,2,8,13,16,..."
"(22,[1,2,9,13,16,..."
"(22,[0,2,6,13,16,..."
"(22,[1,3,6,13,16,..."
"(22,[0,2,9,13,16,..."
"(22,[1,3,5,13,14,..."


In [296]:
dataset_rdd.collect()

[(SparseVector(22, {1: 1.0, 3: 1.0, 5: 1.0, 13: 5.0, 14: 1.0, 16: 2.0, 17: 8.0, 18: 49.0, 19: 1.0, 20: 35.0, 21: 7.0}),),
 (SparseVector(22, {0: 1.0, 2: 1.0, 6: 1.0, 13: 5.0, 16: 1.0, 17: 4.0, 18: 60.0, 19: 1.0, 20: 30.0, 21: 22.0}),),
 (SparseVector(22, {1: 1.0, 2: 1.0, 5: 1.0, 13: 3.0, 16: 1.0, 17: 7.0, 18: 50.0, 19: 1.0, 20: 34.0, 21: 19.0}),),
 (SparseVector(22, {0: 1.0, 2: 1.0, 5: 1.0, 13: 1.0, 16: 1.0, 17: 10.0, 18: 50.0, 19: 2.0, 20: 39.0, 21: 23.0}),),
 (SparseVector(22, {0: 1.0, 2: 1.0, 8: 1.0, 13: 3.0, 16: 1.0, 17: 2.0, 18: 73.0, 19: 1.0, 20: 45.0, 21: 26.0}),),
 (SparseVector(22, {1: 1.0, 2: 1.0, 9: 1.0, 13: 3.0, 16: 1.0, 17: 7.0, 18: 85.0, 19: 2.0, 20: 31.0, 21: 2.0}),),
 (SparseVector(22, {0: 1.0, 2: 1.0, 6: 1.0, 13: 3.0, 16: 2.0, 17: 5.0, 18: 59.0, 19: 1.0, 20: 31.0, 21: 20.0}),),
 (SparseVector(22, {1: 1.0, 3: 1.0, 6: 1.0, 13: 3.0, 16: 1.0, 17: 6.0, 18: 63.0, 19: 1.0, 20: 33.0, 21: 34.0}),),
 (SparseVector(22, {0: 1.0, 2: 1.0, 9: 1.0, 13: 4.0, 16: 1.0, 17: 5.0, 18: 83.0,

### Question: Perform EDA and find any correlation

In [289]:
dataset_rdd = df.select("features").rdd.map(lambda row: row[0:])
corr_mat=Statistics.corr(dataset_rdd, method="spearman")
corr_df = pd.DataFrame(corr_mat)


NameError: name 'col_names' is not defined

In [292]:
corr_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,1.0,-0.95847,-0.009265,0.008341,0.008382,-0.005147,-0.002252,0.004017,-0.005583,-6e-06,...,0.003455,-0.01111,-0.006551,0.005006,0.00752,0.006491,-0.001326,0.011463,0.015921,0.014436
1,-0.95847,1.0,-1.7e-05,-0.000177,-0.000828,0.012529,0.002123,0.00361,-0.015589,0.003554,...,-0.001403,-0.003969,-0.007178,-0.005902,-0.005472,0.001786,-0.006358,-0.008509,-0.003728,0.00164
2,-0.009265,-1.7e-05,1.0,-0.869773,-0.30486,-0.014095,0.018056,-0.044083,-0.015396,0.051893,...,0.033796,-0.038274,0.007507,0.001885,-0.005042,-0.17301,0.007093,0.041174,-0.285934,0.118712
3,0.008341,-0.000177,-0.869773,1.0,-0.131166,-0.037293,0.007564,0.073096,0.030716,-0.058052,...,-0.03815,0.085861,0.010452,-0.000773,0.030122,0.26955,0.032204,-0.032654,0.413659,-0.092097
4,0.008382,-0.000828,-0.30486,-0.131166,1.0,0.161057,-0.059921,-0.063929,-0.056789,0.030513,...,-0.027195,-0.057421,-0.047133,0.000106,-0.063457,-0.092806,-0.098175,-0.025478,-0.126124,-0.024601
5,-0.005147,0.012529,-0.014095,-0.037293,0.161057,1.0,-0.340312,-0.257708,-0.257708,-0.219084,...,-0.092577,-0.109326,-0.121167,-0.007679,-0.160706,0.006522,-0.719331,0.016394,0.006664,0.008436
6,-0.002252,0.002123,0.018056,0.007564,-0.059921,-0.340312,1.0,-0.197734,-0.197734,-0.168098,...,-0.071032,0.126956,0.084272,-0.000285,0.127993,0.078934,-0.027099,-0.069759,0.104153,0.030311
7,0.004017,0.00361,-0.044083,0.073096,-0.063929,-0.257708,-0.197734,1.0,-0.149738,-0.127296,...,-0.053791,-0.003505,0.020728,0.002018,0.137846,0.042884,0.244841,0.020577,0.05547,-0.066652
8,-0.005583,-0.015589,-0.015396,0.030716,-0.056789,-0.257708,-0.197734,-0.149738,1.0,-0.127296,...,-0.053791,-0.048424,-0.006174,0.007062,0.07868,-0.005814,0.431598,0.008572,-0.006516,-0.028302
9,-6e-06,0.003554,0.051893,-0.058052,0.030513,-0.219084,-0.168098,-0.127296,-0.127296,1.0,...,-0.045729,0.041872,0.035938,0.002445,-0.145173,-0.065924,0.475702,0.074476,-0.107817,0.104285


In [291]:
corr_df.index, corr_df.columns = feature_columsn, feature_columsn

ValueError: Length mismatch: Expected axis has 22 elements, new values have 12 elements

In [290]:
# dataset_rdd.collect()

summary = Statistics.colStats(df.select("features").rdd.map(lambda row: row[0:]))
print(summary.mean())  # a dense vector containing the mean value for each column
print(summary.variance())  # column-wise variance
print(summary.numNonzeros())  # number of nonzeros in each column
print(summary.normL1())# return a column of normL1 summary


[5.55502846e-01 4.23660779e-01 6.69044665e-01 2.72314261e-01
 4.39534375e-02 3.07254415e-01 2.07050066e-01 1.30236462e-01
 1.30236462e-01 9.76499781e-02 4.62706174e-02 4.41176471e-02
 1.89570866e-02 3.07874763e+00 3.51974164e-01 2.31717997e-02
 1.29762079e+00 5.86551233e+00 6.33867501e+01 1.25301051e+00
 3.48039155e+01 1.41950445e+01]
[2.46923939e-01 2.44176779e-01 2.21427941e-01 1.98162820e-01
 4.20222995e-02 2.12853023e-01 1.64183332e-01 1.13276993e-01
 1.13276993e-01 8.81160676e-02 4.41304526e-02 4.21720497e-02
 1.85980548e-02 2.23938777e+00 2.28092514e-01 2.26352804e-02
 2.09046468e-01 1.81910284e+01 1.78798603e+02 3.71202643e-01
 5.86781922e+01 1.01732899e+02]
[30446. 23220. 36669. 14925.  2409. 16840. 11348.  7138.  7138.  5352.
  2536.  2418.  1039. 50684. 19291.  1270. 54808. 54808. 54808. 54808.
 54808. 54808.]
[3.044600e+04 2.322000e+04 3.666900e+04 1.492500e+04 2.409000e+03
 1.684000e+04 1.134800e+04 7.138000e+03 7.138000e+03 5.352000e+03
 2.536000e+03 2.418000e+03 1.039000e

In [None]:
# # Apply the model 

In [297]:
train, test = df.randomSplit([0.7, 0.3], seed = 2018)
print("Training Dataset Count: " + str(train.count()))
print("Test Dataset Count: " + str(test.count()))

Training Dataset Count: 38407
Test Dataset Count: 16401


In [298]:
train.columns

['label',
 'features',
 'department',
 'education',
 'gender',
 'recruitment_channel',
 'no_of_trainings',
 'age',
 'previous_year_rating',
 'length_of_service',
 'KPIs_met >80%',
 'awards_won?',
 'avg_training_score',
 'is_promoted',
 'region_1']

In [395]:
# to capture the model perfomance metrics
# Set up a dataframe for results comparison
scores = pd.DataFrame(columns = ['model_name', 'accuracy_test', 'precision_test', 
                                 'recall_test', 
                                 'f1_test'
                                  ])

In [397]:
# Select (prediction, true label) and compute test error

def pyrpsark_evalute_metrics(model_name, label_name, prediction_col_name, prediction_dataframe, score_df):
    
    evaluator = MulticlassClassificationEvaluator( labelCol=label_name, predictionCol=prediction_col_name, metricName="accuracy")
    accuracy = evaluator.evaluate(prediction_dataframe)
    print ("Accuracy :: {0:.2f}".format(accuracy))
    
    evaluator = MulticlassClassificationEvaluator( labelCol=label_name, predictionCol=prediction_col_name, metricName="precisionByLabel")
    precision = evaluator.evaluate(prediction_dataframe)
    print ("Precision :: {0:.2f}".format(precision))
    
    evaluator = MulticlassClassificationEvaluator( labelCol=label_name, predictionCol=prediction_col_name, metricName="f1")
    f1 = evaluator.evaluate(prediction_dataframe)
    print ("f1 :: {0:.2f}".format(f1))
    
    evaluator = MulticlassClassificationEvaluator( labelCol=label_name, predictionCol=prediction_col_name, metricName="recallByLabel")
    recall = evaluator.evaluate(prediction_dataframe)
    print ("Recall :: {0:.2f}".format(recall))
    
    score_df.loc[len(score_df)] = [model_name, accuracy,precision, recall, f1]
    return score_df

## Logistic Regression

In [401]:
log_reg = LogisticRegression(labelCol="label", featuresCol="features",maxIter=40)
model=log_reg.fit(train)

In [402]:
prediction_test=model.transform(test)

In [403]:
# prediction_test.show()

In [404]:
prediction_test.select("label","prediction").show(10)

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       1.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       1.0|
|  0.0|       1.0|
+-----+----------+
only showing top 10 rows



In [405]:
# Compute raw scores on the test set
predictionAndLabels = prediction_test.select("label","prediction").rdd

In [406]:
predictionAndLabels.collect()

[Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=1.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=1.0),
 Row(label=0.0, prediction=1.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=1.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label=0.0, prediction=0.0),
 Row(label

In [407]:
metrics = BinaryClassificationMetrics(predictionAndLabels)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)

Area under ROC = 0.7842906905879232


In [408]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_LR = evaluator.evaluate(prediction_test)
print ("Accuracy = " ,accuracy_LR)

Accuracy =  0.921163343698555


In [409]:
scores = pyrpsark_evalute_metrics("Logistic Regression", "label", "prediction", prediction_test, scores)

Accuracy :: 0.92
Precision :: 0.93
f1 :: 0.90
Recall :: 0.99


In [410]:
scores

Unnamed: 0,model_name,accuracy_test,precision_test,recall_test,f1_test
0,Logistic Regression,0.915066,0.915066,1.0,0.874483
1,Logistic Regression,0.921163,0.927338,0.991538,0.899039


## NaiveBayes

In [411]:
naive_bayes = NaiveBayes(featuresCol='features',labelCol='label',smoothing=1.0)

In [412]:
model = naive_bayes.fit(train) 

In [413]:
# select example rows to display.
prediction_test = model.transform(test)

In [414]:
prediction_test.show()

+-----+--------------------+-----------------+----------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+--------+--------------------+--------------------+----------+
|label|            features|       department| education|gender|recruitment_channel|no_of_trainings|age|previous_year_rating|length_of_service|KPIs_met >80%|awards_won?|avg_training_score|is_promoted|region_1|       rawPrediction|         probability|prediction|
+-----+--------------------+-----------------+----------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+--------+--------------------+--------------------+----------+
|  0.0|(22,[0,2,5,13,14,...|Sales & Marketing|Bachelor's|     1|              other|              2| 28|                   1|                2|            1|          1|                49|          0|      20|[-

In [415]:
prediction_test.select("label","prediction").show(10)

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 10 rows



In [416]:
predictionAndLabels = prediction_test.select("label","prediction").rdd

In [417]:
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_NB = evaluator.evaluate(prediction_test)

In [418]:
print ("Accuracy",accuracy_NB)

Accuracy 0.8933601609657947


In [419]:
metrics = BinaryClassificationMetrics(predictionAndLabels)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)

Area under ROC = 0.5980957586680652


In [420]:
scores = pyrpsark_evalute_metrics("NB", "label", "prediction", prediction_test, scores)

Accuracy :: 0.89
Precision :: 0.92
f1 :: 0.88
Recall :: 0.96


## SVM

In [421]:
SVM = LinearSVC(featuresCol='features',labelCol='label')

In [422]:
model = SVM.fit(train) 

In [423]:
# select example rows to display.
prediction_test = model.transform(test)

In [424]:
prediction_test.show()

+-----+--------------------+-----------------+----------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+--------+--------------------+----------+
|label|            features|       department| education|gender|recruitment_channel|no_of_trainings|age|previous_year_rating|length_of_service|KPIs_met >80%|awards_won?|avg_training_score|is_promoted|region_1|       rawPrediction|prediction|
+-----+--------------------+-----------------+----------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+--------+--------------------+----------+
|  0.0|(22,[0,2,5,13,14,...|Sales & Marketing|Bachelor's|     1|              other|              2| 28|                   1|                2|            1|          1|                49|          0|      20|[0.96209304609969...|       0.0|
|  0.0|(22,[0,2,5,13,14,...|Sale

In [425]:
prediction_test.select("label","prediction").show(10)

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 10 rows



In [426]:
predictionAndLabels = prediction_test.select("label","prediction").rdd

In [427]:
pyrpsark_evalute_metrics('label',"prediction", prediction_test)

TypeError: pyrpsark_evalute_metrics() missing 2 required positional arguments: 'prediction_dataframe' and 'score_df'

In [None]:
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_NB = evaluator.evaluate(prediction_test)

In [None]:
print ("Accuracy",accuracy_NB)

In [None]:
metrics = BinaryClassificationMetrics(predictionAndLabels)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)

In [428]:
scores = pyrpsark_evalute_metrics("SVM", "label", "prediction", prediction_test, scores)

Accuracy :: 0.92
Precision :: 0.92
f1 :: 0.87
Recall :: 1.00


## GBTClassifier

In [429]:
gradient_boost_class = GBTClassifier(labelCol="label", featuresCol="features")

In [430]:
model_gb = gradient_boost_class.fit(train)

In [431]:
prediction_test = model_gb.transform(test)

In [432]:
prediction_test.show()

+-----+--------------------+-----------------+----------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+--------+--------------------+--------------------+----------+
|label|            features|       department| education|gender|recruitment_channel|no_of_trainings|age|previous_year_rating|length_of_service|KPIs_met >80%|awards_won?|avg_training_score|is_promoted|region_1|       rawPrediction|         probability|prediction|
+-----+--------------------+-----------------+----------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+--------+--------------------+--------------------+----------+
|  0.0|(22,[0,2,5,13,14,...|Sales & Marketing|Bachelor's|     1|              other|              2| 28|                   1|                2|            1|          1|                49|          0|      20|[1

In [433]:
prediction_test.select("label","prediction").show(10)

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 10 rows



In [434]:
predictionAndLabels = prediction_test.select("label","prediction").rdd

In [None]:
pyrpsark_evalute_metrics('label',"prediction", prediction_test)

In [None]:
metrics = BinaryClassificationMetrics(predictionAndLabels)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)

In [None]:
# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator( labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy_GBT = evaluator.evaluate(prediction_test)

In [None]:
print ("Accuracy",accuracy_GBT)

In [435]:
scores = pyrpsark_evalute_metrics("GBT", "label", "prediction", prediction_test, scores)

Accuracy :: 0.94
Precision :: 0.94
f1 :: 0.93
Recall :: 1.00


## RandomForestClassifier

In [436]:
random_forest_classifier = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=40)

In [437]:
model = random_forest_classifier.fit(train)

In [438]:
prediction_test = model.transform(test)

In [439]:
prediction_test.show()

+-----+--------------------+-----------------+----------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+--------+--------------------+--------------------+----------+
|label|            features|       department| education|gender|recruitment_channel|no_of_trainings|age|previous_year_rating|length_of_service|KPIs_met >80%|awards_won?|avg_training_score|is_promoted|region_1|       rawPrediction|         probability|prediction|
+-----+--------------------+-----------------+----------+------+-------------------+---------------+---+--------------------+-----------------+-------------+-----------+------------------+-----------+--------+--------------------+--------------------+----------+
|  0.0|(22,[0,2,5,13,14,...|Sales & Marketing|Bachelor's|     1|              other|              2| 28|                   1|                2|            1|          1|                49|          0|      20|[3

In [440]:
prediction_test.select("label","prediction").show(10)

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 10 rows



In [441]:
predictionAndLabels = prediction_test.select("label","prediction").rdd

In [None]:
metrics = BinaryClassificationMetrics(predictionAndLabels)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)

In [442]:
scores = pyrpsark_evalute_metrics("Random Forest", "label", "prediction", prediction_test, scores)

Accuracy :: 0.92
Precision :: 0.92
f1 :: 0.89
Recall :: 1.00


In [445]:
scores.sort_values('precision_test', ascending=False)

Unnamed: 0,model_name,accuracy_test,precision_test,recall_test,f1_test
4,GBT,0.940491,0.941091,0.997401,0.927204
1,Logistic Regression,0.921163,0.927338,0.991538,0.899039
2,NB,0.89336,0.924397,0.962154,0.879386
5,Random Forest,0.921956,0.921622,0.999733,0.890758
0,Logistic Regression,0.915066,0.915066,1.0,0.874483
3,SVM,0.915066,0.915066,1.0,0.874483


In [None]:
print ("Accuracy {0:.2f}".format(accuracy_RF))

In [None]:
print("Accuracy of GBT : ",accuracy_GBT)
print("Accuracy of LR : ",accuracy_LR)
print("Accuracy of NB : ",accuracy_NB)
print("Accuracy of RF : ",accuracy_RF)
print("Accuracy of SVM : ",accuracy_RF)


In [None]:
scores