## Data Science Nigeria Staff Promotion Algorithm
Author:🧕🏿 Hasanat Owoseni\
Date : 25th September, 2019

### STEPS 
1. Import Libraies and Dataset
2. Merge the data set together (train and test)
3. change the way the column names are formatted: (uppercases to samller), no special char except underscore

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### The Dataset 

In [3]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [4]:
df_train.head(3)

Unnamed: 0,EmployeeNo,Division,Qualification,Gender,Channel_of_Recruitment,Trainings_Attended,Year_of_birth,Last_performance_score,Year_of_recruitment,Targets_met,Previous_Award,Training_score_average,State_Of_Origin,Foreign_schooled,Marital_Status,Past_Disciplinary_Action,Previous_IntraDepartmental_Movement,No_of_previous_employers,Promoted_or_Not
0,YAK/S/00001,Commercial Sales and Marketing,"MSc, MBA and PhD",Female,Direct Internal process,2,1986,12.5,2011,1,0,41,ANAMBRA,No,Married,No,No,0,0
1,YAK/S/00002,Customer Support and Field Operations,First Degree or HND,Male,Agency and others,2,1991,12.5,2015,0,0,52,ANAMBRA,Yes,Married,No,No,0,0
2,YAK/S/00003,Commercial Sales and Marketing,First Degree or HND,Male,Direct Internal process,2,1987,7.5,2012,0,0,42,KATSINA,Yes,Married,No,No,0,0


In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38312 entries, 0 to 38311
Data columns (total 19 columns):
EmployeeNo                             38312 non-null object
Division                               38312 non-null object
Qualification                          36633 non-null object
Gender                                 38312 non-null object
Channel_of_Recruitment                 38312 non-null object
Trainings_Attended                     38312 non-null int64
Year_of_birth                          38312 non-null int64
Last_performance_score                 38312 non-null float64
Year_of_recruitment                    38312 non-null int64
Targets_met                            38312 non-null int64
Previous_Award                         38312 non-null int64
Training_score_average                 38312 non-null int64
State_Of_Origin                        38312 non-null object
Foreign_schooled                       38312 non-null object
Marital_Status                         383

In [6]:
df_test.head(3)

Unnamed: 0,EmployeeNo,Division,Qualification,Gender,Channel_of_Recruitment,Trainings_Attended,Year_of_birth,Last_performance_score,Year_of_recruitment,Targets_met,Previous_Award,Training_score_average,State_Of_Origin,Foreign_schooled,Marital_Status,Past_Disciplinary_Action,Previous_IntraDepartmental_Movement,No_of_previous_employers
0,YAK/S/00005,Information Technology and Solution Support,First Degree or HND,Male,Agency and others,2,1976,7.5,2017,0,0,65,FCT,Yes,Married,No,No,1
1,YAK/S/00011,Information Technology and Solution Support,,Male,Direct Internal process,2,1991,0.0,2018,0,0,69,OGUN,Yes,Married,No,No,1
2,YAK/S/00015,Research and Innovation,"MSc, MBA and PhD",Male,Direct Internal process,2,1984,7.5,2012,0,0,76,KANO,Yes,Married,No,No,1


In [7]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16496 entries, 0 to 16495
Data columns (total 18 columns):
EmployeeNo                             16496 non-null object
Division                               16496 non-null object
Qualification                          15766 non-null object
Gender                                 16496 non-null object
Channel_of_Recruitment                 16496 non-null object
Trainings_Attended                     16496 non-null int64
Year_of_birth                          16496 non-null int64
Last_performance_score                 16496 non-null float64
Year_of_recruitment                    16496 non-null int64
Targets_met                            16496 non-null int64
Previous_Award                         16496 non-null int64
Training_score_average                 16496 non-null int64
State_Of_Origin                        16496 non-null object
Foreign_schooled                       16496 non-null object
Marital_Status                         164

### Dataframe merging 
The test data frame doesn't include the Promoted_or_not column. \
The reason I'm merging it is so I can easily clean the test and train dataframe as a dataframe which is faster\ 
rather  than cleaning them individually.\

Hence, I will be adding the missing column `promoted_or_not` column. Populate it with a constant : `15`


In [8]:
 df_test['Promoted_or_Not'] = 15

In [16]:
df = pd.concat([df_train, df_test])

In [17]:
#remove the irregularity in the dataframe's column using string methods
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

In [19]:
df.head(3)

Unnamed: 0,employeeno,division,qualification,gender,channel_of_recruitment,trainings_attended,year_of_birth,last_performance_score,year_of_recruitment,targets_met,previous_award,training_score_average,state_of_origin,foreign_schooled,marital_status,past_disciplinary_action,previous_intradepartmental_movement,no_of_previous_employers,promoted_or_not
0,YAK/S/00001,Commercial Sales and Marketing,"MSc, MBA and PhD",Female,Direct Internal process,2,1986,12.5,2011,1,0,41,ANAMBRA,No,Married,No,No,0,0
1,YAK/S/00002,Customer Support and Field Operations,First Degree or HND,Male,Agency and others,2,1991,12.5,2015,0,0,52,ANAMBRA,Yes,Married,No,No,0,0
2,YAK/S/00003,Commercial Sales and Marketing,First Degree or HND,Male,Direct Internal process,2,1987,7.5,2012,0,0,42,KATSINA,Yes,Married,No,No,0,0


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54808 entries, 0 to 16495
Data columns (total 19 columns):
employeeno                             54808 non-null object
division                               54808 non-null object
qualification                          52399 non-null object
gender                                 54808 non-null object
channel_of_recruitment                 54808 non-null object
trainings_attended                     54808 non-null int64
year_of_birth                          54808 non-null int64
last_performance_score                 54808 non-null float64
year_of_recruitment                    54808 non-null int64
targets_met                            54808 non-null int64
previous_award                         54808 non-null int64
training_score_average                 54808 non-null int64
state_of_origin                        54808 non-null object
foreign_schooled                       54808 non-null object
marital_status                         548

## Dealing with Missing Values
All the columns (features) except qualification have complete values\
Let's view the rows with missing values

In [25]:
df[df['qualification'].isnull()].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2409 entries, 15 to 16469
Data columns (total 19 columns):
employeeno                             2409 non-null object
division                               2409 non-null object
qualification                          0 non-null object
gender                                 2409 non-null object
channel_of_recruitment                 2409 non-null object
trainings_attended                     2409 non-null int64
year_of_birth                          2409 non-null int64
last_performance_score                 2409 non-null float64
year_of_recruitment                    2409 non-null int64
targets_met                            2409 non-null int64
previous_award                         2409 non-null int64
training_score_average                 2409 non-null int64
state_of_origin                        2409 non-null object
foreign_schooled                       2409 non-null object
marital_status                         2409 non-null object

In [27]:
df[df['qualification'].isnull()]

Unnamed: 0,employeeno,division,qualification,gender,channel_of_recruitment,trainings_attended,year_of_birth,last_performance_score,year_of_recruitment,targets_met,previous_award,training_score_average,state_of_origin,foreign_schooled,marital_status,past_disciplinary_action,previous_intradepartmental_movement,no_of_previous_employers,promoted_or_not
15,YAK/S/00022,Customer Support and Field Operations,,Male,Direct Internal process,2,1980,10.0,2008,0,0,49,RIVERS,Yes,Married,No,No,1,0
22,YAK/S/00033,Commercial Sales and Marketing,,Female,Direct Internal process,2,1997,2.5,2017,0,0,40,EDO,Yes,Married,No,No,1,0
28,YAK/S/00044,Commercial Sales and Marketing,,Male,Agency and others,4,1997,5.0,2017,0,0,40,CROSS RIVER,Yes,Married,No,No,1,0
60,YAK/S/00091,Commercial Sales and Marketing,,Female,Direct Internal process,2,2001,0.0,2018,0,0,47,ZAMFARA,Yes,Single,No,No,2,0
137,YAK/S/00190,Customer Support and Field Operations,,Female,Agency and others,2,1988,10.0,2010,0,0,56,LAGOS,Yes,Single,No,No,5,0
168,YAK/S/00232,Commercial Sales and Marketing,,Male,Agency and others,2,1999,10.0,2017,0,0,43,LAGOS,Yes,Single,No,No,0,0
198,YAK/S/00278,Information and Strategy,,Male,Direct Internal process,2,1990,7.5,2015,0,0,77,LAGOS,Yes,Married,No,No,1,0
240,YAK/S/00337,Customer Support and Field Operations,,Male,Agency and others,2,1973,7.5,2011,0,0,49,ADAMAWA,No,Married,No,No,0,0
253,YAK/S/00353,Customer Support and Field Operations,,Female,Agency and others,2,1963,10.0,2001,1,0,52,OGUN,Yes,Married,No,No,0,0
255,YAK/S/00355,Commercial Sales and Marketing,,Male,Agency and others,2,1997,10.0,2017,0,0,43,FCT,Yes,Married,No,No,0,0


Checking the unique qualifications is next on my list

In [28]:
df['qualification'].unique()

array(['MSc, MBA and PhD', 'First Degree or HND', nan,
       'Non-University Education'], dtype=object)