## Brief note on the Problem Statment
* Task : - <b>Imbalanced Classification (is_promoted[0,1])</b>

<p>Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is:</p>

1. They first identify a set of employees based on recommendations/ past performance Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical

2. At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion


## Hypothesis on the approach and breif description on some thoughts

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [14]:
raw_data = pd.read_csv("./train_LZdllcl.csv")

In [15]:
raw_data.head(4)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0


In [16]:
raw_data.columns

Index(['employee_id', 'department', 'region', 'education', 'gender',
       'recruitment_channel', 'no_of_trainings', 'age', 'previous_year_rating',
       'length_of_service', 'KPIs_met >80%', 'awards_won?',
       'avg_training_score', 'is_promoted'],
      dtype='object')

## EDA and Breif notes on the findings!

In [17]:
raw_data.is_promoted.value_counts()

0    50140
1     4668
Name: is_promoted, dtype: int64

In [18]:
raw_data.isnull().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [19]:
raw_data.education.value_counts()

Bachelor's          36669
Master's & above    14925
Below Secondary       805
Name: education, dtype: int64

In [20]:
#filling missing education value with mode
raw_data['education'] = raw_data['education'].fillna("Bachelor's")

In [21]:
raw_data.isnull().sum()

employee_id                0
department                 0
region                     0
education                  0
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [23]:
##filling previous_year_rating value with mode
raw_data['previous_year_rating'] = raw_data['previous_year_rating'].fillna(3)

In [25]:
raw_data.isnull().sum()

employee_id             0
department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
KPIs_met >80%           0
awards_won?             0
avg_training_score      0
is_promoted             0
dtype: int64

In [26]:
from sklearn import model_selection

In [27]:
feature_list1 = ['department', 'region', 'education', 'gender',
       'recruitment_channel', 'no_of_trainings', 'age', 'previous_year_rating',
       'length_of_service', 'KPIs_met >80%', 'awards_won?',
       'avg_training_score']

In [30]:
raw_data[feature_list1].dtypes

department               object
region                   object
education                object
gender                   object
recruitment_channel      object
no_of_trainings           int64
age                       int64
previous_year_rating    float64
length_of_service         int64
KPIs_met >80%             int64
awards_won?               int64
avg_training_score        int64
dtype: object

In [31]:
raw_data[feature_list1].head(3)

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49
1,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50


In [34]:
raw_data['no_of_trainings'].value_counts()

1     44378
2      7987
3      1776
4       468
5       128
6        44
7        12
10        5
9         5
8         5
Name: no_of_trainings, dtype: int64

In [39]:
for x in feature_list1:
    if raw_data[x].dtype!='float64':
        #print(df_raw[[x,Target[0]]].groupby(x, as_index = False).mean())
        print(raw_data[[x,'is_promoted']].groupby(x, as_index = False).mean())
        print("-"*30,"\n")

          department  is_promoted
0          Analytics     0.095665
1            Finance     0.081230
2                 HR     0.056245
3              Legal     0.051011
4         Operations     0.090148
5        Procurement     0.096386
6                R&D     0.069069
7  Sales & Marketing     0.072031
8         Technology     0.107593
------------------------------ 

       region  is_promoted
0    region_1     0.095082
1   region_10     0.078704
2   region_11     0.056274
3   region_12     0.066000
4   region_13     0.086858
5   region_14     0.074970
6   region_15     0.079060
7   region_16     0.069625
8   region_17     0.136935
9   region_18     0.032258
10  region_19     0.060641
11   region_2     0.080126
12  region_20     0.057647
13  region_21     0.043796
14  region_22     0.114188
15  region_23     0.116596
16  region_24     0.035433
17  region_25     0.125763
18  region_26     0.063274
19  region_27     0.078963
20  region_28     0.116844
21  region_29     0.043260
22   r

In [36]:
    data['FareBin'] = pd.qcut(data['Fare'], 4)


0.08517004816815063

In [40]:
pd.cut(raw_data['age'].astype(int), 5)

0         (28.0, 36.0]
1         (28.0, 36.0]
2         (28.0, 36.0]
3         (36.0, 44.0]
4         (44.0, 52.0]
5         (28.0, 36.0]
6         (28.0, 36.0]
7         (28.0, 36.0]
8        (19.96, 28.0]
9         (28.0, 36.0]
10        (28.0, 36.0]
11        (28.0, 36.0]
12        (44.0, 52.0]
13        (36.0, 44.0]
14        (36.0, 44.0]
15        (36.0, 44.0]
16        (36.0, 44.0]
17        (28.0, 36.0]
18        (28.0, 36.0]
19        (36.0, 44.0]
20        (28.0, 36.0]
21        (36.0, 44.0]
22       (19.96, 28.0]
23       (19.96, 28.0]
24        (36.0, 44.0]
25       (19.96, 28.0]
26        (28.0, 36.0]
27        (28.0, 36.0]
28        (28.0, 36.0]
29       (19.96, 28.0]
             ...      
54778     (28.0, 36.0]
54779     (28.0, 36.0]
54780     (36.0, 44.0]
54781     (36.0, 44.0]
54782    (19.96, 28.0]
54783    (19.96, 28.0]
54784     (36.0, 44.0]
54785     (28.0, 36.0]
54786     (28.0, 36.0]
54787    (19.96, 28.0]
54788    (19.96, 28.0]
54789     (28.0, 36.0]
54790     (

In [41]:
raw_data['age'].min()

20

In [46]:
raw_data['age'].value_counts()

30    3665
31    3534
32    3534
29    3405
33    3210
28    3147
34    3076
27    2827
35    2711
36    2517
37    2165
26    2060
38    1923
39    1695
40    1663
25    1299
41    1289
42    1149
43     992
44     847
24     845
45     760
46     697
47     557
48     557
50     521
49     441
23     428
51     389
53     364
52     351
54     313
55     294
56     264
57     238
22     231
60     217
58     213
59     209
20     113
21      98
Name: age, dtype: int64