<a href="https://colab.research.google.com/github/fkivuti/hr-project-francis/blob/main/Copy_of_hr_project_francis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preriquisites


We will work with Supervised Machine learning. We will be training our dataframe against the 'is_promoted' column.
We will convert data in non numerical columns to numerical
We will drop the employee_id column is this is not be of importance at this point. 

The main goal is to use defined features in the dataset to train a model which will be used to predict if an employee will be promoted or not.
The model will be successfull if its able to predict if an employee will be promoted (1) or not (0) 

In [1]:
# Import numpy and pandas libraries
import pandas as pd
import numpy as np


# Load the Hr dataset and grossary and preview /  view the records

In [15]:
# load datafile and preview first few records
hr_df = pd.read_csv('https://bit.ly/2ODZvLCHRDataset')
hr_df.head()

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0


In [None]:
# load glossary dataframe and view the records
glossary_df = pd.read_csv('https://bit.ly/2Wz3sWcGlossary')
glossary_df

Unnamed: 0,employee_id - Unique ID for employee
0,department - Department of employee
1,region - Region of employment (unordered)
2,education - Education Level
3,gender - Gender of Employee
4,recruitment_channel - Channel of recruitment f...
5,nooftrainings - no of other trainings complete...
6,age - Age of Employee
7,previousyearrating - Employee Rating for the p...
8,lengthofservice - Length of service in years
9,KPIs_met >80% - if Percent of KPIs(Key perform...


In [None]:
# Select and preview unique departments/verticals
hr_df.department.unique().tolist()

['Sales & Marketing',
 'Operations',
 'Technology',
 'Analytics',
 'R&D',
 'Procurement',
 'Finance',
 'HR',
 'Legal']

In [7]:
# Select and preview unique recommeded for promotion
hr_df.is_promoted.unique().tolist()

[0, 1]

In [None]:
# Select and preview unique 'KPIs_met >80%' entries
hr_df['KPIs_met >80%'].unique().tolist()

[1, 0]

In [16]:
# check datatypes for the various columns
hr_df.dtypes

employee_id               int64
department               object
region                   object
education                object
gender                   object
recruitment_channel      object
no_of_trainings           int64
age                       int64
previous_year_rating    float64
length_of_service         int64
KPIs_met >80%             int64
awards_won?               int64
avg_training_score        int64
is_promoted               int64
dtype: object

In [17]:
# check if there are null observations in the dataset
hr_df.isnull().any()

employee_id             False
department              False
region                  False
education                True
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating     True
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool

# Cleaning our data

In [3]:
hr_df[hr_df.duplicated(['employee_id'])]

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted


In [10]:
hr_df.shape

(54808, 14)

In [19]:
unique_df = hr_df.drop_duplicates(['employee_id'])

In [None]:
unique_df.shape

(54808, 14)

In [18]:
# check if there is a column with null values
hr_df.isnull().sum()

employee_id                0
department                 0
region                     0
education               2409
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating    4124
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [20]:
# Instead of replacing null values in the previous year rating column with zero, we opt to delete these rows i.e. 4124 rows
hr_df = hr_df.dropna(axis=0, subset=['previous_year_rating'])
hr_df.isnull().sum()

employee_id                0
department                 0
region                     0
education               2024
gender                     0
recruitment_channel        0
no_of_trainings            0
age                        0
previous_year_rating       0
length_of_service          0
KPIs_met >80%              0
awards_won?                0
avg_training_score         0
is_promoted                0
dtype: int64

In [21]:
# Instead of replacing null values in the education column with zero, we opt to delete these rows i.e. 4124 rows
hr_df = hr_df.dropna(axis=0, subset=['education'])
hr_df.isnull().sum()

employee_id             0
department              0
region                  0
education               0
gender                  0
recruitment_channel     0
no_of_trainings         0
age                     0
previous_year_rating    0
length_of_service       0
KPIs_met >80%           0
awards_won?             0
avg_training_score      0
is_promoted             0
dtype: int64

In [22]:
# check the shape of the dataframe
hr_df.shape

(48660, 14)

# Non-numeric data conversion to numeric data

In [23]:
# Iterate through the columns in the dataframe and find unique elements for non numeric columns. We will take a set  of the column values and 
# thus the set within the index within the set will be the new numerical value or id of that non numerical observation.

# create a function that gets the columns and interate through them

def handle_non_numerical_data(hr_df):
    columns = hr_df.columns.values
    for column in columns:

# Embed a function that converts the parameter value to the any value of that item (as Key) from the text_digit_val dictionary

        text_digit_vals = {}
        def convert_to_int(val):
            return text_digit_vals[val]

# During iteration through the columns, check and pick columns which are not int64 or float64 and then convert the column to list of its values
        if hr_df[column].dtype != np.int64 and hr_df[column].dtype != np.float64:
            column_contents = hr_df[column].values.tolist()

# Take a set of the columns and extract the unique values only.            
            unique_elements = set(column_contents)

# Create a new dictionary key for each of the unique values found with avalye of a new number.
            x = 0
            for unique in unique_elements:
                if unique not in text_digit_vals:
                    text_digit_vals[unique] = x
                    x+=1

# Use the map function to perform mapping of the new values into the columns
            hr_df[column] = list(map(convert_to_int, hr_df[column]))

    return hr_df


# Call our handle_non_numerical_data function and preview the newly converted data frame

hr_df = handle_non_numerical_data(hr_df)
print(hr_df.head())

   employee_id  department  ...  avg_training_score  is_promoted
0        65438           5  ...                  49            0
1        65141           3  ...                  60            0
2         7513           5  ...                  50            0
3         2542           5  ...                  50            0
4        48945           7  ...                  73            0

[5 rows x 14 columns]


# Create a Random Forest Regression model for predicting if an employee with certain features will be promoted or not.

In [24]:
# import RandomForestRegressor as follows
from sklearn.ensemble import RandomForestRegressor

# Defining features and target
features =  hr_df.drop(['employee_id', 'is_promoted'], axis=1)
target = hr_df['is_promoted']

# Create a regressor object with random state set to Zero and n_estimators set to 3
random_regressor = RandomForestRegressor(random_state = 42, n_estimators=3)

# Train the model
random_regressor.fit(features, target)

# Define sample data that will be used to predict the 'is_promoted' outcome.

features =  hr_df.drop(['employee_id', 'is_promoted'], axis=1)
new_features = pd.DataFrame(
    [
        [4, 12, 3 , 0, 1, 2, 50, 4.0, 11, 1, 0, 65],
        [3, 6, 1 , 1, 1, 2, 27, 0, 6, 1, 0, 30],
    ],
    columns=features.columns
)

# Predict if this employee will be promoted

is_promoted = random_regressor.predict(new_features)  
print(is_promoted)

# check model's accuracy level
accuracy_score = random_regressor.score(features, target)
print(accuracy_score)




[1. 0.]
0.7758776305347055


In [25]:
# Preview the new entries observations under the features
new_features

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score
0,4,12,3,0,1,2,50,4.0,11,1,0,65
1,3,6,1,1,1,2,27,0.0,6,1,0,30


# Findings and Conclusion

Findings:
Employee with features in the first row will be promoted while the employee with features matching the second row will not be promoted. 
Our model has a prediction has a 77% accuracy when making predictions