
## A Tech-support call-center wants to know based on the certain variables whether an employee can be promoted or not. Create a model using Random Forest to predict whether an employee will be promoted or not.



**Dataset Description**

    Feature 		Definition
    employee_id		Unique ID for employee
    department		Department of employee
    education		Education Level
    gender			Gender of Employee
    no_of_trainings		no of other trainings completed in previous year on soft skills, technical skills etc.
    age	     		Age of Employee
    last_promotion_rating	Employee Rating for the previous year
    employee_since_years	Length of service in years
    all_tasks_completed	If employee has completed all the tasks assigned to him/her
    training_marks		Marks in current year's training evaluations
    is_promoted		(Target)To promote or not

In [None]:
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.ensemble import RandomForestClassifier

**Get the data**

In [None]:
df = pd.read_csv("hrdata.csv")

In [None]:
df.head() 

In [None]:
df.shape

In [None]:
df.drop('Unnamed: 0',axis=1,inplace=True)

In [None]:
df.dtypes.value_counts()

**Check for Null values**

## EDA

**Plot bar chart of target variable**



**Check % of 0s and 1s in Target variable**

In [None]:
df.is_promoted.value_counts(normalize=True)

##### Inference: Data seems to be highly imbalanced with only 8% of 1s in target variable

**Check the gender distribution**

**Check distribution of continuous variables**

In [None]:
continuous=df.dtypes[df.dtypes=='int64'].index

sns.pairplot(df[continuous].drop('is_promoted',axis=1),diag_kind='kde')
plt.show()

**Plot gender with target variable**

In [None]:
sns.countplot(df.is_promoted,hue=df.gender,palette='BuPu_r')
plt.show()

**Plot Employee Since vs Training marks**

set hue as is_promoted

In [None]:
plt.figure(figsize=(20,15))
sns.barplot(df.employee_since_years,, hue=,ci=False)
plt.show()

**Check for outliers**

In [None]:
data_plot=df[continuous].drop('is_promoted',axis=1)
fig=plt.figure(figsize=(12,7))
for i in range(0,len(data_plot.columns)):
    ax=fig.add_subplot(2,4,i+1)
    sns.boxplot(y=data_plot[data_plot.columns[i]])
    ax.set_title(data_plot.columns[i],color='Red')
    plt.grid()
plt.tight_layout()


**There are many outliers in the employee_since_years, age and no_trainings**

**Convert Object datatypes to categorical**

In [None]:
df.gender=pd.Categorical(df.gender).codes

In [None]:
#convert department and education to 0s and 1s using get_dummies in pandas
department=pd.get_dummies(df.department,drop_first=True)
education=pd.get_dummies(df.education,drop_first=True)
df=pd.concat((df.drop(['education','department'],axis=1),department,education),axis=1)

In [None]:
df.head()

In [None]:
# capture the target column ("default") into separate vectors for training set and test set
X = df.drop(["is_promoted",'employee_id'] , axis=1)
y = df.pop("is_promoted")

In [None]:
# splitting data into training and test set for independent attributes
from sklearn.model_selection import train_test_split

X_train, X_test, train_labels, test_labels = train_test_split(X, y, test_size=.30, random_state=1)


# Ensemble RandomForest Classifier

**Instantiate the Random Forest Classifier Class**

In [None]:
rfcl = (n_estimators = 501, random_state=123)
rfcl = rfcl.fit(, )

**Perform Grid Search**

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [7, 10],
    'max_features': [4, 6],
    'min_samples_leaf': [50, 100],
    'min_samples_split': [150, 300],
    'n_estimators': [301, 501]
}

rfcl = RandomForestClassifier()

grid_search = GridSearchCV(estimator = rfcl, param_grid = param_grid, cv = 3)



In [None]:
grid_search.fit(, )

**Get best parameters**

In [None]:
best_grid = grid_search.best_estimator_

**Predict on both train and test sets**

In [None]:
ytrain_predict = best_grid.predict(X_train)


**Check Feature Importances**

In [None]:
x=pd.DataFrame(best_grid.feature_importances_*100,index=X_train.columns).sort_values(by=0,ascending=False)
plt.figure(figsize=(12,7))
sns.barplot(x[0],x.index,palette='dark')
plt.xlabel('Feature Importance in %')
plt.ylabel('Features')
plt.title('Feature Importance using RF')
plt.show()

**Using Scikit-learn metrics, print confusion matrix and classification report**

**Plot ROC Curve and print area under the curve**

In [None]:
import matplotlib.pyplot as plt

In [None]:
# AUC and ROC for the training data

# predict probabilities
probs = best_grid.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(train_labels, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(train_labels, probs)
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.')
# show the plot
plt.show()

In [None]:
# AUC and ROC for the test data


## Conclusion
