# Employee turnover prediction

### Here’s How your HR can Improve Employee Turnover Prediction Accuracy

HR managers play a vital role in any organization as they are responsible for managing the most important resource available to your company which is your employees. While managing employee attrition is one of the key areas of focus for HR, it is also important to predict the employee turnover rate. This article will explore methods to predict employee turnover and how you can develop employee retention strategies.

### Why is Employee Turnover Prediction Important?
Employee turnover is the percentage of employees leaving an organization within a certain period. Turnover is part of doing business, but it can be costly for companies if it happens too often. A study by Employee Benefit News found that the cost of losing a single employee can be up to 33% of their annual salary. That’s why HR managers need to predict employee turnover and take steps to reduce it. 

Several factors can impact employee turnover, including job satisfaction, company culture, pay and benefits, and workload. When predicting turnover, HR managers should consider these factors and improve employee engagement and retention by focusing on job satisfaction and company culture. By offering competitive pay and benefits, your employees will be satisfied, and your HR managers can help reduce employee turnover. 

### Common Factors that can affect Employee Retention Rate
Compensation: Employees who feel they are not being paid enough may be more likely to leave the company. 
Benefits: Employees may be more likely to stay with a company if they feel they are receiving good benefits. 
Work-life balance: If employees feel overloaded and think they are working too much, they may be more likely to leave the company. Not having enough time for their personal lives is a crucial factor.   
Job satisfaction: If employees are not happy with their jobs, roles, or responsibilities, they may not wish to stay with the company. 

### Perfecting the art of Predicting Employee Turnover Rate
There are many factors that HR managers need to take into account while predicting employee turnover rate. Some of the key elements include: 

 Service time: Employees who have been with the organization for a shorter period are more likely to leave than those who have been with the company for longer. 
Age: Younger employees are almost always looking for new opportunities and tend to leave more than older employees.
Job satisfaction: It is closely linked with employee engagement, and if employees don’t feel engaged with the job and their roles, they become disenchanted and will not stay.  
Compensation: Employees who feel they are not adequately compensated will leave than those who feel fairly compensated and get the market rate. 
Workload: Employees who feel their workload is excessive are likelier to leave than those who think their workload is manageable. 
Job market conditions: The job market conditions are also a key factor that HR managers must consider while predicting employee turnover. If the job market is good, employees are more likely to leave their current company in search of a better job. 

I'll uncover the factors that lead to employee attrition and explore important questions such as ‘Show me a breakdown of distance from home by job role and attrition’ or ‘Compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.

# about data 
<details>
    
    Education
    1 'Below College'
    2 'College'
    3 'Bachelor'
    4 'Master'
    5 'Doctor'

    EnvironmentSatisfaction
    1 'Low'
    2 'Medium'
    3 'High'
    4 'Very High'

    JobInvolvement
    1 'Low'
    2 'Medium'
    3 'High'
    4 'Very High'

    JobSatisfaction
    1 'Low'
    2 'Medium'
    3 'High'
    4 'Very High'

    PerformanceRating
    1 'Low'
    2 'Good'
    3 'Excellent'
    4 'Outstanding'

    RelationshipSatisfaction
    1 'Low'
    2 'Medium'
    3 'High'
    4 'Very High'

    WorkLifeBalance
    1 'Bad'
    2 'Good'
    3 'Better'
    4 'Best'
    
    </details>

In [1]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

from category_encoders import OneHotEncoder
from sklearn.preprocessing import  LabelEncoder,StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    classification_report,
    confusion_matrix,
)
from sklearn.linear_model import LogisticRegression
 
from sklearn.model_selection import GridSearchCV, cross_val_score,train_test_split,RepeatedStratifiedKFold
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier



In [2]:
df=pd.read_csv('/kaggle/input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [None]:
df.info()

Cool there are no missing values

# explore Data

How `attrition` affects the other features

Is attrition good or bad?

Negative attrition, especially in industries with the highest turnover rates, is expensive. The organization must once again recruit, assess, hire and train a new employee, and until the position is filled, team productivity declines. Positive attrition refers to staff turnover that actually benefits the organization

In [None]:
df[['Attrition']].head(5)

In [None]:
# Creating plot
att=df['Attrition'].value_counts(normalize=True)
explode = (0.1, 0.0, )
 
# Creating color parameters
colors = ( "orange", "beige")
 
# Wedge properties
wp = { 'linewidth' : 1, 'edgecolor' : "green" }
 
# Creating autocpt arguments
def func(pct, allvalues):
    absolute = int(pct / 100.*np.sum(allvalues))
    return "{:.1f}%".format(pct, absolute)
 
# Creating plot
fig, ax = plt.subplots(figsize =(10, 7))
wedges, texts, autotexts = ax.pie(att,
                                  autopct = lambda pct: func(pct, att),
                                  explode = explode,
                                  labels = att.index,
                                  shadow = True,
                                  colors = colors,
                                  startangle = 90,
                                  wedgeprops = wp,
                                  textprops = dict(color ="black"))
 
# Adding legend
ax.legend(wedges, att.index,
          title ="Attition",
          loc ="center left",
          bbox_to_anchor =(1, 0, 0.5, 1))
 
plt.setp(autotexts, size = 8, weight ="bold")
ax.set_title("Propotion Of emploees attrition")
 
# show plot
plt.show()

## Attrition VS Age

`daily rate` is a predetermined sum of money paid to a worker for each day of work or service, often used for temporary or freelance positions

In [None]:
df[['Age']].head(5)

In [None]:
age_att=df.groupby(['Age','Attrition'])['Attrition'].count().reset_index(name='Counts')
age_att

In [None]:
px.line(age_att[age_att['Attrition']=='Yes'],x='Age',y='Counts',color='Attrition',title='Age line of attrition employees in an Organization')

- we can see that attrition rate is between 28 - 32 ages 
- The attrition rate keeps on falling with increasing age
- from 18-20, the chances of an employee leaving the organization is far more

## BusinessTravel vs Attrition 

The vast majority of business travelers across the globe report the quality of their business travel experience impacts their overall job satisfaction. Beyond that, it can also influence whether they take a job in the first place. Many survey respondents indicated a company’s travel policy is an important factor when considering a potential new employer. Finally, an overwhelming majority reported that the quality of their business travel experience impacts their business results at least somewhat. Ensuring a high quality business travel experience for your road warriors is a win-win, creating happy employees and a healthy bottom line.

In [None]:
df['BusinessTravel'].value_counts()

In [None]:
n=df.groupby(['BusinessTravel','Attrition'])['BusinessTravel'].count().reset_index(name='Counts')
n

In [None]:
per=np.array([(138/150)*100,(12/150)*100,(208/(208+69))*100,(69/(208+69))*100,(887/(887+156))*100,(156/(887+156))*100]).round(3)
n['Percent']=per

In [None]:
n

In [None]:
fig=px.bar(n,x='BusinessTravel',y='Counts',color='Attrition', text_auto=True,title='BusinessTravel Counts of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.show()

In [None]:
fig=px.bar(n,x='BusinessTravel',y='Percent',color='Attrition', text_auto=True,title='BusinessTravel percentage for each individual class of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.show()

- The bar says that the travel_frequently employees have the highest attrition rate then travel rarely then Non-travel
- This bar has some information, the most important one is more travel, leads to more attrition 

## Department vs Attrition

In [None]:
dept_att=df.groupby(['Department','Attrition']).apply(lambda x:x['DailyRate'].count()).reset_index(name='Counts')
per=np.array([(51/(51+12))*100,(12/(51+12))*100,(828/(828+133))*100,(133/(828+133))*100,(354/(354+93))*100,(93/(354+93))*100]).round(2)
dept_att['Percent']=per

In [None]:
fig=px.bar(dept_att,x='Department',y='Counts',color='Attrition', text_auto=True,title='Department counts of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.show()

In [None]:
fig=px.bar(dept_att,x='Department',y='Percent',color='Attrition', text_auto=True,title='Department propotion for each individual class of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.show()

from the above bar we find that :
-  sales department has the most attrition rate with 21% ,then the humman resources ,then the Research & Development
-  Research & Development department has the most capacty of employees but the least attrition rate  

## Daily Rate VS Attrition 

In [None]:
df.boxplot(by= 'Attrition',column='DailyRate',vert=False)


    the Daily Rate average of class yes is greater than the Daily Rate average of class no 
    which make a point that employees with high rate have more chance for keep going in a position than lower rate 

In [None]:
rate_att=df.groupby(['DailyRate','Attrition']).apply(lambda x:x['DailyRate'].count()).reset_index(name='Counts')


In [None]:
fig=px.histogram(rate_att[rate_att['Attrition']=='Yes'],x='DailyRate',y='Counts',nbins=20,color='Attrition', title='DailyRate counts of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.show()

the bar has some sort of left skew

## Distance From Home VS Attrition

In [None]:
df['DistanceFromHome'].head()


In [None]:
dist_att=df.groupby(['DistanceFromHome','Attrition'])['Attrition'].count().reset_index(name='Counts')


In [None]:
px.histogram(dist_att,x='DistanceFromHome',y='Counts',color='Attrition',title='Distance From Home propotion for attrition classes of employees in an Organization')


- as a start, the proportion of employees decreases as the home gets away from an organization
- To some extent, the distance from home to work does not affect whether the employee stays or not, but it is clear that most of the employees are close to home.

## Education VS Attrition

In [None]:
df['Education'].unique()

In [None]:
df['EducationField'].unique()

In [None]:
edu_att=df.groupby(['Education','Attrition'])['Attrition'].count().reset_index(name='Counts')
edu_att[edu_att['Attrition']=='Yes']

In [None]:
fig=px.bar(edu_att,x='Education',y='Counts',color='Attrition', text_auto=True,title='Education counts of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0,
    xaxis_title="Education",
    yaxis_title="Proportion",)

fig.show()

In [None]:
perc=np.array([(139/(139+31))*100,(31/(139+31))*100,(238/(238+44))*100,(44/(238+44))*100,
               (473/(473+99))*100,(99/(473+99))*100,(340/(340+58))*100,(58/(340+58))*100
              ,(43/(43+5))*100,(5/(34+5))*100]).round(2)
edu_att['Percent']=perc

In [None]:
fig=px.bar(edu_att,x='Education',y='Percent',color='Attrition', text_auto=True,title='Education propotion for each individual class of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.show()

Education

1- 'Below College'

2- 'College'

3- 'Bachelor'

4- 'Master'

5- 'Doctor'

from the above bar we find that :
-  Below College employees have the most attrition rate with 18.24% 
-  Doctors grad employees have the lowest attrition rate with 12.82%
-  employees with Bachelor degree have high attrition rate with 17.31% i think it because of the employees it self as they search for the best organization ,they may be fresh graduates

In [None]:
eduf_att=df.groupby(['EducationField','Attrition'])['Attrition'].count().reset_index(name='Counts')
eduf_att[eduf_att['Attrition']=='Yes']

In [None]:
fig=px.bar(eduf_att,x='EducationField',y='Counts',color='Attrition', text_auto=True,title='EducationField counts of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0,
    xaxis_title="EducationField",
    yaxis_title="Proportion",)

fig.show()

## Environment Satisfaction VS Attiration 

A work environment is made up of a range of factors, including company culture, management styles, hierarchies, and human resources policies.

`Employee satisfaction` is the degree to which employees feel personally fulfilled and content in their job roles

In [None]:
envs_att=df.groupby(['EnvironmentSatisfaction','Attrition'])['Attrition'].count().reset_index(name='Counts')
envs_att[envs_att['Attrition']=='Yes']

In [None]:
fig=px.area(envs_att,x='EnvironmentSatisfaction',y='Counts',color='Attrition',title='EnvironmentSatisfaction counts of employees in an Organization')


fig.show()

In [None]:
perc_envs=np.array([(212/(212+72))*100,(72/(212+72))*100,(244/(244+43))*100,(43/(244+43))*100,
               (391/(391+62))*100,(62/(391+62))*100,(386/(386+60))*100,(60/(386+60))*100]).round(2)
envs_att['Percent']=perc_envs

In [None]:
fig=px.bar(envs_att,x='EnvironmentSatisfaction',y='Percent',color='Attrition', text_auto=True,title='Environment Satisfaction propotion for each individual class of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.show()

EnvironmentSatisfaction

1 'Low'

2 'Medium'

3 'High'

4 'Very High'

- As expected, the higher employee satisfaction, the lower the attrition rate and vice versa

## Gender VS Attrition 

In [None]:
gndr_att=df.groupby(['Gender','Attrition'])['Attrition'].count().reset_index(name='Counts')
gndr_att[gndr_att['Attrition']=='Yes']

In [None]:
fig=px.bar(gndr_att,x='Gender',y='Counts',color='Attrition', text_auto=True,title='Gender counts of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0,
    xaxis_title=" Gender",
    yaxis_title="Proportion",)

fig.show()

In [None]:
perc_gndr=np.array([(501/(501+87))*100,(87/(501+87))*100,(732/(732+150))*100,(150/(732+150))*100]).round(2)
gndr_att['Percent']=perc_gndr

In [None]:
fig=px.bar(gndr_att,x='Gender',y='Percent',color='Attrition', text_auto=True,title='Gender propotion for each individual class of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.show()

I don't know why but the attrition rate of Men is higher than Womens, to be honest, it is not much bigger 

## Job involvement VS Attiration
Job involvement, also referred to as job participation, is the degree to which an employee identifies with their work, actively participates in it, and derives a sense of self-worth from it

In [None]:
jinv_att=df.groupby(['JobInvolvement','Attrition'])['Attrition'].count().reset_index(name='Counts')
jinv_att[jinv_att['Attrition']=='Yes']

In [None]:
fig=px.bar(jinv_att,x='JobInvolvement',y='Counts',color='Attrition', text_auto=True,title='Job Involvement counts of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0,
    xaxis_title="Education",
    yaxis_title="Proportion",)

fig.show()

In [None]:
per_jinv=np.array([(55/(55+28))*100,(28/(55+28))*100,(304/(304+71))*100,(71/(304+71))*100,
               (743/(743+125))*100,(125/(743+125))*100,(131/(131+13))*100,(13/(131+13))*100]).round(2)
jinv_att['Percent']=per_jinv

In [None]:
fig=px.bar(jinv_att,x='JobInvolvement',y='Percent',color='Attrition', text_auto=True,title='Job Involvement propotion for each individual class of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.show()

JobInvolvement

1 'Low'

2 'Medium'

3 'High'

4 'Very High'

- As expected, the higher the level of employee engagement  , the lower the attrition rate and vice versa

##  RelationshipSatisfaction VS Attiration 
Going to work every day can be stressful when there is an employer or colleague with whom you struggle to get along. It can leave you feeling unsatisfied at the end of each workday -- and for that matter at the start of it. Eventually you may start looking for other employment. If each person did this, the business would suffer because of its retention issues. The fact is, it already suffers from a morale issue, yours and maybe others'. An employer who recognizes the impact of workplace relationships to employee satisfaction, and encourages flexibility and interaction, can transform a brittle workplace into a productive, satisfying environment.

In [None]:
rltnS_att=df.groupby(['RelationshipSatisfaction','Attrition'])['Attrition'].count().reset_index(name='Counts')
rltnS_att[rltnS_att['Attrition']=='Yes']

In [None]:
fig=px.bar(rltnS_att,x='RelationshipSatisfaction',y='Counts',color='Attrition', text_auto=True,title='Relationship Satisfaction counts of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0,
    xaxis_title="Relation shipSatisfaction",
    yaxis_title="Proportion",)

fig.show()

In [None]:
per_rltnS=np.array([(219/(219+57))*100,(57/(219+57))*100,(258/(258+45))*100,(45/(258+45))*100,
               (388/(388+71))*100,(71/(388+71))*100,(368/(368+64))*100,(64/(368+64))*100]).round(2)
rltnS_att['Percent']=per_rltnS

In [None]:
fig=px.bar(rltnS_att,x='RelationshipSatisfaction',y='Percent',color='Attrition', text_auto=True,title='Relationship Satisfaction propotion for each individual class of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.show()

RelationshipSatisfaction

1 'Low'

2 'Medium'

3 'High'

4 'Very High'


- the higher the level of Relationship Satisfaction , the lower the attrition rate 
- but it is not a huge effect 


## WorkLife Balance VS Attrition 

In [None]:
wlblnc_att=df.groupby(['WorkLifeBalance','Attrition'])['Attrition'].count().reset_index(name='Counts')
wlblnc_att[wlblnc_att['Attrition']=='Yes']

In [None]:
fig=px.bar(wlblnc_att,x='WorkLifeBalance',y='Counts',color='Attrition', text_auto=True,title='WorkLife Balance counts of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0,
    xaxis_title="WorkLife Balance",
    yaxis_title="Proportion",)

fig.show()

In [None]:
per_wlblnc=np.array([(55/(25+55))*100,(25/(55+25))*100,(286/(286+58))*100,(58/(286+58))*100,
               (766/(766+127))*100,(127/(766+127))*100,(126/(126+27))*100,(27/(126+27))*100]).round(2)
wlblnc_att['Percent']=per_wlblnc

In [None]:
fig=px.bar(wlblnc_att,x='WorkLifeBalance',y='Percent',color='Attrition', text_auto=True,title='WorkLife Balance propotion for each individual class of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0)
fig.show()

WorkLifeBalance

1 'Bad'

2 'Good'

3 'Better'

4 'Best'

- For me its un expected to that employees with best balance between the job and life does not have the lowest attiration as shown in the graph 


## Num Companies Worked VS Attrition

In [None]:
num_c_att=df.groupby(['NumCompaniesWorked','Attrition'])['Attrition'].count().reset_index(name='Counts')
num_c_att[num_c_att['Attrition']=='Yes']

In [None]:
fig=px.area(num_c_att,x='NumCompaniesWorked',y='Counts',color='Attrition', title='Num Companies Worked  counts of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0,)

fig.show()

Obviously, lower years of  work experience leads to a higher attrition rate 

## YearsWithCurrManager VS Attrition 

In [None]:
df['YearsWithCurrManager'].unique()

In [None]:
yrs_mngr_att=df.groupby(['YearsWithCurrManager','Attrition'])['Attrition'].count().reset_index(name='Counts')
yrs_mngr_att[yrs_mngr_att['Attrition']=='Yes']

In [None]:
fig=px.line(yrs_mngr_att[yrs_mngr_att['Attrition']=='Yes'],x='YearsWithCurrManager',y='Counts',color='Attrition', title='Years With CurrManager  counts of employees in an Organization')
fig.update_layout(barmode='group', xaxis_tickangle=0,)

fig.show()

From the line plot, I guess  :
- I think that in the years between 1 and 3, the attrition rate is high, and it is possible that the manager does not find in the employee what he wants, or that the employee himself did not find in the manager the person he is comfortable with.
- Secondly, people who have worked with a manager for seven years believe that they change their position to look for a better opportunity or another, but it has nothing to do with comfort between the manager and the employee.
- Finally, when the duration of work with the manager becomes more than 8 years, the attrition rate decreases significantly

# Prepare Date 

In [None]:
df.describe()

In [None]:
# columns with categorical mean 
drop_cat=['Education','EnvironmentSatisfaction','StockOptionLevel','JobLevel','JobInvolvement',
          'JobSatisfaction','PerformanceRating','RelationshipSatisfaction','WorkLifeBalance',
         'BusinessTravel','Department','Gender','MaritalStatus','OverTime','EducationField',
         'JobRole']
# columns un useful 
drop_id=['EmployeeNumber','EmployeeCount','StandardHours','Over18']
# target column
target =['Attrition']
# data frame of cotious columns
con_df=df.drop(columns=drop_cat+drop_id+target)
len(con_df.columns)

In [None]:
cum=[]
fig,axs=plt.subplots(4,4,figsize=(25,15))
axs=axs.flatten()
for i,column in enumerate(con_df.columns) :
    sns.histplot(x=column,data=df,ax=axs[i])
    cum.append(column)
fig.tight_layout()
plt.show()
print(cum)


there are multiple columns with skewed data 

keep this in mind

In [None]:
df['Attrition'].value_counts(normalize=True).plot(kind='bar')

the bar shows us that we have an imbalanced dataset, where our majority class is far bigger than our minority class

 there were 34 features of each employee, each of which had some kind of numerical value. It might be useful to understand where the values for one of these features cluster, so let's make a boxplot to see how the values in `"Age"` are distributed.

In [None]:
# Create boxplot
sns.boxplot(y='Age',x='Attrition',data=df)
plt.xlabel("Bankrupt")
plt.ylabel("Interest Expense Ratio")
plt.title("Distribution of Interest Expense Ratio, by Class");

In [None]:
# lets see the correlation between columns 
corr=con_df.corr().abs()
plt.figure(figsize = (25,15))

ax = sns.heatmap(corr, annot=True, linewidths=1,cmap='mako_r')

In [None]:
drop_corr=['YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
co=con_df.drop(columns=drop_corr).corr().abs()
plt.figure(figsize = (25,15))

ax = sns.heatmap(co, annot=True, linewidths=1,cmap='mako_r')

- the are multiple colummns with skewed data 
- the are columns highly correlated 
- iimbalanced classes for Attrition column

Decesion tree will be the best classifier 

## Skewed Data

skewed data affect the performance of any model so lets see the **distribution** of the features and fix the skewed data

As a general rule of thumb:

    Data is symmetrical: skewness is between -0.5 and 0.5
    Data is slightly skewed: skewness is between -1 and -0.5 or 0.5 and 1
    Data is highly skewed: skewness is less than -1 or greater than 1.

In [None]:
# select columns with skew()>1 or <1

sk=[]
for i in df.drop(columns=drop_cat+target+drop_id+drop_corr).columns:
    if ((df[i].skew()>1) or (df[i].skew()<-1)):
        sk.append(i)
sk

In [None]:
# update values of skew data from x to log(x)
np.seterr(divide = 'ignore') 
sk_=pd.DataFrame(np.select([df[sk]==0, df[sk] > 0, df[sk] < 0], [0, np.log(df[sk]), np.log(df[sk])]),columns=sk).set_index(df.index)
df_skew=df.drop(columns=sk).set_index(df.index)

df_skew=pd.concat([df_skew,sk_],axis=1)
X_skew=df_skew.drop(columns='Attrition')

# Split

In [None]:
# instantiate labelencoder object
le = LabelEncoder()

X = df_skew.drop(columns=target+drop_id+drop_corr)
y = le.fit_transform( np.ravel(df[target]))
#y=df[target].values
print("X shape:", X.shape)
print("y shape:", y.shape)

In [None]:
y

In [None]:
one_hot_encoded_X = pd.get_dummies(X, columns = drop_cat)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(one_hot_encoded_X,y,test_size=0.2,random_state=42)

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

# Resample

Now that we've split our data into training and validation sets, we can address the class imbalance we saw during our EDA. One strategy is to resample the training data. There are many to do this, so let's start with under-sampling.

In [None]:
under_sampler =RandomUnderSampler(random_state=42)
X_train_under, y_train_under = under_sampler.fit_resample(X_train,y_train)
print(X_train_under.shape)
X_train_under.head()

In [None]:
#then over sampler
over_sampler = RandomOverSampler(random_state=42)
X_train_over, y_train_over = over_sampler.fit_resample(X_train,y_train)
print(X_train_over.shape)
X_train_over.head()

# Build Model

## base line

## Iterate

## 1- LogisticRegression

### Hyper parameter Tuning

In [None]:
# Hyper parameters range intialization for tuning 

clf_LogisticRegression=make_pipeline(StandardScaler(),LogisticRegression())
parameters = [    
    {'logisticregression__penalty' : ['l1', 'l2', 'elasticnet'],   # Used to specify the norm used in the penalization.
    'logisticregression__C' : [1e-5, 1e-4, 1e-3, 1, 10, 100, ] ,                      # Inverse of regularization strength; must be a positive float.
    'logisticregression__solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],  # Algorithm to use in the optimization problem.
    'logisticregression__max_iter' : [100, 1000,2500, 5000,10000]                # Maximum number of iterations taken for the solvers to converge.
    }
]

In [None]:
#cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
tuning_model=GridSearchCV(clf_LogisticRegression,param_grid=parameters,cv=3,verbose=True,scoring= 'accuracy')
tuning_model_over=GridSearchCV(clf_LogisticRegression,param_grid=parameters,cv=3,verbose=True,scoring= 'accuracy')

In [None]:

def timer(start_time=None):
    if not start_time:
        start_time=datetime.now()
        return start_time
    elif start_time:
        thour,temp_sec=divmod((datetime.now()-start_time).total_seconds(),3600)
        tmin,tsec=divmod(temp_sec,60)
        #print(thour,":",tmin,':',round(tsec,2))

In [None]:

# from datetime import datetime

# start_time=timer(None)

# tuning_model.fit(X_train,y_train)

# timer(start_time)
# #used to find best params 

In [None]:
# from datetime import datetime

# start_time=timer(None)

# tuning_model_over.fit(X_train_over,y_train_over)

# timer(start_time)
# #used to find best params 

In [None]:
# best hyperparameters 
# tuning_model.best_params_

In [None]:
# best hyperparameters 
# tuning_model_over.best_params_

In [None]:
# best model score
# tuning_model.best_score_

In [None]:
# best model score
# tuning_model_over.best_score_

## Fit te best params 

In [None]:
tuned_hyper_model= make_pipeline(StandardScaler(),LogisticRegression(C=1,penalty='l1',solver= 'liblinear',max_iter=100))
tuned_hyper_model_over= make_pipeline(StandardScaler(),LogisticRegression(C=1,penalty='l1',solver= 'liblinear',max_iter=100))

In [None]:
tuned_hyper_model.fit(X_train,y_train)

In [None]:

tuned_hyper_model_over.fit(X_train_over,y_train_over)

## Evaluate 

In [None]:
for m in [tuned_hyper_model, tuned_hyper_model_over]:
    acc_train = m.score(X_train,y_train)
    acc_test = m.score(X_test,y_test)

    print("Training Accuracy:", round(acc_train, 4))
    print("Test Accuracy:", round(acc_test, 4))

In [None]:
# Plot confusion matrix
ConfusionMatrixDisplay.from_estimator(tuned_hyper_model,X_test,y_test)