<a href="https://colab.research.google.com/github/dengathitu/ML-Collaboration/blob/main/ML_Collaboration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Statement

XYZ Ltd is facing issues of employee attrition.
They want to predict when the attrition occur for ease of preparedness. They need a data scientist to be able to create a machine learning pipeline that will enable them to predict whether an employee will quit or not.

 # Data Exploration
Understanding the data structure of the dataset

In [67]:
import pandas as pd
import numpy
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split # divide the dataset into train & test data
from sklearn.linear_model import LinearRegression, LogisticRegression # Contains the LinearRegression and LogisticRegression algorithims to be used
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeRegressor # Contains the Decision Tree Algorithim to be used
from sklearn.ensemble import RandomForestRegressor # Containins the Radom Forest Algorithim to be used

In [44]:
df = pd.read_csv("HR-Employee-Attrition.csv")

**Print an overview of the dataset**

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

**Description**
* There are 1470 entries.
* There are 26 numerical futures
* There 9 categorical features

**Check for null values**

In [46]:
df.isnull().sum()

Unnamed: 0,0
Age,0
Attrition,0
BusinessTravel,0
DailyRate,0
Department,0
DistanceFromHome,0
Education,0
EducationField,0
EmployeeCount,0
EmployeeNumber,0


**Interpretation**
* The dataset has no null values.

**check for duplicates**

In [47]:
data=df.duplicated().sum()
print("number of duplicated columns are;", data)

number of duplicated columns are; 0


**Find column names**

In [48]:
#Check all the columns in the dataset
df.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

# Data Preprocessing
Clean the dataset ready for analysis

In [49]:
categorical_variables = df.select_dtypes(include=["object"]).columns.tolist()

In [50]:
encoded_data = pd.get_dummies(df,columns=categorical_variables, drop_first=True)
encoded_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 48 columns):
 #   Column                             Non-Null Count  Dtype
---  ------                             --------------  -----
 0   Age                                1470 non-null   int64
 1   DailyRate                          1470 non-null   int64
 2   DistanceFromHome                   1470 non-null   int64
 3   Education                          1470 non-null   int64
 4   EmployeeCount                      1470 non-null   int64
 5   EmployeeNumber                     1470 non-null   int64
 6   EnvironmentSatisfaction            1470 non-null   int64
 7   HourlyRate                         1470 non-null   int64
 8   JobInvolvement                     1470 non-null   int64
 9   JobLevel                           1470 non-null   int64
 10  JobSatisfaction                    1470 non-null   int64
 11  MonthlyIncome                      1470 non-null   int64
 12  MonthlyRate         

# Interpretation
* There are 22 Booleans
* There are 26 intergers (numerical data)

**Data redudancy**
* Repetitive data reduces the quality of our data

In [51]:
List = [col for col in df.columns if df[col].nunique()==1]
print("Repetative columns are;", List)

Repetative columns are; ['EmployeeCount', 'Over18', 'StandardHours']


In [52]:
drop = df.drop(columns=List)

In [53]:
drop.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeNumber',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'OverTime',
       'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')

**Class Imbalanace**  
To prevent biasness of the machine learning model, we have to balance the classes so that the machine doesn't predict only value

In [54]:
X = encoded_data.drop(columns = ["Attrition_Yes"])
y = encoded_data["Attrition_Yes"]

smote_algorithim = SMOTE(random_state=42)
X_balance, y_balance = smote_algorithim.fit_resample(X,y)
print(y_balance.value_counts())


Attrition_Yes
True     1233
False    1233
Name: count, dtype: int64


# Data Splitting
* Split data into train(80%) and test(20%) data



In [55]:
X_test, X_train, y_test, y_train = train_test_split(X,y, test_size=0.2, random_state=42)

# Logistic Regression
* Purpose: Classifying data into two categories.
* How it Works: Uses a logistic function to model the probability of a certain class.

In [56]:
algorithim = LogisticRegression()
algorithim.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Use the Algorithim to make Predictions

In [57]:
y_prediction = algorithim.predict(X_test)

# Model Evaluation

In [58]:
print(confusion_matrix(y_test,y_prediction))

[[967  11]
 [184  14]]


In [60]:
print(classification_report(y_test, y_prediction))

              precision    recall  f1-score   support

       False       0.84      0.99      0.91       978
        True       0.56      0.07      0.13       198

    accuracy                           0.83      1176
   macro avg       0.70      0.53      0.52      1176
weighted avg       0.79      0.83      0.78      1176



# Decision Tree
* **Purpose**: Both regression and classification tasks.
* **How it Works**: Splits data into branches based on feature values until a decision is made.

In [63]:
algorithim2 = DecisionTreeRegressor()
algorithim2.fit(X_train, y_train)

**Use the Model to make Prediction**

In [65]:
y_prediction2 = algorithim2.predict(X_test)

# Model Evaluation

In [66]:
print(confusion_matrix(y_test,y_prediction))

print(classification_report(y_test, y_prediction))

[[843 135]
 [156  42]]
              precision    recall  f1-score   support

       False       0.84      0.86      0.85       978
        True       0.24      0.21      0.22       198

    accuracy                           0.75      1176
   macro avg       0.54      0.54      0.54      1176
weighted avg       0.74      0.75      0.75      1176



# Random Forest
* Purpose: Both regression and classification tasks.
* How it Works: Combines multiple decision trees to improve accuracy and prevent overfitting.

In [68]:
algorithim3= RandomForestRegressor()
algorithim3.fit(X_train, y_train)

# Use the Model to make Predictions

In [69]:
y_prediction3 = algorithim3.predict(X_test)

## Model Evaluation

In [70]:
print(confusion_matrix(y_test,y_prediction))

print(classification_report(y_test, y_prediction))

[[843 135]
 [156  42]]
              precision    recall  f1-score   support

       False       0.84      0.86      0.85       978
        True       0.24      0.21      0.22       198

    accuracy                           0.75      1176
   macro avg       0.54      0.54      0.54      1176
weighted avg       0.74      0.75      0.75      1176



# Conclusion
* From the above model fitting, the best model to use for predition is the Logistic Regression which has an accuracy score of **83%**.