## <div style="text-align:center;text-decoration:underline;color:brown">Researcher: Akinsulure Akintunde</div>

**<u>Problem Statement:</u>**

In the realm of talent management, organizations grapple with the challenge of predicting and understanding employee attribution to enhance retention strategies. The need for accurate forecasting models is evident as businesses strive to optimize their workforce and create a workplace conducive to employee satisfaction and longevity. The task is to develop a predictive model that effectively analyzes HR data, identifies influential factors, and offers insights into employee turnover patterns, ultimately contributing to strategic talent management. 

How can we leverage data and AI to predict employee attribution, enabling organizations to proactively address retention challenges and foster a more stable and productive workforce?

---

### <div style='text-align:center;text-decoration:underline'>Loading the datasets from csv files into pandas</div>

In [4]:
### Neccessary libraries

import pandas as pd

# setting pandas to display all columns
pd.set_option('display.max_columns', None)

In [5]:
# Train data
train_data = pd.read_csv("data/train.csv")

## Inspecting the head
train_data.head()

Unnamed: 0,id,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Attrition
0,0,36,Travel_Frequently,599,Research & Development,24,3,Medical,1,4,Male,42,3,1,Laboratory Technician,4,Married,2596,5099,1,Y,Yes,13,3,2,80,1,10,2,3,10,0,7,8,0
1,1,35,Travel_Rarely,921,Sales,8,3,Other,1,1,Male,46,3,1,Sales Representative,1,Married,2899,10778,1,Y,No,17,3,4,80,1,4,3,3,4,2,0,3,0
2,2,32,Travel_Rarely,718,Sales,26,3,Marketing,1,3,Male,80,3,2,Sales Executive,4,Divorced,4627,16495,0,Y,No,17,3,4,80,2,4,3,3,3,2,1,2,0
3,3,38,Travel_Rarely,1488,Research & Development,2,3,Medical,1,3,Female,40,3,2,Healthcare Representative,1,Married,5347,13384,3,Y,No,14,3,3,80,0,15,1,1,6,0,0,2,0
4,4,50,Travel_Rarely,1017,Research & Development,5,4,Medical,1,2,Female,37,3,5,Manager,1,Single,19033,19805,1,Y,Yes,13,3,3,80,0,31,0,3,31,14,4,10,1


In [7]:
# Test data
test_data = pd.read_csv("data/test.csv")

## Inspecting the head
test_data.head()

Unnamed: 0,id,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,1677,19,Non-Travel,992,Research & Development,1,1,Medical,1,4,Male,43,3,1,Laboratory Technician,3,Single,2318,17778,1,Y,No,12,3,4,80,0,1,2,2,1,0,0,0
1,1678,45,Travel_Rarely,1136,Sales,4,4,Marketing,1,3,Male,67,3,2,Sales Executive,1,Divorced,5486,12421,6,Y,Yes,12,3,3,80,1,7,3,3,2,2,2,2
2,1679,37,Travel_Rarely,155,Research & Development,13,3,Life Sciences,1,4,Male,41,3,1,Research Scientist,4,Divorced,2741,23577,4,Y,Yes,13,3,2,80,2,13,2,2,7,7,1,7
3,1680,32,Travel_Rarely,688,Research & Development,1,4,Life Sciences,1,3,Male,89,2,2,Healthcare Representative,3,Single,5228,20364,1,Y,No,13,3,3,80,0,14,2,2,14,10,11,8
4,1681,29,Travel_Frequently,464,Research & Development,9,1,Life Sciences,1,3,Male,79,3,1,Laboratory Technician,4,Single,1223,15178,1,Y,No,14,3,1,80,0,1,5,3,1,0,0,0


In [9]:
## Finally , viewing how the submission should look like
sample_sub = pd.read_csv('data/sample_submission.csv')
sample_sub.head()

Unnamed: 0,id,Attrition
0,1677,0.119261
1,1678,0.119261
2,1679,0.119261
3,1680,0.119261
4,1681,0.119261


The hackathon Submission file should be in this format, i.e having the `id` and `Attrition` columns

---

### <div style="text-align:center">Exploratory Data Analysis</div>

In [12]:
### Train data inspection with custom df
def get_overview(df):
    var_df = pd.DataFrame(columns=[
        'Variable', 'NaN', 'Percentage Missing', 'Unique', 'N-Unique', 'Dtype', 'Duplicated Rows'
    ])
    for i, col in enumerate(df.columns):
        missing = f"{round((df[col].isna().sum() / len(df[col]))*100, 2)}%"
        duplicated = len(df[col][df[col].duplicated() == True])
        var_df.loc[i] = [
            col, df[col].isna().sum(), missing, df[col].unique(),
            df[col].nunique(), df[col].dtypes, duplicated
        ]
    var_df.reset_index(inplace=True)
    return var_df


get_overview(train_data)

Unnamed: 0,index,Variable,NaN,Percentage Missing,Unique,N-Unique,Dtype,Duplicated Rows
0,0,id,0,0.0%,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",1677,int64,0
1,1,Age,0,0.0%,"[36, 35, 32, 38, 50, 27, 34, 40, 51, 25, 29, 4...",43,int64,1634
2,2,BusinessTravel,0,0.0%,"[Travel_Frequently, Travel_Rarely, Non-Travel]",3,object,1674
3,3,DailyRate,0,0.0%,"[599, 921, 718, 1488, 1017, 566, 944, 1009, 12...",625,int64,1052
4,4,Department,0,0.0%,"[Research & Development, Sales, Human Resources]",3,object,1674
5,5,DistanceFromHome,0,0.0%,"[24, 8, 26, 2, 5, 10, 6, 9, 28, 1, 25, 11, 7, ...",29,int64,1648
6,6,Education,0,0.0%,"[3, 4, 1, 2, 5, 15]",6,int64,1671
7,7,EducationField,0,0.0%,"[Medical, Other, Marketing, Life Sciences, Tec...",6,object,1671
8,8,EmployeeCount,0,0.0%,[1],1,int64,1676
9,9,EnvironmentSatisfaction,0,0.0%,"[4, 1, 3, 2]",4,int64,1673


In [13]:
get_overview(test_data)

Unnamed: 0,index,Variable,NaN,Percentage Missing,Unique,N-Unique,Dtype,Duplicated Rows
0,0,id,0,0.0%,"[1677, 1678, 1679, 1680, 1681, 1682, 1683, 168...",1119,int64,0
1,1,Age,0,0.0%,"[19, 45, 37, 32, 29, 51, 52, 30, 44, 42, 35, 2...",42,int64,1077
2,2,BusinessTravel,0,0.0%,"[Non-Travel, Travel_Rarely, Travel_Frequently]",3,object,1116
3,3,DailyRate,0,0.0%,"[992, 1136, 155, 688, 464, 990, 1146, 945, 548...",515,int64,604
4,4,Department,0,0.0%,"[Research & Development, Sales, Human Resources]",3,object,1116
5,5,DistanceFromHome,0,0.0%,"[1, 4, 13, 9, 20, 6, 5, 7, 26, 14, 2, 3, 11, 1...",29,int64,1090
6,6,Education,0,0.0%,"[1, 4, 3, 2, 5]",5,int64,1114
7,7,EducationField,0,0.0%,"[Medical, Marketing, Life Sciences, Technical ...",6,object,1113
8,8,EmployeeCount,0,0.0%,[1],1,int64,1118
9,9,EnvironmentSatisfaction,0,0.0%,"[4, 3, 2, 1, 0]",5,int64,1114


This is lovely news!!! no missing values in both datasets at all !!!

---

### <div style="text-align:center">Data Visualization</div>