## Understanding the Project Scenario

**Problem Statement:**
Salifort Motors is facing a high employee turnover rate, leading to increased costs and decreased productivity. The leadership team seeks to understand the underlying factors contributing to turnover and develop strategies to improve retention.

**Objective:**
To build a predictive model that can accurately predict whether an employee will leave the company based on various factors such as job title, department, number of projects, average monthly hours, and other relevant data points.

**Data:**
* Employee survey data: Likely includes demographic information, job satisfaction, work-life balance, compensation, etc.
* Relevant variables: Consider factors like department, number of projects, average monthly hours, tenure, and potentially others.

**Model Approach:**
* **Statistical Model:** Logistic regression could be a suitable choice due to its ability to handle binary outcomes (leave or stay).
* **Machine Learning Models:** Decision trees, random forests, and XGBoost are potential candidates for their ability to handle complex relationships and potentially improve predictive accuracy.

**Evaluation:**
* Use appropriate metrics such as accuracy, precision, recall, F1-score, and AUC-ROC to assess the model's performance.

**Recommendations:**
* Based on the model's findings, identify key factors driving turnover.
* Propose strategies to address these factors and improve employee retention.

By following these steps, I can effectively address the employee turnover issue at Salifort Motors and contribute to the company's success.


## PACE Strategy Table for the Salifort Motors Project

| Milestone | Task | PACE Stage |
|---|---|---|
| **Data Acquisition and Exploration** | Collect employee survey data | Plan |
| | Clean and preprocess data | Do |
| | Explore data relationships and distributions | Check |
| | Identify relevant variables | Act |
| **Model Development and Selection** | Build logistic regression model | Plan |
| | Build decision tree, random forest, and XGBoost models | Do |
| | Evaluate model performance using appropriate metrics | Check |
| | Select the best-performing model | Act |
| **Model Interpretation and Insights** | Analyze model coefficients or feature importance | Plan |
| | Identify key factors driving turnover | Do |
| | Generate actionable recommendations | Check |
| | Communicate findings to leadership | Act |
| **Model Deployment and Monitoring** | Deploy model into production environment | Plan |
| | Monitor model performance and retrain as needed | Do |
| | Continuously evaluate and refine the model | Check |
| | Provide ongoing insights to leadership | Act |



Step 1. Imports
Import packages
Load dataset## Analyzing the Salifort Motors Employee Data

### Data Understanding

**Dataset:** HR_capstone_dataset.csv

**Rows:** 14,999 (representing individual employees)

**Columns:** 10 (containing various employee attributes)

**Column Descriptions:**

| Column Name | Type | Description |
|---|---|---|
| satisfaction_level | int64 | Self-reported satisfaction level (0-1) |
| last_evaluation | int64 | Score of last performance review (0-1) |
| number_project | int64 | Number of projects contributed to |
| average_monthly_hours | int64 | Average monthly working hours |
| time_spend_company | int64 | Years with the company |
| work_accident | int64 | Whether an accident occurred |
| left | int64 | Whether the employee left the company |
| promotion_last_5years | int64 | Whether promoted in the last 5 years |
| department | str | Employee's department |
| salary | str | Salary level (low, medium, high) |

### Initial Observations

* **Target Variable:** `left` (binary indicating employee attrition)
* **Predictor Variables:** `satisfaction_level`, `last_evaluation`, `number_project`, `average_monthly_hours`, `time_spend_company`, `work_accident`, `promotion_last_5years`, `department`, and `salary`
* **Data Types:** Most variables are numerical (int64), while `department` and `salary` are categorical.

### Potential Relationships and Hypotheses

Based on the data, we can explore the following relationships and hypotheses:

* **Satisfaction and Attrition:** Employees with lower satisfaction levels may be more likely to leave.
* **Performance and Attrition:** Employees with poor performance reviews or excessive workload might be more likely to leave.
* **Tenure and Attrition:** Employees with shorter tenures may be more likely to leave due to lack of commitment or fit.
* **Promotions and Attrition:** Employees who feel undervalued or lack opportunities for growth may be more likely to leave.
* **Work-Life Balance and Attrition:** Employees with excessive working hours or poor work-life balance may be more likely to leave.
* **Department and Attrition:** Certain departments or roles might have higher turnover rates.
* **Salary and Attrition:** Employees who feel underpaid or dissatisfied with their compensation may be more likely to leave.

### Next Steps

1. **Data Cleaning and Preprocessing:**
   * Handle missing values (if any).
   * Check for outliers and inconsistencies.
   * Convert categorical variables (department, salary) to numerical format (e.g., one-hot encoding).

2. **Exploratory Data Analysis (EDA):**
   * Visualize the distribution of variables (histograms, box plots).
   * Calculate summary statistics (mean, median, mode, standard deviation).
   * Explore correlations between variables.

3. **Feature Engineering:**
   * Consider creating new features based on existing variables (e.g., calculate a work-life balance index).

4. **Model Building and Evaluation:**
   * Build and evaluate various models (logistic regression, decision trees, random forests, XGBoost).
   * Use appropriate metrics (accuracy, precision, recall, F1-score, AUC-ROC) to assess model performance.

5. **Interpretation and Recommendations:**
   * Analyze the model's results to identify key factors influencing attrition.
   * Provide actionable recommendations to improve employee retention.



## Step 1. Imports

*   Import packages
*   Load dataset

### Import packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score### Load dataset

### Load dataset

In [2]:
# Load dataset into a dataframe
df0 = pd.read_csv("/kaggle/input/hr-analytics-and-job-prediction/HR_comma_sep.csv")

# Display first few rows of the dataframe
df0.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


## Step 2. Data Exploration (Initial EDA and data cleaning)

- Understand your variables
- Clean your dataset (missing data, redundant data, outliers)

In [3]:
print("Dataset Shape:")
df0.shape

Dataset Shape:


(14999, 10)

In [4]:
print("\nData Types:")
df0.dtypes


Data Types:


satisfaction_level       float64
last_evaluation          float64
number_project             int64
average_montly_hours       int64
time_spend_company         int64
Work_accident              int64
left                       int64
promotion_last_5years      int64
Department                object
salary                    object
dtype: object

In [5]:
# Gather descriptive statistics about the data
df0.describe()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,0.612834,0.716102,3.803054,201.050337,3.498233,0.14461,0.238083,0.021268
std,0.248631,0.171169,1.232592,49.943099,1.460136,0.351719,0.425924,0.144281
min,0.09,0.36,2.0,96.0,2.0,0.0,0.0,0.0
25%,0.44,0.56,3.0,156.0,3.0,0.0,0.0,0.0
50%,0.64,0.72,4.0,200.0,3.0,0.0,0.0,0.0
75%,0.82,0.87,5.0,245.0,4.0,0.0,0.0,0.0
max,1.0,1.0,7.0,310.0,10.0,1.0,1.0,1.0


### Rename columns

In [6]:
# Display all column names
df0.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'Department', 'salary'],
      dtype='object')

In [7]:
# Rename columns as needed
df0 = df0.rename(columns={'satisfaction_level': 'satisfaction', 'last_evaluation': 'performance_score', 'number_project': 'projects',
       'average_montly_hours': 'average_hours', 'promotion_last_5years': 'promotion', 'Department': 'department'})


# Display all column names after the update
print(df0.columns)

Index(['satisfaction', 'performance_score', 'projects', 'average_hours',
       'time_spend_company', 'Work_accident', 'left', 'promotion',
       'department', 'salary'],
      dtype='object')


### Check missing values

In [8]:
# Check for missing values
df0.isnull().sum()

satisfaction          0
performance_score     0
projects              0
average_hours         0
time_spend_company    0
Work_accident         0
left                  0
promotion             0
department            0
salary                0
dtype: int64

### Check duplicates

In [9]:
# Check for duplicates
duplicates = df0.duplicated()
# Count the number of duplicates
num_duplicates = duplicates.sum()
# Print the number of duplicates
print("Number of duplicates:", num_duplicates)

Number of duplicates: 3008


In [10]:
# Inspect some rows containing duplicates as needed
df0[df0.duplicated()].head()

Unnamed: 0,satisfaction,performance_score,projects,average_hours,time_spend_company,Work_accident,left,promotion,department,salary
396,0.46,0.57,2,139,3,0,1,0,sales,low
866,0.41,0.46,2,128,3,0,1,0,accounting,low
1317,0.37,0.51,2,127,3,0,1,0,sales,medium
1368,0.41,0.52,2,132,3,0,1,0,RandD,low
1461,0.42,0.53,2,142,3,0,1,0,sales,low


In [11]:
# Drop duplicates and save resulting dataframe in a new variable as needed
df0_no_duplicates = df0.drop_duplicates(keep='first')

# Display first few rows of new dataframe as needed
print("First few rows of the dataframe without duplicates:")
df0_no_duplicates.head()

First few rows of the dataframe without duplicates:


Unnamed: 0,satisfaction,performance_score,projects,average_hours,time_spend_company,Work_accident,left,promotion,department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


In [12]:
print("Dataset Shape with duplicates:")
df0.shape

Dataset Shape with duplicates:


(14999, 10)

In [13]:
print("Dataset Shape with no duplicates:")
df0_no_duplicates.shape

Dataset Shape with no duplicates:


(11991, 10)

3,008 rows contain duplicates. That is 20% of the data.

### Check outliers