### EDA_02: Bivariate Analysis & Strategic Hypothesis-Driven Investigation
##### Project name: Employee Churn Prediction
##### Author: Fausto Pucheta Fortin

### **Overview:**
This notebook focuses on understanding the factors influencing employee turnover by performing bivariate analysis and hypothesis-driven investigation. Key steps include examining the relationships between the target variable (left) and key features, identifying significant patterns, and highlighting actionable insights for model feature engineering.

- **Approach**: Application of statistical tests and insights visualization to uncover turnover patterns, considering both individual features and feature interactions that impact turnover.

### **Tasks:**
#### 1. Exploration of Relationships Between Key Continuous Features and Turnover
Objective: To analyze how numerical features (e.g., satisfaction level, last evaluation score, number of projects, monthly hours, and time spent at the company) correlate with Left.

- **Satisfaction Level vs. Left:** Visualized distributions of satisfaction levels among employees who stayed versus those who left. Focus on identifying threshold ranges that may signal increased turnover risk.
- **Last Evaluation vs. Left:** Assessed last evaluation scores across turnover groups, identifying any indication of under- or over-performance that correlates with attrition.
- **Number of Projects vs. Left:** Analyzed the average and distribution of project counts for retained and departed employees to determine if workload intensity influences Left.
- **Average Monthly Hours vs. Left:** Evaluated if excessive hours or low engagement (in terms of monthly hours) are associated with Left, using distribution and central tendency metrics.
- **Time Spent at Company vs. Left:** Investigated tenure patterns to understand if employees are more likely to leave at specific milestones (e.g., 1, 3, or 5 years).


#### 2. Analysis of Categorical Features and Their Association with Turnover
Objective: To assess categorical factors, such as work accident incidence, promotions, department, and salary level, and their contribution to turnover likelihood.

- **Work Accident and Turnover:** Analyzed turnover rates within work accident categories (accidents vs. no accidents) to see if safety incidents influence departure decisions.
- **Promotion in Last 5 Years vs. Turnover:** Compared turnover rates between those who received recent promotions and those who did not, hypothesizing that lack of advancement could drive turnover.
- **Department vs. Turnover:** Assessed department-wise turnover rates to uncover specific functional areas (e.g., sales, technical) with higher attrition, which could signal department-specific issues.
- **Salary Level vs. Turnover:** Evaluated turnover distribution across salary levels (low, medium, high) to identify if compensation tiers affect employee retention.

#### 3. Correlation Analysis of Continuous Features
Objective: To understand the interrelationships among continuous features and their potential collective influence on turnover.

- Created a correlation heatmap to visualize relationships among continuous variables, such as satisfaction level, last evaluation score, average monthly hours, and tenure.
- Focused on identifying high or moderate correlations that could affect turnover jointly, such as the relationship between monthly hours and evaluation scores or satisfaction levels and number of projects.


#### 4. Hypothesis Testing for Significant Differences Between Groups
Objective: To conduct statistical tests that quantify differences between turnover groups across various features, ensuring results are statistically valid.

- **T-tests for Continuous Variables:** Conducted t-tests (or Mann-Whitney U tests where distributions were non-normal) on continuous features (e.g., satisfaction level, monthly hours) to confirm if observed differences between groups (stayed vs. left) are statistically significant.
- **Cohen's d for Effect Size:** Calculated Cohen’s d for each continuous variable where t-tests indicated statistically significant differences. This step allowed quantifying the magnitude of the differences between groups, with thresholds (e.g., 0.2 = small, 0.5 = medium, 0.8 = large) to assess whether observed differences were practically relevant.
- **Chi-Square Test for Categorical Variables:** Performed chi-square tests on categorical features like work accident, department, and salary level, determining the strength of association with turnover.
- **ANOVA for Categorical Variables with Multiple Categories:** Applied ANOVA where appropriate to assess differences in turnover across departments, with post-hoc tests to pinpoint specific department differences if needed.

#### 5. Interaction Analysis of Key Features Related to Turnover
Objective: To explore multi-dimensional relationships that might reveal more nuanced turnover drivers.

**Satisfaction Level & Salary Level:** Analyzed turnover within combinations of satisfaction and salary levels to check for patterns that indicate if dissatisfaction is heightened at certain salary tiers.
**Number of Projects & Average Monthly Hours:** Explored turnover across combinations of project count and monthly hours to assess if workload and effort level jointly influence turnover risk.
**Tenure and Satisfaction Level:** Examined combinations of tenure and satisfaction levels to determine if new versus longer-tenured employees display different satisfaction-to-turnover patterns.


### **Summary of Findings**


In [2]:
# IMPORT LIBRARIES

# Operational
import pandas as pd
import numpy as np

# Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Statistics
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.proportion import proportions_ztest

# Optional for effect size (Cohen's d) and other convenience functions:
import pingouin as pg

# Formatting Summary Tables
from tabulate import tabulate

df = pd.read_csv("./../data/processed/df_subset.csv")

In [3]:
df.head(10)

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,Department,salary
0,0.38,0.53,2,157,3,0,1,sales,low
1,0.8,0.86,5,262,6,0,1,sales,medium
2,0.11,0.88,7,272,4,0,1,sales,medium
3,0.72,0.87,5,223,5,0,1,sales,low
4,0.37,0.52,2,159,3,0,1,sales,low
5,0.41,0.5,2,153,3,0,1,sales,low
6,0.1,0.77,6,247,4,0,1,sales,low
7,0.92,0.85,5,259,5,0,1,sales,low
8,0.89,1.0,5,224,5,0,1,sales,low
9,0.42,0.53,2,142,3,0,1,sales,low
