# Employee Attrition Analysis

## Objective
The goal of this analysis is to understand the key factors that contribute to employee attrition and to build a predictive model that can identify employees who are at higher risk of leaving. The insights from this analysis is aiming to support HR team in improving employee retention strategies.

## Power BI Dashboard â€“ Attrition Analysis

![Power BI Attrition Dashboard]https://github.com/franklincastelino95/Data_Analysis_Portfolio/edit/main/images/powerbi_attrition_dashboard.png)

### View interactive DashboardðŸ”— [Click here](https://app.powerbi.com/groups/me/reports/26d0c0e1-b25a-4ece-99a2-83b2f24bc1de/424c8823459eb66f4017?experience=power-bi)
----------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

## Data Overview
The dataset contains employee-level information related to compensation, work experience, and job tenure. Key features used in the analysis include age, monthly income, total working years, years in the current role, overtime status, and gender. The target variable was Attrition which indicates whether an employee left the organization.

In [None]:
df = pd.read_csv("Employee_attrition.csv")
num_cols = ['Age', 'MonthlyIncome', 'TotalWorkingYears','YearsInCurrentRole', 'YearsSinceLastPromotion','YearsWithCurrManager', 'Education', 'StockOptionLevel']
df['Attrition_encoded'] = df['Attrition'].map({'Yes':1, 'No':0})
df['OverTime_encoded'] = df['OverTime'].map({'Yes':1, 'No':0})
df['Gender_encoded'] = df['Gender'].map({'Male':1, 'Female':0})

## Exploratory Data Analysis
Exploratory analysis was performed to identify patterns and relationships between employee attributes and attrition. Correlation analysis showed that tenure-related variables, such as years in the current role and total working years, have stronger relationships with attrition compared to other features. The Power BI dashboard further highlights that attrition is higher among employees with fewer working years, those early in their roles, and employees working overtime.

In [None]:
#  Correlation between variables
df_encoded = df[num_cols + ['Attrition_encoded', 'OverTime_encoded', 'Gender_encoded']]
df_matrix = df_encoded.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(
    df_matrix,
    cmap="coolwarm",
    center=0,
    square=True
)
plt.title("Correlation Heatmap")
plt.show()

## Filtering variable pairs with high correlation

In [None]:
df_encoded_filtered=df_encoded.corr()
corr_pairs=df_encoded_filtered.unstack()
corr_pairs = corr_pairs [corr_pairs != 1]
high_corr = corr_pairs[corr_pairs.abs() > 0.4]
print(high_corr)

In [None]:
high_corr_df = high_corr.reset_index()
high_corr_df.columns = ['Var1', 'Var2', 'Corr']

features = pd.unique(high_corr_df[['Var1', 'Var2']].values.ravel())

In [None]:
final_df_encoded = corr_df[features].assign(
    Attrition=corr_df['Attrition_encoded'],
    Employee_Count=1
)

# Saved relevant features for modeling
final_df_encoded.to_csv("final_model_dataset.csv", index=False)

# Build dataset for dashboard in Power BI
dashboard_dataset=df[num_cols + ['Attrition_encoded', 'OverTime_encoded', 'Gender_encoded']]

## Model Training
A logistic regression model was used to predict employee attrition. This model was selected because it is easy to interpret and suitable for binary classification problems. Categorical variables such as gender and overtime were encoded, and the data was split into training and testing sets to evaluate model performance on unseen data.

In [None]:
features = [
    'Age',
    'TotalWorkingYears',
    'YearsInCurrentRole',
    'OverTime_encoded',
]
target = 'Attrition'
df = df[features + [target]]

In [None]:


X = df.drop('Attrition', axis=1)
y = df['Attrition']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)


In [None]:

model = LogisticRegression(
    max_iter=1000,
    class_weight="balanced"
)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

## Results & Interpretation
The model achieved moderate overall accuracy, but more importantly, it successfully identified a majority of employees who eventually left. Tenure-related features showed the strongest influence on attrition risk, with negative coefficients indicating that employees who stay longer in their roles are less likely to leave. This confirms that early-stage employees face a higher risk of attrition.


In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


importance = pd.DataFrame({
    'Feature': X.columns,
    'Impact': model.coef_[0]
}).sort_values(by='Impact', ascending=False)

print(importance)



## Conclusion
The analysis shows that employee attrition is strongly influenced by job tenure and early role experience. Employees who are new to their roles or have fewer working years are more likely to leave, highlighting the importance of effective onboarding and early engagement. By combining Power BI visual analytics with machine learning, this project demonstrates how HR data can be transformed into actionable insights to support employee retention and workforce planning.