<a href="https://colab.research.google.com/github/baby1146/baby1146/blob/main/IBM_HR_Analytics_Employee_Attrition_%26_Performance_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**IBM HR Analytics Employee Attrition & Performance**

##1. Introduction
Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists.

Attrition Rate: the percentage of employees leaving an organization over a certain period.

##1.1 Dataset Overview
**Education**: 1-Below College, 2-College, 3-Bachelor, 4-Master, 5-Doctor

**Env.Satisfaction**: 1-Low, 2-Medium, 3-High, 4-Very High

**Job Involvement**: 1-Low, 2-Medium, 3-High, 4-Very High

**Job Satisfaction**: 1-Low, 2-Medium, 3-High, 4-Very High

**Relationship Satisfaction**: 1-Low, 2-Medium, 3-High, 4-Very High

**Performance Rating**: 1-Low, 2-Good, 3-Excellent, 4-Outstanding

**Work-Life Balance**: 1-Bad, 2-Good, 3-Better, 4-Best

This dataset contains information about employee attrition data.


##2. Data Collection

In [None]:
# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import kagglehub
import os

# Download latest version
dataset_dir = kagglehub.dataset_download("pavansubhasht/ibm-hr-analytics-attrition-dataset")

# Find the CSV file within the downloaded directory
for filename in os.listdir(dataset_dir):
    if filename.endswith(".csv"):
        csv_path = os.path.join(dataset_dir, filename)
        break

# Read the CSV file using pandas
df = pd.read_csv(csv_path)


In [None]:
df.head()

#3.Data Cleaning and Preprocessing

In [None]:
# Overview of the Dataset
df.info()

In [None]:

# Check for Missing Values
df.isnull().sum()

we have no null values

In [None]:
#Basic Statistics:
df.describe()

In [None]:
#check duplicate values
df.duplicated().sum()

In [None]:
# Drop irrelevant columns
irrelevant_cols = ['EmployeeNumber', 'Over18', 'EmployeeCount', 'StandardHours']
attrition_df = df.drop(columns=irrelevant_cols)

In [None]:
attrition_df.head(10)

#4.Exploratory Data Analysis (EDA)

In [None]:
# Understanding the distribution of 'Attrition'
plt.figure(figsize=(8, 6))
sns.countplot(x='Attrition', data=attrition_df, palette='viridis')
plt.title('Attrition Count')
plt.show()

In [None]:
# Outlier Analysis
# Analyzing numeric columns for potential outliers
numerical_cols = ['Age', 'DailyRate', 'DistanceFromHome', 'MonthlyIncome',
                  'NumCompaniesWorked', 'PercentSalaryHike', 'TotalWorkingYears',
                  'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
                  'YearsWithCurrManager']

In [None]:
# Boxplots for outliers
for col in numerical_cols:
    plt.figure(figsize=(10, 5))
    sns.boxplot(data=attrition_df, x=col, palette='viridis')
    plt.title(f'Boxplot of {col}')
    plt.show()

In [None]:

# List of features to handle outliers
features = ['YearsWithCurrManager', 'YearsSinceLastPromotion',
            'YearsInCurrentRole', 'YearsAtCompany',
            'TotalWorkingYears', 'MonthlyIncome']

# Function to Analyze outliers
def handle_outliers(df, features):
    for feature in features:
        # Calculate IQR
        Q1 = df[feature].quantile(0.25)
        Q3 = df[feature].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        print(f"\nAnalyzing outliers for '{feature}':")
        outliers = df[(df[feature] < lower_bound) | (df[feature] > upper_bound)]
        print(f"Number of outliers: {len(outliers)}")
        print(f"Lower bound: {lower_bound}, Upper bound: {upper_bound}")
         # Plot box plot to visualize outliers
        plt.figure(figsize=(10, 6))
        sns.boxplot(x=df[feature])


    return df

# Handle outliers
df_cleaned = handle_outliers(attrition_df, features)

# For analysis, show the final dataset shape
print(f"Final dataset shape: {df_cleaned.shape}")



In [None]:

# Function to handle outliers by capping or removing them
def handle_outliers(df, column, lower_bound, upper_bound, action='remove'):
    """
    Handle outliers for a given column.
    Parameters:
    - attrition_df: The dataframe
    - column: The column to handle
    - lower_bound: The lower bound for outliers
    - upper_bound: The upper bound for outliers
    - action: 'remove' to remove outliers, 'cap' to cap the outliers
    """
    if action == 'remove':
        # Remove outliers by filtering based on the IQR
        df = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    elif action == 'cap':
        # Cap the outliers to the upper and lower bounds
        df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
        df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])

    return df

# Handling outliers for 'YearsWithCurrManager'
attrition_df = handle_outliers(attrition_df, 'YearsWithCurrManager', lower_bound=-5.5, upper_bound=14.5, action='remove')

# Handling outliers for 'YearsSinceLastPromotion'
attrition_df = handle_outliers(attrition_df, 'YearsSinceLastPromotion', lower_bound=-4.5, upper_bound=7.5, action='remove')

# Handling outliers for 'YearsInCurrentRole'
attrition_df = handle_outliers(attrition_df, 'YearsInCurrentRole', lower_bound=-5.5, upper_bound=14.5, action='remove')

# Handling outliers for 'YearsAtCompany'
attrition_df = handle_outliers(attrition_df, 'YearsAtCompany', lower_bound=-6.0, upper_bound=18.0, action='remove')

# Handling outliers for 'TotalWorkingYears'
attrition_df = handle_outliers(attrition_df, 'TotalWorkingYears', lower_bound=-7.5, upper_bound=28.5, action='remove')

# Handling outliers for 'MonthlyIncome'
attrition_df = handle_outliers(attrition_df, 'MonthlyIncome', lower_bound=-5291.0, upper_bound=16581.0, action='cap')

# After handling outliers, check the shape of the dataset
print(f"Final dataset shape: {attrition_df.shape}")

# You can check the changes by inspecting any column
print(attrition_df[['YearsWithCurrManager', 'YearsSinceLastPromotion', 'YearsInCurrentRole', 'YearsAtCompany', 'TotalWorkingYears', 'MonthlyIncome']].describe())


In [None]:

# Correlation Analysis
corr_matrix = attrition_df[numerical_cols].corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()


Experience Drives Tenure: Longer working years strongly correlate with tenure, role duration, and manager association.

Income Rises with Experience: Age and work experience moderately influence monthly income.

Independent Factors: Distance from home, salary hikes, and daily rate show minimal impact on other metrics.








In [None]:
attrition_df.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier

categorical_cols = ['BusinessTravel', 'Department', 'EducationField', 'Gender', 'MaritalStatus', 'OverTime']
# Identify all categorical columns in the dataset
categorical_cols = attrition_df.select_dtypes(include=['object']).columns.tolist()

# Encode all categorical columns, including 'Attrition'
data_encoded = attrition_df.copy()
label_encoders = {}

for col in categorical_cols:
    le = LabelEncoder()
    data_encoded[col] = le.fit_transform(data_encoded[col])
    label_encoders[col] = le

# Ensure all columns are numeric
non_numeric_cols = data_encoded.select_dtypes(include=['object']).columns
if not non_numeric_cols.empty:
    print("Remaining non-numeric columns:", non_numeric_cols)
else:
    print("All columns are numeric now.")

# Define features (X) and target (y)
target = 'Attrition'
X = data_encoded.drop(columns=[target])
y = data_encoded[target]

# # Confirm data types for debugging
# print("X types:\n", X.dtypes)
# print("y type:\n", y.dtype)

# Train the Random Forest model
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

# Feature Importance Plot
importances = pd.DataFrame({'Feature': X.columns, 'Importance': rf.feature_importances_})
importances = importances.sort_values(by='Importance', ascending=False)

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 8))
sns.barplot(data=importances, x='Importance', y='Feature', palette='viridis')
plt.title('Feature Importance')
plt.show()


 `RandomForestClassifier` is effective for classifying and predicting HR-related outcomes based on complex, multidimensional employee data, making it useful for HR decision-making in companies like IBM.

  Random forests provide insights into feature importance, helping HR teams understand which factors (e.g., satisfaction, work-life balance, etc.) are most predictive of outcomes like attrition or performance.



In [None]:
# Attrition by Job Role

plt.figure(figsize=(14, 8))
sns.countplot(data=attrition_df, y='JobRole', hue='Attrition', palette='viridis', order=attrition_df['JobRole'].value_counts().index)
plt.title('Attrition by Job Role')
plt.show()

In [None]:
 # Violin plot of MonthlyIncome by Attrition
plt.figure(figsize=(10, 6))
sns.violinplot(data=attrition_df, x='Attrition', y='MonthlyIncome', palette='viridis')
plt.title('Monthly Income Distribution by Attrition')
plt.show()


In [None]:
#Average Daily Rate By Attrition
plt.figure(figsize=(10, 6))
sns.boxplot(data=attrition_df, x='Attrition', y='DailyRate', palette='viridis')
plt.title('Average Daily Rate by Attrition')
plt.xlabel('Attrition')
plt.ylabel('Average Daily Rate')
plt.show()


In [None]:
#Daily Rate by Department and Attrition
plt.figure(figsize=(14, 8))
sns.boxplot(data=attrition_df, x='Department', y='DailyRate', hue='Attrition', palette='viridis')
plt.title('Daily Rate by Department and Attrition')
plt.xlabel('Department')
plt.ylabel('Daily Rate')
plt.xticks(rotation=45)
plt.legend(title='Attrition', loc='upper right')
plt.show()


In [None]:
#Attrition Distribution for OverTime Workers
import matplotlib.cm as cm
# Use viridis colormap
cmap = cm.get_cmap('viridis')
colors = [cmap(0.2), cmap(0.7)]  # Adjust indices for contrast



overtime_attrition = attrition_df[attrition_df['OverTime'] == 'Yes']['Attrition'].value_counts()

overtime_attrition.plot.pie(
    autopct='%1.1f%%',
    startangle=90,
    colors=colors,
    labels=['Stayed', 'Left'],
    title='Attrition Distribution for Overtime Workers'
)
plt.ylabel('')  # Removes the default y-axis label
plt.show()


In [None]:
# Average Distance From Home by Attrition
avg_distance = attrition_df.groupby('Attrition')['DistanceFromHome'].mean().reset_index()

sns.barplot(data=avg_distance, x='Attrition', y='DistanceFromHome', palette='viridis')
plt.title('Average Distance From Home by Attrition')
plt.xlabel('Attrition')
plt.ylabel('Average Distance From Home')
plt.show()


In [None]:
#Average Job Satisfaction by Attrition
sns.barplot(data=attrition_df, x='Attrition', y='JobSatisfaction', palette='viridis', ci=None)
plt.title('Average Job Satisfaction by Attrition')
plt.xlabel('Attrition')
plt.ylabel('Average Job Satisfaction')
plt.ylim(0, 5)  # Assuming JobSatisfaction levels range from 1 to 4
plt.show()


In [None]:
#Numerical Features
import matplotlib.pyplot as plt
import seaborn as sns

# Features to include
numerical_features = ['YearsInCurrentRole', 'YearsWithCurrManager', 'YearsAtCompany', 'YearsSinceLastPromotion']

# Create subplots for all features
fig, axes = plt.subplots(1, len(numerical_features), figsize=(20, 6), sharey=False)

# Iterate through the features and apply different plot types
for i, feature in enumerate(numerical_features):
    ax = axes[i]  # Select the appropriate subplot

    if i == 0:
        # Line plot for YearsInCurrentRole
        sns.violinplot(data=attrition_df, x='Attrition', y=feature, ax=ax,hue='Attrition')
        ax.set_title(f'Line: {feature} by Attrition', fontsize=12)

    elif i == 1:
        # Bar plot for YearsWithCurrManager
        sns.barplot(data=attrition_df, x='Attrition', y=feature, ci=None, palette='muted', ax=ax)
        ax.set_title(f'Bar: {feature} by Attrition', fontsize=12)

    elif i == 2:
        # Boxen plot for YearsAtCompany
        sns.boxplot(data=attrition_df, x='Attrition', y=feature, palette='viridis', ax=ax)
        ax.set_title(f'Boxen: {feature} by Attrition', fontsize=12)

    elif i == 3:
        # Simple bar chart for YearsSinceLastPromotion
        sns.barplot(data=attrition_df, x='Attrition', y=feature, ci=None, palette='muted', ax=ax)
        ax.set_title(f'Bar: {feature} by Attrition', fontsize=12)

# Adjust layout for readability
plt.suptitle('Mixed Plots for Years Features by Attrition', fontsize=16, y=1.05)
plt.tight_layout()
plt.show()


In [None]:
#Categorical Features
import matplotlib.pyplot as plt
import seaborn as sns

# Define features
categorical_features = ['NumCompaniesWorked', 'EnvironmentSatisfaction', 'StockOptionLevel', 'PercentSalaryHike']

# Create subplots
fig, axes = plt.subplots(1, len(categorical_features), figsize=(24, 6), sharey=False)

# Plot for NoOfCompaniesWorked (Horizontal Bar Chart)
ax = axes[0]
no_of_companies_counts = attrition_df['NumCompaniesWorked'].value_counts()
sns.barplot(
    y=no_of_companies_counts.index,
    x=no_of_companies_counts.values,
    palette='viridis',
    ax=ax
)
ax.set_title('No. of Companies Worked (Bar Chart)', fontsize=12)
ax.set_xlabel('Count')
ax.set_ylabel('No. of Companies Worked')


# Plot for EnvironmentalSatisfaction (Bar Plot)
ax = axes[1]
sns.barplot(
    data=attrition_df,
    x='EnvironmentSatisfaction',
    y='Attrition',
    ci=None,
    palette='coolwarm',
    ax=ax
)
ax.set_title('Environmental Satisfaction vs Attrition', fontsize=12)
ax.set_xlabel('Environmental Satisfaction')
ax.set_ylabel('Attrition Rate')

# Plot for StockOptionLevel (Stacked Bar Plot)
ax = axes[2]
stock_option_counts = attrition_df.groupby(['Attrition', 'StockOptionLevel']).size().unstack(fill_value=0)
stock_option_counts.plot(kind='bar', stacked=True, ax=ax, color=['blue', 'red'], edgecolor='black')
ax.set_title('Stock Option Level by Attrition', fontsize=12)
ax.set_xlabel('Attrition')
ax.set_ylabel('Count')
ax.legend(title='Stock Option Level')

# Plot for PercentSalaryHike (Boxen Plot)
ax = axes[3]
sns.boxenplot(
    data=attrition_df,
    x='Attrition',
    y='PercentSalaryHike',
    palette='pastel',
    ax=ax
)
ax.set_title('Percent Salary Hike by Attrition (Boxen Plot)', fontsize=12)
ax.set_xlabel('Attrition')
ax.set_ylabel('Percent Salary Hike')

# Adjust layout
plt.suptitle('Visualization of Selected Features by Attrition', fontsize=16, y=1.05)
plt.tight_layout()
plt.show()


In [None]:
 # 1. Monthly Income vs Job Role for Attrition (Bar Plot)
plt.figure(figsize=(14, 8))
sns.barplot(data=attrition_df, x='MonthlyIncome', y='JobRole', hue='Attrition', ci=None, palette='viridis')
plt.title('Monthly Income vs. Job Role for Attrition')
plt.xticks(rotation=70)
plt.show()

In [None]:
import plotly.express as px

fig = px.scatter(attrition_df, x='Age', y='MonthlyIncome', color='Attrition', hover_data=['JobRole'])
fig.show()




Higher Attrition in Younger and Low-Income Groups: Younger employees (20–30 years) and those earning below 5,000 show the highest attrition rates.

Income Correlation with Retention: Higher-income employees (>10,000) have significantly lower attrition, highlighting income as a key retention factor.

Age-Income Trend: Monthly income increases with age, but attrition remains concentrated among low-income employees across all age groups.


In [None]:

# 4. Monthly Income and Years at Company (Line Plot)
plt.figure(figsize=(12, 6))
sns.lineplot(data=attrition_df, x='YearsAtCompany', y='MonthlyIncome', hue='Attrition', palette='viridis', markers=True, ci=None)
plt.title('Monthly Income vs. Years at Company by Attrition')
plt.show()


Employees with fewer years at the company (0–5 years) and lower income exhibit higher attrition rates compared to long-tenured employees.

Monthly income steadily increases with years at the company, reflecting the impact of tenure on compensation.

Employees with more than 10 years at the company and higher incomes demonstrate significantly lower attrition rates.

In [None]:
attrition_df.columns

In [None]:
g = sns.FacetGrid(data=attrition_df, col="Department", hue="Attrition", height=5, aspect=1)
g.map(sns.scatterplot, "YearsAtCompany", "MonthlyIncome", alpha=0.7)
g.add_legend()


Employees with higher DailyRate tend to have lower attrition, and the distribution of DailyRate varies significantly across different departments.

Employees with lower monthly income are more likely to leave, regardless of their job satisfaction level.

In [None]:
# Pivot table to calculate average monthly income
heatmap_data = attrition_df.pivot_table(values='MonthlyIncome', index='JobSatisfaction', columns='Attrition', aggfunc='mean')

sns.heatmap(heatmap_data, annot=True, fmt='.0f', cmap='viridis', cbar=True)
plt.title('Average Monthly Income by Job Satisfaction and Attrition')
plt.xlabel('Attrition')
plt.ylabel('Job Satisfaction')
plt.show()


##Summary of Attrition Analysis
- Employees working overtime leave more often; flexible schedules may help.
- Low job satisfaction increases turnover; better engagement can reduce it.
- Long commutes cause more employees to leave; remote work could help.
- Shorter tenure or unstable roles lead to more quitting; career growth support is key.
- Lack of promotions drives people away; regular growth chances are important.
- Employees with many past jobs may need tailored retention plans.
- Poor workplace conditions increase quitting; improving them boosts retention.
- Limited stock options cause more attrition; better benefits improve loyalty.
- Small salary hikes push employees to leave; regular raises matter.
- Focus on work-life balance, engagement, flexibility, career growth, pay, and benefits.







