# Salifort Motors: Predicting Employee Attrition
### **Project:** Capstone Project for the Google Advanced Data Analytics Certificate

## 1. Plan

### Business Understanding and Project Goal

This project serves as the capstone for the Google Advanced Data Analytics Certificate. It applies the skills learned throughout the course to a real-world business scenario at Salifort Motors, a fictional alternative energy vehicle manufacturer.

Salifort Motors faces a significant challenge with employee retention. The senior leadership team has tasked the data analytics department with analyzing recent employee data to understand the key drivers of attrition.

The primary goal of this project is to construct a predictive model that can identify employees who are likely to leave the company. This model will provide the Human Resources (HR) department with actionable insights to develop targeted retention strategies, ultimately reducing turnover costs and preserving valuable institutional knowledge.

This notebook documents the entire process, following the **PACE (Plan, Analyze, Construct, Execute)** framework, a key methodology taught in the Google program.

## 2. Analyze
In this phase, we will perform a thorough Exploratory Data Analysis (EDA) to understand the dataset, identify key patterns, and formulate hypotheses about the drivers of employee attrition.

### 2.1. Data Loading and Initial Inspection
The first step is to load the dataset and perform an initial inspection to understand its structure, data types, and identify any immediate data quality issues such as missing values or duplicates.

In [None]:
# Importing necessary libraries and modeling tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
# Load the dataset
df = pd.read_csv('HR_comma_sep.csv')

In [None]:
# Display the first ten rows to understand the features
df.head(10)

In [None]:
# Get a summary of the data structure and types
df.info()

In [None]:
# Get descriptive statistics about the data
df.describe()

In [None]:
# Display all column names
df.columns

In [None]:
# Rename columns as needed
df = df.rename(columns={'Work_accident': 'work_accident',
                          'average_montly_hours': 'average_monthly_hours',
                          'time_spend_company': 'tenure',
                          'Department': 'department'})

# Display all column names after the update
df.columns

In [None]:
# Check for missing values
print("Missing Values Check:")
print(df.isnull().sum())

In [None]:
# Check for and handle duplicate records
print(f"Initial number of duplicate rows: {df.duplicated().sum()}")
df = df.drop_duplicates(keep='first')
# Display the first ten rows of the new DataFrame if needed
df.head(10)

In [None]:
# Show the number of rows and columns in the new DataFrame before moving on to visualization
print(f"Data shape after removing duplicates: {df.shape}")

### 2.2. Exploratory Data Analysis (EDA)
With a clean dataset, we can now explore the relationships between different variables and employee attrition. We will visualize these relationships to draw initial conclusions.

Hypothesis 1: Employee satisfaction is a key factor in attrition.

In [None]:
# Analyze the distribution of satisfaction_level for both groups
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='satisfaction_level', hue='left', multiple='dodge', shrink=0.8)
plt.title('Distribution of Satisfaction Level by Employee Status')
plt.xlabel('Satisfaction Level')
plt.ylabel('Number of Employees')
plt.legend(title='Status', labels=['Left', 'Stayed'])
plt.show()

Observation: The histogram confirms that a significant number of employees who left had a very low satisfaction level (below 0.5). This supports our hypothesis that dissatisfaction is a major driver of attrition.

Hypothesis 2: Workload, measured by average_montly_hours and number_project, impacts an employee's decision to leave.

In [None]:
# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot for average_monthly_hours (typo düzeltildi)
sns.boxplot(ax=axes[0], data=df, x='left', y='average_monthly_hours')
axes[0].set_title('Average Monthly Hours vs. Attrition')
axes[0].set_xticklabels(['Stayed', 'Left'])

# Plot for number_project
sns.countplot(ax=axes[1], data=df, x='number_project', hue='left')
axes[1].set_title('Number of Projects vs. Attrition')

plt.tight_layout()
plt.show()