<a href="https://colab.research.google.com/github/ezzywd78/MS_AI_Engineer_Course/blob/main/Employee_Turnover_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Analyze employee turnover by understanding the problem statement from `/content/1739525570_employee_turnover_problem_statement.docx`, extracting objectives, unzipping and loading the HR data from `/content/1739525599_hr_comma_sep.zip`, and outlining a plan for exploratory data analysis, data cleaning, and feature engineering.

## Read Problem Statement Document

### Subtask:
Load and read the content from the `/content/1739525570_employee_turnover_problem_statement.docx` file to understand the context of the employee turnover problem.


**Reasoning**:
The first step is to install the `python-docx` library, which is necessary to read the `.docx` file specified in the subtask.



In [1]:
pip install python-docx
print("python-docx installed successfully")

SyntaxError: invalid syntax (ipython-input-216443323.py, line 1)

**Reasoning**:
The previous command failed because `pip install` needs to be prefixed with `!` to be executed as a shell command within a Colab notebook.



In [2]:
!pip install python-docx
print("python-docx installed successfully")

Collecting python-docx
  Downloading python_docx-1.2.0-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.2.0-py3-none-any.whl (252 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.0/253.0 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx
Successfully installed python-docx-1.2.0
python-docx installed successfully


**Reasoning**:
Now that the `python-docx` library is installed, I will import the `Document` class, load the specified `.docx` file, extract all the text content by iterating through its paragraphs, and then print the extracted text to fulfill the subtask's requirements.



In [3]:
from docx import Document

doc_path = '/content/1739525570_employee_turnover_problem_statement.docx'
document = Document(doc_path)

problem_statement_text = []
for para in document.paragraphs:
    problem_statement_text.append(para.text)

full_text = '\n'.join(problem_statement_text)
print(full_text)





Machine Learning
Course-End Project Problem Statement



























Employee Turnover Analytics

Project Statement:
Portobello Tech is an app innovator who has devised an intelligent way of predicting employee turnover within the company. It periodically evaluates employees' work details, including the number of projects they worked on, average monthly working hours, time spent in the company, promotions in the last five years, and salary level.
Data from prior evaluations shows the employees’ satisfaction in the workplace. The data could be used to identify patterns in work style and their interest in continuing to work for the company. 
The HR Department owns the data and uses it to predict employee turnover. Employee turnover refers to the total number of workers who leave a company over time.
As the ML Developer assigned to the HR Department, you have been asked to create ML programs to:
Perform data quality checks by checking for missing values, if any.
Understand

## Summarize Objectives and Requirements

### Subtask:
Extract and summarize the key objectives, requirements, and scope of the employee turnover problem as defined in the document.


### Summary of Objectives and Requirements

The main goal of this project for Portobello Tech's HR department is to **predict employee turnover within the company by creating Machine Learning programs**.

**Specific Tasks and Requirements:**

1.  **Data Quality Checks**: Perform checks for missing values.
2.  **Exploratory Data Analysis (EDA)**:
    *   Understand factors contributing to employee turnover.
    *   Draw a heatmap of the correlation matrix for numerical features.
    *   Draw distribution plots for `satisfaction_level`, `last_evaluation`, and `average_montly_hours`.
    *   Draw a bar plot for `number_project` comparing employees who left and stayed.
3.  **Clustering**: Perform K-means clustering (3 clusters) on employees who left, based on `satisfaction_level` and `last_evaluation`.
4.  **Handle Class Imbalance**: Address the imbalance in the 'left' class using the SMOTE technique.
    *   Pre-process data by converting categorical columns to numerical using `get_dummies()`.
    *   Perform a stratified split of the dataset (80:20 train-test ratio, `random_state=123`).
    *   Upsample the training dataset using SMOTE.
5.  **Model Training and Evaluation (5-fold Cross-Validation)**:
    *   Train a Logistic Regression model.
    *   Train a Random Forest Classifier model.
    *   Train a Gradient Boosting Classifier model.
    *   Plot classification reports for each model.
6.  **Identify Best Model**: Determine the best model and justify evaluation metrics.
    *   Find and plot ROC/AUC curves for each model.
    *   Find confusion matrices for each model.
    *   Explain the appropriate metric (Recall or Precision) from the confusion matrix.
7.  **Suggest Retention Strategies**: Using the best model, predict employee turnover probability and categorize employees into four zones (Safe, Low-Risk, Medium-Risk, High-Risk) to suggest targeted retention strategies.

**Scope of the Project:**
The project will utilize employee work details including `number_project`, `average_montly_hours`, `time_spend_company`, `promotion_last_5years`, `salary` level, employee `satisfaction_level`, and `last_evaluation`. The objective is to identify patterns in work style and employee interest in continuing with the company, using data provided by the HR Department to predict and address employee turnover.

## Unzip HR Data

### Subtask:
Unzip the `/content/1739525599_hr_comma_sep.zip` file to access the HR data, which will be used for subsequent analysis.


**Reasoning**:
To unzip the file, I will use the `zipfile` module. The code will import `zipfile`, specify the path to the zip file, create a ZipFile object, extract all its contents to the current directory, and then close the zip file.



In [4]:
import zipfile

zip_file_path = '/content/1739525599_hr_comma_sep.zip'

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall('./')

print(f"'{zip_file_path}' unzipped successfully to './'")

'/content/1739525599_hr_comma_sep.zip' unzipped successfully to './'


## Load HR Data

### Subtask:
Load the unzipped HR data into a pandas DataFrame, preparing it for analysis related to employee turnover.


**Reasoning**:
First, I'll import the pandas library, which is essential for data manipulation and loading the CSV file into a DataFrame. Then, I will load the 'HR_comma_sep.csv' file into a DataFrame named `df` to make the data accessible for analysis. Finally, displaying the first five rows of the DataFrame will allow me to quickly verify that the data has been loaded correctly and to get a preliminary look at its structure.



In [5]:
import pandas as pd

df = pd.read_csv('HR_comma_sep.csv')

print("HR data loaded successfully. Displaying the first 5 rows:")
df.head()

HR data loaded successfully. Displaying the first 5 rows:


Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


## Outline Subsequent Analysis Steps

### Subtask:
Based on the problem statement and the available HR data, outline a detailed plan for initial data analysis, including exploratory data analysis (EDA) to identify potential factors contributing to employee turnover, data cleaning, and feature engineering considerations.


### Detailed Plan for Initial Data Analysis

#### 1. Initial Data Inspection:
*   **Inspect DataFrame Dimensions**: Use `df.shape` to understand the number of rows and columns in the dataset.
*   **Review Data Types and Non-Null Values**: Use `df.info()` to get a summary of the DataFrame, including data types of each column and the count of non-null values, which is crucial for identifying potential data quality issues.
*   **Generate Descriptive Statistics**: Apply `df.describe()` to obtain statistical summaries (count, mean, std, min, 25%, 50%, 75%, max) for all numerical columns, providing insights into their distribution and potential outliers.
*   **Check for Missing Values**: Use `df.isnull().sum()` to calculate the total number of missing values for each column. This will directly address the problem statement's requirement to 'Perform data quality checks by checking for missing values, if any.'

#### 2. Exploratory Data Analysis (EDA) Plan:
*   **Correlation Analysis**: Outline the steps to create a correlation matrix for all numerical features and visualize it using a heatmap to identify relationships between variables, especially with the 'left' column.
*   **Distribution Analysis**: Plan to create distribution plots for `satisfaction_level`, `last_evaluation`, and `average_montly_hours` to understand their spread and characteristics.
*   **Comparative Analysis**: Outline the steps to generate a bar plot comparing the `number_project` for employees who left versus those who stayed, to infer potential patterns related to project workload and turnover.

#### 3. Data Cleaning Plan:
*   Based on the missing value check, formulate a strategy for handling any identified missing data (e.g., imputation, removal, or further investigation), though the problem statement mentions checking for missing values, implying a quality check rather than an immediate cleaning action if none exist.

#### 4. Feature Engineering Considerations:
*   Detail the plan to identify categorical columns in the dataset and convert them into numerical representations using one-hot encoding (e.g., `pd.get_dummies()`), as specified in the problem statement for preprocessing before handling class imbalance.
*   Consider if any new features might be beneficial for modeling, though the problem statement heavily focuses on existing features.

#### 4. Feature Engineering Considerations:
*   Detail the plan to identify categorical columns in the dataset and convert them into numerical representations using one-hot encoding (e.g., `pd.get_dummies()`), as specified in the problem statement for preprocessing before handling class imbalance.
*   Consider if any new features might be beneficial for modeling, though the problem statement heavily focuses on existing features.

## Final Task

### Subtask:
Summarize the understanding of the problem statement, the key requirements, and the proposed analytical steps to address employee turnover.


## Summary:

### Data Analysis Key Findings

*   The problem statement was successfully extracted from the provided document, revealing Portobello Tech's goal to predict employee turnover using machine learning.
*   A comprehensive list of specific tasks was identified from the problem statement, including data quality checks, Exploratory Data Analysis (EDA), K-means clustering, handling class imbalance with SMOTE, training and evaluating multiple classification models (Logistic Regression, Random Forest, Gradient Boosting), identifying the best model, and suggesting retention strategies.
*   The HR dataset was successfully prepared by unzipping the `hr_comma_sep.zip` file and loading the `HR_comma_sep.csv` into a pandas DataFrame.
*   A detailed plan for subsequent analysis steps was outlined, covering:
    *   **Initial Data Inspection**: Checking DataFrame dimensions, data types, non-null values, descriptive statistics, and missing values.
    *   **Exploratory Data Analysis**: Performing correlation analysis with heatmaps, distribution plots for `satisfaction_level`, `last_evaluation`, and `average_montly_hours`, and comparative bar plots for `number_project` between employees who left and stayed.
    *   **Data Cleaning**: Strategizing for handling any identified missing data.
    *   **Feature Engineering**: Converting categorical columns to numerical using one-hot encoding (`pd.get_dummies()`).

### Insights or Next Steps

*   The detailed plan aligns directly with the project objectives, ensuring that all aspects of the problem statement, from data understanding to model development and strategy formulation, will be addressed systematically.
*   The immediate next step is to execute the outlined "Initial Data Inspection" and "Exploratory Data Analysis" steps on the loaded HR data to gain initial insights and identify data quality issues.
