# Outsmarting a Dictator at TechTrend Innovations - Assignment 02

You should complete this Jupyter Notebook with your answers. You may need to write code or add explanatory notes.

dataset link: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset/data

In [1]:
# Importing packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Getting the Dataset in Colab
To download the dataset, you’ll need a Kaggle API key. If you’re using Google Colab and don’t have one yet, follow these steps:  
1. Go to Kaggle.com, sign in, and create an API key under your account settings.  
2. Download the `kaggle.json` file.  
3. Run the code snippet below, then upload your `kaggle.json` file when prompted.  

Once that’s done, you can download the dataset with the Kaggle API

In [2]:
from google.colab import files

files.upload()
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [3]:
import os
import zipfile

# Install Kaggle
%pip install kaggle

# Create a folder to store the dataset
folder_name = 'hrdata'
os.makedirs(folder_name, exist_ok=True)

# Download dataset from Kaggle
!kaggle datasets download -d pavansubhasht/ibm-hr-analytics-attrition-dataset -p {folder_name}

# Unzip the dataset into the folder
zip_path = f"{folder_name}/ibm-hr-analytics-attrition-dataset.zip"
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(folder_name)

Dataset URL: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
License(s): DbCL-1.0


In [4]:
# Load and print head of the dataset
df = pd.read_csv(f"./{folder_name}/WA_Fn-UseC_-HR-Employee-Attrition.csv")
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## Background
Picture a ruthless dictator who’s faking a perfect country. He’s hacked into company employee databases, including TechTrend Innovations, and jacked up the `JobSatisfaction` column (1-4) to trick everyone into thinking workers are thrilled. This messes up systems people rely on: performance trackers for raises, health insurance for fair rates, and recruitment sites for job matches. Everything’s broken until a new survey fixes it. But here’s the twist—you’ve snagged a clean copy of TechTrend’s data from before the hack! Your job: predict the real `JobSatisfaction` so these systems can get back on track.  

These systems—performance trackers, health insurance, recruitment—only care if employees are Satisfied or Dissatisfied, not the full 1-4 scale. So, you’ll turn `JobSatisfaction` into a two-class target: **Satisfied** (3-4) vs. **Dissatisfied** (1-2). Nail this, and you’ll expose the dictator’s lie while saving TechTrend’s day!

## Task 1: Fixing TechTrend’s Data to Beat the Hack
The dictator’s goons scrambled TechTrend’s satisfaction numbers, but your clean data can strike back.   
- Make a new target feature, `SatisfactionLevel`, where `JobSatisfaction` 3-4 is "Satisfied" and 1-2 is "Dissatisfied". Then ditch `JobSatisfaction`—it’s fake out there now, and we’re rebuilding it.  
- Clean up the rest: encode stuff like `JobRole` or `Gender`, scale numbers like `Age` or `MonthlyIncome`, and patch any gaps.  


In [None]:
# Write your answer here:

## Task 2: Showing the Real Satisfaction Split at TechTrend
The dictator’s bragging that everyone’s happy, but your data might bust that myth. Let’s see the truth.  
- Count how many are Satisfied vs. Dissatisfied in `SatisfactionLevel` and show it.  
- Plot it (e.g., a bar chart) to spotlight the imbalance—maybe more Satisfied, maybe not.  

*In a markdown cell, say what this split means for TechTrend and how it dents the dictator’s story.*

In [None]:
# Write your answer here:

## Task 3: Digging Up What Really Shapes Satisfaction at TechTrend
TechTrend’s systems need the real drivers of satisfaction to work again. Let’s snoop around.  
- Guess which features—like `OverTime`, `WorkLifeBalance`, or `MonthlyIncome`—might sway `SatisfactionLevel`.  
- Draw at least two graphs (e.g., box plots, bar plots) to link these to satisfaction.  

*In a markdown cell, share your guesses and what the graphs spill about TechTrend’s real vibe.*

In [None]:
# Write your answer here:

## Task 4: Nailing the Big Clues to Thwart the Dictator
The dictator’s hack buried the truth, but TechTrend needs the key factors to fix its systems.
- Pick a method (e.g., Recursive Feature Elimination, Correlation Thresholding) to find the top features predicting `SatisfactionLevel`.  
- List your picks and defend your method—why’s it perfect for TechTrend’s counterattack?  


In [None]:
# Write your answer here:

## Task 5: Modeling to Crush the Dictator’s Fakeout
TechTrend’s systems are stalled by the dictator’s phony data. Use your Task 4 features to predict the real `SatisfactionLevel`.  
- Pick a classifier (e.g., Naive Bayes or whatever you like) and tune at least one hyperparameter (e.g., smoothing).  
- Use cross-validation (e.g., k-fold) to train and test it solid.  
- Tackle the unbalanced classes (like more Satisfied than Dissatisfied) with something like oversampling or class weights.  
- Check it with metrics like precision, recall, and F1-score—make sure it’s sharp for systems like health insurance.  

*In a markdown cell, explain your classifier, imbalance trick, and why your metrics save TechTrend.*

In [None]:
# Write your answer here: