# Fixing Employee Happiness at TechTrend Innovations - Assignment 02

You should complete this Jupyter Notebook with your answers. You may need to write code or add explanatory notes.

dataset link: https://www.kaggle.com/datasets/lainguyn123/employee-survey/data

In [1]:
# Importing packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Getting the Dataset in Colab
To download the dataset, you’ll need a Kaggle API key. If you’re using Google Colab and don’t have one yet, follow these steps:  
1. Go to Kaggle.com, sign in, and create an API key under your account settings.  
2. Download the `kaggle.json` file.  
3. Run the code snippet below, then upload your `kaggle.json` file when prompted.  

Once that’s done, you can download the dataset with the Kaggle API

In [2]:
from google.colab import files

files.upload()
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [3]:
import os
import zipfile

# Install Kaggle
%pip install kaggle

# Create a folder to store the dataset
folder_name = 'EmployeeData'
os.makedirs(folder_name, exist_ok=True)

# Download dataset from Kaggle
!kaggle datasets download -d lainguyn123/employee-survey -p {folder_name}

# Unzip the dataset into the folder
zip_path = f"{folder_name}/employee-survey.zip"
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(folder_name)

Dataset URL: https://www.kaggle.com/datasets/lainguyn123/employee-survey
License(s): other


In [4]:
# Load and print head of the dataset
df = pd.read_csv(f"./{folder_name}/employee_survey.csv")
df.head()

Unnamed: 0,EmpID,Gender,Age,MaritalStatus,JobLevel,Experience,Dept,EmpType,WLB,WorkEnv,...,SleepHours,CommuteMode,CommuteDistance,NumCompanies,TeamSize,NumReports,EduLevel,haveOT,TrainingHoursPerYear,JobSatisfaction
0,6,Male,32,Married,Mid,7,IT,Full-Time,1,1,...,7.6,Car,20,3,12,0,Bachelor,True,33.5,5
1,11,Female,34,Married,Mid,12,Finance,Full-Time,1,1,...,7.9,Car,15,4,11,0,Bachelor,False,36.0,5
2,33,Female,23,Single,Intern/Fresher,1,Marketing,Full-Time,2,4,...,6.5,Motorbike,17,0,30,0,Bachelor,True,10.5,5
3,20,Female,29,Married,Junior,6,IT,Contract,2,2,...,7.5,Public Transport,13,2,9,0,Bachelor,True,23.0,5
4,28,Other,23,Single,Junior,1,Sales,Part-Time,3,1,...,4.9,Car,20,0,7,0,Bachelor,False,20.5,5


## Background
Imagine a bossy leader in a made-up country who wants everyone to think his people are super happy. He tells hackers to sneak into company records, including TechTrend Innovations, and change the `JobSatisfaction` numbers (1-5) to look better—like everyone loves their job! This tricks important tools people use: performance trackers for pay raises, health insurance for costs, and job sites for finding work. These tools are now confused until a new survey happens. But you’ve got a secret copy of TechTrend’s real data from before the hack! Your job: figure out the true `JobSatisfaction`—split into **Satisfied** (3-5) or **Dissatisfied** (1-2)—to help these tools work again. They only need a simple “happy or not” answer, so two groups make sense. Let’s beat the leader’s trick!

## Task 1: Fixing TechTrend’s Data to Beat the Hack
The leader’s hackers messed up TechTrend’s happiness numbers, but your real data can fix it.  
- Make a new target, `SatisfactionLevel`, where `JobSatisfaction` 3-5 is "Satisfied" and 1-2 is "Dissatisfied". Then remove `JobSatisfaction`—it’s fake outside, and we’re making it new.  
- Clean the rest: turn words like `Gender` or `Dept` into numbers, adjust numbers like `Age` or `HoursWorkload`, and fill any blanks.


In [None]:
# Write your answer here:

## Task 2: Showing the Real Satisfaction Split at TechTrend
The leader says all workers are happy, but your data might prove him wrong. Let’s look.  
- Count how many are Satisfied vs. Dissatisfied in `SatisfactionLevel` and show the numbers.  
- Make a picture (like a bar chart) to show if more are Satisfied or not.  


In [None]:
# Write your answer here:

## Task 3: Digging Up What Really Shapes Satisfaction at TechTrend
TechTrend’s tools need to know what changes happiness to work right. Let’s check it out.  
- Guess which things—like `haveOT`, `WLB`, or `HoursWorkload`—might affect `SatisfactionLevel`.  
- Draw two pictures (like box plots or bar charts) to show how these connect to happiness.  

*In a markdown cell, tell your guesses and what the pictures show about TechTrend’s true workers.*

In [None]:
# Write your answer here:

## Task 4: Nailing the Big Clues to Stop the Leader
The hack hid what’s real, but TechTrend needs the top reasons for happiness to fix its tools. Let’s find them.  
- Choose a way (like Recursive Feature Elimination or Correlation Thresholding) to pick the best features for guessing `SatisfactionLevel`.  
- List your top features and say why your way works for TechTrend.  

*In a markdown cell, tell how these features can help TechTrend’s tools win.*


In [None]:
# Write your answer here:

## Task 5: Modeling to Crush the Leader’s Fakeout
The leader’s fake numbers broke TechTrend’s tools. Use your Task 4 features to guess the real `SatisfactionLevel`.
- Pick a classifier (e.g., Naive Bayes or whatever you like) and tune at least one hyperparameter (e.g., smoothing).  
- Use cross-validation (e.g., k-fold) to train and test it solid.  
- Tackle the unbalanced classes (like more Satisfied than Dissatisfied) with something like oversampling or class weights.  
- Test it with scores like precision, recall, and F1-score—to help tools like job sites.  

*In a markdown cell, explain your classifier, imbalance trick, and why your metrics save TechTrend.*

In [None]:
# Write your answer here: