# Instructions

* Add your code as indicated in each cell.
* You can add as many cells as you want to experiment with your code and get your code right. Once you are done, remove the cells you have added and keep the original cells.
* Do not alter/delete cells that state **"Do not change or delete this cell".**
* Before you turn this in, make sure everything runs as expected.
> * Start from the very first cell, run all cells one by one, including the cell that computes your total score for this assignment.
> * If you cannot program a specific question and get an error, still run its cell and go to the next cell and keep running it.
* All work must be your own. I use a plagiarism check for coding. Copying and pasting code from friends is not allowed.
* If you attempt to fake passing the tests, you will receive an F, and it will be considered an ethical violation.
* You will need to submit three things:
> 1.   ***The link to your Google Colab notebook file***: Submit the link to your notebook file. To do so, click **Share** on the top right-hand side. Then a box will pop-up. You need to change **"restricted"** to **"anyone with the link."** Then, copy the link and paste it as a comment when submitting the assignment.
> 2.   ***The notebook file***: Download the same file as ipynb. To do so, Go to **File**, select **Download**. Then Click on **ipynb** on the menu box.
> 3.   ***The pdf version of your notebook file***: Download the same file as pdf. To do so, Go to **File**, select **Print**,A menu box will pop up. Then Click on **PDF** on the menu box. This will convert the file into a PDF file, instead of printing it using a printer.


# Business Context
* Employee attrition refers to the reduction in workforce due to employees leaving the company through resignations, retirements, or terminations. Predicting employee attrition using analytics is crucial for organizations because it enables them to proactively address potential issues that cause employees to leave.
* By leveraging data on employee performance, satisfaction, and engagement, companies can identify patterns and predictors of attrition, allowing them to implement targeted retention strategies, reduce turnover costs, and maintain a stable and motivated workforce, ultimately enhancing overall productivity and organizational success.

# Run Cell Below to Store Your Assignment Score

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

import random
random.seed(10)

score = {}

# Run Cell Below to Read the Dataset

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL
### THIS CELL WILL READ THE DATA

import pandas as pd
import numpy as np
import random
random.seed(10)

url_employee_attrition = 'https://drive.google.com/file/d/1rd5jMbd7gTXsMEQuALpru5QNSjlmKbPV/view?usp=sharing'
path_employee_attrition = 'https://drive.google.com/uc?export=download&id='+url_employee_attrition.split('/')[-2]
df_employee_attrition = pd.read_csv(path_employee_attrition)

df_employee_attrition.head()


# READ THIS CAREFULLY
# Use the below code only if you reach download limit on G-Drive
# df = pd.read_csv('https://raw.githubusercontent.com/marineevy/datasets/main/hr_analytics.csv')

# READ THIS CAREFULLY
# only use the below code if you cannot load the data using previous two methods
# df = pd.read_csv('hr_analytics.csv')

df_employee_attrition.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


# Question 1

* Create a variable called `X` and assign the datatset only with input variables.
* Create a variable called `y` and assign the output data. Remember `Attrition` is the output (target or y) variable


In [None]:
###
### YOUR CODE HERE
###
# Assign input features (independent variables) to X
X = df_employee_attrition.drop('Attrition', axis=1)

# Assign output feature (dependent variable) to y
y = df_employee_attrition['Attrition']


In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

X.head(2)

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,2,...,1,80,0,8,0,1,6,4,0,5
1,49,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,3,...,4,80,1,10,3,3,10,7,1,7


__Question 1 test case is below__

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

try:
    if not('Attrition' in X.columns):

        score['question 1'] = 'pass'
    else:
        score['question 1'] = 'fail'
except:
    score['question 1'] = 'fail'
score

{'question 1': 'pass'}

# Question 2

* Split the dataset into training and testing sets
* Recall that now you have `X` and `y`. After this step, you should have `X_train`, `y_train`, `X_test`, and `y_test`
* So you will create four new variables named `X_train`, `y_train`, `X_test`, and `y_test` using the `X` and `y` data
* Test size should be **`25 percent`** of the entire dataset


In [None]:
###
### YOUR CODE HERE
###
from sklearn.model_selection import train_test_split

# Split the dataset: 75% training, 25% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

X_train.head(2)

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1343,29,Travel_Rarely,592,Research & Development,7,3,Life Sciences,1,1883,4,...,2,80,0,11,2,3,3,2,1,2
1121,36,Travel_Rarely,884,Sales,1,4,Life Sciences,1,1585,2,...,1,80,0,15,5,3,1,0,0,0


__Question 2 test case is below__

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

try:
    if ((X_train.shape[0]>1100) and (X_train.shape[0]<1110)):
        score['question 2'] = 'pass'
    else:
        score['question 2'] = 'fail'
except:
    score['question 2'] = 'fail'
score

{'question 1': 'pass', 'question 2': 'pass'}

# Question 3

* identify the names of the numerical variables in the input dataset X_train, and save the names of the numerical variables to a python variable called `numerical_cols`
* identify the names of the categorical variables in the input dataset X_train and save the names of the categorical variables to a python variable called `categorical_cols`

In [None]:
###
### YOUR CODE HERE
###
# Identify numerical columns
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Identify categorical columns
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()


In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

print(numerical_cols)
print(categorical_cols)

['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime']


**Question 3 test case is below**

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

try:
  if ((len(numerical_cols) == 26) and (len(categorical_cols) == 8)):
    score['question 3'] = 'pass'
  else:
    score['question 3'] = 'fail'
except:
  score['question 3'] = 'fail'
score

{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass'}

# Question 4

* Scale the numerical variables in the X_train and X_test datasets using the **StandardScaler** method

In [None]:
###
### YOUR CODE HERE
###
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit on training data and transform both training and test sets
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])


In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

X_train.head(2)

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
1343,-0.852159,Travel_Rarely,-0.508455,Research & Development,-0.285906,0.094018,Life Sciences,0.0,1.419614,1.190392,...,-0.645578,0.0,-0.949842,-0.014296,-0.615111,0.365845,-0.64246,-0.603137,-0.360548,-0.569369
1121,-0.093088,Travel_Rarely,0.209318,Sales,-1.017204,1.053349,Life Sciences,0.0,0.928104,-0.641748,...,-1.574333,0.0,-0.949842,0.498037,1.695749,0.365845,-0.970918,-1.155867,-0.682269,-1.138997


**Question 4 test case is below**

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

try:
    if ((round(X_train['Age'].max(), 2) < 3) and (round(X_train['Age'].max(), 2) > 2)):
        score['question 4'] = 'pass'
    else:
        score['question 4'] = 'fail'
except:
    score['question 4'] = 'fail'
score

{'question 1': 'pass',
 'question 2': 'pass',
 'question 3': 'pass',
 'question 4': 'pass'}

# Question 5
* Encode the categorical input variables for X_train and X_test datasets using the OneHotEncoder method


In [None]:
###
### YOUR CODE HERE
from sklearn.preprocessing import OneHotEncoder

# Initialize the encoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Fit and transform on X_train, transform on X_test
X_train_encoded = encoder.fit_transform(X_train[categorical_cols])
X_test_encoded = encoder.transform(X_test[categorical_cols])

# Get new column names after encoding
encoded_cols = encoder.get_feature_names_out(categorical_cols)

# Convert encoded arrays to DataFrames
X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=encoded_cols, index=X_train.index)
X_test_encoded_df = pd.DataFrame(X_test_encoded, columns=encoded_cols, index=X_test.index)

# Drop original categorical columns and replace with encoded ones
X_train = X_train.drop(columns=categorical_cols).join(X_train_encoded_df)
X_test = X_test.drop(columns=categorical_cols).join(X_test_encoded_df)

# Display first few rows to verify
print("Updated X_train with encoded columns:")
display(X_train.head())

Updated X_train with encoded columns:


Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single,Over18_Y,OverTime_No,OverTime_Yes
1343,-0.852159,-0.508455,-0.285906,0.094018,0.0,1.419614,1.190392,-0.3746,0.393923,-0.934008,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
1121,-0.093088,0.209318,-1.017204,1.053349,0.0,0.928104,-0.641748,0.309668,0.393923,-0.027156,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
1048,-0.309965,1.29581,-0.773438,0.094018,0.0,0.751623,1.190392,0.700678,-2.415809,-0.027156,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
1393,-1.069036,0.381387,-0.042139,0.094018,0.0,1.554862,1.190392,-1.107744,0.393923,-0.027156,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
527,-0.526843,0.319933,0.079744,0.094018,0.0,-0.495295,1.190392,-0.570105,0.393923,-0.027156,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0


In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

X_train.head()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,JobRole_Research Director,JobRole_Research Scientist,JobRole_Sales Executive,JobRole_Sales Representative,MaritalStatus_Divorced,MaritalStatus_Married,MaritalStatus_Single,Over18_Y,OverTime_No,OverTime_Yes
1343,-0.852159,-0.508455,-0.285906,0.094018,0.0,1.419614,1.190392,-0.3746,0.393923,-0.934008,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
1121,-0.093088,0.209318,-1.017204,1.053349,0.0,0.928104,-0.641748,0.309668,0.393923,-0.027156,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
1048,-0.309965,1.29581,-0.773438,0.094018,0.0,0.751623,1.190392,0.700678,-2.415809,-0.027156,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
1393,-1.069036,0.381387,-0.042139,0.094018,0.0,1.554862,1.190392,-1.107744,0.393923,-0.027156,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
527,-0.526843,0.319933,0.079744,0.094018,0.0,-0.495295,1.190392,-0.570105,0.393923,-0.027156,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0


**Question 5 test case is below**

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

try:
    if 'JobRole_Research Scientist' in X_train.columns:
        score['question 5'] = 'pass'
    else:
        score['question 5'] = 'fail'
except:
    score['question 5'] = 'fail'
score

{'question 1': 'pass',
 'question 2': 'pass',
 'question 3': 'pass',
 'question 4': 'pass',
 'question 5': 'pass'}

# Question 6
* Encode the categorical output variable in the `y_train` and `y_test` data. Yes should be coded as 1, and No should be coded as 0.


In [None]:
###
### YOUR CODE HERE
# Encode the target variable
y_train = y_train.map({'Yes': 1, 'No': 0})
y_test = y_test.map({'Yes': 1, 'No': 0})



In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

y_train.head()

Unnamed: 0,Attrition
1343,0
1121,0
1048,0
1393,0
527,0


**Question 6 test case is below**

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

try:
    if ((round(y_train.mean(),3) > 0.1) and (round(y_train.mean(),3) < 0.2)):
        score['question 6'] = 'pass'
    else:
        score['question 6'] = 'fail'
except:
    score['question 6'] = 'fail'
score

{'question 1': 'pass',
 'question 2': 'pass',
 'question 3': 'pass',
 'question 4': 'pass',
 'question 5': 'pass',
 'question 6': 'pass'}

# Question 7

* As you may have checked already, the dataset is not balanced
* You will need to balance the dataset.
* Use the imblearn library and the SMOTE method to balance the datatset.  
* Recall that you only balance the trainig dataset

In [None]:
###
### YOUR CODE HERE
###
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to training data and overwrite X_train and y_train
X_train, y_train = smote.fit_resample(X_train, y_train)

# Check class distribution to verify balance
print("Balanced y_train distribution:")
print(y_train.value_counts())




Balanced y_train distribution:
Attrition
0    913
1    913
Name: count, dtype: int64


In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

y_train.value_counts()[0]

np.int64(913)

**Question 7 test case is given below**

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

try:
    if (y_train.value_counts()[0] - y_train.value_counts()[1]) < 2:
        score['question 7'] = 'pass'
    else:
        score['question 7'] = 'fail'
except:
    score['question 7'] = 'fail'
score

{'question 1': 'pass',
 'question 2': 'pass',
 'question 3': 'pass',
 'question 4': 'pass',
 'question 5': 'pass',
 'question 6': 'pass',
 'question 7': 'pass'}

# Question 8

1. Train a decision tree model using the X_train and y_train data
2. Make predictions using the test dataset - X_test
3. Create a variable called accuracy and compute the model accuracy using the test set (X_test and y_test) and save the accuracy value to this variable



In [None]:
###
### YOUR CODE HERE
###
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Initialize and train the decision tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Step 2: Make predictions on the test set
y_pred = dt_model.predict(X_test)

# Step 3: Compute accuracy
accuracy = accuracy_score(y_test, y_pred)



In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

accuracy

0.8097826086956522

**Question 8 test case is given below**

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

try:
    if (accuracy>.7):
        score['question 8'] = 'pass'
    else:
        score['question 8'] = 'fail'
except:
    score['question 8'] = 'fail'
score

{'question 1': 'pass',
 'question 2': 'pass',
 'question 3': 'pass',
 'question 4': 'pass',
 'question 5': 'pass',
 'question 6': 'pass',
 'question 7': 'pass',
 'question 8': 'pass'}

# Total Score
* Below Code will give your Total Score. You must run the below code

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL

total_score = 0
for i in list(score.values()):
    if i=='pass':
        total_score = total_score + 12.5
print('your total score is: ', round(total_score))

your total score is:  100


* Below will give the indvidual test case results

In [None]:
### RUN THIS CELL
### DO NOT CHANGE OR DELETE THIS CELL
### CHANGING OR DELETING THIS CELL MAY IMPACT YOUR GRADE
### AGAIN - DO NOT CHANGE OR DELETE THIS CELL
score

{'question 1': 'pass',
 'question 2': 'pass',
 'question 3': 'pass',
 'question 4': 'pass',
 'question 5': 'pass',
 'question 6': 'pass',
 'question 7': 'pass',
 'question 8': 'pass'}