# Homework: KNN Classification for Employee Attrition

**Total: 120 points** • **Questions: 10**

This homework uses one dataset: **ibm_attrition** (CSV). You will practice:

- Working with a close-to-reality HRIS dataset addressing a recurrent HR problem: predicting employee attrition based on a rich features.
- Feature scaling
- Feature Selection
- Splitting data into train/test sets
- Developing understanding of where test split should and SHOULD NOT be used to prevent test signal leakage
- Fitting a KNN classifer for to predict employees that are at the most risk of leaving the company
- Evaluating the model performance against test split and finding optimal KNN parameters to maximize evaluation metrics

## Instructions (important)

- Do **not** hard-code answers; compute them from the data.
- Some questions ask you to create specific variables. **Name them exactly** as requested.
- If you're using Google Colab, you need to upload the downloaded dataset to your Colab Files section.
- For much of this homework, your solution will be self-guided. You may refer back to the lecture notebook for steps you need to follow and the order by which to follow those steps.
- Following guidance from the lecture will result in an **acceptable solution**. But for a **perfect solution**, further experimentation and exploration is needed.
- You may use additional tools and functions from `sklearn` but using any other libraries besides the ones provided in this notebook is **strictly prohibited**.

### Grading

This notebook uses autograding:

- The major part of the grade for this assignment comes from **Q7** but this question also depends on Q6 and others. Make sure you budget your effort proportionally.
- **Answer cells** are marked as `# YOUR CODE HERE` or `# YOUR ANSWER HERE` and will be graded.
- Remove `raise NotImplementedError()` once you start working on a solution.
- **Do not** edit the content of LOCKED cells.
- **Do not** attempt to DELETE or MOVE any of the included cells.
- You MAY ADD **additional code cells** to experiment. You may remove these added blocks once you're done with your work. But if you intend to show your work (graph, summaries etc.) or another cell depends on the content of the created cell, you may keep it.
- Besides the sanity check tests visible to you, we might use additional rigorous hidden tests that are only available after submission. Double-check your work for accuracy and do not rely on sanity checks.
- This notebook contains metadata for tracking. Do not share your notebook or create a new notebook from scratch.
- If anything breaks, save your work and download a fresh copy of the notebook from Canvas. You can copy your finished code and insert that block by block into the new copy.


In [124]:
# Setup

import hashlib

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
)
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler

RANDOM_STATE = 2025

# Plot style
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)


def dataframe_digest(df: pd.DataFrame) -> str:
    """
    Returns hexdigest of the hashed value of a given pands dataframe.
    """
    row_hashes = pd.util.hash_pandas_object(df, index=True)
    h = hashlib.sha256()
    h.update(row_hashes.values.tobytes())
    return h.hexdigest()

---
# Data Preprocessing

We are using the `ibm_attrition.csv` for this homework.

We first form and process our target vector `y` and feature matrix `X`.


In [125]:
df = pd.read_csv("ibm_attrition.csv")

# Dropping columns with no significant contribution.
df.drop(columns=["EmployeeCount", "EmployeeNumber", "StandardHours"], inplace=True)

print(f"Shape of df: {df.shape}")

df.head()

Shape of df: (1470, 32)


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,2,Female,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,3,Male,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,4,Male,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,4,Female,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,Male,...,3,4,1,6,3,3,2,2,2,2


In [126]:
assert (
    dataframe_digest(df)
    == "3e7e5b96a1eacc81f5b775506e69ef95cd3476d75bf6e0123dec51f823bc06db"
), (
    "Dataframe digest doesn't match. Either your data file is corruputed or you're using a differnt version of pandas library."
)

### Q1 (3 pts) – Manual Encoding

Use custom python dictionary to encode the target variable `"Attrition"`. No should be mapped to 0 and Yes should be mapped to 1. Assign the mapped object to a variable named `y`.

In [127]:
df["Attrition"].value_counts()

Attrition
No     1233
Yes     237
Name: count, dtype: int64

In [128]:
# y = ...
### BEGIN SOLUTION
label_map = {"No": 0, "Yes": 1}
y = df["Attrition"].map(label_map)
### END SOLUTION

In [129]:
assert isinstance(y, pd.core.series.Series), "y should be a pandas Series object"
### BEGIN HIDDEN TESTS
assert set(np.unique(y)) == {0, 1}
assert y.shape == (1470,)
### END HIDDEN TESTS

### Q2 (2 pts) – Using LabelEncoder

Now perform the same encoding of label variable using `LabelEncoder()` from scikit-learn. Assign the resulting variable to `y2` The resulting variable from this question and previous question should be numerically equivalent but they may have different types.

In [130]:
target_encoder = LabelEncoder()
# y2 = ...
### BEGIN SOLUTION
y2 = target_encoder.fit_transform(df["Attrition"])
### END SOLUTION
# df["Attrition"].value_counts()

In [131]:
assert isinstance(y2, np.ndarray), "y2 should be Numpy ndarray object"
### BEGIN HIDDEN TESTS
assert y2.shape == (1470,)
### END HIDDEN TESTS

In [132]:
assert np.isclose(sum(y), sum(y2), atol=0.0001), (
    "y and y2 should be numerically similar"
)

After we're done processing y vector, we can form our feature matrix `X`.

In [133]:
X = df.drop(columns="Attrition")

## Categorical vs. Numerical Features

Here we have to pursue different approaches for numerical and categorical columns. In the lecture we covered dealing with numerical features which we simply standardize. Here we remove the categorical features since numerical values provide good enough decision boundaries.

> **Advanced Topic:**  
> For categorical features we need to employ an encoding scheme (remember we also encoded our binary target y). But different encodoing schemes are used for features. One of the most popular schemes is one-hot-encoding and sklearn provides a `OneHotEncoder()` class for this purpose. This is an advanced topic that you may explore on your own but working with this encoder is somewhat similar to `StandardScaler()` class.  
> In the starter code below, we are simply dropping the categorical columns so we don't have to deal with them. If you decide to include them back in, you may do that in a future block of code. Remember that your notebook should execute from top to bottom, so make sure you're not accidentally overwriting your own code in lower blocks.  
> The object `X` will remain untouched which you can use to slice for categorical features.

In [134]:
# Here we separate out the categorical and numerical feature names.

numerical_feats = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
# You may use the categorical feature names stored in this object for
# indexing if you choose to include them back in.
categorical_feats = X.select_dtypes(include="object").columns.tolist()

print("categorical columns that will be dropped:")
print(categorical_feats)
# We continue working on X_num while leaving X on its own.
X_num = X.drop(columns=categorical_feats)
print("\n")
print(f"X_num shape: {df.shape}")

categorical columns that will be dropped:
['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime']


X_num shape: (1470, 32)


# Splitting and Scaling Sets

Pay attention to the order of operations here. We first split the data, then work on scaling both splits using parameters obtained **ONLY** from train split.  
Since KNN relies on distance metrics, it is very important to have features on the same scale. This is why we must choose a scaling scheme like standardization.  
The target `y` does not need scaling. Why?

> **Advanced Topic:**  
> If you choose to explore using categorical features, keep note of another complication.  
> You must split first, then scale the numerical features while encoding the categorical features.  
> Then you can concatenated them back into a unified feature matrix.

In [135]:
# Choice of random_state which we hardcoded in top of our notebook, can change your computations.
# We use this for code reproducibility.
X_train, X_test, y_train, y_test = train_test_split(
    X_num, y, test_size=0.25, random_state=RANDOM_STATE, stratify=y
)

## Q3 (4 pts) – Manual Feature Scaling

Scale the train and test splits of feature matrix X manually. Do not use `sklearn` provided methods here. Store the results in `X_train_scaled` and `X_test_scaled` objects respectively.

In [136]:
# X_train_scaled = ...
# X_test_scaled = ...

### BEGIN SOLUTION
train_mean = X_train.mean(axis=0)
train_std = X_train.std(axis=0)

X_train_scaled = (X_train - train_mean) / train_std
X_test_scaled = (X_test - train_mean) / train_std
### END SOLUTION

In [137]:
assert isinstance(X_train_scaled, pd.core.frame.DataFrame), (
    "X_train_scaled should be a pandas DataFrame"
)
assert isinstance(X_test_scaled, pd.core.frame.DataFrame), (
    "X_test_scaled should be a pandas DataFrame"
)

### BEGIN HIDDEN TESTS
assert np.allclose(X_train_scaled.mean(axis=0), 0)
assert np.allclose(X_train_scaled.std(axis=0), 1)
assert not np.allclose(X_test_scaled.mean(axis=0), 0)
assert not np.allclose(X_test_scaled.std(axis=0), 1)
assert X_train_scaled.shape == (1102, 23)
assert X_test_scaled.shape == (368, 23)
### END HIDDEN TESTS

## Q4 (4 pts) – Examining Feature Scaling

In the code block below show if `X_train_scaled` and `X_test_scaled` are standardized.

In [138]:
### BEGIN SOLUTION
### END SOLUTION

Are these two feature sets standardized? Explain why.

### BEGIN SOLUTION
The `X_train_scaled` is scaled but `X_test_scaled` is not fully scaled because we used parameters obtained from the train split to sclae the test split. This is expected.
### END SOLUTION

## Q5 (2 pts) – Feature Scaling Using StandardScaler

Now let's scale both train and test X splits this time using provided method from sklearn.  
- Save the output to `X_train_scaled_s` and `X_test_scaled_s` object.  
- The results should be numerically equivalent to those you obtained manually in a previous question.

In [139]:
scaler = StandardScaler()
# X_train_scaled_s = ...
# X_test_scaled_s = ...
### BEGIN SOLUTION
X_train_scaled_s = scaler.fit_transform(X_train)
X_test_scaled_s = scaler.transform(X_test)
### END SOLUTION

> **Advanced Topic:**
> You can examine the scaler object to see the feature names.
> These are stored in the object after .fit() is called on StandardScaler instance and a dataset is passed to it.

In [140]:
assert isinstance(X_train_scaled_s, np.ndarray), (
    "X_train_scaled_s should be a numpy ndarray type"
)
assert isinstance(X_test_scaled_s, np.ndarray), (
    "X_test_scaled_s should be a numpy ndarray type"
)

In [141]:
scaler.get_feature_names_out()

array(['Age', 'DailyRate', 'DistanceFromHome', 'Education',
       'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate',
       'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'], dtype=object)

In [142]:
### BEGIN HIDDEN TESTS
assert np.allclose(X_train_scaled.to_numpy(), X_train_scaled_s, atol=0.01)
assert np.allclose(X_train_scaled_s.mean(axis=0), 0)
assert np.allclose(X_train_scaled_s.std(axis=0), 1)
assert not np.allclose(X_test_scaled_s.mean(axis=0), 0)
assert not np.allclose(X_test_scaled_s.std(axis=0), 1)
### END HIDDEN TESTS

---
# Feature Selection and Model Fitting

## Q6 (8 pts) – Feature Selection

Here you need to perform a few steps to filter and select the features that provide the best model.  
- What constitutes a good model is usually a combination of explainability as well as the performance of predictions.  
- While explaining the process is important, for the purpose of this assignment we aim to maximize the prediction performance of the model.  
- Feature selection is usually a cyclic and iterative approach. You first select some features based on what you see from the data or theories you have about their predictive power. Then you fit the model. Then you come back to drop or add other features. Rinse and repeat.  
- Model fitting will follow this step.

> You can explain your process in the provided block below.  
> You may also use additional code blocks for tables, summaries, graphs etc.  
> When done, you'd need to store your feature names in the provided `selected_features` list.

Explanations:

### BEGIN SOLUTION
### END SOLUTION

In [143]:
# selected_features = [
#     "Age",
#     "DistanceFromHome",
#     "JobInvolvement",
# ]

### BEGIN SOLUTION
selected_features = [
    "Age",
    "DistanceFromHome",
    "JobInvolvement",
    "JobLevel",
    "JobSatisfaction",
    "MonthlyIncome",
    "StockOptionLevel",
    "YearsWithCurrManager",
    "PerformanceRating",
    "RelationshipSatisfaction",
    "WorkLifeBalance",
]
### END SOLUTION

In [144]:
assert isinstance(selected_features, list), "selected_features must be a list"
assert len(selected_features) >= 3, "You should at least select 3 features"
assert all(isinstance(el, str) for el in selected_features), (
    "All elements of selected_features should be strings."
)

### BEGIN HIDDEN TESTS
assert set(selected_features).issubset(set(X_train_scaled.columns))
assert len(set(selected_features)) == len(selected_features)
assert "Attrition" not in selected_features
### END HIDDEN TESTS

## Q7 (80 pts) – Model Fitting and Evaluation

Now it's time to train your model.  
- Besides k `n_neighbors`, our classifier has other hyper-parameters as well which you may choose to tweak.  
- Remember this is an iterative process. You pick some hyper-paramerts, fit the model, evaluate, tweak parameters, fit and evaluate.
- The other part of the iteration is feature selection. You may need to go back to that many times to fit a great model.

**Grading:**  
- You must optimize for your model's accuracy and F1 score (F1 for attrition class)  
- This question is worth 80/120 of your total score for this assignment
- 50/80 comes from F1 score for attrition class

| F1 Score     | Points |
| ------------ | ------ |
| >= 0.36      | 50     |
| 0.33 - 0.359 | 47     |
| 0.30 - 0.329 | 42     |
| 0.25 - 0.299 | 35     |
| 0.20 - 0.249 | 30     |
| 0.12 - 0.199 | 20     |
| 0.06 - 0.119 | 10     |
| < 0.06       | 0      |

- 30/80 comes from the overall accuracy

| Accuracy     | Points |
| ------------ | ------ |
| >= 0.85      | 30     |
| 0.83 - 0.849 | 22     |
| 0.81 - 0.829 | 15     |
| < 0.81       | 0      |


> **Note:**  
> We use your fitted model named `model` to grade your work.


In [145]:
# A base solution is provided here.
# Slicing our feature sets to only include our selected_features
X_train_retained = X_train_scaled[selected_features]
X_test_retained = X_test_scaled[selected_features]

# You can supply different parameters to metric, weights, and p arguments of the classifier
# to override the default values. Some of these might improve evaluation metrics
# Defining the model
model = KNeighborsClassifier(n_neighbors=25)
# Fitting to train set
model.fit(X_train_retained, y_train)
# Making predictions
y_pred = model.predict(X_test_retained)

# Calculating average accuracy and F1 score for attrition class.
f1 = f1_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print(f"F1 Score for Attrition Class: {f1:.3f}")
print(f"Overall Accuracy: {accuracy:.3f}")

# Most of the base solution is implemented for you.
# You can simply remove the error code and run.

### BEGIN SOLUTION
### END SOLUTION

F1 Score for Attrition Class: 0.063
Overall Accuracy: 0.840


In [None]:
assert isinstance(model, KNeighborsClassifier), (
    "model must be a sklearn KNeighborsClassifier."
)
assert hasattr(model, "classes_"), "Model does not appear fitted (missing classes_)."
assert hasattr(model, "n_features_in_"), (
    "Model does not appear fitted (missing n_features_in_)."
)
assert X_train_retained.shape[1] == len(selected_features), (
    "Ensure you are using the right feature subset for model training."
)

> **DO NOT delete the following empty cells. They contain your grading mechanism!**

In [152]:
### BEGIN HIDDEN TESTS
def f1_points(f1: float) -> int:
    f1 = float(f1)
    if f1 >= 0.36:
        return 50
    elif 0.33 <= f1 < 0.36:
        return 47
    elif 0.30 <= f1 < 0.33:
        return 42
    elif 0.25 <= f1 < 0.30:
        return 35
    elif 0.20 <= f1 < 0.25:
        return 30
    elif 0.12 <= f1 < 0.20:
        return 20
    elif 0.06 <= f1 < 0.12:
        return 10
    else:
        return 0


def accuracy_points(acc: float) -> int:
    acc = float(acc)
    if acc >= 0.85:
        return 30
    elif 0.83 <= acc < 0.85:
        return 22
    elif 0.81 <= acc < 0.83:
        return 15
    else:
        return 0


y_pred = model.predict(X_test_retained)

f1s = f1_score(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
assert 0.0 <= f1s <= 1.0, "F1 out of range."
assert 0.0 <= acc <= 1.0, "Accuracy out of range."

f1v = f1_points(f1s)
acv = accuracy_points(acc)
total_score = acv + f1v
print(f"The f1 score: {f1v}")
print(f"The accuracy score: {acv}")
print(f"The total score for this question: {total_score}")
### END HIDDEN TESTS

The f1 score: 10
The accuracy score: 22
The total score for this question: 32


In [153]:
### BEGIN HIDDEN TESTS
assert X_test_retained.shape[0] == 368
assert y_test.shape[0] == 368
assert model.n_samples_fit_ == X_train_retained.shape[0]
assert np.allclose(np.asarray(model._fit_X), np.asarray(X_train_retained))
assert np.array_equal(np.asarray(model._y).ravel(), np.asarray(y_train).ravel())
### END HIDDEN TESTS

In [None]:
### BEGIN HIDDEN TESTS
assert total_score >= 40
### END HIDDEN TESTS

In [None]:
### BEGIN HIDDEN TESTS
assert total_score >= 60
### END HIDDEN TESTS

In [None]:
### BEGIN HIDDEN TESTS
assert 75 < total_score >= 80
### END HIDDEN TESTS

> Hints for getting to 100% score:
> - Look for clues of why F1 score is not performing well.
> - Pay attention to relationship between accuracy and F1 score. Is there a way to improve one without sacrificing the other?
> - Look for other resources and signals in dataset.

---
# Predictions and Visualizations

## Q8 (6 pts) – Instance Prediction

Imagine this scenario:  
Our model is trained and fully deployed to our production environment. HR team sends us a list of employees and asks us to determine whether they are likely to leave the company. The managers of these employees have determined that they might be on the verge of callig in quits!  
You look at the list and see `John`. John is your buddy and he has complained about the work many times to you and said he's looking for work elsewhere.  
Here is what HRIS API call returns for John. Use this data and make a prediction using your model.

Save your model output to an object named `john_class`

> **Hint:**  
> This is a dictionary object (similar to what most API calls can return). How can you pass this to `.predict()` method from your model instance?

In [None]:
john = {
    "Age": 30,
    "BusinessTravel": "Travel_Frequently",
    "DailyRate": 109,
    "Department": "Research & Development",
    "DistanceFromHome": 5,
    "Education": 3,
    "EducationField": "Medical",
    "EnvironmentSatisfaction": 2,
    "Gender": "Female",
    "HourlyRate": 60,
    "JobInvolvement": 3,
    "JobLevel": 1,
    "JobRole": "Laboratory Technician",
    "JobSatisfaction": 2,
    "MaritalStatus": "Single",
    "MonthlyIncome": 2422,
    "MonthlyRate": 25725,
    "NumCompaniesWorked": 0,
    "Over18": "Y",
    "OverTime": "No",
    "PercentSalaryHike": 17,
    "PerformanceRating": 3,
    "RelationshipSatisfaction": 1,
    "StockOptionLevel": 0,
    "TotalWorkingYears": 4,
    "TrainingTimesLastYear": 3,
    "WorkLifeBalance": 3,
    "YearsAtCompany": 3,
    "YearsInCurrentRole": 2,
    "YearsSinceLastPromotion": 1,
    "YearsWithCurrManager": 2,
}

In [None]:
# This is a suggestion, there are many ways to read in this datapoint.

# j_df = ... # First create a DataFrame from john
# j_num = ... # Then slice using numerical feature (we have this in the notebook)
# j_scaled = ... # Then scale using your previous scaler object
# j_scaled_df = ... # Then turn back into a DataFrame
# j_selected = ... # Filter by your selected features
# john_class = model.predict(...) # Finally make a prediction

### BEGIN SOLUTION
j_df = pd.DataFrame([john])
j_num = j_df[numerical_feats]
j_scaled = scaler.transform(j_num)
j_scaled_df = pd.DataFrame(j_scaled, columns=numerical_feats)
j_selected = j_scaled_df[selected_features]
john_class = model.predict(j_selected)[0]
### END SOLUTION

In [None]:
assert isinstance(john_class, (np.int64, np.int32, int, np.ndarray)), (
    "john_class should be either of the specified classes"
)

### BEGIN HIDDEN TESTS
if isinstance(john_class, np.ndarray):
    assert john_class.shape == (1,)
    assert john_class[0] == 1
else:
    assert john_class == 1
### END HIDDEN TESTS

## Q9 (2 pts) – Instance Prediction Continued

Explain what the oputcome of your model's prediction for John means.
### BEGIN SOLUTION
1 means attrition, 0 means no attrition.
### END SOLUTION

## Q10 (4 pts) – Visualize Decision Boundaries

Pick two of the most influential features in your feature set (the ones you included in your final model), and plot decision boundaries for a range of K values. This range should include your chosen K value for your model as well.

> **Notes:**
> - Picking the most influential or determining feature can be somewhat subjective, and you may need to refer to what you did for your feature selection.
> - You can copy the entire `plot_decision_boundary()` function definition from lecture notebook. And use it similar to how we used it in lecture notebook.
> - Remember, the best K for the particular pair of features you choose here is not necessarily the best K for your fully fitted model. Why?

In [None]:
### BEGIN SOLUTION
### END SOLUTION

## Q11 (5 pts) – Visualize Decision Boundaries Continued

Explain which K value makes the most sense for this particular feature pair. Can you explain what the decision boundaries mean at this K level? How they explain classification of observations?

### BEGIN SOLUTION
### END SOLUTION