In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw7.ipynb")

# Homework 7: Studying Employee Attrition With K-Means

Name:

Student ID:

Collaborators:


## Instructions

In this homework, we will be exploring a more realistic application of clustering. It might be helpful to review **Lab 7 (K-Means Clustering)** first. Most of the things we ask you to do in this homework are explained in the lab. In general, you should feel free to import any package that we have previously used in class. Ensure that all plots have the necessary components that a plot should have (e.g. axes labels, a title, a legend).

Furthermore, in addition to recording your collaborators on this homework, please also remember to cite/indicate all external sources used when finishing this assignment. This includes peers, TAs, and links to online sources. Note that these citations will not free you from your obligation to submit your _own_ code and write-ups, however, they will be taken into account during the grading and regrading process.

### Submission instructions
* Submit this python notebook including your answers in the code cells as homework submission.
* **Feel free to add as many cells as you need to** — just make sure you don't change what we gave you. 
* **Does it spark joy?** Note that you will be partially graded on the presentation (_cleanliness, clarity, comments_) of your notebook so make sure you [Marie Kondo](https://lifehacker.com/marie-kondo-is-not-a-verb-1833373654) your notebook before submitting it. Remember that part of data science is clearly presenting your results to others.

### Some imports and configurations

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from utility.util_hw import load_toy, configure_plots
from utility.util_hw import sample_centroids, fit, plot_kmeans

# run this cell twice to have pretty plots
configure_plots()

In [None]:
configure_plots()

## 1. Do the Initial Centroids Matter?

Let's investigate if the $k$-means algorithm is sensitive to the initial starting points. In the cell below, we generate a toy dataset. This time with five clusters in order to make things more obvious.

In [None]:
X, _ = load_toy(500, 5, width=0.07, random_state=4)

<!-- BEGIN QUESTION -->

### Problem 1.1

Let's take a quick peek at what the data looks like.

**Do this!** Plot the toy data $X$. Make sure that your plot has all of the necessary components.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Problem 1.2

**Do this!** Using the functions `sample_centroids`, `fit`, and `plot_kmeans`, experiment with different `random_state`s to see if you can observe different final centroids depending on the initial starting points. Use the data sampled above and produce two plots in the two code cells provided below; at least one of them should show a reasonabley nice $k$-means solution. 
> **Hint:** `fit` as imported from `util_hw` returns two arrays, the centroids **and** assignments.  

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Problem 1.3

**Write-up!** You should have seen quite different clusterings based on different inital centroids. What might be causing this to happen? How might we better choose our initial centroids? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 2. Using `sklearn` for $k$-means 

In this section, we will explore the [$k$-means model from `sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and discuss some of the additional features supplied by their implementation. Before we begin, it is suggested that you work through Lab 7 if you haven't already as we will assume familiarity with the terms used there.

### Looking Into the Model

Now let's create a new $k$-means model and learn about it's interface. In general, you will find that the $k$-means model from `sklearn` shares a lot of the same methods as the other models that we have looked at. However, there are some differences that are notable.

In [None]:
from sklearn.cluster import KMeans

model = KMeans()

<!-- BEGIN QUESTION -->

### Problem 2.1

**Write-up!** Use the IPython `?` operator to answer the following question: how do you specify the number of clusters you would like to fit?

### Note: Please remove these ? operators before you submit your work, they cause some problems with the autograder

In [None]:
# use this cell to explore


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Problem 2.2

**Write-up!** Use the IPython `?` operator to answer the following questions: How does the model initialize centroids by default? How does it work and why is it better than randomly choosing random starting centroids?

In [None]:
# use this cell to explore


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Problem 2.3

**Write-up!** Use the IPython `?` operator to answer the following questions: How does the model decide that the centroids have converged? Why might we need to adjust this based on our input data?

In [None]:
# use this cell to explore


_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Clustering the Toy Data with `sklearn`

Let's try using `sklearn` to cluster our data.

### Problem 3.1

**Do this!** Create and fit a _new_ `KMeans` model of our data with the default arguments except for `random_state` which should be set to 11. Be sure to store the fit centroids in `centroids` and cluster assignment indicees in `assignments` for later use.
> **Hint:** Check the scikit-learn [`kMeans` 🔗](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) documentation for how to get the centroids and assignments or use `KMeans?` and `KMeans.` + `tab`.

In [None]:
toy_model = ...
toy_assignments = ...
toy_centroids = ...

In [None]:
grader.check("q3ai")

<!-- BEGIN QUESTION -->

**Write-up!** How many centroids were fitted by the model? How many points were assigned to each cluster?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Problem 3.2

**Do this!** Create a plot showing the centroids that were produced by the model and the data points colored by their cluster assignment. Be sure to include all necessary plot components and remember that presentation matters.

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Write-up!** Given this plot, do you think this is a reasonable clustering of the data?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Deciding How Many Clusters to Use
As we have seen for $k$-means, $k$ is the number of clusters/centroids that the algorithm will try to find. Choosing $k$ is an important task as it determines the output of the algorithm. Since $k$ is a parameter to our clustering algorithm, we can use a **model selection** strategy to do this.


### Problem 3.3

Consider the Sum of Squared Distances $SSD_j$ as the sum of all points in the $j$th cluster to its corresponsing cluster center $c_j$: 
$$SSD_j = \sum_{i=1}^{n} \gamma_{ij} \;d(x_i,c_j)^2,$$
where $\gamma_{ij}$ is 1 if $x_i$ belongs to cluster $j$ and 0 otherwise. 

Then, the objective function that $k$-means optimizes is the sum of the $SSD_j$ over all clusters $SSD = \sum_j SSD_j$. This means that, we want to find clusters of points that are close to one another. We can estimate how close the cluster points are to one another by measuring how far each point assigned to the cluster is from its center.

**Write-up!** Why is it difficult to find the right number of clusters for a general clustering task with d-dimensional input? Think about the value of the objective function we are optimizing with respect to the number of clusters $k$.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

In the cell below, we will generate a dataset with an _unknown_ number of clusters.

In [None]:
X_unknown, _ = load_toy(1000, k=-1, random_state=6)

<!-- BEGIN QUESTION -->

**Do this!** Use the elbow method described in the lecture to find a good clustering for our data. Produce a plot that shows the model performance $SSD^{(k)}$ as a function of $k \in [1, 10]$, where $SSD^{(k)} = \sum_{j=1}^k SSD_j$.  Make sure to create new models when appropriate.

In [None]:
from utility.util_hw import squared_distance

...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Write-up!** Describe how you would choose which $k$ to use. Then, choose the appropriate $k$.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**[Optional] Try this!** In the following cell, plot the dataset to see if your choice is reasonable. 
> **Note:** This kind of visual verification/evaluation is _only_ possible for 2D input data. 

In [None]:
# This optional but useful!
# BEGIN SOLUTION
plt.scatter(X_unknown[:, 0], X_unknown[:, 1])

plt.title('Toy Clustering Data')
plt.xlabel('x1')
plt.ylabel('x2')

# END SOLUTION

## 3. Tackling Employee Attrition


A real problem that Human Resources (HR) departments in companies all over the world would like to address is employee attrition, or turnover. They would like to reduce the number of employees who leave the company as hiring new employees is expensive. In this section, we would like to see if we can make use of $k$-means to identify patterns in employee attrition so that we might suggest which areas an HR department should focus on.

To show off their Watson platform, IBM released a (fictional) [sample dataset](https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/) in 2015 containing employee statistics and whether or not they left the company. We'll use this dataset in our own exercise.

In [None]:
import pandas as pd

data = pd.read_csv('./utility/data/HR-Employee-Attrition.csv')

<!-- BEGIN QUESTION -->

### Problem 4.1

With our problem in mind, the next thing to do is to acquire and process our data.

**Try this!** Describe the data in `HR-Employee-Attrition.csv` (`data`), answering questions including, but not limited to, these: How many examples and features does the dataset have? What kinds of features are in the dataset? What values can these features take?
> **Hint:** Consider the steps of EDA; what would you like to know about this dataset. 

In [None]:
# use this cell to explore!

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Problem 4.2

Now that we have a feel for what our data looks like, let's do some data wrangling.

**Try this!** In the cells below (feel free to add more as you need them), explain and perform the steps that you need to prepare this data for further analysis. Make sure that your analysis and work is presented well and effectively communicates your work. In this process, consider whether each feature is informative (eg. EmployeeNumber might not be) and remove those that are not from your dataset. This is somewhat subjective, but there are at least two columns that aren't necessary.
> **Hint:** You can use the `pandas.DataFrame.drop` function.

In [None]:
...

<!-- END QUESTION -->

### Problem 4.3

Another step to do in our data processing phase is to replace categorical variables that are represented as strings with an enumeration. For example, `'Attrition'` has `'Yes'` and `'No'` values that we would like to encode as `1` and `0` respectively.

**Do this!** In the following cell, [`replace`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) the string values in categorical variables with enumerations. Make uses of the `encoded` `DataFrame` which is a copy of `data`. 
> **Hint:** You can use the `unique` and `enumerate` functions to help you do this.

In [None]:
encoded = data.copy()

...

In [None]:
grader.check("q4c")

<!-- BEGIN QUESTION -->

### Problem 4.4

Now that we have a processed dataset, let's move on to forming clusters with $k$-means. Normally, we would do some EDA here, but in the interest of time, we will forgo that part of the data science workflow. If you want to, we still encourage you to do so.

That being said, we will need to prepare an $X$ matrix of our dataset. At this point, we will drop the `'Attrition'` column from our dataset. We will also [scale our data](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling). 

In [None]:
from sklearn.preprocessing import scale

X_processed = scale(np.float64(encoded.copy().drop('Attrition', axis=1)))

**Do this!** In the cell below, build an elbow plot for $k \in [1, 21]$ as you did in [Problem 3.3](#Problem-3.3).

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Write-up!** State which 𝑘 you would choose and explain why.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### Problem 4.5

**Do this!** Using the $k$ you selected in [Problem 4.4](#Problem-4.4), create and fit a new model. Remember to save the cluster assignments and centroids. Use a random state of 11.

In [None]:
k = ...
new_model = ...
new_assignments = ...
new_centroids = ...

In [None]:
grader.check("q4e")

### Problem 4.6

Now that we have cluster assignments from $k$-means, we need to analyze the significance of each cluster. To do that, let's return to our original DataFrame, `data`.

In the following cell, we add our cluster assignments to `data_aug`. We also compute a pivot table which provides a summary of each cluster with the within cluster `mean` for each feature.

In [None]:
data_aug = data.copy()
data_aug['Cluster'] = new_assignments
pivot = data_aug.pivot_table(index='Cluster', aggfunc=np.mean)

**Do this!** Compute the percentage of total attrition accounted for by each cluster and store the result in `pivot['% of Attrition']`.

In [None]:
...

...
pivot

In [None]:
grader.check("q4f")

<!-- BEGIN QUESTION -->

### Problem 4.7

Let's take a look at the results and identify potential areas for intervention to suggest to the HR department.

In [None]:
pivot.sort_values(by='% of Attrition', axis=0).T

**Write-up!** Describe the clusters produced and interpret their meaning. What makes each one a separate cluster? Is there anything that stands out with respect to attrition rate? What might you suggest HR look into to improve employee retention?

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)