<center>
<img style="float: center;" src="images/CI_horizontal.png" width="400">
</center>
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>


<center> Julia Lane, Benjamin Feder, Angela Tombari, Ekaterina Levitskaya, Tian Lou, Lina Osorio-Copete. </center> 

# Unsupervised Machine Learning

There are problems where there does not exist a target variable to predict, but instead we want to discover any inherent groupings or patterns in the data. Unsupervised machine learning methods can help tackle these problems. Clustering is the most common unsupervised machine learning technique, but you might also be aware of principal components analysis (PCA) or neural networks implementations such as self-organizing maps (SOM). This notebook will provide an introduction to unsupervised machine learning through a clustering example.

## Introduction to Clustering

Clustering is used to group data points together that are similar to each other. Optimally, a given clustering method will produce groupings with high intra-cluster (within) similarity and low inter-cluster (between) similarity. Clustering algorithms typically require a distance or similarity metric to generate clusters. They take a dataset and a distance metric (and sometimes additional parameters), and they generate clusters based on that distance metric. The most common distance metric used is Euclidean distance, but other commonly-used metrics are Manhattan, Minkowski, Chebyshev, Cosine, Hamming, Pearson, and Mahalanobis.

Most clustering algorithms also require the user to specify the number of clusters (or some other parameter that indirectly determines the number of clusters) in advance as a parameter. This is often difficult to do a priori and typically makes clustering an iterative and interactive task. Another aspect of clustering that makes it interactive is often the difficulty in automatically evaluating the quality of the clusters. While various analytical clustering metrics have been developed, the best clustering is task-dependent and thus must be evaluated by the user. There may be different clusterings that can be generated with the same data. You can imagine clustering similar news stories based on the topic content, writing style or sentiment. The right set of clusters depends on the user and the task at hand. Clustering is therefore typically used for exploring the data, generating clusters, exploring the clusters, and then rerunning the clustering method with different parameters or modifying the clusters (by splitting or merging the previous set of clusters). Interpreting a cluster can be nontrivial: you can look at the centroid of a cluster, look at frequency distributions of different features (and compare them to the prior distribution of each feature), or other aspects.

Here, we will focus on **K-Means clustering** (*k* defines the number of clusters), which is considered to be the most commonly used clustering method. The algorithm works as follows:
1. Select *k* (the number of clusters you want to generate).
2. Initialize by selecting k points as centroids of the *k* clusters. This is typically done by selecting k points uniformly at random.
3. Assign each point a cluster according to the nearest centroid.
4. Recalculate cluster centroids based on the assignment in **(3)** as the mean of all data points belonging to that cluster.
5. Repeat **(3)** and **(4)** until convergence.

The algorithm stops when the assignments do not change from one iteration to the next. The final set of clusters, however, depends on the starting points. If initialized differently, it is possible that different clusters are obtained. One common practical trick is to run *k*-means several times, each with different (random) starting points. The *k*-means algorithm is fast, simple, and easy to use, and is often a good first clustering algorithm to try and see if it fits your needs. When the mean of the data points cannot be computed, a related method called *K-medoids* can be used.

### Learning Objectives

This notebook demonstrates using *k*-means clustering to better understand Kentucky's labor market in 2013Q3. We've already developed a handful of employer-level measures in a supplemental notebook. We will try a few different values of *k* to see how we can best understand the labor market by looking for differentiation between each of the clusters.

## Import Packages and Set Up


The main R package that we will use for clustering is called `cluster`. We also import all our usual packages for database connection and data manipulation/visualization.

In [None]:
#database interaction imports
library(DBI)
library(RPostgreSQL)

# for data manipulation/visualization
library(tidyverse)
library(ggplot2)

# clustering
library(cluster)

In [None]:
# create an RPostgreSQL driver
drv <- dbDriver("PostgreSQL")

# connect to the database
con <- dbConnect(drv,dbname = "postgresql://stuffed.adrf.info/appliedda")

## 1. Read in the Data

We will read-in a table from the database called `employers_2013` which contains characteristics of Kentucky's labor market from 2012Q4-2014Q3.

In [None]:
# read into R
qry <- "
select *
from ada_ky_20.employers_2013
"
emp <- dbGetQuery(con, qry)

# see employers
head(emp)

This table contains information for employers by quarter. Because some employers appear in one quarter but may not appear in another quarter, for consistency, we will subset our dataframe to include information only for one quarter and one year: third quarter of 2013.

In [None]:
# Subset a dataframe by rows from 2013Q3

emp <- emp %>%
    filter(qtr == 3, calendaryear == 2013)

In [None]:
# Check that we only have 3rd quarter now
unique(emp$qtr)

In [None]:
# Check that we only have one year
unique(emp$calendaryear)

### Clean the Data

We need to remove the `employeeno` variable from our data frame since the feature does not provide any explanatory power for our k-means algorithm. Additionally, k-means algorithms only work properly with continuous features. This is because k-means calculates its distance measure using euclidean distance, which is the distance between each data point and the centroid of a cluster. It is hard to assign positions for categorical variables in the euclidean space. Thus, we also need to remove `naics` from `emp`.

> There are more sophisticated clustering algorithms that do not use Euclidean distances and thus allow categorical variables in the model. If you are interested in them, you can take a look at the functions `kmodes` and `gower.dist` - you will need to download their respective libraries first.

In [None]:
# Remove employeeno, naics, and also quarter and calendaryear as we only have one quarter and one year
emp_ml <- emp %>%
    select(-c(employeeno, naics, qtr, calendaryear))

In [None]:
head(emp_ml)

In [None]:
# Check data type of all variables - make sure all of them are numeric
str(emp_ml)

**It is important that we consider scaling these features** before we compute *k*-means clustering, especially if the metrics are on a variety of numerical scales. Let's see if they are.

In [None]:
# Get descriptions of each variable using "summary" function
summary(emp_ml)

We can see that we have variables on different numerical scales - we can scale them using `scale()` function on our dataframe `emp_ml`.

In [None]:
# Scale the features
emp_ml <- scale(emp_ml)

# View first rows after scaling
head(emp_ml)

Before running a clustering algorithm, we need to make sure that there are no missing values. Here we will use `na.omit()` function which removes all rows with any NA values. (If an employer has missing information in any of the columns, a row will be dropped).

> Note that you should **never remove data** if possible - in a real world setting you would likely want to fill any missing data with an imputation or baseline assumption. We will discuss missing data during the Inference session in Module 3 of the program.

In [None]:
# Check number of rows (where each row is a unique employer)
nrow(emp_ml)

In [None]:
# We also need to remove all missing data points before running clustering
# na.omit will remove any rows with any NA values
emp_ml <- na.omit(emp_ml)

In [None]:
# Check number of rows after dropping rows with any NA values
nrow(emp_ml)

## 2. Choose the Number of Clusters, *K*

Running a *k*-means model is simple: we just need to use `kmeans()` and choose the number of clusters (called `centers`). What number should we choose? Here, we have 11 features, so it is hard to visualize the data and decide the proper number by using our eyes. Let's start with a small number, such as 3, and see how the results look like.  

Because *k*-means clustering will generate different results (due to different starting points), we will set a seed so that the work in this notebook can be reproducible using the `set.seed()`. To get the same results, you must use the same seed before running the clustering algorithm every time. Luckily, if you set the same seed as your collaborators and are running the same *k*-means algorithm, you will see the same results, even if you are working in different environments, i.e. Jupyter notebooks.

### k = 3

In [None]:
# Initialize the model and run on emp_ml
set.seed(1)
k3 <- kmeans(emp_ml, centers=3, nstart=20)

> `nstart` specifies a number of initial configurations and reports on the best one - an optimal number is usually somewhere between 20 and 50. (See more information in the Resources section - Professor Steorts, Duke University).

In [None]:
str(k3)

`kmeans` function returns the following components, most useful for us:
- `cluster` - an integer indicating a cluster to which each point is allocated
- `centers` - a matrix of cluster centers
- `totss` - the total sum of squares
- `withinss` - vector of within-cluster sum of squares, one component per cluster.
- `tot.withinss` - total within-cluster sum of squares, i.e. `sum(withinss)`
- `betweenss` - the between-cluster sum of squares, i.e. `totss-tot.withinss`
- `size` - the number of points in each cluster

Let's check the size of each cluster:

In [None]:
k3$size

We can see that most of the employers are concentrated in cluster [redacted]. In the perfect world, we would want them to be distributed more evenly across clusters, but in some cases, it may make sense that they wouldn't. Most importantly, we are looking for high intra-cluster similarity and low inter-cluster similarity.

Are there major differences in the characteristics of employers in each cluster?

We can take a look at basic descriptives of the employers in these clusters by adding our clustering results to the original dataframe, `emp`, and call this dataframe `frame_3`.

In [None]:
emp <- na.omit(emp)                     # remove missing values
frame_3 <- data.frame(emp, k3$cluster)  # add cluster number to the original dataframe
frame_3 <- subset(frame_3, select= -c(employeeno,naics, qtr, calendaryear))  # remove employeeno, naics, qtr, calendaryear columns

frame_3 %>%
    group_by(k3.cluster) %>%
    summarize_all("mean")

In general, we can see that our biggest cluster, cluster [redacted], contains relatively [redacted] employers that pay their employees [redacted] wages. Cluster [redacted] also has relatively [redacted], but on average, they pay their employees more than [redacted] employers in cluster [redacted] and they employ more full-quarter employees than in cluster [redacted]. 

## Evaluate clusters

One simple way to evaluate resulting clusters is to compare the summary stats between key variables of interest.

We can also visualize the differences between the clusters in more detail by finding mean and standard deviation for the following variables: `avg_earnings`, `bottom_25_pctile`, and `top_25_pctile`. We will first need to convert our data frame into a long format, with each variable/cluster combination corresponding to a distinct row.

In [None]:
head(frame_3)

In [None]:
# Save results with mean to a dataframe
frame_3_mean <- frame_3 %>%
    group_by(k3.cluster) %>%
    select(c(avg_earnings, bottom_25_pctile, top_25_pctile)) %>%
    summarize_all(mean) %>%
    pivot_longer(-k3.cluster, names_to = "variable", values_to = "mean")

# Save results with standard deviation to a dataframe
frame_3_sd <- frame_3 %>%
    group_by(k3.cluster) %>%
    select(c(avg_earnings, bottom_25_pctile, top_25_pctile)) %>%
    summarize_all(sd) %>%
    pivot_longer(-k3.cluster, names_to = "variable", values_to = "sd") %>%
    select(-c(k3.cluster, variable))

# Bind two dataframes together
df <- cbind(frame_3_mean,frame_3_sd)

df

Now we can use this data frame to visualize the means and standard deviations in our 3 clusters for these 3 variables: `avg_earnings`, `bottom_25_pctile`, and `top_25_pctile` using a bar plot.

In [None]:
ggplot(df, aes(x=k3.cluster, y=mean, fill=k3.cluster)) +
    geom_bar(stat="identity", position = position_dodge()) +    # plot bars for the mean values
    geom_errorbar(aes(ymax= mean + sd, ymin = mean),            # add standard deviation bars
                  width=.2,
                  position = position_dodge(.9)) +
    facet_grid(. ~ variable) +                                  # plot by 3 variables of interest
    ggtitle("REDACTED") +                                       # add title
    xlab("Clusters") +                                          # add label for x-axis
    ylab("Mean") +                                              # add label for y-axis
    theme(text = element_text(size=16),                         # increase text font
          axis.text.x = element_text(size=18, face="bold"),     # increase text font on x-axis and make it bold
          legend.position = "none")                             # remove legend

### Visulization functions

We can also create a function to facilitate visualizing different columns in a similar way. The function takes the mean, standard deviation, and title of the visualization.

In [None]:
# Save it to a dataframe
frame_3_mean_sd <- frame_3 %>%
    group_by(k3.cluster) %>%
    select(c(avg_earnings, bottom_25_pctile, top_25_pctile)) %>%
    summarise_all(funs(mean, sd))

# Visualize average earnings by cluster
viz <- function(mean, sd, title) {
    ggplot(frame_3_mean_sd, aes(x=k3.cluster, y=mean, fill=k3.cluster)) +
    geom_bar(position = position_dodge(), stat="identity", fill="gray") +
    geom_errorbar(aes(ymax= mean + sd, ymin = mean),
                  width=.2,
                  position = position_dodge(.9)) +
    ggtitle(title) +
    xlab("Clusters") +
    ylab("Mean") +
    theme(text = element_text(size=16),
          axis.text.x = element_text(size=18, face="bold"),
          legend.position = "none")
}

In [None]:
viz(frame_3_mean_sd$avg_earnings_mean, frame_3_mean_sd$avg_earnings_sd, "Average Earnings: Differences between clusters")

In [None]:
viz(frame_3_mean_sd$bottom_25_pctile_mean, frame_3_mean_sd$bottom_25_pctile_sd, "Bottom 25 Percentile: Differences between clusters")

In [None]:
viz(frame_3_mean_sd$top_25_pctile_mean, frame_3_mean_sd$top_25_pctile_sd, "Top 25 Percentile: Differences between clusters")

### Compare industries

We can also compare clusters by looking at the most common industries within each cluster. Let's read in our `naics_2012_upd` table to find the corresponding titles to these codes.

In [None]:
# read naics_2012_upd table into R as dataframe naics
qry = '
select *
from ada_ky_20.naics_2012_upd
'
naics <- dbGetQuery(con, qry)

In [None]:
frame_3 <- data.frame(emp, k3$cluster)  # add cluster number to the original dataframe

frame_3 <- frame_3 %>%
    group_by(k3.cluster, naics) %>%          # group by cluster and industry
    summarise(unique_emp = n_distinct(employeeno)) %>%  
    ungroup() %>%
    group_by(k3.cluster) %>%
    arrange(desc(unique_emp)) %>%        # count number of unique employers
    slice(1:3)                           # choose top 3 industries in each cluster

In [None]:
# left join with industry names
frame_3 %>% 
    left_join(naics, by=c('naics' = 'naics_us_code')) %>%
    select(-c(seq_no,naics)) %>%
    filter(!is.na(naics_us_title)) %>%
    arrange(k3.cluster, desc(unique_emp))

What are the most prominent industries in each of the clusters?

Do these clustering results make sense to you? 

## Selecting *k*

How do we know if we chose an optimal number of clusters to describe our data?

### Elbow method

We can use the *Elbow method* as one input in selecting the optimal cluster number. Recall that *k*-means starts with k random cluster centers (centroids), assigns each data point to the closest centroid, and calculates the distances between each point and the centroid. Then it moves the positions of the centroids and repeats the previous steps until there is convergence. In the *Elbow method*, we try different k values and calculate the sum of squared errors (`SSE`) after the model converges. Then we plot all the `SSE` by K in a line-chart. The line-chart should resemble an arm.

In [None]:
set.seed(1)

# function to compute total within-cluster sum of square
wss <- function(k) {
    kmeans(emp_ml, k)$tot.withinss
}

# compute and plot wss for k =1 to k = 15
k.values <- 1:15

# extract wss values for each k
wss_values <- map_dbl(k.values, wss)

# plot the resulting SSE for each value of k
plot(k.values, wss_values, 
    type = "b", pch=19, frame=FALSE,
    xlab = "Number of clusters K", 
    ylab = "Total within-clusters sum of squares")

We can see that SSE decreases as we increase k. Here, it decreases faster when k is small. As k increases, the reduction in SSE becomes smaller. We try to choose the number around the inflection point, where the change in SSE becomes negligible, indicating that there is little room to improve the model by increasing k (the bend in the elbow). On our graph, the elbow curve begins to flatten around k = 4.

Let's run the model with 4 clusters.

In [None]:
set.seed(1)
k4 <- kmeans(emp_ml, centers = 4)
k4$size

We can see that the cluster size with 4 clusters is more evenly distributed now.

Let's save these results to a dataframe called `frame_4`, and check characteristics of employers in each cluster:

In [None]:
frame_4 <- data.frame(emp, k4$cluster)  # add cluster number to the original dataframe
frame_4 <- subset(frame_4, select= -c(employeeno,naics, qtr, calendaryear))  # remove employeeno, naics, qtr, calendaryear columns

frame_4 %>%
    group_by(k4.cluster) %>%
    summarize_all("mean")

We still have [redacted] cluster(s) with [redacted] employers, but who do not necessarily pay the highest wages - the highest wages are in cluster [redacted], from [redacted] employers, and then we have [redacted] cluster(s) [redacted] with [redacted] employers and [redacted] wages. The difference between [redacted] employers is in the number of full-quarter employees, as well as employment, hire, and separation rates.

We can also take a look at the three most prominent industries in each cluster.

In [None]:
# read naics_2017 table into R as naics
qry = '
select *
from public.naics_2017
'
naics <- dbGetQuery(con, qry)

In [None]:
frame_4 <- data.frame(emp, k4$cluster)  # add cluster number to the original dataframe

frame_4 <- frame_4 %>%
    group_by(k4.cluster, naics) %>%
    summarise(unique_emp = n_distinct(employeeno)) %>%
    top_n(3, unique_emp) 

frame_4 %>% 
    left_join(naics, by=c('naics' = 'naics_us_code')) %>%
    select(-c(seq_no,naics))  %>%
    arrange(k4.cluster, desc(unique_emp))

Which clustering results - `frame_3` or `frame_4` - do you prefer? Do you think it could be optimal to choose more clusters?

In summary, in clustering there is no single right answer - every time we run a different number of clusters, interesting patterns about our data can be exposed. However, what we do want to know is whether the clusters that we find represent true subgroups in our data. This could be a crucial input toward choosing the right number of clusters. (See more information on additional methods for selecting `k` in the Resources section - Professor Steorts, Duke University).

Experiment with different numbers of clusters in the Checkpoint 1 below - given knowledge about Kentucky labor market in 2013 Q3, which number of clusters makes the most sense to you?

<font color=red><h3> Checkpoint 1: Run a K-Means clustering model </h3></font> 

1. Take a look again at the elbow curve, which number(s) do you think is (are) optimal?

2. Choose a cluster number that you think is best (other than 3 or 4). Use `kmeans()` to run a k-means clustering model with the number you choose. Save your results and features in `frame_k`. 

3. Compare your results with the results we got previously. Do you find any differences? Are the results improved, in your opinion?

Hint: in the Elbow method graph, it looks like 11 could be another optimal cluster - you can try with 11 clusters and see the differences.

### Cohort's Employers

In this section we will take a look at our cohort's employers, and identify the clusters they belong to based on the `frame_4` clustering results.

> We will need to subset `df_wages` to just jobs in 2013Q3 in order to line up with these clusters.

In [None]:
# read earnings of cohort into R
qry = "
select *
from ada_ky_20.cohort_wages
"
df_wages = dbGetQuery(con, qry)

In [None]:
unique(df_wages$job_date)

In [None]:
# Subset by 2013 Q3
df_wages <- df_wages[which(df_wages$job_date=='2013-07-01'), ]

Here, we will not subset to the dominant employer for each `coleridge_id` in `df_wages`. However, in the final section, we introduce the idea of just selecting dominant employers for each `coleridge_id` before starting a cohort-specific analysis.

In [None]:
frame_4 <- data.frame(emp, k4$cluster)  

# Join wages table with frame_4 clustering results
df_wages <- df_wages %>%
    inner_join(frame_4, by='employeeno')

In [None]:
# Group by clusters and find number of unique employers in each cluster
df_wages %>%
    group_by(k4.cluster) %>%
    summarise(emp_cohort = n_distinct(employeeno))

We can also compare what percentage of all employers in our clusters hire our cohort.

In [None]:
# Get number of unique employers per cluster in the full dataframe (all employers)
frame_4 %>%
    group_by(k4.cluster) %>%
    summarise(emp_all = n_distinct(employeeno))

In [None]:
# Save cohort and all employers dataframes

cohort_emp <- df_wages %>%
    group_by(k4.cluster) %>%
    summarise(emp_cohort = n_distinct(employeeno))

emp_all <- frame_4 %>%
    group_by(k4.cluster) %>%
    summarise(emp_all = n_distinct(employeeno))

# Join cohort employers with all employers, and find percentage
cohort_emp %>%
    inner_join(emp_all, by = 'k4.cluster') %>%
    mutate(percentage = (emp_cohort / emp_all) * 100)

As a reminder, one limitation of our `employers_2013` file is that it doesn't include employers with less than 5 employees. 

Let's add industry names:

In [None]:
df_wages_industry_names <- df_wages %>%
    group_by(k4.cluster, naics) %>%
    summarise(unique_emp = n_distinct(employeeno)) %>%
    slice_max(unique_emp, n = 3) %>%
    arrange(k4.cluster, naics) %>%
    slice(1:3)         # for cases where there are ties, we need to use slice, to pick only top 3

In [None]:
df_wages_industry_names %>% 
    left_join(naics, by=c('naics' = 'naics_us_code')) %>%
    select(-c(seq_no,naics)) %>%
    arrange(k4.cluster, desc(unique_emp))

We can also compare average earnings of our cohort by cluster with average earnings of all employees in each cluster:

In [None]:
# Calculate average earnings for KY graduates by cluster
df_wages %>%
    group_by(k4.cluster) %>%
    summarise(mean_earnings_cohort = mean(wages))

In [None]:
# Calculate average earnings for all employees by cluster
frame_4 %>%
    group_by(k4.cluster) %>%
    summarise(mean_earnings_all = mean(avg_earnings))

<font color=red><h3> Checkpoint 2: Cohort's Employers </h3></font> 

How are the cohort's employers distributed between clusters in other clustering models (numbers of clusters) that you tried in Checkpoint 1?

-------------------------------------------------------------
### Code for visualizations from the ML slides

In this section we provide the code that created visualizations for results with 11 clusters in the Machine Learning slides.

1. Compare the mean of variables between all employers and employers in each cluster.

In [None]:
# Get 11 clusters
set.seed(1)
k11 <- kmeans(emp_ml, centers = 11)
k11$size

In [None]:
# Add cluster assignment to the original dataframe
frame_11 <- data.frame(emp, k11$cluster) 

In [None]:
# Count number of unique employers by cluster
result <- frame_11 %>%
                group_by(k11.cluster) %>%
                summarise(n_employers = n_distinct(employeeno))

In [None]:
# Remove non-numerical columns
frame_11_subset <- subset(frame_11, select = -c(employeeno, naics, qtr, calendaryear))

# Get means by cluster
cframe_cluster <- frame_11_subset  %>%
                    group_by(k11.cluster) %>%
                    summarise_all('mean')

# Get means for all employers
cframe_all <- frame_11_subset %>%
                    summarise_all('mean')

In [None]:
# Remove column name with clusters
cframe_cluster_subset <- subset(cframe_cluster, select = -c(k11.cluster))

In [None]:
# Loop through columns and compare means of clusters with means of all employers
# If a mean of a variable in a cluster is higher than a mean for all employers, then add +, otherwise -

for(i in names(cframe_cluster_subset)){
    cframe_cluster_subset[,i] <- ifelse(cframe_cluster[,i] > cframe_all[,i], '+', '-')
}


In [None]:
# Combine two tables: one with number of unique employers and with means comparison
means_comparison <- cbind(result,cframe_cluster_subset)

means_comparison

In [None]:
# You can save a dataframe to a csv with a write_csv function
# means_comparison %>% write_csv('/nfshome/YOURUSERNAME/means_comparison.csv')

The resulting table shows us a comparison between the means of variables in each cluster and the means of the same variables for all employers.

2. Fuzzy box plot of full-quarter employees by cluster.

In [None]:
# Create a new dataframe
upd_stats <- data.frame()

# Create a loop to go through each cluster
# start of a loop:
for(grp in unique(frame_11$k11.cluster)){
    new_df <- frame_11 %>%
        filter(k11.cluster == grp)
    
stats <- new_df %>%
    group_by(k11.cluster) %>%
    # find fuzzy percentiles
    summarize(
        'fuzzy_25' = (quantile(full_num_employed, .20) + quantile(full_num_employed, .30))/2,
        'fuzzy_50' = (quantile(full_num_employed, .45) + quantile(full_num_employed, .55))/2,
        'fuzzy_75' = (quantile(full_num_employed, .70) + quantile(full_num_employed, .80))/2
        ) %>%
   # find min and max cutoff values
    mutate(
        fuzzy_min_cutoff = (fuzzy_25 - 1.5*(fuzzy_75 - fuzzy_25)),
        fuzzy_max_cutoff = (fuzzy_75 + 1.5*(fuzzy_75 - fuzzy_25))
       )

df_grp <- new_df %>%
    filter(full_num_employed > stats[stats$k11.cluster == grp,]$fuzzy_min_cutoff, 
           full_num_employed < stats[stats$k11.cluster == grp,]$fuzzy_max_cutoff)

# find fuzzy max
new_max <- df_grp %>%
    arrange(desc(full_num_employed)) %>%
    head(2) %>%
    summarize(m = mean(full_num_employed))

# find fuzzy min
new_min <- df_grp %>%
    arrange(full_num_employed) %>%
    head(2) %>%
    summarize(m = mean(full_num_employed))

stats<-stats %>%
    mutate(
        fuzzy_min = new_min$m,
        fuzzy_max = new_max$m
    )
# fill upd_stats with the stats for each of the kpeds_sectors
upd_stats <- rbind(upd_stats, stats)
}
# end of a loop

In [None]:
upd_stats %>%
    ggplot(aes(x=as.character(k11.cluster), ymin = fuzzy_min, lower = fuzzy_25, middle = fuzzy_50, upper = fuzzy_75, ymax = fuzzy_max)) +
    geom_boxplot(stat="identity") + 
    labs(
        title = 'Number of Full Quarter Employees per Employer by Cluster',
        y = 'Number of Full Quarter Employees (log10 scale)',
        x='Cluster',
        caption = 'Source: KPEDS, UI Wages data'
    ) +
    scale_y_continuous(trans = 'log10') + 
    scale_x_discrete(limits = c(1:11)) +
    theme_minimal() 

In [None]:
# ggsave('/nfshome/YOURUSERNAME/box_plot_full_quarter_employees.pdf')

3. Fuzzy box plot of average earnings per employee by cluster.

In [None]:
# Use the same loop as above - change the variable to: avg_earnings

upd_stats <- data.frame()

for(grp in unique(frame_11$k11.cluster)){
    new_df <- frame_11 %>%
        filter(k11.cluster == grp)
    
stats <- new_df %>%
    group_by(k11.cluster) %>%
    summarize(
        'fuzzy_25' = (quantile(avg_earnings, .20) + quantile(avg_earnings, .30))/2,
        'fuzzy_50' = (quantile(avg_earnings, .45) + quantile(avg_earnings, .55))/2,
        'fuzzy_75' = (quantile(avg_earnings, .70) + quantile(avg_earnings, .80))/2
        ) %>%
   # find min and max cutoff values
    mutate(
        fuzzy_min_cutoff = (fuzzy_25 - 1.5*(fuzzy_75 - fuzzy_25)),
        fuzzy_max_cutoff = (fuzzy_75 + 1.5*(fuzzy_75 - fuzzy_25))
       )

df_grp <- new_df %>%
    filter(avg_earnings > stats[stats$k11.cluster == grp,]$fuzzy_min_cutoff, 
       avg_earnings < stats[stats$k11.cluster == grp,]$fuzzy_max_cutoff)

# find fuzzy max
new_max <- df_grp %>%
    arrange(desc(avg_earnings)) %>%
    head(2) %>%
    summarize(m = mean(avg_earnings))

# find fuzzy min
new_min <- df_grp %>%
    arrange(avg_earnings) %>%
    head(2) %>%
    summarize(m = mean(avg_earnings))

stats<-stats %>%
    mutate(
        fuzzy_min = new_min$m,
        fuzzy_max = new_max$m
    )
# fill upd_stats with the stats for each of the kpeds_sectors
upd_stats <- rbind(upd_stats, stats)
}

In [None]:
upd_stats %>%
    ggplot(aes(x=as.character(k11.cluster), ymin = fuzzy_min, lower = fuzzy_25, middle = fuzzy_50, upper = fuzzy_75, ymax = fuzzy_max)) +
    geom_boxplot(stat="identity") + 
    labs(
        title = 'Average Earnings per Employee by Cluster',
        y = 'Earnings',
        x='Cluster',
        caption = 'Source: KPEDS, UI Wages data'
    ) +
    scale_x_discrete(limits = c(1:11)) +
    theme_minimal() 

In [None]:
# You can save this plot by using ggsave:
# ggsave('/nfshome/YOURUSERNAME/box_plot_average_earnings.pdf')

4. Compare cohort's employers and all employers in each cluster.

In [None]:
# read earnings of cohort into R
qry = "
select *
from ada_ky_20.cohort_wages
"
df_wages = dbGetQuery(con, qry)

In [None]:
# Subset by 2013 Q3
df_wages <- df_wages[which(df_wages$job_date=='2013-07-01'), ]

In [None]:
# Subset by only dominant employer (by highest wages)
df_wages_dominant <- df_wages %>%
                        group_by(coleridge_id) %>%
                        top_n(1, wages)

In [None]:
# Add industry names
frame_11 <- frame_11 %>% 
    left_join(naics, by=c('naics' = 'naics_us_code')) %>%
    select(-c(seq_no,naics))

# Rename the industry column to match with the df_wages_dominant dataframe
frame_11 <- frame_11 %>%
                rename(industry = naics_us_title)

In [None]:
# Join wages table with frame_11 clustering results
df_wages_dominant <- df_wages_dominant %>%
    inner_join(frame_11, by=c('employeeno','industry'))

In [None]:
dominance <- df_wages_dominant %>%
    group_by(employeeno, k11.cluster) %>%
    summarize(n=n()) %>%
    ungroup() %>%
    group_by(k11.cluster) %>%
    mutate(prop=n/sum(n)) %>%
    top_n(1) %>%
    arrange(k11.cluster)

In [None]:
# dominance %>% write_csv('/nfshome/YOURUSERNAME/table_industries_Kentucky_dominance.csv')

In [None]:
# Remove columns that are not needed
df_wages_dominant_subset <- subset(df_wages_dominant, select = c(coleridge_id, employeeno, k11.cluster))

In [None]:
# Get number of distinct employers and individuals from the cohort by cluster
cohort_person_employer <- df_wages_dominant_subset %>%
    group_by(k11.cluster) %>%
    summarise_all("n_distinct")

In [None]:
cohort_person_employer <- subset(cohort_person_employer, select = -c(k11.cluster))

new_df <- cbind(result, cohort_person_employer)

new_df

In [None]:
# Find number of unique dominant employers in the cohort
length(unique(df_wages_dominant_subset$employeeno))

In [None]:
# Find number of unique individuals in the cohort with dominant employers
length(unique(df_wages_dominant_subset$coleridge_id))

In [None]:
# Calculate proportions of employers and individuals
new_df$total_cohort_emp <- [redacted]
new_df$total_individ_cohort <- [redacted]
new_df$percentage_emp <- (new_df$employeeno / new_df$total_cohort_emp) * 100
new_df$percentage_indiv <- (new_df$coleridge_id / new_df$total_individ_cohort) * 100

new_df

In [None]:
# Get the mean of average earnings of the cohort by cluster
avg_earnings <- df_wages_dominant %>%
                group_by(k11.cluster) %>%
                summarise(mean_earnings_cohort = mean(wages))

In [None]:
new_df <- left_join(new_df, avg_earnings, by='k11.cluster')

In [None]:
# Remove missing industry names
frame_11 <- na.omit(frame_11)

# Add top 3 industries by cluster
cluster_industries <- frame_11 %>%
    group_by(k11.cluster, industry) %>%
    summarise(unique_emp = n_distinct(employeeno)) %>%
    top_n(3, unique_emp) %>%
    slice(1:3)

In [None]:
# Sort by industry with largest number of employers first by cluster
cluster_industries <- cluster_industries %>%
    arrange(k11.cluster, desc(unique_emp))

In [None]:
library(reshape2)
library(data.table)

In [None]:
cluster_industries <- dcast(setDT(cluster_industries), k11.cluster~rowid(k11.cluster), value.var=c('industry','unique_emp'))

In [None]:
new_df <- merge(new_df, cluster_industries, by='k11.cluster')

In [None]:
new_df <- subset(new_df, select = -c(total_cohort_emp, total_individ_cohort))

In [None]:
new_df

In [None]:
# You can write this dataframe to a csv using write_csv function
# new_df %>% write_csv('/nfshome/YOURUSERNAME/table_industries.csv')

In [None]:
options(repr.matrix.max.cols=100)

In [None]:
degrees <- df_wages_dominant %>%
    group_by(k11.cluster, deg_class) %>%
    summarise(indiv = n_distinct(coleridge_id)) %>%
    top_n(3, indiv) %>%
    slice(1:3) %>%
    arrange(k11.cluster, desc(indiv))

In [None]:
degrees <- dcast(setDT(degrees), k11.cluster~rowid(k11.cluster), value.var=c('deg_class','indiv'))

In [None]:
degrees

In [None]:
# degrees %>% write_csv('/nfshome/YOURUSERNAME/degrees.csv')

### Resources:
- UC Business Analytics R Programming Guide: https://uc-r.github.io/kmeans_clustering
- Rebecca Steorts, Assistant Professor, Duke University, Department of Statistical Science, Data Mining and Machine Learning course: https://github.com/resteorts/data-mine/tree/master/lectures_2018/10-unsupervise/10-kmeans.pdf