Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
182 lines (139 sloc) 4.78 KB
---
title: "Cluster Analysis in R"
output:
pdf_document:
toc: yes
toc_depth: 2
html_document:
theme: readable
toc: yes
toc_depth: 3
date: "June 6, 2016"
---
```{r setup, include = FALSE}
# my set-up
knitr::opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE,
fig.align = 'center')
options(scipen = 999)
require(dplyr)
require(ggplot2)
colleges = read.delim('./data/colleges.tsv',
sep = '\t',
header = TRUE)
```
# Read in data
We're continuing our exploration of college features, so read in the College
Scorecard data:
```{r, eval = FALSE}
require(dplyr)
require(ggplot2)
colleges = read.delim('./data/colleges.tsv',
sep = '\t',
header = TRUE)
```
We're also going to use a couple new libraries, `stringr` and `ggdendro`, so
go ahead and install those now. You can use the `packages` menu, or the following
commands:
```{r, eval = FALSE}
install.packages('stringr', dependencies = TRUE)
install.packages('ggdendro', dependencies = TRUE)
```
Let's use a realistic application that we might use clustering to solve. Say that
I work for a granting agency, like the Gates Foundation, and I want to identify
colleges that have high numbers of low-income, and first generation college
attendees because I want to give those colleges additional funding. How should I
identify which schools to fund?
# $k$-means clustering
One approach would be to use $k$-means to identify clusters of schools meeting
our granting requirements.
```{r}
college_features =
colleges %>%
select(institution_name, first_gen_share, poverty_rate,
family_income_median, median_earnings) %>%
na.omit() %>%
distinct()
# run k-means clustering
kmeans_cluster = kmeans(select(college_features, -institution_name), 3)
# check what attributes are in the kmeans object
attributes(kmeans_cluster)
# Find which cluster the observations belong to
head(kmeans_cluster$cluster, 10)
# centers
kmeans_cluster$centers
```
Now we can plot the clusters and how they relate to two of the variables. Note that the clusters are labeled by numbers as seen in the figure legend.
```{r}
# plot 4 clusters
ggplot(college_features,
aes(x = family_income_median,
y = median_earnings,
color = factor(kmeans_cluster$cluster))) +
geom_point(alpha = 0.50) +
theme_minimal()
```
We can use our trusty friend `ggpairs` to visualize the resulting clusters:
```{r}
require(GGally)
college_features =
college_features %>%
mutate(cluster = factor(kmeans_cluster$cluster))
ggpairs(college_features,
lower = list(mapping = aes(color = cluster, alpha = 0.20)),
diag = list(mapping = aes(fill = cluster, color = cluster, alpha = 0.50)),
upper = list(mapping = aes(group = cluster)),
columns = c('first_gen_share', 'poverty_rate', 'family_income_median', 'median_earnings'))
```
What schools should we give money to? (Use the earlier plot to figure out which cluster number to select.)
```{r}
grant_candidates =
college_features %>%
filter(cluster == 1)
```
# Hierarchical Clustering
Can we find similarities in the colleges based on the composition of majors?
```{r, fig.width = 12, fig.height = 12}
require(stringr)
require(ggdendro)
# take a sample to readablility
sample = colleges[sample(nrow(colleges), 50), ]
# select all the columns that contain the string "_major_perc"
majors = sample[, str_detect(names(sample), '_major_perc')]
# put the institution name back on
majors$institution_name = sample$institution_name
# remove missing values
majors = na.omit(majors)
# compute the euclidean distance
euclidean = dist(select(majors, -institution_name),
method = 'euclidean')
attributes(euclidean)
# hierarchical clustering
hier = hclust(euclidean)
attributes(hier)
# label by id
hier$labels
hier$labels = majors$institution_name
# plot dendrogram
ggdendrogram(hier, rotate = FALSE, size = 2)
ggdendrogram(hier, rotate = TRUE, size = 2)
```
Do the schools we identified as candidates for grants cluster by area of study?
```{r}
# extract the dendrogram data
dendro_data = dendro_data(hier)
attributes(dendro_data)
dendro_data$labels = unique(merge(dendro_data$labels,
select(college_features, institution_name, cluster),
by.x = 'label',
by.y = 'institution_name',
all.x = TRUE))
ggplot(segment(dendro_data)) +
geom_segment(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_text(data = label(dendro_data),
aes(label = label, x = x, y = 0, hjust = 0, color = cluster),
size = 2) +
coord_flip() +
scale_y_reverse(expand = c(0.25, 0)) +
theme_minimal() +
theme(legend.position = 'bottom')
```