# Clustering

If you begin with unlabeled data, you can use clustering to create class labels. From there, you could apply a supervised learner such as decision trees to find the most important predictors of these classes. This is called <b>semi-supervised learning</b>.

K-means:
+ uses simple principles that can be explained in non-statistical terms
+ highly flexible, and can be adapted with simple adjustments to address nearly all of its shortcomings
+ performs well enough under many real-world use cases.
+ not as sophisticated as more modern clustering algorithms
+ because it uses an element of random chance, it is not guaranteed to find the optimal set of clusters
+ requires a reasonable guess as to how many clusters naturally exist in the data
+ not ideal for non-spherical clusters or clusters of widely varying density

## algorithm

1. Assign examples to an initial set of k clusters.
2. Update the assignments by adjusting the cluster boundaries according to the examples that currently fall into the cluster.

If n indicates the number of features, the formula for Euclidean distance between example x and example y is:

$$dist(x,y)=\sqrt{\sum_{i=1}^{n}{(x_i-y_i)}^2}$$

## finding teen market segments using k-means clustering

In [1]:
teens = read.csv('../../R_projects/MLwR/Machine Learning with R (2nd Ed.)//Chapter 09/snsdata.csv')

In [2]:
str(teens)

'data.frame':	30000 obs. of  40 variables:
 $ gradyear    : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
 $ gender      : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ...
 $ age         : num  19 18.8 18.3 18.9 19 ...
 $ friends     : int  7 0 69 0 10 142 72 17 52 39 ...
 $ basketball  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ football    : int  0 1 1 0 0 0 0 0 0 0 ...
 $ soccer      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ softball    : int  0 0 0 0 0 0 0 1 0 0 ...
 $ volleyball  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ swimming    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cheerleading: int  0 0 0 0 0 0 0 0 0 0 ...
 $ baseball    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ tennis      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ sports      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cute        : int  0 1 0 1 0 0 0 0 0 1 ...
 $ sex         : int  0 0 0 0 1 1 0 2 0 0 ...
 $ sexy        : int  0 0 0 0 0 0 0 1 0 0 ...
 $ hot         : int  0 0 0 0 0 0 0 0 0 1 ...
 $ kissed      : int  0 0 0 0 5 0 0 0 0 0 ...
 $ dance       : int

In [3]:
table(teens$gender)


    F     M 
22054  5222 

In [5]:
table(teens$gender, useNA = 'ifany')


    F     M  <NA> 
22054  5222  2724 

In [6]:
summary(teens$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  3.086  16.312  17.287  17.994  18.259 106.927    5086 

In [15]:
teens$age = ifelse(teens$age >= 13 & teens$age < 20, teens$age, NA)

In [16]:
summary(teens$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  13.03   16.30   17.27   17.25   18.22   20.00    5523 

In [21]:
teens$female = ifelse(teens$gender == 'F' 
                     & !is.na(teens$gender), 1, 0)

In [22]:
teens$no_gender = ifelse(is.na(teens$gender), 1, 0)

In [24]:
table(teens$gender, useNA = 'ifany')


    F     M  <NA> 
22054  5222  2724 

In [25]:
table(teens$female, useNA = 'ifany')


    0     1 
 7946 22054 

In [26]:
table(teens$no_gender, useNA = 'ifany')


    0     1 
27276  2724 

In [27]:
mean(teens$age)

In [28]:
mean(teens$age, na.rm = TRUE)

In [29]:
aggregate(data = teens, age ~ gradyear, mean, na.rm = TRUE)

gradyear,age
2006,18.65586
2007,17.70617
2008,16.7677
2009,15.81957


In [32]:
ave_age = ave(teens$age, teens$gradyear, FUN = function(x) mean(x, na.rm = TRUE))

In [34]:
teens$age = ifelse(is.na(teens$age), ave_age, teens$age)

In [35]:
summary(teens$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.03   16.28   17.24   17.24   18.21   20.00 

In [37]:
library(stats)

myclusters <- kmeans(mydata, k)

myclusters$cluster

myclusters$centers

myclusters$size

In [38]:
interests = teens[5:40]

In [39]:
interests_z = as.data.frame(lapply(interests, scale))

Scale the data to know if someone mentioned a topic many more times <b>than the average teenager.</b>

how many clusters? Teenage characters may be identified by stereotypes: a brain, an athlete, a basket case, a princess, and a criminal.

In [41]:
set.seed(2345)

In [42]:
teen_clusters = kmeans(interests_z, 5)

In [43]:
teen_clusters$size

In [44]:
teen_clusters$centers

Unnamed: 0,basketball,football,soccer,softball,volleyball,swimming,cheerleading,baseball,tennis,sports,⋯,blonde,mall,shopping,clothes,hollister,abercrombie,die,death,drunk,drugs
1,0.16001227,0.2364174,0.10385512,0.07232021,0.18897158,0.23970234,0.3931445,0.02993479,0.13532387,0.10257837,⋯,0.0613734,0.60368108,0.79806891,0.5651537331,4.1521844,3.9649381,0.043475966,0.09857501,0.035614771,0.03443294
2,-0.09195886,0.0652625,-0.09932124,-0.01739428,-0.06219308,0.03339844,-0.1101103,-0.1148751,0.04062204,-0.09899231,⋯,-0.01146396,-0.08724304,-0.03865318,-0.0003526292,-0.16783,-0.14129577,0.009447317,0.05135888,-0.08677322,-0.06878491
3,0.52755083,0.487348,0.29778605,0.37178877,0.37986175,0.29628671,0.3303485,0.35231971,0.14057808,0.3296713,⋯,0.03471458,0.48318495,0.66327838,0.375972512,-0.0553846,-0.07417839,0.037989066,0.1197219,-0.009688746,-0.05973769
4,0.34081039,0.3593965,0.1272225,0.16384661,0.110322,0.26943332,0.1856664,0.27527088,0.10980958,0.7971192,⋯,0.36134138,0.62256686,0.27101815,1.2306917174,0.1610784,0.26324494,1.71218187,0.93631312,1.8973882,2.73326605
5,-0.16695523,-0.1641499,-0.0903352,-0.11367669,-0.11682181,-0.10595448,-0.1136077,-0.10918483,-0.05097057,-0.13135334,⋯,-0.02918252,-0.18625656,-0.22865236,-0.1865419798,-0.1557662,-0.14861104,-0.09487518,-0.08370729,-0.087520105,-0.11423381


cluster 3: cluster of teenagers interested in sports (Athletes?)

cluster 1: includes most mentions of "cheerleading", "hot" and above average "football" (Princesses?)

Cluster 5 is unexceptional. Its members had lower than average levels of interest in every measured activity. It's the largest group in terms of number of members. Potential explanation: created a profile but never posted.

Cluster 1: Princesses

Cluster 2: Brains

Cluster 3: Criminals

Cluster 4: Athletes

Cluster 5: Basket Cases

Apply the clusters to the original data:

In [45]:
teens$cluster = teen_clusters$cluster

Examine how the cluster assignment relates to individual characteristics:

In [46]:
teens[1:5, c('cluster', 'gender', 'age', 'friends')]

cluster,gender,age,friends
5,M,18.982,7
3,F,18.801,0
5,M,18.335,69
5,F,18.875,0
4,,18.995,10


Average age of each group:

In [47]:
aggregate(data = teens, age ~ cluster, mean)

cluster,age
1,16.86497
2,17.39037
3,17.07656
4,17.11957
5,17.29849


Although we didn't use gender data to create clusters, they are still predictive of gender:

In [48]:
aggregate(data = teens, female ~ cluster, mean)

cluster,female
1,0.8381171
2,0.725
3,0.8378198
4,0.8027079
5,0.6994515


Predict the number of friends the users have:

In [49]:
aggregate(data = teens, friends ~ cluster, mean)

cluster,friends
1,41.43054
2,32.57333
3,37.16185
4,30.5029
5,27.70052
