# DMA Fall 19

In [0]:
NAME = "Fan Zhang"
COLLABORATORS = ""

---

# Lab 2: Clustering ##

**Please read the following instructions very carefully**

## About the Dataset
The dataset for this lab has been created from some custom features from Lab 1. The columns are named as q1, q2....etc. A description of the features can be found at this link: https://docs.google.com/spreadsheets/u/1/d/1PJGxD8GzXL6xb4zuyWmtlgBbpafkCdVemLLK6WDsBXg/edit?usp=sharing

## Working on the assignment / FAQs
- **Always use the seed/random_state as *42* wherever applicable** (This is to ensure repeatability in answers, across students and coding environments) 
- Questions can be either autograded and manually graded.
- The type of question and the points they carry are indicated in each question cell
- An autograded question has 3 cells
     - **Question cell** : Read only cell containing the question
     - **Code Cell** : This is where you write the code
     - **Grading cell** : This is where the grading occurs, and **you are required not to edit this cell**
- Manually graded questions only have the question and code cells.
- To avoid any ambiguity, each question also specifies what *value* the function must return. Note that these are dummy values and not the answers
- If an autograded question has multiple answers (due to differences in handling NaNs, zeros etc.), all answers will be considered.
- Most assignments have bonus questions for extra credit, do try them out! 
- You can delete the `raise NotImplementedError()` for all manually graded questions.
- **Submitting the assignment** : Download the '.ipynb' file from Colab and upload it to bcourses. Do not delete any outputs from cells before submitting.
- That's about it. Happy coding! 

In [20]:
import pandas as pd
import collections
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np
from sklearn.preprocessing import normalize

import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
matplotlib.style.use('ggplot')



#DOWNLOADING DATASET
!wget -nc http://people.ischool.berkeley.edu/~zp/course_datasets/yelp_reviewers.zip
!unzip -u yelp_reviewers.zip
print('Dataset Downloaded: yelp_reviewers.csv')
df = pd.read_csv('yelp_reviewers.csv')
df = df.sample(frac=0.3, random_state=42)
print(df.dropna().describe())

print('....SETUP COMPLETE....')

File ‘yelp_reviewers.zip’ already there; not retrieving.

Archive:  yelp_reviewers.zip
Dataset Downloaded: yelp_reviewers.csv
                q3           q4  ...        q16ab        q16ac
count  7177.000000  7177.000000  ...  7177.000000  7177.000000
mean      6.838651     5.281455  ...     1.127751     3.649254
std       7.597977    16.208703  ...     4.652206     0.977100
min       1.000000     1.000000  ...     0.000000     1.000000
25%       3.000000     1.000000  ...     0.000000     3.200000
50%       5.000000     2.000000  ...     0.500000     3.777778
75%       9.000000     4.000000  ...     1.307692     4.333333
max     252.000000   607.000000  ...   342.300000     5.000000

[8 rows x 40 columns]
....SETUP COMPLETE....


---

### Question 1 `(1 point)`
What is the best choice of k according to the silhouette metric for clustering q4-q6? Only consider 2 <= k <= 8. 


**NOTE**: For features with high variance, empty clusters can occur. There are several ways of dealing with empty clusters. A common approach is to drop empty clusters, the prefered approach for this Lab is to treat the empty cluster as a “singleton” leaving it empty with a single point placeholder.


In [0]:
#Make sure you return the answer value in this function
#The return value must be an integer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
def q1(df):
  ndf = df[['q4','q5','q6']].values
  x = -1
  for i in range(2,9):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(ndf)
    kmeans.labels_
    score = silhouette_score(ndf, kmeans.labels_)
    if score > x:
      max = i
      x = score
  
  return max

    # YOUR CODE HERE
 
    
print(q1(df))

What is the best choice of k? 

In [0]:
# YOUR ANSWER HERE
2


### Question 2 `(1 point)`
What is the best choice of k according to the silhouette metric for clustering q7-q10? Only consider 2 <= k <= 8. 

In [0]:
#Make sure you return the answer value in this function
#The return value must be an integer
def q2(df):
  ndf = df[['q7','q8','q9','q10']].dropna().values
  x = -1
  for i in range(2,9):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(ndf)
    kmeans.labels_
    score = silhouette_score(ndf, kmeans.labels_)
    if score > x:
      max = i
      x = score
  
  return max
    
    # YOUR CODE HERE
    #raise NotImplementedError()
print(q2(df))

What is the best choice of k? 

In [0]:
# YOUR ANSWER HERE
2

### Question 3 `(1 point)`
What is the best choice of k according to the silhouette metric for clustering q11-q13? Only consider 2 <= k <= 8. 

In [0]:
#Make sure you return the answer value in this function
#The return value must be an integer
def q3(df):
  ndf = df[['q11','q12','q13']].dropna().values
  x = -1
  for i in range(2,9):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(ndf)
    kmeans.labels_
    score = silhouette_score(ndf, kmeans.labels_)
    if score > x:
      max = i
      x = score
  
  return max
    
print(q3(df))

What is the best choice of k?

In [0]:
# YOUR ANSWER HERE
8

### Question 4 `(1 point)`
Consider the best cluster from Question 3 and List the number of data points in each cluster.

In [0]:
#Make sure you return the answer value in this function
#The return value must be an dictionary. Eg : {0:1000,1:500,2:1460}
def q4(df):
    ndf = df[['q11','q12','q13']].dropna().values
    kmeans = KMeans(n_clusters=8, random_state=42)
    kmeans.fit(ndf)
    arr = kmeans.labels_.tolist()
    Dict = {} 
    for i in range(0,8):
      Dict[i] = arr.count(i)
    return Dict
    # YOUR CODE HERE
    #raise NotImplementedError()

In [0]:
#This is an autograded cell, do not edit
print(q4(df))

### Question 5 `(1 point)`
Consider the best cluster from Question 3. Were there clusters that represented very funny but useless reviewers (check column definitions for columns corresponding to funny, useless etc)?  If so, print the center of that cluster.

In [0]:
#Make sure you return the answer value in this function
#The return value must be an Array. Eg : [10,30,54]
def q5(df):
  ndf = df[['q11','q12','q13']].dropna().values
  kmeans = KMeans(n_clusters=8, random_state=42)
  kmeans.fit(ndf)
  kmeans.labels_
  return kmeans.cluster_centers_[6]
    # YOUR CODE HERE
    #raise NotImplementedError()

In [0]:
#This is an autograded cell, do not edit
print(np.round_(q5(df), decimals=1, out=None))

### Question 6 `(1 point)`
Consider the best cluster from Question 3. How many reviewers were in the cluster that represented relatively equal strength in all voting categories

In [0]:
#Make sure you return the answer value in this function
#The return value must be an Array. Eg : [10,30,54]
def q6(df):
  return q4(df)[7]
    
    # YOUR CODE HERE
    #raise NotImplementedError()

In [0]:
#This is an autograded cell, do not edit
print(q6(df))

### Question 7 `(1 point)`
Cluster the dataset using $k = 5$ and using features q7-q15 (refer to the column descriptions if needed).
What is the silhouette metric for this clustering?
For a more in-depth understanding of cluster analysis with silhouette, look [here](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html)

In [0]:
#Make sure you return the answer value in this function
#The return value must be a float
def q7(df):
  ndf = df[['q7','q8','q9','q10','q11','q12','q13','q14','q15']].dropna().values
  kmeans = KMeans(n_clusters=5, random_state=42)
  kmeans.fit(ndf)
  kmeans.labels_
  score = silhouette_score(ndf, kmeans.labels_)
  return score
    # YOUR CODE HERE
    #raise NotImplementedError()

In [0]:
#This is an autograded cell, do not edit
print(q7(df))

### Question 8 `(1 point)`
Cluster the dataset using $k = 5$ and using features q7-q15 (refer to the column descriptions if needed).

What was the average q3 among the points in each of the clusters?

In [0]:
ndf = df[['q7','q8','q9','q10','q11','q12','q13','q14','q15']].dropna().values
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(ndf)
kmeans.labels_
kmeans.cluster_centers_
arr = kmeans.labels_
nndf = df[['q3','q7','q8','q9','q10','q11','q12','q13','q14','q15']].dropna()
nndf['group']= arr
nndf.groupby(['group']).mean()['q3'].array

In [0]:
#Make sure you return the answer value in this function
#The return value must be an Array. Eg : [10,30,54]
def q8(df):
  ndf = df[['q7','q8','q9','q10','q11','q12','q13','q14','q15']].dropna().values
  kmeans = KMeans(n_clusters=5, random_state=42)
  kmeans.fit(ndf)
  kmeans.labels_
  kmeans.cluster_centers_
  arr = kmeans.labels_
  nndf = df[['q3','q7','q8','q9','q10','q11','q12','q13','q14','q15']].dropna()
  nndf['group']= arr
  return nndf.groupby(['group']).mean()['q3'].array
    # YOUR CODE HERE
    

In [0]:
#This is an autograded cell, do not edit
print(np.round_(q8(df), decimals=1, out=None))

### Question 9 `(2 points)`
**This question will be manually graded.**

Cluster the dataset using all features in the dataset

We can drop features with high incidents of -Inf / blank / or NaN values). It is suggested that you perform some form of normalization on these question 16 features so as not to over bias the clustering towards the larger magnitude features. Let's do that now.

#### Data Cleansing and Normalization ####
Check how many null values there are in each column.

In [0]:
# YOUR CODE HERE
df.isnull().sum()


It looks like q8 - q13 and q16ab have a lot of null values, especially q8 and q9. Let's see what the impact is of removing some of these columns before removing any columns

In [0]:
# YOUR CODE HERE
df.dropna().shape

In [0]:
ndf = df.drop(['q8','q9'],axis=1).dropna()
ndf.shape

By removing 2 features, we effectively have double the number of rows remaining. That's pretty good.  
Preprocess categorical variables to dummy values.

In [0]:
# YOUR CODE HERE
ndf=ndf.drop(['user_id'],axis=1)
ndf=pd.get_dummies(data=ndf,columns=['q16s','q16t'])

Now normalize the remaining values

In [0]:
# YOUR CODE HERE
normalize(ndf)

Using the the `sum of within cluster variance` metric with the elbow method what was the best k?

In [0]:
# YOUR CODE HERE
#raise NotImplementedError()
Sum_of_squared_distances = []
for k in range(1,8):
    km = KMeans(n_clusters=k)
    km = km.fit(ndf)
    Sum_of_squared_distances.append(km.inertia_)
plt.plot(range(1,8), Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('sum of within cluster variance')
plt.title('Elbow Method')
plt.show()
# k=4

### Question 10 `(1 points)`
**This question will be manually graded.**

For this question please come up with your own question about this dataset and using a clustering technique as part of your method of answering it. Describe in short the question, and how clustering can answer that question.


In [0]:
#If we use elbow method on question 3, is the optimized number of clusters different?
from sklearn.cluster import SpectralClustering

ddf = df[['q11','q12','q13']].dropna().values
Sum_of_squared_distances = []
for k in range(1,15):
    km = KMeans(n_clusters=k)
    km = km.fit(ddf)
    Sum_of_squared_distances.append(km.inertia_)
plt.plot(range(1,15), Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('sum of within cluster variance')
plt.title('Elbow Method')
plt.show()
# k=5 is the optimized number of clusters in elbow method.

In [0]:
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(ddf)
kmeans.labels_
score = silhouette_score(ddf, kmeans.labels_)
print(score)

In [0]:
kmeans = KMeans(n_clusters=8, random_state=42)
kmeans.fit(ddf)
kmeans.labels_
score = silhouette_score(ddf, kmeans.labels_)
print(score)
# But from the silhouette score, 8 is the optimal number compares to 5. Elbow method is not always accurate.

## Bonus question (`2 Points`) - Reviewer overlap:
- Download last week's dataset
- Aggregate cool, funny and useful votes for each business id
- You may transform the aggregations (take %, log, or leave it as it is)
- Cluster this dataframe (you can choose k). Do you find any meaningful/interesting clusters?
- Assign the cluster label to each business id
- Merge this with users to show what clusters the reviewers have reviewed. (You may need to use the pivot function) 

In [0]:
# YOUR CODE HERE
#raise NotImplementedError()