# Lab 11- Extended Exercises on Time Series Clustering

You are the Senior Data Scientist in a learning platform called LernTime. Your data science team built a data frame in which each row contains the aggregated features per student (calculated over the first 5 weeks of interactions) and the feature `dropout` indicates whether the student stopped using the platform (1) or not (0) before week 10.

The dataframe is in the file `lerntime.csv` and contains the following features:
- `video_time`: total video time (in minutes) 
- `num_sessions` total number of sessions
- `num_quizzes`: total number of quizzes attempts
- `reading_time`: total theory reading time
- `previous_knowledge`: standardized previous knowledge
- `browser_speed`: standardized browser speed
- `device`:  whether the student logged in using a smartphone (1) or a computer (-1)
- `topics`: the topics covered by the user
- `education`: current level of education (0: middle school, 1: high school, 2: bachelor, 3: master, 4: Ph.D.).
- `dropout`: whether the student stopped using the platform (1) or not (0) before week 5.

In [1]:
import pandas as pd

# Data directory
DATA_DIR = "./../../data/"

In [2]:
import requests

exec(requests.get("https://courdier.pythonanywhere.com/get-send-code").content)

npt_config = {
    'session_name': 'lab-11',
    'session_owner': 'mlbd',
    'sender_name': input("Your name: "),
}

In [3]:
df = pd.read_csv(f'{DATA_DIR}/lerntime_dropout.csv')

In [4]:
df.head()

Unnamed: 0,video_time,num_sessions,num_quizzes,reading_time,previous_knowledge,browser_speed,device,topics,education,dropout
0,45.793303,99.0,36.0,48.186562,1.675972,-0.294704,1.0,"['Locke', 'Descartes', 'Socrates', 'Kant', 'Ni...",2.0,0
1,51.331242,57.0,12.0,49.94581,0.700522,1.253694,1.0,"['Nietzche', 'Locke', 'Confucius', 'Aristotle'...",3.0,0
2,87.414834,52.0,7.0,20.611978,1.836716,-1.171352,1.0,"['Plato', 'Locke', 'Nietzche', 'Socrates', 'De...",4.0,0
3,58.556388,47.0,31.0,33.785805,0.209577,-2.043047,1.0,"['Aristotle', 'Socrates', 'Plato', 'Confucius'...",3.0,0
4,74.822362,58.0,37.0,38.907983,0.265678,-0.754559,1.0,"['Kant', 'Aristotle', 'Confucius', 'Locke', 'P...",4.0,0


You decide to explore the different type of users. You want to use your knowledge from your ML4BD course and decide to cluster using Spectral Clustering. 
In the course, you learnt different ways of constructing the similarity graph, yielding the adjacency matrix serving as an input to the Spectral Clustering. 
Based on your in-depth exploration of the data, you decide to construct the similarity graph as a  *k-nearest neighbor graph*.

Your tasks are to:

a) Write a function to compute the k-nearest neighbor graph.

b) Cluster the users using Spectral Clustering and your k-nearest neighbor graph function (use 4 neighbors). Use only the features *reading_time* and *topics*. You can assume that optimal number of clusters is 2.


## a) Computation of the k-nearest neighbor graph 
Unfortunately, there is no k-nearest neighbor graph implementation available in scikit-learn and you therefore have to implement the function yourself.

The function `'k_nearest_neighbor_graph'` takes a similarity matrix `S` as well as the number of neighbors `k` as an input an returns the adjacency matrix `W`.

Note that we will not evaluate the coding efficiency of your function. 

In [49]:
import numpy as np


def k_nearest_neighbor_graph(S, k):
    # S: similarity matrix
    # k: number of neighbors
    np_S = np.array(S)
    # For each entry in S, keep the k largest values and set the rest to 0
    # Hint: use np.argsort and fancy indexing
    indexes = np.argsort(np_S, axis=1)[:, -k:]
    print(indexes)
    print(np.arange(np_S.shape[0])[:, None])
    W = np.zeros_like(S)
    W[np.arange(np_S.shape[0])[:, None], indexes] = np_S[np.arange(np_S.shape[0])[:, None], indexes]
    return W

In [75]:
# What np.argsort does is that it returns the indexes of the sorted array
# i.e if we have [1, 3, 2] it will return [0, 2, 1]. Which means that the smallest element is at index 0, the second smallest at index 2 and the largest at index 1
# If we get the last k elements of this array, we get THE INDEXES of k largest elements of the original array

k = 3
# Please run this cell for evaluation purposes
S = [[1, 0.2, 0.7, 0.1],
     [0.2, 1, 0.8, 0.4],
     [0.7, 0.8, 1, 0.6],
     [0.1, 0.4, 0.6, 1]]

a = k_nearest_neighbor_graph(S, k)
print(a)
#send(a, 1)

[[1 2 0]
 [3 2 1]
 [0 1 2]
 [1 2 3]]
[[0]
 [1]
 [2]
 [3]]
[[1.  0.2 0.7 0. ]
 [0.  1.  0.8 0.4]
 [0.7 0.8 1.  0. ]
 [0.  0.4 0.6 1. ]]


In [13]:
# Please run this cell for evaluation purposes
S = [[1, 0.3, 0.01, 0.1],
     [0.3, 1, 0.8, 0.9],
     [0.01, 0.8, 1, 0.6],
     [0.1, 0.9, 0.6, 1]]

k_nearest_neighbor_graph(S, k)
a = k_nearest_neighbor_graph(S, k)
print(a)
send(a, 2)

[[1. 1. 0. 0.]
 [0. 1. 0. 1.]
 [0. 1. 1. 0.]
 [0. 1. 0. 1.]]


<Response [200]>

## b) Spectral Clustering 
Perform a spectral clustering using a k-nearest neighbor graph (with 4 neighbors). 

Use the two features `reading_time` and `topics` only. 

If you did not manage to solve task a), use a *fully connected graph* as similarity graph to obtain the adjacency matrix `W`. 

You can assume that the optimal number of clusters is 2. 

Print the obtained cluster labels. 

In [52]:
from sklearn.manifold import spectral_embedding
from sklearn.cluster import KMeans
from scipy import linalg
from scipy.sparse.csgraph import laplacian

def spectral_clustering(W, n_clusters, random_state=111):
    """
    Spectral clustering
    :param W: np array of adjacency matrix
    :param n_clusters: number of clusters
    :param normed: normalized or unnormalized Laplacian
    :return: tuple (kmeans, proj_X, eigenvals_sorted)
        WHERE
        kmeans scikit learn clustering object
        proj_X is np array of transformed data points
        eigenvals_sorted is np array with ordered eigenvalues 
        
    """
    # Compute eigengap heuristic
    L = laplacian(W, normed=True)
    eigenvals, _ = linalg.eig(L)
    eigenvals = np.real(eigenvals)
    eigenvals_sorted = eigenvals[np.argsort(eigenvals)]

    # Create embedding
    random_state = np.random.RandomState(random_state)
    proj_X = spectral_embedding(W, n_components=n_clusters,
                              random_state=random_state,
                              drop_first=False)

    # Cluster the points using k-means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state = random_state)
    kmeans.fit(proj_X)

    return kmeans, proj_X, eigenvals_sorted

In [79]:
reading_time = df['reading_time'].values
topics = df['topics'].values
# Convert topics to list of lists
topics = [eval(t) for t in topics]
topics_np = np.array(topics)
topics

  topics_np = np.array(topics)


(300,)

In [83]:
# Reading time similarity matrix we can use Gaussian kernel use pairwise_kernels from sklearn.metrics.pairwise
from sklearn.metrics.pairwise import pairwise_kernels
reading_similarity = pairwise_kernels(reading_time.reshape(-1,1), metric='rbf')
print(reading_similarity.shape)
# Topics is a set so we can use the Jaccard similarity 
from scipy.spatial.distance import jaccard
topic_similarity = pairwise_kernels(topics_np.reshape(-1,1), metric=jaccard)

(300, 300)


ValueError: setting an array element with a sequence.

In [None]:
cluster_labels = []
send(cluster_labels, 3)