Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All) to avoid typical problems with Jupyter notebooks. **Unfortunately, this does not work with Chrome right now, you will also need to reload the tab in Chrome afterwards**.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". Please put your name here:

In [70]:
NAME = "Aymane Hachcham"

---

# Spherical k-Means Clustering

In this assignment, your task is to implement spherical k-means clustering *yourself*.

You will need to pay attention to performance. Using "for" loops over all instances and variables will not work, but instead you need to perform efficient vectorized operations.

In [71]:
import numpy as np, pandas as pd, scipy

In [72]:
# Load the input data
import json, gzip
raw = json.load(gzip.open("/data/simpsonswiki.json.gz", "rt", encoding="utf-8"))
titles, texts, classes = [x["title"] for x in raw], [x["text"] for x in raw], [x["c"] for x in raw]

Before you begin anything, always first have a look at the data you are dealing with!

In [73]:
# YOUR CODE HERE
classes

['Episodes',
 'Episodes',
 'Episodes',
 'Episodes',
 'Episodes',
 'Episodes',
 'Episodes',
 'Episodes',
 'Episodes',
 'Episodes',
 'Episodes',
 'Episodes',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 'Locations',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 'Episodes',
 'Episodes',
 'Characters',
 'Episodes',
 'Locations',
 'Episodes',
 'Characters',
 'Guest stars',
 'Characters',
 'Locations',
 'Characters',
 'Episodes',
 'Trivia',
 'Episodes',
 'Episodes',
 'Episodes',
 'Characters',
 'Episodes',
 'Episodes',
 'Guest stars',
 'Episodes',
 'Episodes',
 'Episodes',
 'Characters',
 'Characters',
 'Episodes',
 'Episodes',
 'Episodes',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 'Objects',
 'Characters',
 'Characters',
 'Characters',
 'Episodes',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 'Characters',
 

## Vectorize the text

Vectorize the Wiki texts, use the standard TF-IDF from the lecture (standard SMART `ltc` version, lowercase, *not* the scikit-learn variant) as discussed in the previous assignments. Use a minimum document frequency of 5 and standard english stopwords to reduce the vocabulary.

In [74]:
from sklearn.feature_extraction.text import CountVectorizer # Please use this
from scipy.sparse import spdiags


vectorizer = CountVectorizer(min_df=5, stop_words='english')
dtm = vectorizer.fit_transform(texts)

tfidf = None # sparse tf-idf matrix
vocabulary = vectorizer.get_feature_names_out()
idf = None # IDF values

def tf(dtm):
    tf_matrix = dtm.astype(np.float32)
    tf_matrix.data = 1 + np.log(tf_matrix.data)
    
    return tf_matrix

def idf(dtm):
    idf_matrix = np.log(dtm.shape[0] / (dtm.getnnz(0)))
    return idf_matrix

# _tf, _idf = tf(dtm), idf(dtm)
# mat = spdiags(_idf, np.array([0]), _idf.shape[0], _idf.shape[0], format=None)

def tfidf(dtm):
    _tf, _idf = tf(dtm), idf(dtm)
    
    sparse_matrix = _tf @ scipy.sparse.spdiags(_idf, 0, _idf.shape[0], _idf.shape[0])
    _tfidf = 1/np.sqrt(sparse_matrix.power(2).sum(axis=1).A1)
    return scipy.sparse.spdiags(_tfidf, 0, _tfidf.shape[0], _tfidf.shape[0]) @ sparse_matrix

tfidf = tfidf(dtm)
idf = idf(dtm)

In [75]:
pd.DataFrame.sparse.from_spmatrix(tfidf, columns=vocabulary)

Unnamed: 0,00,000,01,02,04,05,06,07,08,10,...,zoo,zooms,zorina,zorro,zsa,zuckerberg,zuylen,zzyzwicz,zörker,üter
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.087577,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10121,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10123,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10124,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [76]:
### Automatic tests
assert tfidf is not None and vocabulary is not None, "Variables not set"
assert tfidf.shape[0] == len(texts), "Missing documents"
assert len(vocabulary) == tfidf.shape[1], "Vocabulary size does not match"
assert isinstance(tfidf, scipy.sparse.csr_matrix), "Not a sparse matrix"

In [77]:
### Automatic tests
assert len(idf) == tfidf.shape[1], "IDF size does not match"
assert idf.max() != 1 +np.log(len(texts)/5), "No, default sklearn is NOT okay"
assert idf.max() == np.log(len(texts)/5), "IDF does not match definition"
assert not isinstance(idf, scipy.sparse.csr_matrix), "IDF should not be sparse!"
assert isinstance(idf, np.ndarray), "IDF must be an array"

## Reassignment step

Implement the reassignment step of **spherical** k-means. Use **vectorized code**, or it will likely be too slow.

Do *not* use a Python `for` loop, and do *not* convert the input data to a dense matrix (slow).

In [78]:
def reassign(tfidf, centers):
    """Reassign each object in tfidf to the most similar center.
       Return a flat array, not a matrix."""
    # YOUR CODE HERE
    similar_centers = np.array((centers @ tfidf.T).argmax(axis=0)).flatten()
    return similar_centers
# Test run
print(reassign(tfidf[:20], tfidf[:5]))

[0 1 2 3 4 4 1 1 4 1 4 1 0 2 0 0 0 0 0 0]


In [79]:
### Automatic tests
_test = reassign(tfidf, tfidf[:5])
assert _test.shape == (tfidf.shape[0],), "Return shape does not match"
assert (_test[:5] == np.arange(0,5)).all(), "Incorrect results"
assert _test.min() == 0 and _test.max() == 4, "Invalid values in array"
assert isinstance(_test, np.ndarray), "Return value is not a dense array -- you may need to use asarray() on the return value, unfortunately."
assert _test.dtype == np.int64, "Not an integer array"
del _test

In [80]:
### Automatic tests
from unittest.mock import patch
with patch('__main__.range') as mock_r1, patch('numpy.arange') as mock_r2:
    reassign(tfidf[:10], tfidf[:5])
assert not mock_r1.called and not mock_r2.called, "Vectorize your code! Otherwise you will be waiting a long time below."
with patch('sklearn.metrics.pairwise.cosine_similarity') as mock_c1, patch('sklearn.metrics.pairwise.cosine_distances') as mock_c2:
    reassign(tfidf[:10], tfidf[:5])
assert not mock_c1.called and not mock_c2.called, "Use your own code, not sklearn."

## Recompute the cluster centers

Given a cluster assignment, recompute the cluster centers as used by *spherical* k-means.

Vectorize your code: do not iterate over all points with a Python for loop

Hint: for the assignment, it is okay to assume that a cluster never becomes empty.

In [81]:
def new_centers(tfidf, assignment):
    """Return a matrix containing the new cluster centers for spherical k-means."""
    centers = [] # Okay to use a list or an array for the assignment
    # YOUR CODE HERE
    for text_cluster in np.unique(assignment):
        k_centers = tfidf[assignment == text_cluster].sum(axis=0).A1
        centers.append(k_centers/np.sqrt(np.square(k_centers).sum()))
        
    return np.array(centers) # Always return an array, copying is okay for the assignment

In [82]:
### Automatic tests
_tmp = new_centers(tfidf[:10], np.linspace(0, 4, 10).astype(np.int64))
assert len(_tmp) == 5, "Wrong number of centers."
for r in _tmp: assert r.shape[-1] == tfidf.shape[-1], "Not a proper center"
del _tmp

In [83]:
### Automatic tests
_tmp = new_centers(tfidf, np.zeros((tfidf.shape[0],)))
assert abs(np.array(_tmp).mean()-tfidf.mean()) > 1e-5, "This is not spherical k-means."
del _tmp

## Initialization

Now write initialization code. Given a random generator *seed*, chose `k` objects as initial cluster centers without replacement. Please use numpy.

In [84]:
from sklearn.utils.extmath import safe_sparse_dot

def initial_centers(tfidf, k, seed):
    """Choose k initial cluster centers."""
    # YOUR CODE HERE
    np.random.seed(seed)
    
    initial_dict = {}
    ind = np.random.randint(0, tfidf.shape[0], k)
    
    for index in range(0, len(ind)):
        initial_dict[index] = ind[index]
    
    centers = tfidf[list(map(initial_dict.get, list(set(safe_sparse_dot(tfidf, tfidf[np.random.randint(0, tfidf.shape[0], k)].T, dense_output=True).argmax(axis=1)))))].toarray()
    
    return np.array(centers)

In [85]:
### Automatic tests
_tmp = initial_centers(tfidf, 10, 42)
assert isinstance(_tmp, scipy.sparse.csr_matrix) or len(_tmp) == 10, "Wrong number of centers."
assert not isinstance(_tmp, scipy.sparse.csr_matrix) or _tmp.shape[0] == 10, "Wrong number of centers."
for r in _tmp: assert r.shape[-1] == tfidf.shape[-1], "Not a proper center."
del _tmp

In [86]:
### Automatic tests
assert (initial_centers(tfidf, 1, 42)-initial_centers(tfidf, 1, 42)).sum() == 0, "Seeding not okay."
assert (initial_centers(tfidf, 1, 42)-initial_centers(tfidf, 1, 21)).sum() != 0, "Seeding not okay."

## Implement a Quality Measure

As quality measure, compute the *sum* of cosine similarities of every point to its cluster center

In [87]:
def quality(tfidf, centers, assignment):
    """Evaluate the quality given the current centers and cluster assignment."""
    # YOUR CODE HERE
    s = 0
    for index in set(assignment):
        s = s + (tfidf[assignment== index] @ centers[index].T).sum()
        
    #raise NotImplementedError()
    return s

In [88]:
### Automatic tests
# This test is likely slow if you use a "for" loop in quality(). But that is okay.
_tmp = quality(tfidf, tfidf[0], np.zeros((tfidf.shape[0],)))
assert quality(tfidf, tfidf, np.arange(0,tfidf.shape[0])) == tfidf.shape[0], "Result incorrect"
assert _tmp > 100, "This largely random result should score better"
assert _tmp < 500, "This largely random result should score less"

As a reference value, compute the quality of assigning every object to the global *spherical* center.

Hint: you can use `new_centers` here.

In [89]:
center1 = None # Compute the overall center
sim1 = 0 # Compute the overall similarity

# YOUR CODE HERE
center1 = new_centers(tfidf, np.zeros((tfidf.shape[0],), dtype=np.int64))
sim1 = quality(tfidf, center1, np.zeros((tfidf.shape[0],), dtype=np.int64))

print("Similarity sum to center:", sim1)
print("Average similarity to center:", sim1 / tfidf.shape[0])

Similarity sum to center: 1042.5999538592491
Average similarity to center: 0.10296266579688418


In [90]:
### Automatic tests
assert center1 is not None, "Not answered."
assert abs(np.array(center1).mean()-tfidf.mean()) > 1e-5, "This is not the spherical center."
assert sim1 > 500, "This result should score better"
assert sim1 < 2000, "This result should score less"

## Implement Spherical k-Means

Now use these methods to implement spherical k-means clustering. Stop after a maximum number of iterations, or if no point is reassigned.

Return the cluster centers, the final cluster assignment, and an array of quality scores evaluated every time *after* reassigning the points to the clusters.

In [93]:
def spherical_kmeans(tfidf, initial_centers, max_iter=100):
    qualities = []
    
    # YOUR CODE HERE
    centers = initial_centers
    assignment = None
    for index in range(0, max_iter):
        assign_again = reassign(tfidf, centers)
        
        qualities.append(quality(tfidf, centers, assign_again))
        if assignment is not None and all(assignment == assign_again):
            break
        
        assignment = assign_again
        centers = new_centers(tfidf, assignment)
    
    return centers, assignment, qualities

In [94]:
### Automatic tests
from unittest.mock import patch
with patch('__main__.reassign') as mock_1, patch('__main__.new_centers') as mock_2, patch('__main__.quality') as mock_3:
    spherical_kmeans(tfidf, tfidf[0], 1)
    assert mock_1.called, "You did not use reassign"
    assert mock_2.called, "You did not use new_centers"
    assert mock_3.called, "You did not use quality"

## CLUSTER!

Now try out if your code works! First, cluster with `k=2`.

In [95]:
# YOUR CODE HERE
centers = initial_centers(tfidf, 2, 21)
centers, assign, qualities = spherical_kmeans(tfidf, centers, 100)
qualities

[167.6762003518968,
 1116.224472635951,
 1159.7350903225708,
 1172.9851729628508,
 1178.0056844418848,
 1180.1271594505547,
 1180.9847121420466,
 1181.2532651430688,
 1181.4404554346825,
 1181.5492072542543,
 1181.5983030670272,
 1181.624383568244,
 1181.635029531228,
 1181.6503654295382,
 1181.6697869663308,
 1181.7230700869272,
 1181.7481779426025,
 1181.752141412929]

In [96]:
### Automatic tests
_tmp = spherical_kmeans(tfidf, tfidf[[0,1]], 100)
assert len(_tmp[0]) == 2, "Wrong number of clusters"
assert _tmp[0].shape[-1] == tfidf.shape[-1], "Centers have bad shape"
assert sorted(np.unique(_tmp[1])) == [0,1], "Missing some clusters?"
assert len(_tmp[2]) < 90, "Should take much fewer iterations"
assert _tmp[2] == sorted(_tmp[2]), "Quality must be increasing"
assert _tmp[2][-1] == quality(tfidf, _tmp[0], _tmp[1]), "Quality wrong"
assert len(spherical_kmeans(tfidf, tfidf[[0,1]], 2)[2]) == 2, "max_iter incorrect."
del _tmp

In [97]:
### Automatic tests
_tmp = spherical_kmeans(tfidf, tfidf[[0,1]], 5)
_tmp2 = spherical_kmeans(tfidf, tfidf[[0,1]], 10)
assert _tmp[2] == _tmp2[2][:5]
del _tmp, _tmp2
# Additional hidden tests

## Study the Clusters

As we cannot rely on heuristics such as the "knee" to choose the number of clusters, we need to perform manual inspection:

- what are the most important words of each cluster?
- what are the most central documents in each cluster?

In [98]:
def most_important(vocabulary, center, k=10):
    """Find the most important words for each cluster."""
    
    # YOUR CODE HERE
    important_words = [vocabulary[word] for word in center.argsort() [-1::-1][:k]]
    return important_words

In [99]:
### Automatic tests
_tmp = tfidf[0].toarray()[0]
assert len(most_important(vocabulary, _tmp, 42)) == 42, "Wrong number of results."
for x in most_important(vocabulary, _tmp): assert isinstance(x, str), "Not words."

In [100]:
def most_central(tfidf, centers, assignment, i, k=5):
    """Find the most central documents of cluster i"""
    
    # YOUR CODE HERE
    final_central_result = (tfidf@centers[i].T).flatten()*(assignment==i)
    return final_central_result.argsort() [-1::-1][:k]

In [101]:
### Automatic tests
assert len(most_central(tfidf, tfidf[[0]].toarray(), np.zeros((tfidf.shape[0],)), 0, 42)) == 42, "Wrong number of results."
assert (most_central(tfidf, tfidf[[0,1]].toarray(), np.arange(0,tfidf.shape[0])&1, 0, 10)&1==0).all(), "Only documents from the same cluster may be returned."

## Explain your Clusters

Write a function to print a cluster explanation using above functions, and run it for k=20.

In [102]:
def explain(tfidf, vocabulary, titles, centers, assignment):
    """Use what you built."""
    
    # YOUR CODE HERE
    for index, cluster in enumerate(centers):
        print(f"The Cluster number {index + 1} and the number of words {len(cluster[cluster != 0])}")
        print("----------------------#####----------------------")
        print("The Top Important Words: ", "; ".join(most_important(vocabulary, cluster)))
        print("The Top title entities : \n")
        for x in most_central(tfidf, centers, assignment, index):
            print("", titles[x])
        print("----------------------#####----------------------")

In [104]:
# Cluster with k=20, and explain!

# YOUR CODE HERE
centers, clusters, _ = spherical_kmeans(tfidf, initial_centers (tfidf, 20, 42), 100)
explain(tfidf, vocabulary, titles, centers, clusters)

The Cluster number 1 and the number of words 4610
----------------------#####----------------------
The Top Important Words:  krusty; clown; burger; bart; brand; history; springfield; rabbi; krustofsky; kancelled
The Top title entities : 

 Bill (Accountant)
 Little Bearded Woman
 Krusty's ex-girlfriend
 The Krusty the Clown Show
 It's the Most Wonderful Time of the Year
----------------------#####----------------------
The Cluster number 2 and the number of words 13823
----------------------#####----------------------
The Top Important Words:  homer; bart; marge; lisa; episode; simpsons; moe; family; tells; springfield
The Top title entities : 

 Homer Simpson
 Bart Simpson
 Homer Strangles Bart (or Someone)
 Robert Terwilliger
 Opening Sequence
----------------------#####----------------------
The Cluster number 3 and the number of words 4100
----------------------#####----------------------
The Top Important Words:  book; comic; books; guy; history; read; lisa; reading; angelica; se

In [106]:
### Automatic tests
with patch('__main__.most_important') as mock_1, patch('__main__.most_central') as mock_2, patch('__main__.print') as mock_3:
    explain(tfidf, vocabulary, titles, tfidf[[0,1]].toarray(), np.arange(0,tfidf.shape[0])&1)
    assert mock_1.called, "You did not use most_important"
    assert mock_2.called, "You did not use most_central"
    assert mock_3.called, "You did not print"