Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All) to avoid typical problems with Jupyter notebooks. **Unfortunately, this does not work with Chrome right now, you will also need to reload the tab in Chrome afterwards**.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". Please put your name here:

In [1]:
NAME = "Aymane Hachcham"

---

# Explore pre-trained word2vec embeddings

In [2]:
### Load the input data - do not modify
import json, gzip, numpy as np
raw = json.load(gzip.open("/data/simpsonswiki.json.gz", "rt", encoding="utf-8"))
titles, texts, classes = [x["title"] for x in raw], [x["text"] for x in raw], [x["c"] for x in raw]

In [3]:
### Load the pretrained word2vec model from Google
from gensim.models import KeyedVectors
model = KeyedVectors.load("/data/w2v-google-news.wordvectors", mmap="r")
model.fill_norms()

In [4]:
# Find the 10 most similar words to the "Simpsons"
most_simpsons = None # words only

# YOUR CODE HERE
most_simpsons = [item[0] for item in model.most_similar('Simpsons', topn=10)]

In [5]:
# Automatic unit tests, no need to modify/study this.
assert len(most_simpsons) == 10
assert isinstance(most_simpsons[0], str)
assert not "paris_hilton" in most_simpsons
assert "Simpsons_Movie" in most_simpsons

## Verify the classic king-queen example

Verify that "King - Man + Woman = Queen", using the built-in function for this.

In [6]:
most_kmw = None # 10 nearest words to "king-man+woman" using the gensim API

# Finding the top ten most similar words to the combination: King-Man + Woman
most_kmw = [item[0] for item in model.most_similar(positive=['king', 'woman'], negative=['man'], topn=10)]

In [7]:
# Automatic unit tests, no need to modify/study this.
assert most_kmw[0] == "queen" or most_kmw[0][0] == "queen"

## Try using Euclidean geometry

Get the vectors for king, man, queen, and woman.

Compute king-man+woman, and compute the distances to each of above four words. What word is closest?

In [9]:
king, man, queen, woman = None, None, None, None # get the word vectors

# Get the vectors for the 4 above words:
king = model['king']
man = model['man']
queen = model['queen']
woman = model['woman']

king_man_women = king - man + woman

# Calculate the Euclidean Distance using Numpy:
import numpy as np

all_distances = [np.linalg.norm(king_man_women - vect) for vect in [king, man, queen, woman]]

# all_distances = []
# for vect in [king, man, queen, woman]:
#     euclidean_dist = np.linalg.norm(king_man_women - vect)
#     all_distances.append(euclidean_dist)

print('Vector of All distances:{}'.format(all_distances))

Vector of All distances:[1.727951, 3.7211041, 2.2986577, 3.2687893]


In [10]:
target = king - man + woman
for word, vec in [("king", king), ("man", man), ("woman", woman), ("queen", queen)]:
    score = np.sqrt(((target - vec)**2).sum())
    print("distance(king - man + woman, %s) = %.5f" % (word, score))

distance(king - man + woman, king) = 1.72795
distance(king - man + woman, man) = 3.72110
distance(king - man + woman, woman) = 3.26879
distance(king - man + woman, queen) = 2.29866


In [40]:
# Hidden unit tests

## Document representations

Represent each document as the average word2vec vector of all words present in the model.

In [14]:
document_vectors = np.zeros((len(titles), 300))
from gensim.utils import tokenize
for i, (title, text) in enumerate(zip(titles, texts)):
    tokens = tokenize(title + "\n" + text)
    
    # For each doc we assemble all words that exist in the model and create the mean vector out of them:
    document_vectors[i] = np.mean(model[[word for word in list(set(tokens)) if word in model.key_to_index]], axis=0)

In [15]:
# Automatic unit tests, no need to modify/study this.
assert document_vectors.shape == (len(titles), 300)
assert np.abs(document_vectors).sum(axis=0).min() > 0, "Some vector not initialized?"
assert np.abs(document_vectors).sum(axis=1).min() > 0, "Some vector not initialized?"

## Find the document with the shortest vector

Note: this likely will be one of the longer documents.

In [16]:
shortest = None # Document number of the document with the shortest vector

# YOUR CODE HERE
lengths = []
from gensim.utils import tokenize
for i, text in enumerate(texts):
    # Tokenize first:
    tokens = tokenize(text)
    doc = [word for word in list(tokens) if word in model.key_to_index]
    lengths.append(len(doc))

shortest = lengths[min(lengths)]

print(titles[shortest], len(texts[shortest]))

Üter Zörker 2197


In [44]:
# Hidden unit tests for grading

## Find the two most similar documents

In [17]:
similarity_matrix = None

# Using Cosine similarity:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate the cosine similarity matrix:
similarity_matrix = cosine_similarity(document_vectors)

In [18]:
# Automatic tests (no points)
assert similarity_matrix.shape == (len(titles), len(titles))

In [19]:
# lower triangular part of the matrix:
lower_similarity_matrix = np.tril(similarity_matrix, -1)

In [20]:
most_similar = None # Pair of two different documents

# YOUR CODE HERE
most_similar = np.unravel_index(np.argmax(lower_similarity_matrix, axis=None), lower_similarity_matrix.shape)

print(titles[most_similar[0]], " and ", titles[most_similar[1]])
print(len(texts[most_similar[0]]), " and ", len(texts[most_similar[1]]))

Larry Mackleberry  and  Jerry Mackleberry
185  and  185


In [21]:
# Automatic unit tests, no need to modify/study this.
assert most_similar[0] != most_similar[1]
_a, _b = min(most_similar), max(most_similar)
_tmp = similarity_matrix[_a].copy()
_tmp[[_a,_b]] = -1
assert similarity_matrix[_a, _b] >= _tmp.max()
del _tmp

## Find the two most similar longer documents

Now only consider documents that have at least 1000 characters in the body!

In [44]:
most_similar = None # Pair of two different documents

# YOUR CODE HERE
# lengths = []
from gensim.utils import tokenize
# for i, text in enumerate(texts):
#     # Tokenize first:
#     tokens = tokenize(text)
#     doc = [word for word in list(tokens) if word in model.key_to_index]
#     if len(doc) >= 1000:
#         lengths.append((i, len(doc)))

# lengths
        
# print(titles[most_similar[0]], " and ", titles[most_similar[1]])
# print(len(texts[most_similar[0]]), " and ", len(texts[most_similar[1]]))

In [None]:
# Automatic unit tests, no need to modify/study this.
assert most_similar[0] != most_similar[1]
_a, _b = min(most_similar), max(most_similar)
assert len(texts[_a]) >= 1000 and len(texts[_b]) >= 1000, "not long documents."
_tmp = similarity_matrix[_a].copy()
_tmp[[_a,_b]] = -1
assert similarity_matrix[_a, _b] >= _tmp.max()
del _tmp

## Run k-means and spherical k-means

Cluster the document vectors (*not* the similarity matrix) with spherical k-means.

Use k=10, and a fixed random seed of 42.

Recall the assumptions of our spherical k-means implementation!

In [23]:
kcent = None # Compute the k-means cluster centers
kassi = None # Compute the k-means cluster assignment
from sklearn.cluster import KMeans

# YOUR CODE HERE
kmeans = KMeans(n_clusters=10, random_state=42).fit(document_vectors)
kcent = kmeans.cluster_centers_
kassi = kmeans.labels_

In [24]:
# Minimalistic implementation for spherical k-means, so we use the same version in this assignment
# This is NOT meant as an example of good code, but to be short.
def initial_centers(X, k, seed):
    return X[np.random.default_rng(seed=seed).choice(X.shape[0], k, replace=False)]

def sphericalkmeans(X, centers, max_iter=100):
    assert abs((X**2).sum()-len(X)) < 1e-7, "Improper input for spherical k-means!"
    last_assignment = None
    for iter in range(max_iter):
        assignment = np.asarray((X @ centers.T).argmax(axis=1)).squeeze()
        if last_assignment is not None and all(assignment == last_assignment): break
        last_assignment, centers = assignment, np.zeros(centers.shape)
        for i in range(centers.shape[0]):
            c = X[assignment == i,:].sum(axis=0)
            centers[i] = c / np.sqrt((c**2).sum())
    return centers, assignment

In [25]:
scent = None # Compute the spherical k-means cluster centers
sassi = None # Compute the spherical k-means cluster assignment

# First, let's normalize the input:
from sklearn.preprocessing import normalize
normalized_documents = normalize(document_vectors, 'l2')

# Initialize the centers:
init = initial_centers(document_vectors, 10, 42)

# Compute Spherical K means:
scent, sassi = sphericalkmeans(normalized_documents, init)

## Explore your result

Explore the result: write a function to determine the most important words for each factor, and the most relevant documents.

In [35]:
def most_central(tfidf, centers, assignment, i, k=5):
    """Find the most central documents of cluster i"""
    central_docs = (tfidf@centers[i].T).flatten()*(assignment==i)
    return central_docs.argsort() [-1::-1][:k]
    
def explain(tfidf, titles, classes, centers, assignment):
    """Explain the clusters: print
    (1) relative size of each cluster
    (2) three most frequent classes of each cluster
    (3) five most central documents of each cluster
    (4) ARI of the entire clustering"""
    from sklearn.metrics import adjusted_rand_score
    from collections import Counter
        
    # Relative size of each cluster:
    print('----------------Size of Clusters-------------------\n')
    print('The Relative size of each cluster:')
    [print('For Cluster: {}, There is {} documents'.format(el, sassi.tolist().count(el))) for el in list(set(assignment)) ]
        
    # Three most frequent classes of each cluster:
    print('----------------Most frequent Classes-------------------\n')
    print('The most frequent classes of each cluster:')
    print([classes[i[0]] for i in Counter(assignment).most_common(3)])
    
    # Five most central documents for each cluster:
    print('----------------Central Documents-------------------\n')
    print('The Five most central documents for each cluster:')
    print()
    [print((titles[t[0]],titles[t[1]], titles[t[2]], titles[t[3]], titles[t[4]]), '\n')  for t in [most_central(tfidf, centers, assignment, i, k=5) for i in np.unique(assignment)]]
#     for i in np.unique(assignment):
#         print('Cluster: {} has the following 5 central documents: \n'.format(i))
#         for t in most_central(tfidf, centers, assignment, i, k=5):
#             print(titles[t])
    print('---------------------####---------------------')
    # ARI for the entire clustering:
    print('----------------ARI-------------------\n')
    print('The ARI measure for the current assignment is: {}'.format(adjusted_rand_score(assignment, classes)))
    

In [36]:
print("Regular k-means clustering:")
explain(document_vectors, titles, classes, kcent, kassi)

Regular k-means clustering:
----------------Size of Clusters-------------------

The Relative size of each cluster:
For Cluster: 0, There is 1218 documents
For Cluster: 1, There is 2164 documents
For Cluster: 2, There is 395 documents
For Cluster: 3, There is 386 documents
For Cluster: 4, There is 694 documents
For Cluster: 5, There is 723 documents
For Cluster: 6, There is 1672 documents
For Cluster: 7, There is 1147 documents
For Cluster: 8, There is 1045 documents
For Cluster: 9, There is 682 documents
----------------Most frequent Classes-------------------

The most frequent classes of each cluster:
['Episodes', 'Episodes', 'Episodes']
----------------Central Documents-------------------

The Five most central documents for each cluster:

('Everytime We Say Good-Bye', 'Explode You', "If You Think I'm Cuddly", "I Don't Know You", '(You Make Me Feel Like) A Natural Woman') 

('Rotoscoped couch gag', 'Cake couch gag', 'Dice Couch Gag', 'The Couch Movie Trailer couch gag', 'Paintbrush

In [37]:
# Note: in case of poor performance, revisit your code above!
print("Spherical k-means clustering:")
explain(document_vectors, titles, classes, scent, sassi)

Spherical k-means clustering:
----------------Size of Clusters-------------------

The Relative size of each cluster:
For Cluster: 0, There is 1218 documents
For Cluster: 1, There is 2164 documents
For Cluster: 2, There is 395 documents
For Cluster: 3, There is 386 documents
For Cluster: 4, There is 694 documents
For Cluster: 5, There is 723 documents
For Cluster: 6, There is 1672 documents
For Cluster: 7, There is 1147 documents
For Cluster: 8, There is 1045 documents
For Cluster: 9, There is 682 documents
----------------Most frequent Classes-------------------

The most frequent classes of each cluster:
['Episodes', 'Episodes', 'Episodes']
----------------Central Documents-------------------

The Five most central documents for each cluster:

('Gaston Simpson', 'Little Bearded Woman', 'Jeopardy Contestant 2', 'They Read, And Write, They Read and Read and Write', 'Joey (flashback)') 

('Stanlerina', 'Comedian', 'Rapunzel', 'Pa (How Munched Is That Birdie in the Window?)', 'Tommy') 



In [None]:
# Hidden unit tests

In [None]:
# Hidden unit tests