# Text 4: Word2Vec
**Internet Analytics - Lab 4**

---

**Group:** *H*

**Names:**

* *Antoine Basseto*
* *Andrea Pinto*
* *Jérémy Baffou*


---

#### Instructions

*This is a template for part 4 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import re
import pickle
import string
import numpy as np

import nltk
from nltk.corpus import wordnet
from nltk.util import ngrams

from collections import defaultdict, Counter
from scipy.sparse import csr_matrix
import json
from utils import *
import gensim
from sklearn.cluster import KMeans

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

## Redo pre-processing

### Cleaning

In [2]:
"""Preprocess text description
@param: (description) Dirty string text
@return: Cleaned array of words
"""
def preprocess(description):
    # Preprocess courses descriptions
    split_joined_words = re.sub(r'([a-z])([A-Z])', r'\1 \2', description)
    split_3D_occurences = re.sub(r'3D', r' 3D ', split_joined_words)
    stick_phd_occurences = re.sub(r'Ph D', r'PhD', split_3D_occurences)
    cleaned_dataset = stick_phd_occurences
    
    # Remove punctuations and separate
    cleaned_dataset = (cleaned_dataset
                       .translate(str.maketrans(string.punctuation, ' '*len(string.punctuation)))
                       .split())
    
    # Remove stopwords and digits
    cleaned_dataset = [
        word for word in cleaned_dataset 
        if word.lower() not in stopwords and not word.isdigit()
    ]
    
    return cleaned_dataset

In [3]:
"""Clean courses RDD
@param: (courses_rdd) Dirty courses RDD
@return: Cleaned courses RDD
"""
def clean_rdd(courses_rdd):
    return courses_rdd.map(
        lambda c: {
            'courseId': c['courseId'],
            'name': c['name'],
            'description': preprocess(c['description'])
        }
    )

In [4]:
courses_rdd = sc.parallelize(courses)
courses_preprocessed_rdd = clean_rdd(courses_rdd)

### Frequency analysis

In [5]:
"""Frequency analysis of given corpus
@param: (corpus) RDD of list of words
@return: RDD of list of (frequency ratio, word) tuples
@return: Number of words in the corpus
"""
def frequency_analysis(corpus):
    count = corpus.count()
    freqs2word_rdd = (
        corpus
        .map(lambda word: (word, 1))
        .reduceByKey(lambda x, y: x + y)
        .map(lambda x: (x[1] / count, x[0]))
        .sortByKey(False)
    )
    return freqs2word_rdd, count

In [6]:
corpus = courses_preprocessed_rdd.flatMap(lambda c: c["description"])
freqs2word_rdd, count = frequency_analysis(corpus)
freqs = np.asarray(freqs2word_rdd.map(lambda x: x[0]).collect())

In [7]:
def frequent_words(freqs2word_rdd, freqs, quantile):
    indices = np.where(freqs > np.quantile(freqs, quantile))[0]
    frequent = freqs2word_rdd.take(indices[-1])
    frequent_words = set(map(lambda x: x[1], frequent))
    return frequent_words

In [8]:
def rare_words(freqs2word_rdd, count, apparitions):
    rare = freqs2word_rdd.filter(lambda x: x[0]*count == apparitions)
    rare_words = set(rare.map(lambda x: x[1]).collect())
    return rare_words

In [9]:
def remove_rdd(courses_rdd, words):
    return courses_rdd.map(
        lambda c: {
            "courseId": c["courseId"],
            "name": c["name"],
            "description": [w for w in c["description"] if w not in words]
        }
    )

In [10]:
fw = frequent_words(freqs2word_rdd, freqs, 0.995)
rw = rare_words(freqs2word_rdd, count, 1)

In [11]:
print(f'{len(fw)} of frequent words out of {freqs2word_rdd.count()} total words.')
print(f'{len(rw)} of rare words out of {freqs2word_rdd.count()} total words.')

81 of frequent words out of 16772 total words.
7757 of rare words out of 16772 total words.


We only delete frequent words, as words that occur even only once account for 46% of the data, which is too much to drop.

In [12]:
preprocessed_data_rdd = remove_rdd(courses_preprocessed_rdd, fw)

## Exercise 4.12 : Clustering word vectors

### Get pre-trained model

In [13]:
w2v_model = gensim.models.KeyedVectors.load_word2vec_format('/ix/model.txt')  

In [14]:
def get_word_vector(model, word):
    # Get a default word_vector for words out of the vocabulary of the model, 
    # use the zero vector so it is not really taken into account for the calculation of
    # document and query vector representations
    return model.get_vector(word) if word in model.vocab else np.zeros(len(model.get_vector("dog")))

### K-Means

In [15]:
np.linalg.norm(w2v_model.get_vector("dog"))

0.9999998

Note that vectors in our model are already normalised, meaning that k-means, even though it is using the euclidean distance and not the cosine similarity, should find reasonable clusters.

In [16]:
"""Get dict with words in the data and their vector 
representation according to the given model, for use with the kmeans algorithm.
@param: (data_rdd) RDD of list of words
@param: (model) Gensim w2v model
@return: dict of words in the data and their vector representation
"""
def get_data_word_vectors(data_rdd, model):
    data_w2v = {} 
    for word in data_rdd.flatMap(lambda c : c["description"]).collect():
        # As this is for use with the kmeans algorithm, we discard any data not in 
        # the model as it would create a big cluster around the default word vector 
        # and pollute our findings
        if word in model.vocab:
            data_w2v.update({word: get_word_vector(model, word)})
        
    return data_w2v

In [17]:
data_w2v = get_data_word_vectors(preprocessed_data_rdd, w2v_model)

In [18]:
# Apply k-means to cluster word vectors of the unique (guaranteed by the list of dictionnary values)
# words in our pre-processed dataset
kmeans = KMeans(n_clusters=25, random_state=0).fit(list(data_w2v.values()))

In [21]:
cluster_assignments = kmeans.predict(list(data_w2v.values()))

In [36]:
def print_top_n_cluster(kmeans, data_w2v, model, topn=10):    
    data_vectors = np.array(list(data_w2v.values()))
    data_word = np.array(list(data_w2v.keys()))
    
    for i, center in enumerate(kmeans.cluster_centers_):
        similarities = model.cosine_similarities(center, data_vectors[cluster_assignments == i])
        # get the closest words in ascending order
        top_indices = np.argsort(similarities)[:-topn-1:-1]
        print(f'The {topn} closest words to cluster center #{i} are:\n{data_word[cluster_assignments == i][top_indices]}\n')

In [37]:
print_top_n_cluster(kmeans, data_w2v, w2v_model, topn=10)

The 10 closest words to cluster center #0 are:
['consecutive' 'final' 'starting' 'September' 'scoring' 'August' 'ended'
 'June' 'July' 'October']

The 10 closest words to cluster center #1 are:
['optimal' 'parameters' 'optimizing' 'evaluating' 'optimum' 'determining'
 'utilization' 'flexibility' 'robustness' 'tradeoffs']

The 10 closest words to cluster center #2 are:
['Perturbation' 'Variational' 'Estimation' 'Processes' 'Analysis'
 'Iterative' 'Kinematics' 'Function' 'Deterministic' 'Mechanisms']

The 10 closest words to cluster center #3 are:
['seminars' 'teaching' 'academics' 'internships' 'lecturers' 'education'
 'faculty' 'coursework' 'educational' 'professors']

The 10 closest words to cluster center #4 are:
['notion' 'notions' 'epistemologies' 'normative' 'context' 'rationality'
 'subjectivity' 'worldview' 'understanding' 'paradigm']

The 10 closest words to cluster center #5 are:
['connecting' 'adjacent' 'linking' 'junction' 'situated' 'stretching'
 'area' 'connected' 'located

We can observe clusters about course topics, academics, actions, people's names, concepts (time) or french words.

We give the following labels for 10 of the above clusters:

concepts
- cluster 0 is `time`
- cluster 3 is `education`

french words
- cluster 9 is `french words`

names
- cluster 10 is `names`

actions
- cluster 17 is `understanding`

course topics`
- cluster 14 is `finance`
- cluster 16 is `developement`
- cluster 18 is `chemistry`
- cluster 19 is `life sciences`
- cluster 22 is `mathematics`


Topics are much more meaningful here than for LSI and LDA, almost all could have a clear meaning associated to it.

## Exercise 4.13 : Document similarity search

In [21]:
def get_DF(data_rdd):
    DF = {}
    for word in data_rdd.flatMap(lambda c : c["description"]).collect():
        if word not in DF:
            DF[word] = 1
        else:
            DF[word] += 1
    return DF

In [22]:
DF = get_DF(preprocessed_data_rdd)

In [23]:
"""Get courses with a vector for their description
@param: (data_rdd) RDD of the data
@return: list of dict of courses with a vector component
"""
def document_vectors_rdd(data_rdd, model, DF):
    return list(map(lambda c: add_vector_to_course(c, model, DF), data_rdd.collect()))

def add_vector_to_course(course, model, DF):
    TF_IDF = {word: TF/DF[word] for word, TF in Counter(course['description']).items()}
    total_TF_IDF = np.sum(list(TF_IDF.values()))
    return {'courseId': course['courseId'],
            'name': course['name'],
            'description': course['description'],
            'vector': np.sum([get_word_vector(model, word) * TF_IDF[word] / total_TF_IDF 
                              for word in course['description']], axis=0)}

In [24]:
def get_query_vector(query, DF, model):
    words = query.split()
    if np.all([word in DF for word in words]):
        TF_IDF = {word: TF/DF[word] for word, TF in Counter(words).items()}
        total_TF_IDF = np.sum(list(TF_IDF.values()))

        return np.sum([get_word_vector(model, word) * TF_IDF[word] / total_TF_IDF for word in words], axis=0)
    else:
        return np.mean([get_word_vector(model, word) for word in words], axis=0)

In [25]:
def get_topn_courses(courses_with_vectors, search_word_vector, model, topn=5):
    result = list(map(lambda c: (c, model.cosine_similarities(search_word_vector, [c['vector']])[0]),
                      courses_with_vectors))
    result.sort(reverse=True, key=lambda t: t[1])
    return result[:topn]

In [26]:
def print_topn_courses(courses_with_vectors, query, DF, model, topn=5):
    search_word_vector = get_query_vector(query, DF, model)
    sorted_courses = get_topn_courses(courses_with_vectors, search_word_vector, model, topn=topn)
    
    print(f'Searching the top {topn} courses for query \"{query}\":\n')
    for i, c in enumerate(sorted_courses):
        print(f'Result #{i} is the course {c[0]["courseId"]} {c[0]["name"]} with similarity {c[1]}')

In [27]:
courses_with_vectors = document_vectors_rdd(preprocessed_data_rdd, w2v_model, DF)

In [28]:
print_topn_courses(courses_with_vectors, "Markov chains", DF, w2v_model, topn=5)

Searching the top 5 courses for query "Markov chains":

Result #0 is the course MATH-332 Applied stochastic processes with similarity 0.5617865746406503
Result #1 is the course MGT-484 Applied probability & stochastic processes with similarity 0.5294180894619351
Result #2 is the course COM-516 Markov chains and algorithmic applications with similarity 0.5192955703253681
Result #3 is the course CH-311 Molecular and cellular biophysic I with similarity 0.4822200434776796
Result #4 is the course MSE-211 Organic chemistry with similarity 0.472022200454137


In [29]:
print_topn_courses(courses_with_vectors, "Facebook", DF, w2v_model, topn=5)

Searching the top 5 courses for query "Facebook":

Result #0 is the course EE-727 Computational Social Media with similarity 0.7550354730039651
Result #1 is the course COM-308 Internet analytics with similarity 0.4954955788676604
Result #2 is the course CS-622 Privacy Protection with similarity 0.4782632547353867
Result #3 is the course COM-208 Computer networks with similarity 0.4726744392633931
Result #4 is the course CS-486 Human computer interaction with similarity 0.47204086049787874


Results are quite good, two courses in the `Markov chains` query are more related to carbon chains than anythong else but we can still conclude that overall the results are comparable to LSI and better than using only the TF-IDF (especially when considering the `Facebook` results)

## Exercise 4.14: Document similarity search with outside terms

In [30]:
print_topn_courses(courses_with_vectors, "MySpace Orkut", DF, w2v_model, topn=5)

Searching the top 5 courses for query "MySpace Orkut":

Result #0 is the course EE-727 Computational Social Media with similarity 0.7118328548684894
Result #1 is the course COM-208 Computer networks with similarity 0.5225126718946227
Result #2 is the course COM-308 Internet analytics with similarity 0.5213876867320217
Result #3 is the course MGT-517 Entrepreneurship laboratory (e-lab) with similarity 0.48860631331937526
Result #4 is the course CS-486 Human computer interaction with similarity 0.485148847320839


Results are almost the same as for the `Facebook` query, which makes sense and supports the validity of the model.

In [31]:
print_topn_courses(courses_with_vectors, "coronavirus", DF, w2v_model, topn=5)

Searching the top 5 courses for query "coronavirus":

Result #0 is the course BIO-657 Landmark Papers in Cancer and Infection with similarity 0.5979577792035651
Result #1 is the course BIO-477 Infection biology with similarity 0.5882584923093604
Result #2 is the course BIO-638 Practical - Lemaitre Lab with similarity 0.5714327124173664
Result #3 is the course CH-414 Pharmacological chemistry with similarity 0.5480594849117834
Result #4 is the course BIOENG-433 Biotechnology lab (for CGC) with similarity 0.5412282985903536


Results here are quite satisfactory, with all courses pertaining to life sciences and the two most similar to infections.