# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *H*

**Names:**

* *Baffou Jérémy*
* *Basseto Antoine*
* *Pinto Andrea*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [2]:
import pickle
import string
import numpy as np
import matplotlib.pyplot as plt 
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.util import ngrams

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.1: Pre-processing

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw')

For me it makes sense to do filtering operations in this order:

- remove punctuation
- remove stop words
- remove digits
- lower case the words
- stemming
- lemming
- compute frequencies over all corpus
- remove too frequent and unfrequent terms 
- add n-grams (2 and 3 grams)

In [6]:
courses_rdd = sc.parallelize(courses)

In [9]:
courses_processed = courses_rdd.map(lambda c : {"courseId" : c["courseId"], "name" : c["name"], "description" : [word.casefold() for word in c["description"].translate(str.maketrans('', '', string.punctuation)).split() if (word.lower() not in stopwords and not word.isdigit())]})

In [10]:
courses_processed.count()

854

We also cut words with digit in it as they don't give information about course content most of the times (except 3d which maybe we could keep?)

In [None]:
courses_processed = courses_processed.map(lambda c :{"courseId" : c["courseId"], "name" : c["name"], "description" : [w for w in c["description"] if not any(i.isdigit() for i in w)]}) 

### Lemming

In [None]:
tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
def tag_mapper(tag):
    if tag[0] in tag_dict:
        return tag_dict[tag[0]]
    else :
        return wordnet.NOUN

def lemmatize(words):
    return list(map(lambda w : lm.lemmatize(w[0],tag_mapper(w[1])),nltk.pos_tag(words)))

In [None]:
#lemmatization  (have to collect dataset because can't figure how to have wordnet downloaded on every workers)
lm = WordNetLemmatizer()
courses_lemmatized = sc.parallelize(list(map(lambda c : {"courseId" : c["courseId"], "name" : c["name"], "description" : lemmatize(c["description"])},courses_processed.collect())))

### [OPTIONAL] : Stemming

In [None]:
ps = PorterStemmer()
courses_stemmed = courses_lemmatized.map(lambda c : {"courseId" : c["courseId"], "name" : c["name"], "description" : list(map(lambda w : ps.stem(w), c["description"]))})

I'm not sure about stemming here. I mean it can cut words and group them into bigger group (worker, and working will be map to work which isn't done by the lemmatizer here). But sometimes it cuts too much end of words, so we lose a bit of meaning.

Now we're going to compute the frequencies of words in the **entire corpus**.

In [None]:
#courses_word_aggregation = courses_stemmed.flatMap(lambda c : c["description"]) #flatten all words lists
courses_word_aggregation = courses_lemmatized.flatMap(lambda c : c["description"]) #flatten all words lists

In [None]:
words_number = courses_word_aggregation.count()
words_count = courses_word_aggregation.map(lambda w : (w,1)).reduceByKey(lambda w1,w2 : w1+w2).map(lambda w : (w[1],w[0])).sortByKey(False)
words_freq = words_count.map(lambda w : (w[1],w[0]/words_number))
words_freq_for_plot = np.asarray(words_freq.map(lambda w : w[1]).collect())

In [None]:
fig,axs = plt.subplots(1,2,figsize=(16,4))
axs[0].set_title("Frequencies of words")
axs[0].set_xlabel("wordId")
axs[0].set_ylabel("log of frequencies")
plot1 = axs[0].plot(np.linspace(0,len(words_freq_for_plot), num=len(words_freq_for_plot)),np.log(words_freq_for_plot))
plot2 = axs[1].boxplot(words_freq_for_plot)

We want to erase words that are too frequent in the corpus as they are unlikely to differentiate documents. We choose to keep only the words which were under the 0.975 quantile because it corresponds more or less to the big vertical bar at the beginning in the frequencies plot.

In [None]:
really_frequent_indices = np.where(words_freq_for_plot > np.quantile(words_freq_for_plot,0.60)) # maybe change quantile
really_frequent_words = set(words_freq.take(really_frequent_indices[0][-1]))

(attention, remove << and >> in text)

Concerning the infrequent words we have:

In [None]:
single_apparition_words = words_count.filter(lambda w : w[0] == 1)
single_apparition_words.count()/words_count.count()

So these words are a big part of the dataset, should we really cut them?

Now we cut the really frequent words in the list of words per course

In [None]:
#bag_of_words_per_course = courses_stemmed.map(lambda c :{"courseId" : c["courseId"], "name" : c["name"], "description" : [w for w in c["description"] if w not in really_frequent_words]}) 
bag_of_words_per_course = courses_lemmatized.map(lambda c :{"courseId" : c["courseId"], "name" : c["name"], "description" : [w for w in c["description"] if w not in really_frequent_words]}) 

### N-grams

A big question is at which step are we supposed to create the n-grams :
- before the first processing step
- after lemmatization
- after stemming 

I think that starting with after first lemming step is a good choice because we are interested of words in their context. And stemming is mainly here to extract most of the information of a single word but we lose information about the original one which, when associated with other, can give more information.

Another big question is how many n_grams do we create?

In [None]:
#after lemming and cut of frequent words:
two_grams = bag_of_words_per_course.map(lambda c : {"courseId" : c["courseId"], "name" : c["name"], "description" : [w for w in ngrams(c["description"],2)]})
three_grams = bag_of_words_per_course.map(lambda c : {"courseId" : c["courseId"], "name" : c["name"], "description" : [w for w in ngrams(c["description"],3)]})
four_grams =bag_of_words_per_course.map(lambda c : {"courseId" : c["courseId"], "name" : c["name"], "description" : [w for w in ngrams(c["description"],4)]})
n_grams = two_grams.union(three_grams.union(four_grams))

---

In [None]:
n_gram_aggregation = n_grams.flatMap(lambda c : c["description"]) #flatten all words lists
n_gram_number = n_gram_aggregation.count()
n_gram_count = n_gram_aggregation.map(lambda w : (w,1)).reduceByKey(lambda w1,w2 : w1+w2).map(lambda w : (w[1],w[0])).sortByKey(False)
n_gram_freq = n_gram_count.map(lambda w : (w[1],w[0]/n_gram_number))
n_gram_freq_for_plot = np.asarray(n_gram_freq.map(lambda w : w[1]).collect())
fig,axs = plt.subplots(1,2,figsize=(16,4))
axs[0].set_title("Frequencies of words")
axs[0].set_xlabel("ngram Id")
axs[0].set_ylabel("log of frequencies")
plot1 = axs[0].plot(np.linspace(0,len(n_gram_freq_for_plot), num=len(n_gram_freq_for_plot)),np.log(n_gram_freq_for_plot))
plot2 = axs[1].boxplot(n_gram_freq_for_plot)

In [None]:
really_frequent_n_gram_indices = np.where(n_gram_freq_for_plot > np.quantile(n_gram_freq_for_plot,0.5)) # maybe change quantile
really_frequent_n_gram = set(n_gram_freq.take(really_frequent_n_gram_indices[0][-1]))

Sould we cut off the n-grams that are too frequent?

In [None]:
n_grams = n_grams.map(lambda c :{"courseId" : c["courseId"], "name" : c["name"], "description" : [w for w in c["description"] if w not in really_frequent_n_gram]}) 

In [None]:
n_grams = n_grams.map(lambda c : {"courseId" : c["courseId"], "name" : c["name"], "description" : [" ".join(w) for w in c["description"]]})

### Bag of words creation

We will use the bag of words per course, put them in sets and make union so that we don't keep duplicate. Then we will do the same things for n-grams, and finally union them all. 

In [None]:
bag_of_words_per_course = bag_of_words_per_course.map(lambda c : (c["courseId"], (c["name"],c["description"]))).union(n_grams.map(lambda c : (c["courseId"], (c["name"],c["description"])))).reduceByKey(lambda a,b : (a[0],a[1]+b[1]))
bag_of_words_per_course = bag_of_words_per_course.map(lambda c : {"courseId" : c[0], "name" : c[1][0], "description" : c[1][1]})

In [None]:
bag_of_words = bag_of_words_per_course.map(lambda c : set(c["description"])).reduce(lambda a,b : a.union(b))

### IX description

Should the n_grams be included?

In [None]:
set(sorted(bag_of_words_per_course.filter(lambda c : c["courseId"] == "COM-308").take(1)[0]["description"]))

## Exercise 4.2: Term-document matrix

We will use an implementation of TF-IDF which uses the term frequecy expression : $\frac{f_{td}}{\sum_{t' \in d}{f_{t',d}}}$, and as inverse document frequency : $\log{\frac{N}{|d \in D : t \in d|}}$

In [None]:
course_id_document_mapping = dict(zip(bag_of_words_per_course.map(lambda c : c["courseId"]).collect(), range(bag_of_words_per_course.count())))
term_id_mapping = dict(zip(sorted(bag_of_words), range(len(bag_of_words))))

In [None]:
def set_freq_element(t):
    term_frequency[t[0][0]][t[0][1]] = t[1]
    return t

In [None]:
term_count = bag_of_words_per_course.map(lambda c : [((term_id_mapping[w],course_id_document_mapping[c["courseId"]]),1) for w in c["description"]]).flatMap(lambda c : c).reduceByKey(lambda a,b : a+b).collect()

In [None]:
term_frequency = np.zeros((len(term_id_mapping),len(course_id_document_mapping)),dtype=np.int64)
idf = np.ones((term_frequency.shape[0],1),dtype=np.int64)*term_frequency.shape[1]
term_count = list(map(lambda t : set_freq_element(t), term_count))
term_frequency = term_frequency/term_frequency.sum(axis=0)
idf = idf/(np.bincount(np.where(term_frequency != 0)[0]).reshape(idf.shape[0],1))
tf_idf = term_frequency*idf

IX top 15 words in term of TF-IDF score:

In [None]:
sorted(list(zip(tf_idf[:,course_id_document_mapping["COM-308"]],term_id_mapping.keys())),key=lambda t : t[0],reverse=True)[:15]

From the choice of our TF-IDF score implementation, its value is higher when the term is frequent in the document but not in the rest of the corpus ($t_f$ is big and ${|d \in D : t \in d|}$ is small thus idf is big).

## Exercise 4.3: Document similarity search

In [None]:
def cos_similarity(d_i,d_j):
    return (d_i.T@d_j)/(np.linalg.norm(d_i)*np.linalg.norm(d_j))

In [None]:
query_doc = np.zeros((tf_idf.shape[0],1),dtype=np.int64)
query_term = ["markov chain","facebook"]
for t in query_term:
    query_doc[term_id_mapping[t]] = np.max(tf_idf)

In [None]:
query_result = np.zeros((tf_idf.shape[1],1))
for i in range(tf_idf.shape[1]):
    query_result[i,0] = cos_similarity(query_doc,tf_idf[:,i].reshape(tf_idf.shape[0],1))

In [None]:
sorted(list(zip(query_result,course_id_document_mapping.keys())),key=lambda t : t[0],reverse=True)[:5]

In [None]:
courses

In [None]:
def query_processor(words_list,num=5):
    query_doc = np.zeros((tf_idf.shape[0],1),dtype=np.int64)
    for t in words_list:
        query_doc[term_id_mapping[t]] = np.max(tf_idf)
    query_result = np.zeros((tf_idf.shape[1],1))
    for i in range(tf_idf.shape[1]):
        query_result[i,0] = cos_similarity(query_doc,tf_idf[:,i].reshape(tf_idf.shape[0],1))
    best_fit = sorted(list(zip(query_result,course_id_document_mapping.keys())),key=lambda t : t[0],reverse=True)[:num]
    courses_description = courses_rdd.filter(lambda c : c["courseId"] in list(map(lambda t : t[1],best_fit))).collect() 
    output = []
    for i in best_fit:
        for j in courses_description: ###  A UPGRADE!!!!!
            if i[1] == j["courseId"]:
                output.append(j)
    return output

In [None]:
query_processor(["markov chain","facebook"])