# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *D*

**Names:**

* *Marc Bickel*
* *Cyril Cadoux*
* *Emma Lejal*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [107]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl


stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.1: Pre-processing

In [108]:
import string

Check NLTK library !

In [109]:
import nltk
#nltk.download()

We have chosen to implement the following features : 
- Remove the stop words
- Remove the punctuation
- Stem the words
- Lemmatize the words

In [110]:
def deletePunctNumber(s):
    
    punct_number = ['.',',','\'', '=', '\"','#', '*', '&', '!', '?', ':', ';', '-', '_', '(', ')', '/',
                    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '%']
    result = s
    
    for p in punct_number :
        result = result.replace(p, " ")
        
    return result

In [111]:
def remove_mail_and_http(s):
    
    result = s
    
    for word in s.split():
        
        if ('www' in word or '@' in word or 'http' in word or 'moodle' in word):
            result = result.replace(word, " ")
    
    return result

Using the nltk lib, we define a function that lemmatize and stemms a string

In [112]:
def givePos(c):
    #if(c == 'p' or c == 'i' or c == 'j' or c== 'f' or c == 'c' or c== 'm' or c == 'd' or c== 'w'):
    if(c != 'v'):   
        return 'n'
    else:
        return 'v'

In [113]:
def lemm_stemm(s):
    
    # ----- Ensuring lemm and stemm don't crash -----
    while( len(s) > 0 and s[0] == ' '):
        s = s[1:]
        
    s = s.replace('  ', ' ')
    s = s.replace('   ', ' ')
    s = s.replace('    ', ' ')
    s = [word for word in s.split(" ") if len(word) > 0]
    # ------------------------------------------------

    wordsWtag = nltk.pos_tag(s)

    lemmatizer = nltk.stem.WordNetLemmatizer()
    lems = [lemmatizer.lemmatize(w, pos = givePos(t.lower()[0])) for (w, t) in wordsWtag ]

    porter_stemmer = nltk.stem.PorterStemmer()
    stems = [porter_stemmer.stem(l) for l in lems]

    return (' '.join(stems)).lower()

In [114]:
all_stopwords = ''

for e in stopwords:
    all_stopwords += e + " "


# Handmade stopwords :
all_stopwords += "class explore number key past decade century course field application pratice current"
all_stopwords += " commun provide good sessions laboratory prerequisite problem student teach cathedra"
all_stopwords +=  " homework lab expect knowledge lecture fundament concept project midterm final exam question"
all_stopwords += " grade make learn understand skill capacity small large group report write oral interpret access output"
all_stopwords += " start wide invitation method develop base website control exercise discuss scale introduction illustration "
all_stopwords += " selection student study keyword opportun previous"
all_stopwords += " model system content design assessment process analysi basic activity outcome present evaluation material"
all_stopwords += " structure requirement theory generation plan hour import effect scientific time proprety cells office program"
all_stopwords += " technology principles equation sources science solution relation interaction week semester bachelor master"
all_stopwords += " results industry test device experiment weekly medium concrete techniques acquire handson moodle epfl"

l_s_stop = lemm_stemm(all_stopwords)

In [115]:
def strip_stopwords(s):
    
    s_words = s.split() 
    
        
    # Remove basic stopwords, internet and email addresses
    first_filter = [word for word in s_words if (word.lower() not in all_stopwords)
                   and (len(word) > 2)
                   and ("http" not in word)
                   and ('@' not in word) ]
        
    s_pure = ''
    for w in first_filter:
        s_pure += w + " "
    
    if (len(s_pure) == 0) :
        return ''
    
    # Remove lemm_stemm versions of stopwords
    lemm_stemm_words = lemm_stemm(s_pure).split()
    
    resultWords = [word for word in lemm_stemm_words if (word.lower() not in l_s_stop)]
    

    result = ' '.join(resultWords)
    return result

In [116]:
def function_easy(description):
    liste_dude = description.split(' ')
    dude_viens = ''
    for word_dude in liste_dude:
        petit_string_1 = word_dude
        petit_string_2 = ''
        for i in range(len(word_dude)):
            if (str.isupper(word_dude[i]) and i != 0):
                petit_string_1 = word_dude[:i]
                petit_string_2 = word_dude[i:] + ' '
        dude_viens += petit_string_1 + ' ' + petit_string_2
    return dude_viens

In [117]:
def preProcess(s):
    s = function_easy(s)
    s = remove_mail_and_http(s)
    s = deletePunctNumber(s)
    

    return strip_stopwords(s)

Now we can finally preprocess all the courses' descriptions

In [118]:
courses = load_json('data/courses.txt')

for i in range(len(courses)):
    if (i % 7 == 1):
        print("\r", str(i*100/len(courses))[:4], " %", end='')
        
    courses[i]['description'] = preProcess(courses[i]['description']) #lemm_stemmed
    
print("\r100 %")

100 %


In [119]:
print(function_easy("environmentRisk assessmentMicrobial source trackingDisinfection"))

environment Risk assessment Microbial source tracking Disinfection 


In [120]:
coursesTest = load_json('data/courses.txt')
coursesTest[549]['description']

'The goal of this course is to obtain an overview over waterborne pathogens and the risk they pose. We will not focus on biological virulence mechanisms. Instead, we will discuss the fate of pathogens in the environment, detection, inactivation mechanisms, and microbial risk assessment Content Overview over waterborne diseasesWaterborne virusesWaterborne bacteriaWaterborne protozoaWaterborne helminths Detection methodsIndicator organismsPathogens and global changeTransport and survival in the environmentRisk assessmentMicrobial source trackingDisinfection and other control mechanisms'

A useful dict :

In [121]:
name_2_description = {}
pos_2_name = {}
name_2_pos = {}


for i in range  (len(courses)):
    name_2_description[courses[i]["name"]] = courses[i]["description"]
    pos_2_name[i] = courses[i]['name']
    
name_2_pos = {v:k for k,v in pos_2_name.items()}


Here is the preprocessed version of the description of the course Internet Analytics

In [122]:
name_2_description["Internet analytics"]

'internet analyt collect user data onlin servic social network commerc search advertis function onlin servic ubiquit seek balanc foundat algorithm statist graph world inspir practic internet cloud servic social inform network recommend cluster detect search retriev topic dimension reduct stream comput onlin auction coverag data mine analyt social network commerc social combin theoret dataset world dedic infrastructur hadoop apach spark data mine machin social network map reduc hadoop recommend cluster detect topic inform retriev stream comput auction stochast recommend linear algebra algorithm data graph linear algebra markov chain java world data onlin servic framework typic data mine onlin servic analyz effici modelsdata mine machin world practic world dataset curat draw'

## Exercise 4.2: Term-document matrix

Let's compute the absolute occurence of each word on the whole corpus

In [123]:
from collections import Counter

- We create a string containing the whole corpus.
- We create a list 'docs' containing the courses's names
- We create a list of tuple 'occurences' that stores the words of the corpus along with their number of apparition
- We create a list 'words' that contains all different words

In [124]:
wholeCorpus = ''
docs = []

for c in courses:
    wholeCorpus += c['description']
    docs.append(c['name'])

In [125]:
occurences = Counter(wholeCorpus.lower().split()).most_common()

nb_words = len(occurences)
nb_docs = len(courses)

print("We have ", nb_words , " different words in total among the ", nb_docs , " descriptions")

We have  8620  different words in total among the  854  descriptions


In [126]:
words = []
for (w, n) in occurences:
    words.append(w)

In [127]:
print(words[:100])

['data', 'comput', 'engin', 'energi', 'physic', 'mechan', 'optim', 'recommend', 'research', 'transvers', 'function', 'optic', 'inform', 'electron', 'practic', 'includ', 'properti', 'supervis', 'biolog', 'tool', 'resourc', 'imag', 'manag', 'linear', 'chemic', 'perform', 'integr', 'topic', 'discuss', 'analyz', 'state', 'case', 'assist', 'algorithm', 'dynam', 'signal', 'chemistri', 'note', 'organ', 'object', 'statist', 'molecular', 'measur', 'level', 'network', 'solv', 'product', 'particip', 'simul', 'architectur', 'continu', 'flow', 'paper', 'methodolog', 'oper', 'numer', 'implement', 'approach', 'reaction', 'set', 'circuit', 'cover', 'technic', 'explain', 'task', 'power', 'space', 'experiment', 'type', 'limit', 'risk', 'attend', 'mathemat', 'team', 'feedback', 'sensor', 'languag', 'domain', 'theoret', 'distribut', 'strategi', 'estim', 'protein', 'critic', 'role', 'digit', 'phase', 'demonstr', 'quantum', 'complex', 'advanc', 'prepar', 'transfer', 'goal', 'laser', 'market', 'magnet', 'mod

- Now we define the TF function

In [128]:
#Function to compute TF
def TF(s):
    
    w_n_tuple = Counter(s.lower().split()).most_common()
    highestScore = w_n_tuple[0][1]
    res = []
    
    for(w, n) in w_n_tuple:
        res.append((w, n/highestScore))
    return res

In [129]:
TF(name_2_description['Internet analytics'])

[('data', 1.0),
 ('world', 0.8333333333333334),
 ('social', 0.8333333333333334),
 ('servic', 0.8333333333333334),
 ('onlin', 0.8333333333333334),
 ('mine', 0.6666666666666666),
 ('network', 0.6666666666666666),
 ('recommend', 0.5),
 ('cluster', 0.3333333333333333),
 ('search', 0.3333333333333333),
 ('detect', 0.3333333333333333),
 ('dataset', 0.3333333333333333),
 ('algebra', 0.3333333333333333),
 ('machin', 0.3333333333333333),
 ('commerc', 0.3333333333333333),
 ('topic', 0.3333333333333333),
 ('hadoop', 0.3333333333333333),
 ('inform', 0.3333333333333333),
 ('stream', 0.3333333333333333),
 ('analyt', 0.3333333333333333),
 ('algorithm', 0.3333333333333333),
 ('comput', 0.3333333333333333),
 ('graph', 0.3333333333333333),
 ('linear', 0.3333333333333333),
 ('retriev', 0.3333333333333333),
 ('auction', 0.3333333333333333),
 ('internet', 0.3333333333333333),
 ('practic', 0.3333333333333333),
 ('seek', 0.16666666666666666),
 ('spark', 0.16666666666666666),
 ('dimension', 0.1666666666666666

In [130]:
#Computing IDF
words_2_nbDoc = {}
for w in words:
    words_2_nbDoc[w] = 0

In [131]:
for i in range(nb_docs):
    
    d = courses[i]['description'].lower()
    listOfWords = d.split(' ')
    
    for j in listOfWords:
        if (j == "methodsind"):
            print(i)
        try:
            words_2_nbDoc[j] += 1
        except KeyError:
            #print(j)
            words_2_nbDoc[j] = 1
            words.append(j)

In [132]:
for k in words_2_nbDoc.keys():
    words_2_nbDoc[k] = - np.log2(words_2_nbDoc[k]/nb_docs)

  from ipykernel import kernelapp as app


In [133]:
words_2_nbDoc['internet']

5.7380922596204904

In [134]:
TFIDF = np.zeros((len(words),nb_docs))

In [135]:
def indexOf(w):
    for i in range(len(words)):
        if(words[i] == w):
            return i

In [136]:
for k in range(len(courses)):
    doc = courses[k]['description']
    if len(doc) != 0:
        tf = TF(doc)
        for (w, f) in tf:
            TFIDF[indexOf(w)][k] = f* words_2_nbDoc[w]

In [137]:
TFIDF[:][43]

array([ 0.        ,  0.        ,  0.10914407,  0.        ,  0.        ,
        0.        ,  0.15462077,  0.        ,  0.        ,  0.61848307,
        0.        ,  0.        ,  0.23193115,  0.        ,  0.        ,
        0.        ,  0.        ,  0.37108984,  0.        ,  0.20616102,
        0.        ,  0.        ,  0.18554492,  0.1686772 ,  0.4638623 ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.26506417,  0.        ,  0.23193115,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.23193115,  0.30924154,  0.        ,  0.        ,
        0.26506417,  0.        ,  0.20616102,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  1.11326953,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.18

In [138]:
orderedIndexes = np.argsort(TFIDF[:][43], axis = 0)
for i in range(15):
    print(words[orderedIndexes[-15 +i]])

volum
charg
exchang
static
therapi
gaussian
obtain
neuron
offer
wastewat
workshop
technic
fix
classroom
ressourc


The large scores are words that appears a lot in one document but rarely in other documents (TF score is high and IDF is high).
Indeed IDF is high when log2(n/N) is low, log2(n/N) is low when n/M << 1. and n/M << 1 means that n is small as M is fixed. n is small when the word apears in very few documents 
The small scores are words that either appear rarely in a doc or rarely in every documents (TF or IDF score is low)

## Exercise 4.3: Document similarity search

In [139]:
import numpy.linalg as la

In [140]:
def sim(di, dj):
    return di.T@ dj /(la.norm(di)*la.norm(dj))

In [141]:
def topFiveCourses(query):
    queryWords = query.split(' ')
    Q = np.zeros(len(words))
    for w in queryWords:
        Q[indexOf(w)] = 1
    similarityScores = np.zeros(nb_docs)
    similarityScores = sim(Q, TFIDF[:])
    indexes = np.argsort(similarityScores)
    for i in range(5):
        print("Course : ", docs[indexes[-5 +i]], 
              " with similarity :", similarityScores[indexes[-5+i]] )

In [142]:
topFiveCourses('markov chain')

Course :  Operations: economics & strategy  with similarity : 0.00885964223185
Course :  Statistical Sequence Processing  with similarity : 0.0110378462712
Course :  Markov chains and algorithmic applications  with similarity : 0.014741599374
Course :  Applied probability & stochastic processes  with similarity : 0.0154676673871
Course :  Applied stochastic processes  with similarity : 0.018631825327


In [143]:
topFiveCourses('facebook')

Course :  Molecular and cellular biophysic II  with similarity : 0.0
Course :  CCMX Advanced Course - Instrumented Nanoindentation  with similarity : 0.0
Course :  Electronic properties of solids and superconductivity  with similarity : 0.0
Course :  Hydrogeophysics  with similarity : 0.0
Course :  Computational Social Media  with similarity : 0.00469659925583


In [144]:
from utils import save_pkl

In [146]:
words[:10]

['data',
 'comput',
 'engin',
 'energi',
 'physic',
 'mechan',
 'optim',
 'recommend',
 'research',
 'transvers']

In [147]:
save_pkl(TFIDF, 'data/tfidf.pkl')
save_pkl(words, 'data/words.pkl')
save_pkl(docs, 'data/docs.pkl')