# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *D*

**Names:**

* *Marc Bickel*
* *Cyril Cadoux*
* *Emma Lejal*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [93]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl


stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.1: Pre-processing

In [94]:
import string

Check NLTK library !

In [95]:
import nltk
#nltk.download()

We have chosen to implement the following features : 
- Remove the stop words
- Remove the punctuation
- Stem the words
- Lemmatize the words

In [211]:
def deletePunctNumber(s):
    
    punct_number = ['.',',','\'', '=', '\"','#', '*', '&', '!', '?', ':', ';', '-', '_', '(', ')', '/',
                    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '%']
    result = s
    
    for p in punct_number :
        result = result.replace(p, " ")
        
    return result

In [248]:
def remove_mail_and_http(s):
    
    result = s
    
    for word in s.split():
        
        if ('www' in word or '@' in word or 'http' in word or 'moodle' in word):
            result = result.replace(word, " ")
    
    return result

Using the nltk lib, we define a function that lemmatize and stemms a string

In [249]:
def lemm_stemm(s):
    
    # ----- Ensuring lemm and stemm don't crash -----
    while( len(s) > 0 and s[0] == ' '):
        s = s[1:]
        
    s = s.replace('  ', ' ')
    s = s.replace('   ', ' ')
    s = s.replace('    ', ' ')
    s = [word for word in s.split(" ") if len(word) > 0]
    # ------------------------------------------------

    wordsWtag = nltk.pos_tag(s)

    lemmatizer = nltk.stem.WordNetLemmatizer()
    lems = [lemmatizer.lemmatize(w, pos = givePos(t.lower()[0])) for (w, t) in wordsWtag ]

    porter_stemmer = nltk.stem.PorterStemmer()
    stems = [porter_stemmer.stem(l) for l in lems]

    return (' '.join(stems)).lower()

In [266]:
all_stopwords = ''

for e in stopwords:
    all_stopwords += e + " "


# Handmade stopwords :
all_stopwords += "class explore number key past decade century course field application pratice current"
all_stopwords += " commun provide good sessions laboratory prerequisite problem student teach cathedra"
all_stopwords +=  " homework lab expect knowledge lecture fundament concept project midterm final exam question"
all_stopwords += " grade make learn understand skill capacity small large group report write oral interpret access output"
all_stopwords += " start wide invitation method develop base website control exercise discuss scale introduction illustration "
all_stopwords += " selection student study keyword opportun previous"
all_stopwords += " model system content design assessment process analysi basic activity outcome present evaluation material"
all_stopwords += " structure requirement theory generation plan hour import effect scientific time proprety cells office program"
all_stopwords += " technology principles equation sources science solution relation interaction week semester bachelor master"
all_stopwords += " results industry test device experiment weekly medium concrete techniques acquire handson moodle epfl"

l_s_stop = lemm_stemm(all_stopwords)

In [281]:
def strip_stopwords(s):
    
    s_words = s.split() 
    
        
    # Remove basic stopwords, internet and email addresses
    first_filter = [word for word in s_words if (word.lower() not in all_stopwords)
                   and (len(word) > 2)
                   and ("http" not in word)
                   and ('@' not in word) ]
        
    s_pure = ''
    for w in first_filter:
        s_pure += w + " "
    
    if (len(s_pure) == 0) :
        return ''
    
    # Remove lemm_stemm versions of stopwords
    lemm_stemm_words = lemm_stemm(s_pure).split()
    
    resultWords = [word for word in lemm_stemm_words if (word.lower() not in l_s_stop)]
    
    unique_words = list(set(resultWords))


    result = ' '.join(unique_words)
    return result

In [282]:
def givePos(c):
    #if(c == 'p' or c == 'i' or c == 'j' or c== 'f' or c == 'c' or c== 'm' or c == 'd' or c== 'w'):
    if(c != 'v'):   
        return 'n'
    else:
        return 'v'

In [283]:
def preProcess(s):
    
    s = remove_mail_and_http(s)
    s = deletePunctNumber(s)
    

    return strip_stopwords(s)

Now we can finally preprocess all the courses' descriptions

In [284]:
courses = load_json('data/courses.txt')

for i in range(len(courses)):
    if (i % 7 == 1):
        print("\r", str(i*100/len(courses))[:4], " %", end='')
        
    courses[i]['description'] = preProcess(courses[i]['description']) #lemm_stemmed
    
print("\r100 %")

100 %


A useful dict :

In [285]:
name_2_description = {}
pos_2_name = {}
name_2_pos = {}


for i in range  (len(courses)):
    name_2_description[courses[i]["name"]] = courses[i]["description"]
    pos_2_name[i] = courses[i]['name']
    
name_2_pos = {v:k for k,v in pos_2_name.items()}


Here is the preprocessed version of the description of the course Internet Analytics

In [286]:
name_2_description["Internet analytics"]

'seek chain social modelsdata combin reduct reduc function algebra linear curat inform commerc markov infrastructur apach framework draw algorithm analyt cluster mine graph stochast data network auction map practic hadoop retriev statist effici dimension java balanc cloud advertis recommend dataset topic collect typic theoret servicesdevelop user ubiquit search stream onlin internet comput world spark inspir servic dedic detect machin servicesanalyz coverag foundat'

## Exercise 4.2: Term-document matrix

Let's compute the absolute occurence of each word on the whole corpus

In [157]:
from collections import Counter

- We create a string containing the whole corpus.
- We create a list 'docs' containing the courses's names
- We create a list of tuple 'occurences' that stores the words of the corpus along with their number of apparition
- We create a list 'words' that contains all different words

In [158]:
wholeCorpus = ''
docs = []

for c in courses:
    wholeCorpus += c['description']
    docs.append(c['name'])

In [159]:
occurences = Counter(wholeCorpus.lower().split()).most_common()

nb_words = len(occurences)
nb_docs = len(courses)

print("We have ", nb_words , " different words in total among the ", nb_docs , " descriptions")

We have  13302  different words in total among the  854  descriptions


In [160]:
words = []
for (w, n) in occurences:
    words.append(w)

In [169]:
print(words[:100])

['model', 'system', 'content', 'design', 'assess', 'process', 'analysi', 'basic', 'activ', 'outcom', 'present', 'evalu', 'materi', 'data', 'structur', 'requir', 'theori', 'comput', 'engin', 'energi', 'physic', 'techniqu', 'gener', 'recommend', 'transvers', 'plan', 'mechan', 'hour', 'research', 'import', 'discuss', 'effect', 'optim', 'includ', 'practic', 'technolog', 'electron', 'supervis', 'optic', 'function', 'biolog', 'principl', 'properti', 'time', 'scientif', 'imag', 'chemic', 'program', 'topic', 'inform', 'linear', 'assist', 'offic', 'cell', 'manag', 'tool', 'integr', 'case', 'perform', 'equat', 'dynam', 'algorithm', 'semest', 'sourc', 'chemistri', 'resourc', 'molecular', 'statist', 'organ', 'scienc', 'solut', 'signal', 'particip', 'relat', 'interact', 'week', 'methodolog', 'network', 'state', 'architectur', 'product', 'measur', 'level', 'numer', 'simul', 'note', 'flow', 'paper', 'solv', 'industri', 'experi', 'cover', 'technic', 'oper', 'test', 'approach', 'devic', 'circuit', 'res

In [None]:
all_stopwords += "class explore number key past decade century course field application pratice current"
all_stopwords += " commun provide good sessions laboratory prerequisite problem student teach cathedra"
all_stopwords +=  " homework lab expect knowledge lecture fundament concept project midterm final exam question"
all_stopwords += " grade make learn understand skill capacity small large group report write oral interpret access output"
all_stopwords += " start wide invitation method develop base website control exercise discuss scale introduction illustration "
all_stopwords += " selection student study keyword opportun previous"
all_stopwords += " model system content design assess process analysi basic activity outcome present evaluation material"
all_stopwords += " structure requirement theory generation plan hour import effect scientific time proprety cells office program"
all_stopwords += " technology principles equation sources science solution relation interaction week semester bachelor master"
all_stopwords += " results industry test device experiment"

- Now we define the TF function

In [161]:
#Function to compute TF
def TF(s):
    
    w_n_tuple = Counter(s.lower().split()).most_common()
    highestScore = w_n_tuple[0][1]
    res = []
    
    for(w, n) in w_n_tuple:
        res.append((w, n/highestScore))
    return res

In [162]:
TF(name_2_description['Internet analytics'])

[('data', 1.0),
 ('model', 1.0),
 ('social', 0.8333333333333334),
 ('onlin', 0.8333333333333334),
 ('realworld', 0.6666666666666666),
 ('network', 0.6666666666666666),
 ('mine', 0.5),
 ('recommend', 0.5),
 ('largescal', 0.5),
 ('basic', 0.5),
 ('servic', 0.5),
 ('analyt', 0.3333333333333333),
 ('ecommerc', 0.3333333333333333),
 ('materi', 0.3333333333333333),
 ('inform', 0.3333333333333333),
 ('algorithm', 0.3333333333333333),
 ('graph', 0.3333333333333333),
 ('practic', 0.3333333333333333),
 ('comput', 0.3333333333333333),
 ('machin', 0.3333333333333333),
 ('algebra', 0.3333333333333333),
 ('linear', 0.3333333333333333),
 ('system', 0.3333333333333333),
 ('auction', 0.3333333333333333),
 ('hadoop', 0.3333333333333333),
 ('dataset', 0.3333333333333333),
 ('stream', 0.3333333333333333),
 ('internet', 0.3333333333333333),
 ('cluster', 0.3333333333333333),
 ('detect', 0.3333333333333333),
 ('seek', 0.16666666666666666),
 ('weekli', 0.16666666666666666),
 ('selfcontain', 0.1666666666666666

In [163]:
#Computing IDF
words_2_nbDoc = {}
for w in words:
    words_2_nbDoc[w] = 0

In [164]:
for i in range(nb_docs):
    
    d = courses[i]['description'].lower()
    listOfWords = d.split(' ')
    
    for j in listOfWords:
        try:
            words_2_nbDoc[j] += 1
        except KeyError:
            print(j)
            words_2_nbDoc[j] = 1
            words.append(j)

concis
schmid
zaharko
genèv
frossard
examnin
mallin
telephon
microbiologybiochemistri
synopsi
solidifi
chatedra
productlifecycl
esem
fontain
bibliograpgi
videos
projectbas
perpect
twoday
broadbas
kunt
appointment
fudament
rigour
durret
biostatist
presentationpost
madou
hesit
biol
zhakypov
acquired
rnabas
reportpresent
engineeringrelev
available
practicebas
haykin
introdut
threesemest
wellin
enviroment
lasting
présentat
tini


In [165]:
for k in words_2_nbDoc.keys():
    words_2_nbDoc[k] = - np.log2(words_2_nbDoc[k]/nb_docs)

In [166]:
words_2_nbDoc['internet']

6.0376525414793987

In [85]:
TFIDF = np.zeros((len(words),nb_docs))

In [86]:
def indexOf(w):
    for i in range(len(words)):
        if(words[i] == w):
            return i

In [89]:
for k in range(len(courses)):
    doc = courses[k]['description']
    tf = TF(doc)
    for (w, f) in tf:
        TFIDF[indexOf(w)][k] = f* words_2_nbDoc[w]

In [54]:
TFIDF[:][43]

array([ 0.09178897,  0.        ,  0.06884173,  0.        ,  0.        ,
        0.        ,  0.11014676,  0.08472828,  0.        ,  0.        ,
        0.22029353,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.13768345,  0.18357794,  0.        ,  0.        ,
        0.12238529,  0.04589448,  0.11014676,  0.11014676,  0.07867626,
        0.09178897,  0.13768345,  0.13768345,  0.        ,  0.        ,
        0.        ,  0.10013342,  0.13768345,  0.18357794,  0.09178897,
        0.15735252,  0.        ,  0.13768345,  0.05797198,  0.        ,
        0.        ,  0.12238529,  0.22029353,  0.        ,  0.07867626,
        0.        ,  0.06884173,  0.        ,  0.13768345,  0.        ,
        0.11014676,  0.11014676,  0.11014676,  0.        ,  0.08472828,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.06884173,  0.08472828,  0.18357794,  0.        ,  0.        ,
        0.18357794,  0.        ,  0.        ,  0.        ,  0.11

In [55]:
orderedIndexes = np.argsort(TFIDF[:][43], axis = 0)
for i in range(15):
    print(words[orderedIndexes[-15 +i]])

transit
area
step
moodl
experiment
period
standard
teacher
hybrid
organis
lab
laboratori
current
photovolta
tp


The large scores are words that appears a lot in one document but rarely in other documents (TF score is high and IDF is high).
Indeed IDF is high when log2(n/N) is low, log2(n/N) is low when n/M << 1. and n/M << 1 means that n is small as M is fixed. n is small when the word apears in very few documents 
The small scores are words that either appear rarely in a doc or rarely in every documents (TF or IDF score is low)

## Exercise 4.3: Document similarity search

In [146]:
import numpy.linalg as la

In [147]:
def sim(di, dj):
    return di.T@ dj /(la.norm(di)*la.norm(dj))

In [152]:
#Queries = "Markov Chain" "facebook"
"""Q1= np.zeros(len(words))
Q2 = np.zeros(len(words))
indMarkov = indexOf('markov')
indChain = indexOf('chain')
indFacebook = indexOf('facebook')
Q1[indMarkov] = 1;
Q1[indChain] = 1;
Q2[indFacebook] = 1;"""

In [153]:
#similarityScores = np.zeros(nb_docs)
#similarityScores = sim(Q1, TFIDF[:])

In [158]:
#indexes = np.argsort(similarityScores)
#docs[indexes[-1]]

'Applied probability & stochastic processes'

In [159]:
def topFiveCourses(query):
    queryWords = query.split(' ')
    Q = np.zeros(len(words))
    for w in queryWords:
        Q[indexOf(w)] = 1
    similarityScores = np.zeros(nb_docs)
    similarityScores = sim(Q, TFIDF[:])
    indexes = np.argsort(similarityScores)
    for i in range(5):
        print("Course : ", docs[indexes[-5 +i]], 
              " with similarity :", similarityScores[indexes[-5+i]] )

In [160]:
topFiveCourses('markov chain')

Course :  Markov chains and algorithmic applications  with similarity : 0.00773359636733
Course :  Markov chains and algorithmic applications  with similarity : 0.00773359636733
Course :  Applied probability & stochastic processes  with similarity : 0.00895240801474
Course :  Applied probability & stochastic processes  with similarity : 0.00895240801474
Course :  Applied probability & stochastic processes  with similarity : 0.00895240801474


In [161]:
topFiveCourses('facebook')

Course :  Transportation systems engineering  with similarity : 0.0
Course :  Discrete mathematics  with similarity : 0.0
Course :  Computational Social Media  with similarity : 0.00264110109284
Course :  Computational Social Media  with similarity : 0.00264110109284
Course :  Computational Social Media  with similarity : 0.00264110109284
