# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *N*

**Names:**

* *Anh Nghia Khau (223613)*
* *Sandra Djambazovska(224638)*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [None]:
import pickle
import math
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl, save_json
import operator

import nltk
from nltk import pos_tag
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.corpus import wordnet


courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.1: Pre-processing

In [33]:
def split_words(word):
    """Transform HelloWord into Hello Word"""
    if (word.isupper() or word.islower()):
        return word
    else:
        pos_to_cut = []
        for i in range(1, len(word)):
            if (word[i].isupper()):
                pos_to_cut.append(i)
        curr = 0
        words = ''
        for pos in pos_to_cut:
            words += ' ' + word[curr: pos]
            curr = pos
        words += ' ' + word[curr:]
        return words

In [34]:
"""Pre-requisite to choosing indexing terms"""
"""Combine RegularExpr (Remove the punctuation) and word_tokenize"""
def tokenization(sentence):
    tokenizer = RegexpTokenizer(r'\w+')
    temp = ''
    for w in sentence.split():
        temp += ' ' + split_words(w)   
    temp = word_tokenize(temp)
    new_sentence = ''
    for grams in temp:
        new_sentence += ' ' + grams

    return  tokenizer.tokenize(new_sentence) 

In [35]:
"""Remove words not important => smaller indexes and give more informative indexes"""
def stop_words(sentence):
    stopwords = load_pkl('data/stopwords.pkl')
    return [x.lower() for x in sentence if x.lower() not in stopwords]

In [36]:
def get_wordnet_pos(treebank_tag):
    """Map ['NN', 'NNS', 'NNP', 'NNPS'] to NOUN....."""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [37]:
"""The goal of lemma and stemming is : reduces lexical variability 
                                      ⇒ reduces index size"""
def lemmatization(sentence):
    lemmatiser = WordNetLemmatizer()
    tokens_pos = pos_tag(sentence)
    tokens_pos = [(w,get_wordnet_pos(p)) for (w,p) in tokens_pos]
    
    return [lemmatiser.lemmatize(w, pos=p) for (w,p) in tokens_pos if p != None]

In [38]:
def preprocessing(sentence):
    """Tokenization"""
    new_sentence = tokenization(sentence)
    """Stopwords"""
    new_sentence = stop_words(new_sentence)
    """POS and Lemmatization"""
    new_sentence = lemmatization(new_sentence)
    
    return new_sentence

In [39]:
def frequent_words(nbOccurrences):
    frequencies = {}
    for course in courses:
        for prep in preprocessing(course['description']):
            if prep not in frequencies:
                frequencies[prep] = 1
            else:
                frequencies[prep] += 1
    sorted_frequencies = sorted(frequencies.items(), key=operator.itemgetter(1), reverse=True)
    blacklist = []
    for(k, v) in sorted_frequencies:
        if (v > nbOccurrences):
            blacklist.append(k)
    return blacklist, sorted_frequencies
blacklist, sorted_frequencies = frequent_words(300)

In [23]:
"""Frequency of words"""
sorted_frequencies

[('student', 2055),
 ('method', 1810),
 ('learn', 1554),
 ('system', 1194),
 ('model', 1177),
 ('design', 983),
 ('content', 922),
 ('project', 774),
 ('course', 760),
 ('analysis', 744),
 ('lecture', 707),
 ('basic', 706),
 ('end', 679),
 ('work', 664),
 ('assessment', 655),
 ('concept', 651),
 ('exercise', 628),
 ('teach', 619),
 ('data', 605),
 ('process', 602),
 ('prerequisite', 597),
 ('application', 573),
 ('keywords', 573),
 ('outcome', 572),
 ('problem', 566),
 ('material', 562),
 ('write', 547),
 ('activity', 524),
 ('introduction', 506),
 ('theory', 505),
 ('skill', 495),
 ('presentation', 474),
 ('study', 473),
 ('report', 465),
 ('structure', 455),
 ('plan', 446),
 ('exam', 435),
 ('energy', 434),
 ('require', 433),
 ('technique', 433),
 ('evaluate', 429),
 ('base', 429),
 ('expect', 427),
 ('time', 407),
 ('transversal', 397),
 ('hour', 384),
 ('research', 379),
 ('group', 377),
 ('engineering', 375),
 ('recommend', 370),
 ('class', 368),
 ('property', 367),
 ('field', 365

## Exercise 4.2: Term-document matrix

In [24]:
"""Read data and return a term-frequency matrix"""
"""Weighting scheme for term t in doc d: 
   
   TF(t,d) =  # occurs of t in d / max {# occurs of t'} for all terms t' in d"""


def load_data(path):
    """Mapping from 'term' to 'row indice'"""
    term_indices = {}
    """Mapping from 'row indice' to 'term'"""
    indices_term = {}
    """Mapping from 'courseID' to 'col indice'"""
    doc_indices = {}
    """Mapping from 'courseID' to 'course name'"""
    doc_names = {}
    
    values = []
    rows = []
    columns = []
    courses = load_json(path)
    terms_count = 0
    docs_count = 0
    
    blacklist, frequencies = frequent_words(300)
    
    for course in courses:
        id_  = course['courseId']
        name = course['name']
        description = course['description']
        doc_indices[id_] = docs_count
        doc_names[id_] = name

        processed = preprocessing(description)
        for term in processed:
            """Remove 1-gram and 2-grams"""
            if(len(term) <= 2):
                continue
            """Remove very frequent words nb occurrences > 300"""
            if (term in blacklist):
                continue
            if term not in term_indices:
                term_indices[term] = terms_count
                indices_term[terms_count] = term
                terms_count += 1
            """Append a value to matrix(row, col)"""
            values.append(1.0)
            rows.append(term_indices[term])
            columns.append(docs_count)
        """Go to another doc"""
        docs_count += 1
    """Create csr matrix"""
    tf_matrix = csr_matrix((values, (rows, columns)), shape=(terms_count, docs_count))
    """Transforme to TF matrix"""
    for col in range(docs_count):
        tf_matrix[:,col] /= np.max(tf_matrix.getcol(col))
        
    return tf_matrix, term_indices, indices_term, doc_indices, doc_names

In [25]:
"""Inverse Docement Frequency IDF(t,D): log(# documents/ # documents contain term t)"""
"""TF_IDF = TF(t,d)*IDF(t,D)"""
def tf_idf(tf_matrix, doc_indices, term_indices):
    nbDocs = len(doc_indices)
    nbTerms= len(term_indices)
    tfidf_matrix = tf_matrix.copy()
    for i in range(nbTerms):
        tfidf_matrix[i,:] *= math.log(nbDocs/tfidf_matrix.getrow(i).nnz)
    return tfidf_matrix

### Store result for the rest of the lab

In [26]:
def save_sparse_csr(filename,array):
    np.savez(filename,data = array.data ,indices=array.indices,
             indptr =array.indptr, shape=array.shape )

In [27]:
def store_result():
    """Store result"""
    tf_matrix, term_indices, indices_term, doc_indices, doc_names = load_data('data/courses.txt')
    tfidf_matrix = tf_idf(tf_matrix, doc_indices, term_indices)
    save_sparse_csr("tf_matrix", tf_matrix)
    save_sparse_csr("tfidf_matrix", tfidf_matrix)
    save_json([term_indices], 'term_indices.txt')
    save_json([indices_term], 'indices_term.txt')
    save_json([doc_indices], 'doc_indices.txt')
    save_json([doc_names], 'doc_names.txt')
store_result()

### 15 terms  of the IX class with the highest TF-IDF scores

In [28]:
def top_highest_score(class_name, n):
    tf_matrix, term_indices, indices_term, doc_indices, doc_names = load_data('data/courses.txt')
    tfidf_matrix = tf_idf(tf_matrix, doc_indices, term_indices)
    
    cols = tfidf_matrix.getcol(doc_indices[class_name])
    data = cols.data
    row_indices = cols.toarray().nonzero()[0]
    #col_indices = cols.toarray().nonzero()[1] # Not interested in
    
    indices = data.argsort()[::-1][:n]
    top_values = data[indices]
    top_indices = row_indices[indices]
    
    print("Top {k} (descending) terms of {c} class: ".format(k=n, c=class_name))
    for i in top_indices:
        print (indices_term[i])
    
top_highest_score("COM-308", 15)

Top 15 (descending) terms of COM-308 class: 
mining
service
online
social
explore
world
hadoop
real
recommender
auction
commerce
retrieval
datasets
internet
analytics


Difference between the large scores and the small ones: term 'mining' has a high frequency on this document AND is contained in a few documents => mining is a keyword important of this document. 

## Exercise 4.3: Document similarity search

In [29]:
"""https://en.wikipedia.org/wiki/Okapi_BM25"""

def bm25_score(tf_matrix, row, col, avg_doc_lenght, nbDocs, doc_length, K, B):
    tf  = tf_matrix[row, col]
    """term_length : # of doc contain 'term'"""
    term_length = tf_matrix.getrow(row).nnz
    idf = math.log((nbDocs - term_length + 0.5)/(term_length + 0.5))
    return idf * (tf * (K + 1)) / (tf + K*(1 - B + (B * doc_length / avg_doc_lenght)))
    

In [30]:
def top_bm25_similarity(query, n):
    K = 1.5
    B = .75
    """a dict for storing result. E.g.: scores[CS-308] = 14"""
    scores = {}
    tf_matrix, term_indices, indices_term, doc_indices, doc_names = load_data('data/courses.txt')
        
    """Number of documents"""
    nbDocs = len(doc_indices)
    
    """Average document length"""
    avg_doc_lenght = 0.0
    for doc in doc_indices.keys():
        avg_doc_lenght += tf_matrix.getcol(doc_indices[doc]).nnz
    avg_doc_lenght /= nbDocs

    """Compute score (query, doc)"""
    processed = preprocessing(query)
    for doc in doc_indices.keys():
        col = doc_indices[doc]
        """Length of document 'doc'"""
        doc_length = tf_matrix.getcol(col).nnz
        score = 0.0
        for term in processed:
            row = term_indices.get(term)
            # Check if term exist
            if row != None:
                score += bm25_score(tf_matrix, row, col, avg_doc_lenght, nbDocs, doc_length, K, B)
        scores[doc] = score
        
    """Sort dict by value/score and take n"""
    sorted_scores = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)[:n]
    
    print("Top {k} classes (descending) for your query '{q}'".format(k=n,q=query))
    for (k,v) in sorted_scores:
        print("{key}--{c} with score {s}".format(key=k, c=doc_names[k],s=v))


In [31]:
top_bm25_similarity("markov chains", 5)

Top 5 classes (descending) for your query 'markov chains'
MATH-332--Applied stochastic processes with score 7.438899485691768
COM-516--Markov chains and algorithmic applications with score 5.528111584837584
MGT-484--Applied probability & stochastic processes with score 5.268466618583339
EE-605--Statistical Sequence Processing with score 3.9862647912865166
MGT-528--Operations: economics & strategy with score 3.2625344882425895


In [32]:
top_bm25_similarity("facebook", 5)

Top 5 classes (descending) for your query 'facebook'
EE-727--Computational Social Media with score 1.223114375979703
MGT-641(b)--Technology and Public Policy - (b) Technology, policy and regulation with score 0.0
MSE-231--Ceramics, structures and properties   TP with score 0.0
PHYS-458--Metrology I with score 0.0
FIN-504--Credit risk with score 0.0


Explain the result obtained: 

For the 'markov chains': the result is plausible, all docs returned contain either 'markov' or 'chains' or 'markov chains'

For the 'facebook': the only document that contains term 'facebook' is EE-727 in which is returned without doubt. But now we have a problem because the algorithm does not know which document are related with the query 'facebook' if it does not contain 'facebook'

In [2]:
' '.join(['A', 'B', 'C'])

'A B C'