# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *J*

**Names:**

* *Ann-Kristin Bergmann*
* *Nephele Aesopou*
* *Ewa Miazga*

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl, save_json

import nltk
nltk.download("punkt")
nltk.download('wordnet')
from nltk.tokenize import word_tokenize

import re
import string
import math

from nltk.stem import PorterStemmer
ps = PorterStemmer()

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

from collections import defaultdict

from scipy import linalg



  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'
[nltk_data] Downloading package punkt to /home/aesopou/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aesopou/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.1: Pre-processing

In [3]:
courses[0]['description']

"The latest developments in processing and the novel generations of organic composites are discussed. Nanocomposites, adaptive composites and biocomposites are presented. Product development, cost analysis and study of new markets are practiced in team work. Content Basics of composite materialsConstituentsProcessing of compositesDesign of composite structures\xa0Current developmentNanocomposites Textile compositesBiocompositesAdaptive composites\xa0ApplicationsDriving forces and marketsCost analysisAerospaceAutomotiveSport Keywords Composites - Applications - Nanocomposites - Biocomposites - Adaptive composites - Design - Cost Learning Prerequisites Required courses Notion of polymers Recommended courses Polymer Composites Learning Outcomes By the end of the course, the student must be able to: Propose suitable design, production and performance criteria for the production of a composite partApply the basic equations for process and mechanical properties modelling for composite materi

In [4]:
num_courses = len(courses)
print("Number of courses in our dataset =", num_courses)

Number of courses in our dataset = 854


In [5]:
# Keep track of courseId and the course's index

# (index, courseId)
courseId_index = []
for i, course in enumerate(courses):
    courseId_index.append((i, course['courseId']))
# get an example
print(courseId_index[0])

(0, 'MSE-440')


In [6]:
def pre_processing(text):
    for i in range(len(text)):
        
        # tokenize
        text[i]['description'] = word_tokenize(text[i]['description'], language= 'english')
                
        # lowercase
        text[i]['description'] = [w.lower() for w in text[i]['description']]
        
        # remove stopwords
        text[i]['description'] = [w for w in text[i]['description'] if not w in stopwords]
        
        # Lemming
        text[i]['description'] = [lemmatizer.lemmatize(w) for w in text[i]['description'] if w.isalpha()]
        
        # Stemming on words only (remove numbers and urls)
        #text[i]['description'] = [ps.stem(w) for w in text[i]['description'] if w.isalpha()]
        
        # Remove punctuation
        text[i]['description'] = [t for t in text[i]['description'] if not t in string.punctuation]   
        


In [7]:
pre_processing(courses)

In [8]:
# Also, remove very short word-strings because there are no usefull words with two or less letters...
for course in courses:
    course['description'] = [item for item in course['description'] if len(item) > 2]    

In [9]:
# Function that gets a text and creates n-grams of the tokens. 
ngrams=[]
def n_grams(text, n):
    for i in range(len(text)):
        ngrams_curr = zip(*[text[i]['description'][j:] for j in range(n)])
        ngrams.append([" ".join(ngram) for ngram in ngrams_curr])
    return ngrams

In [10]:
bigrams = n_grams(courses, 2)

In [11]:
# Add bigrams to the vocabulary
for i in range(num_courses):
    courses[i]['description'] += bigrams[i]

In [12]:
# This function crates a word_corpus by adding the words from all documents to one list.
# Word_corpus contains also duplicates because we need them to count the occurences of words.
def make_word_corpus(text):
    word_corpus = []
    for course in courses:
        for token in course['description']:
            word_corpus.append(token)
    # Sort the words alphabetically
    word_corpus.sort()
    
    return word_corpus

In [13]:
# Name it as word_corpus
word_corpus = make_word_corpus(courses)

In [14]:
print(f"Word_corpus contains {len(word_corpus)} words.")
print(f"There are {len(np.unique(word_corpus))} unique words in our corpus.")

Word_corpus contains 258378 words.
There are 90319 unique words in our corpus.


**1. Explain which ones you implemented and why.**

First of all, we tokenize the course description, which means that we split the text into words. Then, we make all words lowercase so that we do not have duplicates and remove the stopwords using the stopwords file given. The stopwords are removed because they do not are of importance to the description.

We then tried both lemming and stemming and we decided that the best way for this text corpus was lemming. Stemming reduces the word to its "root" while lemming considers the word's meaning and words with the same root have not necessarily the same meaning. In our courses context this is important.

Finally, we remove punctuation using the feature string.punctuation. Another decision that was made was to remove words with 2 or less letters because they are mostly with no meaning. We decided to also add bigrams to our bag of words because most of the course descriptions use them. There is no use for 3-grams or more because very few course terms are like that, and it would makde our word corpus very heavy.

In 4.2, we also remove words with low frequency. We chose the low frequency in this case to be 2 or less because some tokens may belong to only one course, so our boundary frequency cannot be too high. This of course can be changed.

In [15]:
# Function that returns the index of a course in the courses list, if given the courseId.
def find_index(courseId):
    index = None
    for i in range(num_courses):
        if courseId_index[i][1] == courseId:
            index = courseId_index[i][0]
            break

    return index

**2. Print the terms in the pre-processed description of the IX class in alphabetical order.**

In [16]:
COM_308_tokens = courses[find_index('COM-308')]['description']
COM_308_tokens.sort()
print("Words tokens for course COM-308:\n")
print(COM_308_tokens)

Words tokens for course COM-308:

['acquired', 'acquired lecture', 'activity', 'activity lecture', 'advertisement', 'advertisement class', 'algebra', 'algebra', 'algebra algorithm', 'algebra markov', 'algorithm', 'algorithm', 'algorithm data', 'algorithm statistic', 'analysis', 'analysis user', 'analytics', 'analytics', 'analytics application', 'analytics collection', 'apache', 'apache spark', 'application', 'application', 'application inspired', 'application social', 'assessment', 'assessment method', 'auction', 'auction', 'auction learning', 'auction provide', 'balance', 'balance foundational', 'based', 'based', 'based hadoop', 'based number', 'basic', 'basic', 'basic', 'basic linear', 'basic material', 'basic model', 'cathedra', 'cathedra homework', 'chain', 'chain java', 'class', 'class', 'class', 'class explores', 'class lab', 'class seek', 'cloud', 'cloud service', 'clustering', 'clustering', 'clustering community', 'clustering community', 'collection', 'collection modeling', 'co

## Exercise 4.2: Term-document matrix

In [17]:
# Function gets a word corpus and an empty dictionary and creates a (key, value) = (word, # of occurrences). 
def get_word_count(corpus, dictionary):
    for word in corpus:
        dictionary[word] += 1
#   return dictionary

In [18]:
# We create a dict where each potential key is an int
word_corpus_count = defaultdict(int)

# Build a dictionary called word_corpus_count
get_word_count(word_corpus, word_corpus_count)

In [19]:
print(len(word_corpus_count)) # This number agrees with our unique word corpus

90319


In [20]:
# Remove low frequency words --set low frew
def remove_low_freq(corpus, n):
    # Find the words with frequency in the corpus less than n
    low_freq = [key for key, value in corpus.items() if value < n]

    # Remove them from the description of the courses they belong to
    for course in courses:
        course['description'] = [word for word in course['description'] if word not in low_freq]
    
    # Build a new corpus with the higher-frequency terms
    corpus = {key: value for key, value in corpus.items() if value >= n}
    
    # return corpus

In [21]:
remove_low_freq(word_corpus_count, 3)

In [22]:
print("When we remove low frequency terms we have:")
print(f"There are {len(np.unique(word_corpus))} unique words in our corpus.")

When we remove low frequency terms we have:
There are 90319 unique words in our corpus.


In [23]:
word_corpus = make_word_corpus(courses)

In [24]:
# To access each document seperately, make a list of dictionaries with (key,value) = (word in doc, frequency in doc)
bag_docs_words = []
for j in range(num_courses):
    diction = defaultdict(int)
    for word in courses[j]['description']:
        diction[word] += 1
        
    bag_docs_words.append(diction)

In [25]:
# create a list containing lists of the tokens of each course
course_tokens =[set(course.keys()) for course in bag_docs_words]


In [26]:
# First find the max frequency among the words in each document
max_word_in_doc = {i : max(dic.values()) for i,dic in enumerate(bag_docs_words)}

In [27]:
# Calculate Inverse Document frequency ---> diminishes the weight of terms that occur very frequently in the document set and
# increases the weight of terms that occur rarely.
idf = {w: -math.log2(sum([1 for course in course_tokens if w in course]) / num_courses) for w in word_corpus}

TF_IDF matrix

In [28]:
# Build a unique word corpus
unique_word_corpus = list(np.unique(np.array(word_corpus)))

In [29]:
# Matrix_values is a list containing all the TF_IDF values that we are going to insert in the matrix
matrix_values = []

# Val_position is a list containing tuples of (word index, document index) for each of the above values.
val_position = []

n = 0
for course in bag_docs_words:
    
    # Get the word indexes in unique_word_corpus of each word in the document 
    word_pos = [unique_word_corpus.index(key) for key in list(course.keys())]
    words = len(word_pos) # num of words in the current document
    
    # Get the IDF value for each of the words
    curr_idf = [idf[word] for word in list(course.keys())]
    
    # Calculate TF for this document
    TF = [val/ max_word_in_doc[n] for val in list(course.values())]
    
    # Multiply TF * IDF
    values = [x*y for x,y in zip(TF, curr_idf)]
    
    matrix_values.extend(values)
    
    for j in range(words):
        val_position.append((word_pos[j], n))
        
    n+=1
    
# Change into an array
positions = np.array(val_position)

In [30]:
# Build TFIDF matrix

# Define the size of the matrix
m = len(unique_word_corpus)
n = num_courses

# Set the values = TFIDF
values = matrix_values

# Define their position in the matrix
rows = positions[:,0]
columns = positions[:,1]

X = csr_matrix((matrix_values, (rows, columns)), shape=(m, n))

In [31]:
X.shape

(10820, 854)

In [32]:
from scipy.sparse import save_npz, load_npz
save_npz('X_matrix.npz', X)

**1. Print the 15 terms in the description of the IX class with the highest TF-IDF scores.**


In [33]:
IX_class = find_index('COM-308')

In [34]:
# Accessing column of the given class
column_IX = X[:, IX_class]  # Remember, index starts from 0


# Convert column to an array and get the values
values = column_IX.toarray().flatten()
# Getting the indices of the highest 15 values (- makes the sorting in descending order)
top_15_indices = np.argsort(-values)[:15] 

# Getting the words
top_15_words = [unique_word_corpus[i] for i in top_15_indices]
print(top_15_words)

['social networking', 'online', 'social', 'data mining', 'explore', 'mining', 'networking', 'community detection', 'hadoop', 'recommender', 'recommender system', 'service', 'auction', 'datasets', 'internet']


**2. Explain where the difference between the large scores and the small ones comes from.**

The large scores in the TF-IDF matrix mean that the specific word is very frequent in the specific document. However, it also means that it is not as frequent in all the rest documents.

Small values in the matrix mean that the specific word is common to many documents and is not specific to the current one.

## Exercise 4.3: Document similarity search

**1. Display the top five courses together with their similarity score for each query.**

In [35]:
# Calculate similarity of two documents
def similarity(doc1, doc2):
    value = (doc1.T @ doc2) / (linalg.norm(doc1.toarray()) * linalg.norm(doc2.toarray()))
    return value[0,0]

In [None]:
# markov chain
index_markov = unique_word_corpus.index('markov chains')
print("index of markov chain bigram:",index_markov)

print(X[index_markov])

# Convert to an array and perform argsort. [::-1] reverses the order of the indices obtained from np.argsort() --> descending
# [:5] selects the first 5 indices from the reversed order.
top5_markov = np.argsort(X[index_markov].toarray()[0])[::-1][:5]

index of markov chain bigram: 5649
  (0, 43)	1.1551228895938142
  (0, 80)	2.7722949350251547
  (0, 245)	4.620491558375257
  (0, 398)	3.465368668781443
  (0, 412)	1.1551228895938142
  (0, 417)	1.2601340613750702
  (0, 555)	0.7700819263958761


In [37]:
print("Markov Chain:\n" )

for i in range(5):
    doc1 = X[:,top5_markov[i]] 
    course1 = courses[top5_markov[i]]['name']
    
    for j in range(i+1,5):
        doc2 = X[:,top5_markov[j]] 
        course2 = courses[top5_markov[j]]['name']
        
        similarity_score = similarity(doc1,doc2)
        print(f'"{course1}" & "{course2}": score =  {similarity_score}.')

Markov Chain:

"Markov chains and algorithmic applications" & "Applied probability & stochastic processes": score =  0.29414406139161475.
"Markov chains and algorithmic applications" & "Applied stochastic processes": score =  0.3197696157233014.
"Markov chains and algorithmic applications" & "Stochastic calculus I": score =  0.19203475147034796.
"Markov chains and algorithmic applications" & "Internet analytics": score =  0.10838044263157404.
"Applied probability & stochastic processes" & "Applied stochastic processes": score =  0.3140109171696444.
"Applied probability & stochastic processes" & "Stochastic calculus I": score =  0.15509394061628803.
"Applied probability & stochastic processes" & "Internet analytics": score =  0.07454353919365653.
"Applied stochastic processes" & "Stochastic calculus I": score =  0.17457143406429212.
"Applied stochastic processes" & "Internet analytics": score =  0.07449246262450858.
"Stochastic calculus I" & "Internet analytics": score =  0.032650351312

In [38]:
index_facebook = unique_word_corpus.index('facebook')
print("index of facebook word:", index_facebook)
print(X[index_facebook])

# Facebook appears in only one course
# Find its similarity score with the Markov chain courses

# Get the column=course of the X matrix where the facebook word appears
course_facebook = int(X[index_facebook].indices)

index of facebook word: 3604
  (0, 798)	1.5375935146769195


In [39]:
print("Facebook & Markov chain:\n" )

for i in range(5):
    doc1 = X[:,top5_markov[i]] 
    course1 = courses[top5_markov[i]]['name']
    
    doc2 = X[:, course_facebook] 
    course2 = courses[course_facebook]['name']
        
    similarity_score = similarity(doc1,doc2)
    print(f'"{course1}" & "{course2}": score =  {similarity_score}.')

Facebook & Markov chain:

"Markov chains and algorithmic applications" & "Computational Social Media": score =  0.030149529921798212.
"Applied probability & stochastic processes" & "Computational Social Media": score =  0.01170559340863037.
"Applied stochastic processes" & "Computational Social Media": score =  0.014758550536404303.
"Stochastic calculus I" & "Computational Social Media": score =  0.004213235867044565.
"Internet analytics" & "Computational Social Media": score =  0.19592516704149104.


**What do you think of the results? Give your intuition on what is happening**

In general, we are getting higher similarity scores between the courses in markov chains than when comparing them with the facebook courses. The course that contains the bigram "markov chains" in its title has the highest scores with all courses. However, there are some scores which are very low, ie < 0.09. Hence, even if these courses are the top 5 containing "markov chains", they are not similar in most of their other words and get a low similarity.

The highest score of the facebook course, Computational Social Media, is with the course Internet Analytics.