# Text 4: Word2Vec
**Internet Analytics - Lab 4**

---

**Group:** *J*

**Names:**

* *Ann-Kristin Bergmann*
* *Nephele Aesopou*
* *Ewa Miazga*

---

#### Instructions

*This is a template for part 4 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import re
import numpy as np
from numpy import linalg

from scipy.sparse import csr_matrix
from collections import defaultdict
import json
from utils import *
import gensim
from sklearn.cluster import KMeans


courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


In [2]:
import string
import nltk
nltk.download("punkt")
nltk.download('wordnet')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /home/aesopou/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aesopou/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Redo pre-processing

In [3]:
from gensim.models import KeyedVectors

# Specify the path to your vector file
vector_file = '/ix/model.txt'

# Load the vectors using KeyedVectors
model = KeyedVectors.load_word2vec_format(vector_file, binary=False, no_header=False)


In [4]:
# Test the loaded vectors
word = 'apple'

# Check if the word is in the vocabulary
if word in model.key_to_index:
    # Retrieve the vector representation of the word
    vector = model.get_vector(word)
else:
    print(f"'{word}' is not in the vocabulary.")
    
vector_dim = len(vector)

In [5]:
def pre_processing(text):
    for i in range(len(text)):
        
        # tokenize
        text[i]['description'] = word_tokenize(text[i]['description'], language= 'english')
        
        # take only words
        text[i]['description'] = [w for w in text[i]['description'] if w.isalpha()]
                 
        # remove stopwords
        text[i]['description'] = [w for w in text[i]['description'] if not w in stopwords]
        
        # Remove punctuation
        text[i]['description'] = [t for t in text[i]['description'] if not t in string.punctuation]   
        


In [6]:
pre_processing(courses)


In [7]:
#You would also need to make a reasonable choice for a ‘default’ vector for those words that
#are in your dataset but are not present in the pre-trained model

# Add n-grams?

In [8]:
# Also, remove very short word-strings because there are no usefull words with two or less letters...
for course in courses:
    course['description'] = [item for item in course['description'] if len(item) > 2]   

In [9]:
# This function crates a word_corpus by adding the words from all documents to one list.
# Word_corpus contains also duplicates because we need them to count the occurences of words.
def make_word_corpus(text):
    word_corpus = []
    for course in courses:
        for token in course['description']:
            word_corpus.append(token)
    # Sort the words alphabetically
    word_corpus.sort()
    
    return word_corpus

In [10]:
# Name it as word_corpus
word_corpus = make_word_corpus(courses)

# Make unique word corpus
unique_word_corpus = list(np.unique(np.array(word_corpus)))

In [11]:
print(f"Word_corpus contains {len(word_corpus)} words.")
print(f"There are {len(unique_word_corpus)} unique words in our corpus.")

Word_corpus contains 133227 words.
There are 17122 unique words in our corpus.


## Exercise 4.12 : Clustering word vectors

In [12]:
def normalize_vec(vector): 
    array = np.array(vector)
    # Calculate the normalization factor
    norm_factor = np.linalg.norm(array)
    # Normalize the array
    normalized_array = array / norm_factor
    # Convert the normalized array back to a list (if needed)
    normalized_data = normalized_array.tolist()
    return normalized_data


In [13]:
# Build matrix word_vec_matrix
num_words = len(unique_word_corpus)
word_vec_matrix = []
word_count = 0

for word in unique_word_corpus:
    # Check if the word is in the vocabulary
    if word in model.key_to_index:
        
    # Retrieve the vector representation of the word
        vector = model.get_vector(word)
        word_vec_matrix.append(normalize_vec(vector))
        
        word_count +=1
    else:
        # here we decide that if the word is not in the given vocab, make it a zero vector
        # because it probably is not an important word if its not in the vocab
        word_vec_matrix.append([0] * vector_dim)

print(f"We have {word_count} words that belong to the pre-trained model out of {num_words}")
    


We have 12260 words that belong to the pre-trained model out of 17122


In [14]:
kmeans = KMeans(n_clusters=10, random_state=0).fit(np.array(word_vec_matrix))

In [15]:
# To investigate the clusters, we creat a dictionary that saves the words in that specific cluster and their frequency
words_in_cluster = []
cluster_labels = list(kmeans.labels_)
cluster_centroids = list(kmeans.cluster_centers_)

In [16]:
# Function returns a list of tuples of the 10 most similar words of a given list (cosine similarity) to the centroid
def get_most_similar(centroid, word_list):
    similar_words = model.most_similar(positive=[centroid])
    return similar_words

In [21]:
# Build cluster words which is a dictionary storing (key, values) = (cluster_id, [words in cluster])
# Step 2: Create defaultdict to store words for each cluster
cluster_words = defaultdict(list)

# Step 3: Retrieve words for each cluster
for i, label in enumerate(cluster_labels):
    cluster_words[label].append(unique_word_corpus[i])


**1. After performing k-means, print the top-10 words for each cluster.**

In [19]:
for i in range(len(cluster_centroids)):
    
    cluster_id = list(cluster_words.keys())[i]
    print("Cluster ID:\n", cluster_id)
    top_10 = get_most_similar(cluster_centroids[cluster_id], cluster_words[i])
    
    for word, similarity in top_10:
        
        print(f"Word:{word}\t\t\t Similarity: {similarity}")

Cluster ID:
 4
Word:interfaces			 Similarity: 0.7877712845802307
Word:software			 Similarity: 0.7847545146942139
Word:hardware/software			 Similarity: 0.783154308795929
Word:Simulink			 Similarity: 0.777946412563324
Word:multi-threading			 Similarity: 0.7766555547714233
Word:Exadata			 Similarity: 0.7758769392967224
Word:kernel-mode			 Similarity: 0.7736071348190308
Word:server-based			 Similarity: 0.7725211977958679
Word:implementations			 Similarity: 0.7722378969192505
Word:hardware-based			 Similarity: 0.7716246843338013
Cluster ID:
 2
Word:Miller			 Similarity: 0.6404972076416016
Word:Bennett			 Similarity: 0.6253702044487
Word:Baker			 Similarity: 0.620003342628479
Word:Moore			 Similarity: 0.6169677376747131
Word:McEveety			 Similarity: 0.6096540093421936
Word:Smith			 Similarity: 0.6080101728439331
Word:Robinson			 Similarity: 0.6079325079917908
Word:Thompson			 Similarity: 0.6040717959403992
Word:Walker			 Similarity: 0.6033617258071899
Word:Matthews			 Similarity: 0.6013290882

**2. What are the different types of clusters that you observe ? Give labels to 10 of the
clusters that are representative of the different types that you see.**

To find the optimal value for k we simply tried a few values and used our judgement. We did not want k to be too large, nor too small. In the end, we decided with k=10, like in LSI.

Here are the lables we give to the clusters:

Cluster 0: international terms

Cluster 1: methodologies (notice the capital letters)

Cluster 2: first/last names

Cluster 3: mathematics

Cluster 4: computer software/hardware terms

Cluster 5: biology

Cluster 6: nature/object desription

Cluster 7: comprehension

Cluster 8: employment

Cluster 9: chemistry

**3. Compare the clusters to the topics that you obtained for LSI and LDA.**

## Exercise 4.13 : Document similarity search

## Exercise 4.14: Document similarity search with outside terms