# Text 1: Vector space models
**Internet Analytics - Lab 4**

---

**Group:** *X*

**Names:**

* Linqi LIU
* Yifei SONG
* Yuhang YAN
* Ying Xu Dempster TAY

---

#### Instructions

*This is a template for part 1 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

In [1]:
import pickle
import numpy as np
from scipy.sparse import csr_matrix
from utils import load_json, load_pkl
import string
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

courses = load_json('data/courses.txt')
stopwords = load_pkl('data/stopwords.pkl')

## Exercise 4.1: Pre-processing

In [2]:
# Number of courses

len(courses)

854

In [3]:
# First course description

print(courses[0]['description'])

The latest developments in processing and the novel generations of organic composites are discussed. Nanocomposites, adaptive composites and biocomposites are presented. Product development, cost analysis and study of new markets are practiced in team work. Content Basics of composite materialsConstituentsProcessing of compositesDesign of composite structures Current developmentNanocomposites Textile compositesBiocompositesAdaptive composites ApplicationsDriving forces and marketsCost analysisAerospaceAutomotiveSport Keywords Composites - Applications - Nanocomposites - Biocomposites - Adaptive composites - Design - Cost Learning Prerequisites Required courses Notion of polymers Recommended courses Polymer Composites Learning Outcomes By the end of the course, the student must be able to: Propose suitable design, production and performance criteria for the production of a composite partApply the basic equations for process and mechanical properties modelling for composite materialsDisc

In [4]:
# Split connected words (eg. materialsConstituentsProcessing but not for acronyms like IT)

for course in courses:
	words = []
	current_word = ''

	for word in course['description'].split():
		for char in word:
			# If char is uppercase, it is the start of a new word
			if char.isupper():
				# Don't add word yet if it is all uppercase (acronym)
				if current_word.isupper() and len(current_word) > 0:
					current_word += char
					continue
				# Add current word
				words.append(current_word)
				current_word = char
			else:
				current_word += char

		words.append(current_word)
		current_word = ''

	course['description'] = ' '.join(words)

print(courses[0]['description'])

 The latest developments in processing and the novel generations of organic composites are discussed.  Nanocomposites, adaptive composites and biocomposites are presented.  Product development, cost analysis and study of new markets are practiced in team work.  Content  Basics of composite materials Constituents Processing of composites Design of composite structures  Current development Nanocomposites  Textile composites Biocomposites Adaptive composites  Applications Driving forces and markets Cost analysis Aerospace Automotive Sport  Keywords  Composites -  Applications -  Nanocomposites -  Biocomposites -  Adaptive composites -  Design -  Cost  Learning  Prerequisites  Required courses  Notion of polymers  Recommended courses  Polymer  Composites  Learning  Outcomes  By the end of the course, the student must be able to:  Propose suitable design, production and performance criteria for the production of a composite part Apply the basic equations for process and mechanical propertie

In [5]:
# Remove punctuation

for course in courses:
	course['description'] = ''.join([char for char in course['description'] if char not in string.punctuation])

print(courses[0]['description'])

 The latest developments in processing and the novel generations of organic composites are discussed  Nanocomposites adaptive composites and biocomposites are presented  Product development cost analysis and study of new markets are practiced in team work  Content  Basics of composite materials Constituents Processing of composites Design of composite structures  Current development Nanocomposites  Textile composites Biocomposites Adaptive composites  Applications Driving forces and markets Cost analysis Aerospace Automotive Sport  Keywords  Composites   Applications   Nanocomposites   Biocomposites   Adaptive composites   Design   Cost  Learning  Prerequisites  Required courses  Notion of polymers  Recommended courses  Polymer  Composites  Learning  Outcomes  By the end of the course the student must be able to  Propose suitable design production and performance criteria for the production of a composite part Apply the basic equations for process and mechanical properties modelling fo

In [6]:
# Remove stopwords and capitalized versions

stopwords_capitalized = [s.capitalize() for s in stopwords]

for course in courses:
	course['description'] = ' '.join([w for w in course['description'].split() if w not in stopwords and w not in stopwords_capitalized])

print(courses[0]['description'])

latest developments processing generations organic composites discussed Nanocomposites adaptive composites biocomposites presented Product development cost analysis study markets practiced team work Content Basics composite materials Constituents Processing composites Design composite structures Current development Nanocomposites Textile composites Biocomposites Adaptive composites Applications Driving forces markets Cost analysis Aerospace Automotive Sport Keywords Composites Applications Nanocomposites Biocomposites Adaptive composites Design Cost Learning Prerequisites Required courses Notion polymers Recommended courses Polymer Composites Learning Outcomes end student Propose suitable design production performance criteria production composite part Apply basic equations process mechanical properties modelling composite materials Discuss main types composite applications Transversal skills work methodology task general domain specific IT resources tools Communicate effectively profe

In [7]:
# Check word frequency

freq_dict = {}

for course in courses:
  for word in course['description'].split():
    if word in freq_dict:
      freq_dict[word] += 1
    else:
      freq_dict[word] = 1

len(freq_dict)

18796

In [8]:
frequent_sorted = sorted(freq_dict.items(), key=lambda x: x[1], reverse=True)
infrequent_sorted = sorted(freq_dict.items(), key=lambda x: x[1])
print(frequent_sorted[:10])
print(infrequent_sorted[:10])

[('methods', 1592), ('Learning', 1239), ('student', 1177), ('Content', 835), ('courses', 757), ('systems', 697), ('end', 661), ('students', 655), ('design', 603), ('Outcomes', 600)]
[('biocomposites', 1), ('Constituents', 1), ('Textile', 1), ('Automotive', 1), ('Sport', 1), ('photograph', 1), ('blot', 1), ('extracted', 1), ('reproducibility', 1), ('pipelines', 1)]


In [9]:
# Choose infrequent words that only appear once

infrequent_sorted = [w for w in infrequent_sorted if w[1] == 1]
len(infrequent_sorted)

9462

In [10]:
# Make lists of frequent and infrequent words

frequent_list = [w[0] for w in frequent_sorted[:10]]
infrequent_list = [w[0] for w in infrequent_sorted]

In [11]:
# Remove frequent and infrequent words

for course in courses:
	course['description'] = ' '.join([w for w in course['description'].split() if w not in frequent_list and w not in infrequent_list])

print(courses[0]['description'])

latest developments processing generations organic composites discussed Nanocomposites adaptive composites presented Product development cost analysis study markets practiced team work Basics composite materials Processing composites Design composite structures Current development Nanocomposites composites Biocomposites Adaptive composites Applications Driving forces markets Cost analysis Aerospace Keywords Composites Applications Nanocomposites Biocomposites Adaptive composites Design Cost Prerequisites Required Notion polymers Recommended Polymer Composites Propose suitable production performance criteria production composite part Apply basic equations process mechanical properties modelling composite materials Discuss main types composite applications Transversal skills work methodology task general domain specific IT resources tools Communicate effectively professionals disciplines Evaluate performance team receive respond appropriately feedback Teaching cathedra invited speakers G

In [12]:
# Stem the words

stemmer = PorterStemmer()

for course in courses:
	stemmed_words = [stemmer.stem(word) for word in course['description'].split()]
	course['description'] = ' '.join(stemmed_words)

print(courses[0]['description'])


latest develop process gener organ composit discuss nanocomposit adapt composit present product develop cost analysi studi market practic team work basic composit materi process composit design composit structur current develop nanocomposit composit biocomposit adapt composit applic drive forc market cost analysi aerospac keyword composit applic nanocomposit biocomposit adapt composit design cost prerequisit requir notion polym recommend polym composit propos suitabl product perform criteria product composit part appli basic equat process mechan properti model composit materi discuss main type composit applic transvers skill work methodolog task gener domain specif IT resourc tool commun effect profession disciplin evalu perform team receiv respond appropri feedback teach cathedra invit speaker group session exercis work project expect activ attend lectur design composit part bibliographi search assess written exam report oral present class


### Explain which ones you implemented and why.

1. **Remove Punctuation**: Punctuation is often irrelevant for text analysis tasks, so removing them simplifies the text and ensures consistency in tokenization.

2. **Split Connected Camelcase Words**: CamelCase words are compound words written without spaces and with each word capitalized. Splitting them into individual words helps tokenize them accurately.

3. **Remove Stopwords**: Stopwords don't contribute much to the context and can be safely removed to focus on content-bearing words.

4. **Remove Very Frequent Words**: Very frequent words don't help distinguish between documents. Removing them reduces noise and improves model performance.

5. **Remove Infrequent Words**: Infrequent words include noise or rare terms. Removing them further reduces noise and prevents overfitting.

6. **Stem the Words**: Stemming reduces words to their root forms, simplifying analysis by collapsing inflected forms to a common representation and improving the effectiveness of text mining algorithms.

In [13]:
# Print the terms in the pre-processed description of the IX class in alphabetical order

for course in courses:
	if course['courseId'] == 'COM-308':
		description = course['description'].split()
		print(sorted(description))

['20', '30', '50', 'acquir', 'activ', 'ad', 'ad', 'advertis', 'algebra', 'algebra', 'algorithm', 'algorithm', 'analysi', 'analyt', 'analyt', 'analyz', 'applic', 'applic', 'assess', 'auction', 'auction', 'balanc', 'base', 'base', 'basic', 'basic', 'basic', 'cathedra', 'chain', 'class', 'class', 'class', 'cloud', 'cluster', 'cluster', 'collect', 'com300', 'combin', 'commun', 'commun', 'commun', 'comput', 'comput', 'concept', 'concept', 'concret', 'coverag', 'current', 'data', 'data', 'data', 'data', 'data', 'data', 'dataset', 'dataset', 'decad', 'dedic', 'design', 'detect', 'detect', 'develop', 'dimension', 'draw', 'ecommerc', 'ecommerc', 'effect', 'effici', 'exam', 'expect', 'explor', 'explor', 'explor', 'explor', 'explor', 'field', 'final', 'foundat', 'framework', 'function', 'fundament', 'good', 'graph', 'graph', 'hadoop', 'handson', 'homework', 'homework', 'import', 'inform', 'inform', 'infrastructur', 'inspir', 'internet', 'internet', 'java', 'key', 'keyword', 'knowledg', 'lab', 'la

## Exercise 4.2: Term-document matrix

In [14]:
# Keep track of the mapping between terms and their indices, and documents and their indices

terms = []
for course in courses:
  for word in course['description'].split():
    terms.append(word)
terms_indices = list(set(terms))

document_indices = [x['courseId'] for x in courses]

In [15]:
# Construct matrix X

M = len(terms_indices)
N = len(document_indices)

X = np.zeros((M, N))
X.shape

(5262, 854)

In [16]:
# Populate matrix X

for i, word in enumerate(terms_indices):
  for j, course in enumerate(courses):
    description_list = course['description'].split()
    freq = description_list.count(word)
    X[i][j] = freq / len(description_list)

In [17]:
# Construct IDF

idf = np.zeros(len(terms_indices))

for row in range(len(terms_indices)):
	freq = 0
	for val in X[row]:
		if val > 0:
			freq += 1
	idf[row] = np.log(len(courses) / freq)

idf = idf.reshape(len(idf), 1)
idf

array([[2.96574156],
       [0.98787981],
       [3.45409433],
       ...,
       [5.65131891],
       [6.74993119],
       [6.05678401]])

In [18]:
# Construct TF-IDF matrix

tf_idf = X * idf
tf_idf.shape

(5262, 854)

In [19]:
# Show the 15 terms in the description of the IX class with the highest TF-IDF scores.

ix_idx = document_indices.index('COM-308')

ix_words = tf_idf[:, ix_idx]
ix_tf_idf_vals = {}
for idx, val in enumerate(ix_words):
	if val < 0:
		continue
	ix_tf_idf_vals[terms_indices[idx]] = val

ix_tf_idf_vals_sorted = sorted(ix_tf_idf_vals.items(), key=lambda x:x[1], reverse = True)
ix_tf_idf_vals_sorted[:15]

[('servic', 0.09381766001103216),
 ('realworld', 0.08748768295385784),
 ('onlin', 0.08673424009748963),
 ('social', 0.08119005782903743),
 ('explor', 0.07435342543791054),
 ('mine', 0.07185368695551898),
 ('largescal', 0.06218278450286707),
 ('ecommerc', 0.06212086167413974),
 ('auction', 0.05501165982224287),
 ('internet', 0.04927201071521289),
 ('network', 0.04732881922663084),
 ('ad', 0.04463626585630974),
 ('stream', 0.04292289062899521),
 ('dataset', 0.04292289062899521),
 ('data', 0.041592876685255915)]

In [28]:
np.save('TFIDF_matrix.npy', tf_idf)
np.save('document_indices.npy', document_indices)
np.save('terms_indices.npy', terms_indices)

In [27]:
# Load the TF-IDF matrix
tf_idf = np.load('TFIDF_matrix.npy')
print("TF-IDF matrix (first 5 rows):")
print(tf_idf[:5])

# Load the document indices
document_indices = np.load('document_indices.npy', allow_pickle=True).tolist()
print("Document Indices (first 5):")
print(document_indices[:5])

# Load the terms indices
terms_indices = np.load('terms_indices.npy', allow_pickle=True).tolist()
print("Terms Indices (first 5):")
print(terms_indices[:5])

TF-IDF matrix (first 5 rows):
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.01555716 0.01924441 0.         ... 0.         0.03528142 0.03704549]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
Document Indices (first 5):
['MSE-440', 'BIO-695', 'FIN-523', 'MICRO-614', 'ME-231(a)']
Terms Indices (first 5):
['approxim', 'analysi', 'divers', 'matanya', 'refriger']


### Explain where the difference between the large scores and the small ones comes from.

Words like 'onlin' and 'realworld' have higher scores as compared to terms like 'stream' and 'data'. This could be due to 'stream' and 'data' being frequently mentioned in other course descriptions as well. The difference in TF-IDF scores is influenced by how often these terms appear across all descriptions. Terms that are exclusive to that course will have a higher score.

## Exercise 4.3: Document similarity search

In [21]:
# Initialize markov and facebook matrices

markov_chain = np.zeros(len(terms_indices))
markov_chain[terms_indices.index('markov')] = 0.5
markov_chain[terms_indices.index('chain')] = 0.5

facebook = np.zeros(len(terms_indices))
facebook[terms_indices.index('facebook')] = 1

In [22]:
# Function to compare two documents

def cosine_similarity(doc1, doc2):
	norm_doc1 = np.linalg.norm(doc1)
	norm_doc2 = np.linalg.norm(doc2)
	return np.dot(doc1, doc2) / (norm_doc1 * norm_doc2)

In [23]:
# Search for "markov chains"

markov_similarity = []
for idx in range(len(document_indices)):
  markov_similarity.append(cosine_similarity(tf_idf.T[idx], markov_chain))

# Dictionary {course name: similarity value}
markov_sim_vals = {}
for idx, sim in enumerate(markov_similarity):
  markov_sim_vals[courses[idx]['name']] = sim

# Sort by similarity value
markov_sim_vals_sorted = sorted(markov_sim_vals.items(), key=lambda x:x[1], reverse = True)


In [24]:
# Search for "facebook"

facebook_similarity = []
for idx in range(len(document_indices)):
    facebook_similarity.append(cosine_similarity(tf_idf.T[idx], facebook))

# Dictionary {course name: similarity value}
facebook_sim_vals = {}
for idx, sim in enumerate(facebook_similarity):
    facebook_sim_vals[courses[idx]['name']] = sim

# Sort by similarity value
facebook_sim_vals_sorted = sorted(facebook_sim_vals.items(), key=lambda x:x[1], reverse = True)

In [25]:
# Display the top five courses together with their similarity score for each query.

print("markov chain - top five courses with similarity score")
for i in range(5):
    print(markov_sim_vals_sorted[i])

print("\nfacebook - top five courses with similarity score")
for i in range(5):
    print(facebook_sim_vals_sorted[i])

markov chain - top five courses with similarity score
('Applied stochastic processes', 0.557910816140559)
('Applied probability & stochastic processes', 0.541839456832587)
('Markov chains and algorithmic applications', 0.4210724606519117)
('Supply chain management', 0.39279034730463636)
('Mathematical models in supply chain management', 0.3076572352007134)

facebook - top five courses with similarity score
('Computational Social Media', 0.18922283654097266)
('Composites technology', 0.0)
('Image Processing for Life Science', 0.0)
('Global business environment', 0.0)
('Electrochemical nano-bio-sensing and bio/CMOS interfaces', 0.0)


### What do you think of the results? Give your intuition on what is happening.



In [26]:
count_markov = 0
count_chain = 0
count_facebook = 0

for course in courses:
	for word in course['description'].split():
		if word == 'markov':
			count_markov += 1
		if word == 'chain':
			count_chain += 1
		if word == 'facebook':
			count_facebook += 1

print("Total number of occurrences for")
print("markov:", count_markov)
print("chain:", count_chain)
print("facebook:", count_facebook)

Total number of occurrences for
markov: 45
chain: 74
facebook: 3


For markov chain, the top five courses with the highest similarity values all seem to be related to markov chain. However, for facebook, most of the similarity values of top five courses are 0. This suggests that this model is unable to detect courses that are related to facebook.

Since facebook doesn't occur often in the descriptions, the TF-IDF model is unable to find similar courses for facebook. The TF-IDF model is only able to find similar courses for words that have a high number of occurrences.