In [68]:
import pandas as pd
from tqdm import tqdm
import random

from embedding import OneHotEmbedder, Doc2VecEmbedder
from vectorstore import SimpleVectorDatabase
from models import TextDoc, Vector

# Text Document Similarity
Create a python program that will compute the text document similarity between different documents. Your implementation will take a list of documents as an input text corpus, and it will compute a dictionary of words for the given corpus. Later, when a new document (i.e, search document) is provided, your implementation should provide a list of documents that are similar to the given search document, in descending order of their similarity with the search document.

For computing similarity between any two documents in our question, you can use the following distance measures (optionally, you can also use any other measure as well).
1. dot product between the two vectors
2. distance norm (or Euclidean distance) between two vectors e.g. $|| u − v ||$

As part of answering the question, you can also compare and comment on which of the two methods (or any other measure if you have used some other measure) will perform better and what are the reasons for it.

---

This notebook guids trough the implementation and intended use of it. For additional details see the project report (pdf).


## Data

In [69]:
# Lets load the dataset. We use a small sample dataset from wikipedia here. But in the 
# `data_load` notebook, you can find a script to download a larger dataset. The larger
# dataset will give better results, but is more computationally expensive. To test this
# script, the small dataset is sufficient.

# The dataset consists of multiple wikipedia articles. Each row contains the title of the
# article, the text of the article as well as the url.
dataset = pd.read_csv("data/wiki_subset_mini.csv", sep=",")
dataset.head()

Unnamed: 0,id,url,title,text
0,31945,https://en.wikipedia.org/wiki/List%20of%20metr...,List of metro systems,This list of metro systems includes electrifie...
1,37600,https://en.wikipedia.org/wiki/Axis,Axis,Axis may refer to:\n\nMathematics\nAxis (mathe...
2,37945,https://en.wikipedia.org/wiki/Giovanni%20Paisi...,Giovanni Paisiello,Giovanni Paisiello (or Paesiello; 9 May 1740 –...
3,20807,https://en.wikipedia.org/wiki/Merlin,Merlin,"Merlin (, , ) is a mythical figure prominently..."
4,621,https://en.wikipedia.org/wiki/Amphibian,Amphibian,"Amphibians are ectothermic, tetrapod vertebrat..."


In [70]:
# remove some random articles for later testing
# draw random sample of 5 articles
random.seed(42)
testing_indices = random.sample(range(len(dataset)), 5)

test_texts = dataset.loc[testing_indices,:].reset_index(drop=True)
train_texts = dataset.loc[~dataset.index.isin(testing_indices),:].reset_index(drop=True)

## Embedding

In [71]:
# This parameter is used to set the vector dimension of the embedding algorithm and the 
# vector database. The higher the dimension, the more accurate the results will be, but
# the more computationally expensive the calculations will be. OpenAI's embedding models
# use a dimension of 1536 as a reference.
# We suggest to use a dimension of 150 for the smaller datasets. For the larger datasets,
# this parameter can be increased. The larger the vector dimension, the more accurate the
# the embeddings are, because more information can be embedded. However, this might also
# increase noise and is more computationally expensive.

VECTOR_DIM = 150

In [72]:
# In this step the embedding model is fitted. This means that the model is trained on the
# given dataset. The user can choose which embedding algorithm to use - this implementation
# is designed to be easily extendable to other embedding algorithms while keeping the user
# interface the same. Every embedding implementation needs to have the methods `fit` and
# `embed`. The `fit` method is used to train the model on the given dataset. The `embed`
# method is used to embed a given text into a vector.

# The `OneHotEmbedder` is a simple embedding algorithm that uses a one-hot encoding to create
# the embedding-vectors. It is trained on the dataset by creating a vocabulary of all words
# and then creating a very sparse vector for each text. Each vector signals which words are
# present in the text. There are two different embedding methods available: additive and
# ont-hot. The additive method is the default method, it counts the number of occurences of
# each word in the text. The one-hot method only signals if a word is present in the text or
# not, without counting. After creating the vector for one text, the vector has a length
# of the corpus, which is really huge. To reduce the dimensionality of the vector, PCA is
# applied to the vector. This reduces the dimensionality to the given `vector_dim`
# parameter.

# The `Doc2VecEmbedder` is a more complex embedding algorithm from the gensim library. It
# is more computationally expensive, but also more accurate. Details can be found in the 
# gensim documentation. We won't go into detail here, since this is out of scope of this
# excercise. (gensim documentation: https://radimrehurek.com/gensim/models/doc2vec.html)


# Choose the embedder model to use: OneHot or Doc2Vec currently available
embedder = OneHotEmbedder(vector_dim=VECTOR_DIM, embedding_method="additive")
#embedder = Doc2VecEmbedder(vector_dim=VECTOR_DIM)

# fit the embedder model: this could take a view minutes
embedder.fit(list(train_texts.text))

# Update the vector dimension to the actual dimension of the embedding model (should be 
# equal to the given parameter, but can be changed by the embedding algorithm if the 
# dimensions don't fit)
VECTOR_DIM = embedder.vector_dim

2024-01-28 11:04:44,376::onehotembedder.py[_fit()]::INFO::Fitting embedder. Preprocessing documents...
100%|██████████| 295/295 [00:06<00:00, 45.13it/s]
2024-01-28 11:04:50,957::onehotembedder.py[_fit()]::INFO::Corpus created with 65281 words.
2024-01-28 11:04:50,958::onehotembedder.py[_fit()]::INFO::Reducing vector dimensions: fitting PCA (n=150)


In [73]:
# In the next step we create a vector database. This database is used to store the texts
# we want to search in later. We fill the database with the texts, embedded as vectors.
# Again we designed this implementation to be easily extendable to other database
# implementations. The only requirement is that the database has specific methods like
# `sim_search` or `upsert` that takes a vector as an argument and updates or inserts a
# vector into the database. 

# The `SimpleVectorDatabase` is a simple implementation of a vector database. It stores
# the vectors in a simple dictionary. For each wikipedia article, we create a TexDoc from
# the text. This TextDoc is embedded as a vector using the embedder model, fitted 
# earlier. We add some metadata like the actual text, the title and the source and finally
# upsert the vector into the database. This process might take some time, depending on the
# size of the dataset.


# Create vector database instance
vecdb = SimpleVectorDatabase(vector_dim=VECTOR_DIM)

# Create vectors and store them in the database. The tqdm package is used to show a
# progress bar. This is not necessary, but it is nice to see the progress.
for i in tqdm(range(len(train_texts))):
    row = train_texts.iloc[i]
    # Create TextDoc object from text.
    # This acts as a common interface for the database
    doc = TextDoc(row.text)
    # Create the vector instance and add the metadata
    vec = Vector(
        embedding=embedder.embed(doc),
        data=TextDoc(doc),
        metadata={"type": "wikipedia", "src": row.url, "title": row.title})
    # Upsert the vector into the database
    vecdb.upsert(vec)

100%|██████████| 295/295 [00:15<00:00, 18.86it/s]


In [74]:
# Lets look at our SimpleVectorDatabase

# show first 5 entries of dict
vecdb.data

# Each entry consists of a UUID as the key and a vector as the value.

{UUID('9a591172-fbd0-4e61-a858-d81bdf24a83d'): <Vector 9a591172-fbd0-4e61-a858-d81bdf24a83d (150) : TextDoc(6166): This list of metro systems includes electrified rapid t ...>,
 UUID('edac5c25-b7f6-405f-8729-1aad66dab162'): <Vector edac5c25-b7f6-405f-8729-1aad66dab162 (150) : TextDoc(2346): Axis may refer to:
 
 Mathematics
 Axis (mathematics), a d ...>,
 UUID('f2bd783f-c4db-406e-8552-46a5c5671ac7'): <Vector f2bd783f-c4db-406e-8552-46a5c5671ac7 (150) : TextDoc(21140): Giovanni Paisiello (or Paesiello; 9 May 1740 – 5 June 1 ...>,
 UUID('4be0177d-9c88-4ddc-b684-f29be891ed92'): <Vector 4be0177d-9c88-4ddc-b684-f29be891ed92 (150) : TextDoc(24719): Merlin (, , ) is a mythical figure prominently featured ...>,
 UUID('fc0e1297-3270-41e1-ba46-f06a066ec5e1'): <Vector fc0e1297-3270-41e1-ba46-f06a066ec5e1 (150) : TextDoc(75026): Amphibians are ectothermic, tetrapod vertebrates of the ...>,
 UUID('cfb7787d-eb6f-41a7-8b24-4cc03b7c0318'): <Vector cfb7787d-eb6f-41a7-8b24-4cc03b7c0318 (150) : TextDoc(7

In [75]:
# Now we fitted an embedding model and created a vector database. We can now try to find
# similar documents in the database. First we need to embedd the new text into a vector
# using the same embedding model.


# Let's take the last row of the test dataset and query the database for similar vectors.
row = test_texts.iloc[3,:]
new_doc = TextDoc(row.text)
print(new_doc)

TextDoc(36454): Stephen Gary Wozniak (; born August 11, 1950), also kno ...


In [76]:
# Now we create a embedding vector from the TextDoc
new_vec = Vector(
    embedding=embedder.embed(new_doc),
    data=TextDoc(new_doc),
    metadata={"type": "wikipedia", "src": row.url, "title": row.title}
)
print(new_vec)

<Vector 6412c642-4df7-4373-ba52-2ac6d8aa78b5 (150) : TextDoc(36454): Stephen Gary Wozniak (; born August 11, 1950), also kno ...>


In [77]:
# Now we can query the database for similar vectors. Select the measure to use and the
# number of similar vectors to return. Again, this implementation is designed to be easily
# extendable to other similarity measures. Currently implemented measures are: cosine,
# euclidean and dot. 
# Each result is the vector found in the database and the corresponding similarity score.

similar_vectors = vecdb.sim_search(new_vec, measure="dot", k=3)
similar_vectors

[{'vector': <Vector bc847194-e212-4730-9124-7e5b5f166757 (150) : TextDoc(62194): Dolly Rebecca Parton (born January 19, 1946) is an Amer ...>,
  'score': 4112.219389186815},
 {'vector': <Vector fa86d81b-9145-4d85-8256-949f5d614106 (150) : TextDoc(29752): A wearable computer, also known as a wearable or body-b ...>,
  'score': 4092.8116332783234},
 {'vector': <Vector 92524dad-61db-4292-ae38-1dcf952d4f2a (150) : TextDoc(31711): Activision Publishing, Inc. is an American video game p ...>,
  'score': 3965.4917935897442}]

In [78]:
# Note: The following description assumes that the OneHotEmbedder (additive) is used, with
# a vector dimension of 150. The results might differ if you use a different embedding or 
# a different vector dimension.

In [79]:
# Lets look at the results. Our search vector is the wikipedia article for Stephen Gary
# Wozniak, the co-founder of Apple. The most similar vector in our database is the
# wikipedia article of Dolly Parton, the american singer. The second most similar vector
# is the article about a warable computer. This is a good result, since the article about
# Dolly Parton is also about a famous person and the article about the warable computer is
# also about a technical topic. The similarity measure is not perfect, but it is a good
# start. The results can be improved by using a larger dataset and a more complex embedding.


print(f"Vector: [title: {new_vec.metadata['title']}] {new_vec.data}")
print("Most Similar:")
for i in range(len(similar_vectors)):
    print("_"*80)
    print(f"Rank {i+1}")
    print(f"{similar_vectors[i]['vector'].metadata['title']}")
    print(f"{similar_vectors[i]['vector'].data}")
    print(f"Similarity Score: {similar_vectors[i]['score']}")
    # uncomment to see the full text
    #print(f"{similar_vectors[i]['vector'].data.body}")

Vector: [title: Steve Wozniak] TextDoc(36454): Stephen Gary Wozniak (; born August 11, 1950), also kno ...
Most Similar:
________________________________________________________________________________
Rank 1
Dolly Parton
TextDoc(62194): Dolly Rebecca Parton (born January 19, 1946) is an Amer ...
Similarity Score: 4112.219389186815
________________________________________________________________________________
Rank 2
Wearable computer
TextDoc(29752): A wearable computer, also known as a wearable or body-b ...
Similarity Score: 4092.8116332783234
________________________________________________________________________________
Rank 3
Activision
TextDoc(31711): Activision Publishing, Inc. is an American video game p ...
Similarity Score: 3965.4917935897442


In [80]:
# Lets try using another measure: the cosine similarity. This measure is more robust to
# different vector lengths and is a good choice for text embeddings. The results are
# even better than before. We once again get the article about warable computers, but also
# the article about Activision, a video game company. This is a good result.


similar_vectors = vecdb.sim_search(new_vec, measure="cosine", k=3)

print(f"Vector: [title: {new_vec.metadata['title']}] {new_vec.data}")
print("Most Similar:")
for i in range(len(similar_vectors)):
    print("_"*80)
    print(f"Rank {i+1}")
    print(f"{similar_vectors[i]['vector'].metadata['title']}")
    print(f"{similar_vectors[i]['vector'].data}")
    print(f"Similarity Score: {similar_vectors[i]['score']}")
    # uncomment to see the full text
    #print(f"{similar_vectors[i]['vector'].data.body}")

Vector: [title: Steve Wozniak] TextDoc(36454): Stephen Gary Wozniak (; born August 11, 1950), also kno ...
Most Similar:
________________________________________________________________________________
Rank 1
Wearable computer
TextDoc(29752): A wearable computer, also known as a wearable or body-b ...
Similarity Score: 0.6031062451740679
________________________________________________________________________________
Rank 2
Activision
TextDoc(31711): Activision Publishing, Inc. is an American video game p ...
Similarity Score: 0.41472153970255576
________________________________________________________________________________
Rank 3
Id Software
TextDoc(33704): id Software LLC () is an American video game developer  ...
Similarity Score: 0.27824082034685466


In [81]:
# Lastly lets try the euclidean distance. This measure returns an article about Akito
# Morita, the co-founder of Sony. The second most similar article is about Microserfs, a
# novel about a group of computer programmers. This also seems to be a pretty good result.


similar_vectors = vecdb.sim_search(new_vec, measure="euclidean", k=3)

print(f"Vector: [title: {new_vec.metadata['title']}] {new_vec.data}")
print("Most Similar:")
for i in range(len(similar_vectors)):
    print("_"*80)
    print(f"Rank {i+1}")
    print(f"{similar_vectors[i]['vector'].metadata['title']}")
    print(f"{similar_vectors[i]['vector'].data}")
    print(f"Similarity Score: {similar_vectors[i]['score']}")
    # uncomment to see the full text
    #print(f"{similar_vectors[i]['vector'].data.body}")

Vector: [title: Steve Wozniak] TextDoc(36454): Stephen Gary Wozniak (; born August 11, 1950), also kno ...
Most Similar:
________________________________________________________________________________
Rank 1
Akio Morita
TextDoc(7900): Akio Morita (January 26, 1921 – October 3, 1999) was a  ...
Similarity Score: 55.13239118699554
________________________________________________________________________________
Rank 2
Microserfs
TextDoc(10410): Microserfs, published by HarperCollins in 1995, is an e ...
Similarity Score: 56.75194130105295
________________________________________________________________________________
Rank 3
Nastassja Kinski
TextDoc(9790): Nastassja Aglaia Kinski (née Nakszynski; ; born 24 Janu ...
Similarity Score: 57.573978379397495
