<div style="background-color: #ffffff; color: #000000; padding: 10px;">
<img src="../media/img/kisz_logo.png" width="192" height="69"> 
<h1> NLP Fundamentals
<h2> Working with Embeddings
</div>

<div style="background-color: #f6a800; color: #ffffff; padding: 10px;">
<h2>Part 2.2 - Bag of Words
</div>

In this section we will build our first vector representation based on a statistical model, Bag of Words, from scratch and using a special package called <kbd>gensim</kbd>. We will then make our first query and check how this vector representation works. 

In [None]:
# imports
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

import nb_config

from src.data import data_loader

import warnings
warnings.filterwarnings('ignore')

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>1. Overview
</div>

The Bag of Words (BoW) model is a simplified representation used in Natural Language Processing and Text Analysis. It treats a document as an unordered set of words, disregarding grammar and word order but focusing on word frequency.

Here's a brief explanation of how the Bag of Words model works:
- Tokenization:
    The first step involves breaking down a document into tokens. We have already done that!

- Building the Vocabulary:
    A unique vocabulary is created by compiling a list of all distinct tokens present in the entire set of documents. Each token in this vocabulary is assigned a unique index or identifier.

- Vectorization:
    For each document in the dataset, a numerical vector is constructed. The length of the vector is equal to the size of the vocabulary, and each position corresponds to the count or presence of a specific word in the document. If a word is present, its corresponding position is marked; otherwise, it is set to zero.

- Document Comparison:
    The Bag of Words model allows for the comparison of documents based on their word vectors. Similar documents will have similar vector representations, despite variations in word order or grammar. We will compare documents with the metrics we discuss before.

Our first step will be loading the data we have tokenized in our previous part, and the arguments passed to the <kbd>normalize()</kbd> function, when we created the tokens. This info will be used later, when we tokenize the query. 

In [None]:
df, params = data_loader("my_tokenized_data.parquet")

# extract the arguments for the normalize function
tokenizer = params['tokenizer']
arguments = params['args']

# print the tokenizer used getting the tokens
print(f"Tokenizer used: {tokenizer}")
print(f"Arguments passed to the normalize function:\n{arguments}")

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>2. Creating the vocabulary
</div>

As we said before, the vocabulary refers to the complete set of unique words present in a given set of documents or corpus. It represents the entirety of distinct words used across documents and serves as the guide for building afterwards our vector representations.

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

Create a python list that contains all the unique tokens in our descriptors and store it in a variable called <kbd>vocabulary</kbd>.
</div>


The vocabulary can't be stored in a set. It needs to be a list, or even better, a tuple. That choice is primarily related to the need to maintain order and mapping between words and their corresponding indices.

We will create a variable called *tokens* for easier access to the column *'tokens'* in our dataframe.

>
> <details>
> <summary>Need help? Here some possible solutions</summary>
> 
> 1. Using a for loop over all the elemens in the column:
> <pre><code># create an empty set
> vocabulary = set()
> 
> # iterate over the pd Series and updates the set
> for index in df.index:
>    vocabulary.update(df.loc[index, 'tokens'].tolist())
> 
> # turn the vocabulary into a list
> vocabulary = list(vocabulary)
> 
> </code></pre>
> 
> <hr align=left width=350>
> 
> 2. Using <kbd>chain</kbd> from the package <kbd>itertools</kbd>:
> <pre><code>from itertools import chain
> 
> # chain all token lists, convert the result into a set
> # and then turn it into a list
> vocabulary = list(set(chain(*tokens)))
> 
> </code></pre>
> 
> <hr align=left width=350>
> 
> 3. Using <kbd>corpora</kbd> from the package <kbd>gensim</kbd>:
> <pre><code>from gensim import corpora
> 
> # create a Dictionary object from the tokens
> dictionary = corpora.Dictionary(tokens)
> 
> # filter the tokens with a set and turn it into a list
> vocabulary = list({value for key, value in dictionary.items()})
> 
> </code></pre>
> This last system has the advantage of allowing us to make really cool things based on the structure of the texts, e.g. filtering directly tokens that appear in too many or too few texts. In addition, it has an interesting collection of other useful methods.
> </details>
>
<br>

In [None]:
# create an easy access to the tokens
tokens = df['tokens']

# your code here
...
...
vocabulary = ...

# print vocabulary size
print(f"The vocabulary has {len(vocabulary)} tokens.")

Let's take a look to the first 15 tokens in our vocabulary, before we proceed.

In [None]:
print(*vocabulary[:15])

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>3. Vectorizing the documents
</div>

We are going to construct now a numerical vector For each document in the dataset. The length of the vector is equal to the size of the vocabulary, and each position corresponds to the count or presence of a specific word in the document. This results in a numerical representation of the document based on the words it contains.

We have here to different possibilities:
- **binary BoW**, where we track with a 1 or a 0 the presence of the token in the document
- **Term Frequency BoW**, where we track not only the presence but we also count the number of ocurrences of the token in the document and use that number as the vector value for that token in that document.

We will start with the binary Bag of Words, but you should be able later to reuse the same code with small modifications for trying the TF-BoW.

<br>

> <details>
> <summary>About sparse representations...</summary>
> As most documents contain only a small subset of the entire vocabulary, the resulting vectors are typically sparse, meaning that the majority of entries are zero. In those cases we will try to code the vectors as sparse representations.
> This sparse representations help manage computational resources efficiently by reducing the amount of memory and processing power required: only the non-zero elements need to be stored and processed, resulting in significant savings in terms of both storage space and computational effort.
> </details>

<br>

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

Create a sparse representation of the documents using the **binary Bag of Words** model. Follow the next steps:
1. Create a *scipy* sparse matrix called <kbd>bow_sparse</kbd> that has as many rows as documents in our dataset and as many columns as tokens in our vocabulary. The datatype should integer.
2. Index the tokens in the vocabulary to their position in the vocabulary with a dictionary called <kbd>string_to_index</kbd>.
3. Populate the matrix with nested for loops.
</div>

For pedagogical reasons, we will also create a pandas DataFrame with the matrix and use the tokens in our vocabulary as column names.


> <details>
> <summary>You don't know what to do?</summary>
> 
> Maybe this code can help you:
> 
> <pre><code># create the sparse matrix
> bow_sparse = dok_matrix((len(tokens), len(vocabulary)), dtype=np.int32)
> 
> # dictionary with pairs 'token':index
> string_to_index = {string: i for i, string in enumerate(vocabulary)}
> 
> # populate the matrix
> for i, string_list in enumerate(tokens):
>    for string in string_list:
>        bow_sparse[i, string_to_index[string]] = 1
> </code></pre>
> </details>


In [None]:
# we can use dok_matrix from scipy
from scipy.sparse import dok_matrix

# create the sparse matrix
bow_sparse = ...

# dictionary with pairs 'token':index
string_to_index = {...}

# populate the matrix
for ...:
    for ...:
        ...

# Convert the DOK matrix to a sparse DataFrame
bow_df = pd.DataFrame.sparse.from_spmatrix(bow_sparse, columns=vocabulary)

How does that dataframe look like? Well, really sparse...

In [None]:
bow_df

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>4. Making our first query
</div>

Time to look for a film. Let's say we are interested in watching again an old film where one guy (was it Tim Robbins?) is sent to prison but he is innocent, and even though the conditions are really harsh he doesn't give up and ends up doing the warden a favour with his accountant skills? We don't remember the name of the film. So let's see if our model can find it for us.

We are going to write a text describing briefly the movie. Something like:

> *An innocent man goes to prison accused of killing his wife and her lover, but never loses the hope*

We will pass that sentence through a tokenizer. We will use the metadata we collected before and create a tokenizer from scratch with that metadata, so we will have exactly the same tokenizer we used for the rest of the texts.

In [None]:
query = "An innocent man in prison that never loses the hope starts helping the warden as accountant"

print(tokenizer)
print(arguments)

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

Create a vector representation of the query. We will follow this steps:
1. Tokenize the query the same way we did with the texts in our dataset
2. Make a list with all the unique tokens in the query and store in the variable <kbd>query_tokens</kbd>
3. Create a scipy sparse matrix as we did for the corpus vectors with just one row and as many columns as tokens in our vocabulary. Store it in the variable <kbd>query_sparse</kbd>
4. Populate the sparse matrix with nested for loops, as we did before
</div>

Here again, we will create a pandas DataFrame to store the query vector.

> <details>
> <summary>We are sure you don't need the solution, but just in case...</summary>
> 
> There you go. This code should work:
> 
> <pre><code>from src.normalizing import normalize, NLTKTokenizer  # <- import the tokenizer you need
>  
> # instantiate the tokenizer
> tkn = NLTKTokenizer()
> 
> # tokenize the query with the same tokenizer you used for the corpus texts
> query_tokens = list(set(normalize(text=query, tkn=tkn, **arguments)[1]))
> 
> # create the sparse matrix
> query_sparse = dok_matrix((1, len(vocabulary)), dtype=np.int32)
> 
> # populate the matrix
> for string in query_tokens:
>     if string in vocabulary:
>         query_sparse[0, string_to_index[string]] = 1
> </code></pre>
> </details>

In [None]:
from src.normalizing import normalize, ... # <- import the tokenizer you need

# instantiate the tokenizer
tkn = ...

# tokenize the query with the same tokenizer you used for the corpus texts
query_tokens = ...

# create the sparse matrix
query_sparse = ...

# populate the matrix
for ...:
    if ...:
        ...

# Convert the DOK matrix to a sparse DataFrame
query_df = pd.DataFrame.sparse.from_spmatrix(query_sparse, columns=vocabulary)

Time to compare our vectors. We will take advantage of NumPy's optimized numerical computations, memory efficiency, and broadcasting capabilities. As pandas is built on top of numpy, we can easily extract the vectors and work easily with them.

> <details>
> <summary>Sorry, you said broadcasting?</summary>
> 
> Yes, we said broadcasting.
> 
> Broadcasting is like a smart way for NumPy to handle math with arrays of different sizes. It helps make operations work even if the arrays aren't the exact same shape. This makes it easier to do calculations without having to write lots of extra code.
>
> Curious about it? Take a look here:
>
> [https://numpy.org/doc/stable/user/basics.broadcasting.html](https://numpy.org/doc/stable/user/basics.broadcasting.html)
> </details>

So, that's what we are going to do:

- We will store the numerical values from our dataframes in numpy ndarrays.
- Then we will calculate the distance (or similarity) of our query vector to (or with) all the other vectors.
- We will create a copy of our original data dataframe and add a column for each metric.
- Finally, for helping us to better understand the model, we will add another column that shows the tokens present both in the query in every specific film.

In [None]:
# getting the numpy ndarrays from our dataframes
query_vector = query_df.to_numpy()
bow_matrix = bow_df.to_numpy()

# calculating the distances
euclid_distances = euclid_dist_AB = np.linalg.norm(bow_matrix - query_vector, axis=1)
dotprod_similarities = np.dot(bow_matrix, query_vector.T)
cos_similarities = cosine_similarity(query_vector, bow_matrix).flatten()

# creating the new dataframe and adding the extra columns
results = df.loc[:, ['title', 'descriptor']].copy()
results.loc[:, 'euclid_dist'] = euclid_distances
results.loc[:, 'dot_prod_sim'] = dotprod_similarities
results.loc[:, 'cos_sim'] = cos_similarities
results.loc[:, 'common_tokens'] = df.loc[:, 'tokens'].map(lambda x: list(set(x).intersection(query_tokens)))

We can now check the best results by sorting the dataframe by metric.

In [None]:
# choose the metric! options:
# 'euclid_dist', 'dot_prod_sim', 'cos_sim'
metric = 'euclid_dist'

# choose number of results
n = 10

# show the results
results.sort_values(by=metric, ascending=False).head(n)

Are you curious about the descriptor for the film we were looking for? Well, it looks like this:

In [None]:
df.loc[df.title == "The Shawshank Redemption", 'descriptor'].item()

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

Feel free to look for other films using your own queries, or just let the system recommend you films based on what you would like to see. And try to answer this questions:
- How good are the recommendations?
- How close are the suggested films to what you were loooking for?
- Which metrics perform better for this model?
</div>

<br>

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>(Optional) Exercise</strong>

Try to change the code from the last two chapters to turn the binary Bag of Words into a Term Frequency Bag of Words. It involves changing just a couple of things and it's done.

Is the tf-BoW better than the binary BoW? What do you think?
</div>


<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>5. (Optional) Bag of Words with the <kbd>gensim</kbd> package
</div>

We could do the same in a more efficient way with libraries specifically developed to deal with texts. One example for this is <gensim>. Gensim is a Python library designed for topic modeling and document similarity analysis. It provides tools for building, training, and using topic models such as Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). It is widely used for tasks like extracting topics from large text corpora, document similarity comparison, and creating vector representations of words and documents using techniques like Word Embeddings (Word2Vec) as we will see later.

> <details>
> <summary>More options</summary>
> Of course, there are way many more options. If you want to explore a bit further the Bag of Words universe, you could take a look to <kbd>tmtoolkit</kbd>, a library developed for text mining and topic modeling. You can also combine this library with <kbd>gensim</kbd> very easily.
>
> You can find more Info about tmtoolkit here:
> [https://tmtoolkit.readthedocs.io/en/latest/bow.html](https://tmtoolkit.readthedocs.io/en/latest/bow.html)
> </details>

We are going to repeat the same steps we did before for creating a Bag of Words model but using <kbd>gensim</kbd>. This time is going to be a Term Frequency Bag of Words. The metric per default in <kbd>gensim</kbd> is the cosine similarity. 

Let's start creating a dictionary with the tokens and then the corpus. The dictionary will contain all the unique tokens so it will work as our vocabulary.

In [None]:
from gensim import corpora

# Create a Dictionary object from the tokens
# store a copy in the artifacts folder
dictionary = corpora.Dictionary(tokens)
dictionary.save('../artifacts/descriptors.dict')

# Create the corpus with a vector for each document
corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in tokens]

In our next step we need to create a (temporary) index for fast access. For that we will use the <kbd>Similarity</kbd> class.

> <details>
> <summary>Hmm, but what's indexing?</summary>
> 
> Indexing is a common technique that allows us to create an optimized data structure for quick access to information, enhancing data retrieval speed. It facilitates efficient search, sorting, and supports constraints, significantly improving overall system performance and responsiveness. In databases, indexes are crucial for optimizing query execution and join operations.
>
> But don't worry! At the end of the workshop we will take a look at indexing with a bit more of detail.
> </details>

In [None]:
from gensim.test.utils import get_tmpfile
from gensim.similarities import Similarity

# get the path to the temporary file 'index'
index_tmpfile = get_tmpfile("index")

# create the index instantiating the class Similarity
index = Similarity(index_tmpfile, corpus, num_features=len(dictionary))

In our next step, we write down our query and tokenize it as we have done before. Then it needs to be vectorized.

We can then use our index to find the cosine similarity between the query and all the rest of descriptors in the corpus.

Finally we add the cosine similarities to a column in a copy of the original dataframe. 

In [None]:
query = "An innocent man in prison that never loses the hope starts helping the warden as accountant"

# normalizing the query
query_tokens = normalize(query, tkn, punct_signs=True)[1]
query_bow = dictionary.doc2bow(query_tokens)
query_bow

# get cosine similarity between
# the query and all index documents
cos_similarities = index[query_bow]

# creating the dataframe
results = df.loc[:, ['title', 'descriptor']].copy()
results.loc[:, 'cos_sim'] = cos_similarities
results.loc[:, 'common_tokens'] = df.loc[:, 'tokens'].map(lambda x: list(set(x).intersection(query_tokens)))

# show the results
results.sort_values(by='cos_sim', ascending=False).head(10)

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

As in the previous chapter, feel free to look for other films using your own queries, or just let the system recommend you films based on what you would like to see. And try to answer this questions:
- How good are the recommendations from this BoW created with <kbd>gensim</kbd>?
- Are they better than the ones we coded ourselves?
</div>

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>6. Advantages and disadvantages of BoW Models
</div>

Let's talk about the pros and cons of BoW models, and see where can they be used.

#### Advantages:

> - **Simplicity**: Bag of Words (BoW) models excel in simplicity, offering an intuitive representation of text based on word frequency. This simplicity makes BoW ideal for straightforward implementation and quick deployment in various natural language processing (NLP) tasks.
> 
> - **Versatility**: BoW models exhibit versatility, making them adaptable to a spectrum of NLP applications. Their simplicity allows for easy customization, making them suitable for tasks such as text classification, sentiment analysis, and information retrieval.
> 
> - **Efficiency**: BoW representations, especially in sparse formats, contribute to computational efficiency. This efficiency becomes crucial when dealing with large text corpora, enabling the processing of extensive datasets without an overwhelming demand for computational resources.
> 
> - **Interpretability**: BoW models offer interpretable representations. Each element in the BoW vector corresponds to the frequency or presence of a specific word, providing a clear understanding of the contributing factors to the model's output.

#### Disadvantages:

> - **Loss of Word Order**: A notable limitation of BoW models is the neglect of word order and grammar. This oversight results in the loss of crucial sequential information present in the text, making BoW less suitable for tasks reliant on context.
> 
> - **Sparsity**: BoW vectors can become highly sparse in high-dimensional spaces, posing challenges in terms of memory efficiency. The sparsity may necessitate additional computational resources for storage and processing.
> 
> - **Lack of Semantics**: BoW models lack the ability to capture semantic relationships between words. This limitation makes them less effective for tasks requiring an understanding of nuanced meanings and contextual significance.
> 
> - **Vocabulary Size**: Managing and processing extensive vocabularies can be a consideration, particularly when dealing with large and diverse datasets. The size of the vocabulary impacts both memory and computational efficiency.
> 

#### Applications:

> - **Text Classification**: In text classification tasks, such as sentiment analysis or spam detection, Bag of Words (BoW) models serve as a fundamental representation. Multinomial Naive Bayes classifiers are often the method of choice in this context. These classifiers operate on the BoW vectors, leveraging word frequencies to calculate probabilities and make predictions about the category or sentiment of a given document. The simplicity of BoW complements the assumptions of Naive Bayes, making this combination effective for quick and interpretable text classification.
> 
> - **Information Retrieval**: For information retrieval applications, where the goal is to build efficient search engines, BoW models find utility in Vector Space Models (VSM). BoW vectors, representing documents and queries, are used to calculate the cosine similarity, serving as a measure of relevance. This method allows search engines to quickly retrieve and rank documents based on their similarity to a given search query, making BoW a foundational component in information retrieval systems.
>
> - **Topic Modeling**: In tasks related to topic modeling, such as uncovering latent themes within a collection of documents, BoW models play a crucial role when combined with methods like Latent Dirichlet Allocation (LDA). LDA utilizes BoW representations to identify topics by analyzing the distribution of words across documents. BoW's ability to capture word frequencies is essential for LDA to uncover the underlying thematic structure within the corpus.
>
> - **Baseline Models**: As baseline models in natural language processing, BoW, when paired with Naive Bayes classifiers, offers a simple and effective solution for various applications. Whether it's a quick assessment of sentiment or classifying documents into predefined categories, the straightforwardness of BoW representations aligns well with the assumptions of Naive Bayes. This combination serves as a practical starting point in NLP tasks, providing a balance between simplicity and reasonable performance.