<div style="background-color: #ffffff; color: #000000; padding: 10px;">
<img src="../media/img/kisz_logo.png" width="192" height="69"> 
<h1> Working with embeddings:
<h2>An introductory workshop with applications on Semantic Search
</div>

<div style="background-color: #f6a800; color: #ffffff; padding: 10px;">
<h2>Part 2.1 - Text Normalization
</div>

In this part we are going to see how to turn texts into a collection of smaller pieces, tokens, that we will use for building numerical representations of our texts. In our way to find good ways of tokenizing texts we will be facing some common problems and we will discuss possible solutions. At the end, we will build a pipeline for tokenizing and serialize the pipeline and the tokenized texts.

We start importing some packages.

In [None]:
# imports
import pandas as pd
from datasets import load_dataset

import warnings
warnings.filterwarnings('ignore')

# import config variables for the notebooks
from nb_config import RAW_DATA_PATH, INTERIM_DATA_PATH

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>1. Overview
</div>

We are going to work with real data and with a concrete problem in mind: retrieving information from a collection of objects, each described with a text, based on a query written by the user. Let's formulate the problem in a more precise way.




<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Challenge</strong>


As a developer for the cutting-edge movie platform HBFlix, your task is to implement a semantic search feature that enables users to input a description or query, and in return, receive a list of films that match as close as possible the query based on semantic similarities. The movie database at your disposal consists of around three thousand films, each accompanied by release year and a concise publicity descriptor of the movie.
</div>

The dataset is a reduced version of [Kaggle's The Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset). We have kept only film names, release years and film descriptors. Also, because some of the models that we are going to use don't perform well on big datasets, we have filtered out.

As we are going to make a lot of experiments with this data, we will save the dataframe in parquet format for easy access. We will also rename the column 'overview' with the name 'descriptor'.

><details>
><summary>Do you need more data?</summary>
>You can load a bigger unfiltered version of this dataset, with the same structure but including descriptors for around 45.000 films with this code:
>
><code>full_df = load_dataset('mt0rm0/movie_descriptors', split='train').to_pandas()</code>
></details>

In [None]:
# Loading the dataset
df = load_dataset('mt0rm0/movie_descriptors_small', split='train').to_pandas()
df.rename(columns={'overview': 'descriptor'}, inplace=True)

# save the dataframe in parquet format
df.to_parquet(RAW_DATA_PATH + 'movie_descriptors.parquet')

# show the dataframe
df

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>2. Tokenization
</div>

Tokenization is the process of breaking down a text or a sequence of characters into smaller units, often words or subwords, referred to as tokens. In Natural Language Processing (NLP), tokenization is a crucial step in preparing textual data for analysis. The resulting tokens serve as the basic building blocks for various NLP tasks, allowing algorithms to process and understand the structure of the text.

Let's start taking a look to the texts we are going to work with, the descriptors for the movies in our list. The first one looks like this:

In [None]:
df.descriptor[0]

Time to get our hands dirty!


<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

Let's keep it easy and just split the text using whitespaces as reference. Each splitted string will be a token. Add then to our dataframe a column called *tokens* that contains the list of tokens for each movie.
</div>

Write your code in the cell below.

><details>
><summary>Do you need some help?</summary>
><br>
>You can split strings in python with the <kbd>split()</kbd> method applied to a string.
></details>

<br>


><details>
><summary>Maybe a bit more of help?</summary>
><br>
>You can apply a method elementwise to a series or dataframe column with the <kbd>map()</kbd> method and a lambda function.
></details>

<br>


> <details>
> <summary>Got completely stuck? Here there are some possible solutions</summary>
> 
> This line of code would work:<br>
> <code>df['tokens'] = df['descriptor'].map(lambda x: x.split(' '))</code>
> </details>

In [None]:
df['tokens'] = ... # Your solution here

# Show the tokens
df.tokens

If we check the output for the first descriptor, we can already see some problems with this method...


<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

Take a look to the output for other descriptors. Then, take a couple of minutes to discuss with your neighbour where have you found problems and how would we fix them.
</div>

In [None]:
# change the index in the next line (any number from 0 to 2864 would do)
# to check the tokens for other films
print(df.tokens[0])

**Problems found**: 
- ...
- ...
- ...
- ...

**How to solve these problems**:
- ...
- ...
- ...
- ...

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>3. Ways to improve our tokens
</div>

We will focus now on how to improve our tokens so we can extract as much information as possible from them.

For making it a bit easier for you, we have implemented a better version of the simple tokenizer that you have written in the last section. This tokenizer accounts for some of the problems, like:
- newline escape characters
- commas and dots at the end of a word
- empty strings/tokens

You can import this tokenizer from <kbd>src.normalizing</kbd>. If you want, Feel free to take a look to the code and try to understand how <kbd>SimpleTokenizer</kbd> works.

The following code shows how to use it.

In [None]:
from src.normalizing import SimpleTokenizer

# instantiate the SimpleTokenizer
st = SimpleTokenizer()

# apply the tokenizer to our text
df.loc[:, 'tokens'] = df.descriptor.map(lambda x: st.tokenize(x))

# show the tokens
df.tokens

We can see that they are still not perfect, but it looks much better.

There are, of course, a lot of ways for tokenizing our texts, but we don't even need to do it ourselves. Examples of common tokenizer implementations are:
- **Python**: str.split, re.split
- **NLTK**: PennTreeBankTokenizer, TweetTokenizer
- **spaCy**: Tokenizer class, fully customizable
- **Stanford CoreNLP**: linguistically accurate, requires Java interpreter
- **Huggingface**: BertTokenizer

We have prepared for you in the module <kbd>src.normalizing</kbd> a standard Spacy tokenizer and the NLTK PennTreeBank tokenizer.
You can find them with the names <kbd>SpaCyTokenizer</kbd> and <kbd>NLTKTokenizer</kbd>.

In [None]:
from src.normalizing import SpaCyTokenizer, NLTKTokenizer

You can see how to use them in the next lines. Tokenizing the whole data set would take quite a few minutes so we are going to check only the tokens for the first movie.

> <details>
> <summary>About the SpacyTokenizer</summary>
> 
>  We have preloaded only the small SpaCy tokenizer, but you could load tokenizers of different sizes using as parameter <kbd>size='md'</kbd> for medium size or <kbd>size='lg'</kbd> for the largest model, when instantiating the SpacyTokenizer as shown:
>
> <pre><code># use 'sm' for small (default)
> # 'md' for medium or 'lg' for large  
> tokenizer = SpaCyTokenizer(size='lg')
> </code></pre>
> 
> </details>



In [None]:
# Instantiate the tokenizer you want to use: SpacyTokenizer or NLTKTokenizer 
tokenizer = SpaCyTokenizer()

# Apply the tokenizer to the first abstract as example
tokens = tokenizer.tokenize(df.descriptor[0])
print(tokens)

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>4. Lemmatization and stemming
</div>

**Lemmatization** is a technique that involves reducing words to their base or canonical form, known as the "lemma." The lemma represents the dictionary form or the base form of a word, which is often a valid word that can be found in the language's dictionary.

Lemmatization is different from **stemming**, another text normalization technique. While stemming involves cutting off prefixes or suffixes of words to obtain a root form (which may not always be a valid word), lemmatization aims to transform words to their base form, preserving their meaning and ensuring that the resulting lemma is a valid word in the language.

For example:
- The lemma of "running" is "run" and the stem is "run"
- The lemma of "better" is "good" and the stem is "bett" (stemming doesn't always produce valid words)
- The lemma of "mice" is "mouse" and the stem is "mice"

We have prepared a lemmatizer and some stemmers as alternatives to our simple tokenizer:

The SpaCy lemmatizer is implemented as part of the SpaCyTokenizer in the module <kbd>src.normalizing</kbd>. **Instead of** using the method <kbd>tokenize</kbd>, you can use the method <kbd>lemmatize</kbd>. 

In [None]:
# Instantiate the SpacyTokenizer 
tokenizer = SpaCyTokenizer()

# Apply the lemmatizer to the first abstract as example
tokens = tokenizer.lemmatize(df.descriptor[0])
print(tokens)

The stemmers work on a token list and can be found as methods of the class <kbd>WordTools</kbd> in the <kbd>src.utils</kbd> module.

Two different stemmers have been implemented:
- The Porter stemmer (<kbd>porter_stemmer()</kbd>)
- The Snowball stemmer (<kbd>snowball_stemmer()</kbd>)

In [None]:
from src.utils import WordTools

# Instantiate the SpacyTokenizer 
tokenizer = SpaCyTokenizer()

# Apply the lemmatizer to the first abstract as example
tokens = tokenizer.tokenize(df.descriptor[0])

# Apply the stemmer
tokens = WordTools.porter_stemmer(tokens)

print(tokens)

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>5. Stop words
</div>

Stop words are common words in a language that are often filtered out during text preprocessing because they are considered to be of little value in terms of conveying meaning. Examples of stop words in English include "the," "and," "is," "of," etc. Stop words can vary depending on the language.

In natural language processing, it's common to remove stop words from text data to reduce noise and improve the efficiency of downstream tasks. The NLTK library in Python provides a list of common stop words for various languages.

We provide a tool for removing stop words using the NLTK stop word filter. You can find it as method of the class <kbd>WordTools</kbd> in the <kbd>src.utils</kbd> module.

In [None]:
# Instantiate the SpacyTokenizer 
tokenizer = SpaCyTokenizer()

# Apply the lemmatizer to the first abstract as example
tokens = tokenizer.tokenize(df.descriptor[0])

# Apply the stemmer
tokens = WordTools.stopword_filter(tokens)

print(tokens)

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>6. Case folding
</div>

Case folding is a text normalization technique that involves converting all the characters in a piece of text to a common, usually lowercase, form. The purpose of case folding is to ensure consistency and facilitate comparisons, as it makes the text case-insensitive.

You can find a tool for case folding as method of the class <kbd>WordTools</kbd> in the <kbd>src.utils</kbd> module.

In [None]:
# Instantiate the SpacyTokenizer 
tokenizer = SpaCyTokenizer()

# Apply the lemmatizer to the first abstract as example
tokens = tokenizer.tokenize(df.descriptor[0])

# Apply the stemmer
tokens = WordTools.case_folding(tokens)

print(tokens)

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>7. Building our own normalization pipeline
</div>

We can choose now some of these tools to create our tokens.

As the SpaCy and NLTK tokenizers take considerably more time than the simple tokenizer, we are going to stick to this last one for now. We will not stem or lemmatize but we will use case folding and remove the stop words.

In [None]:
# Instantiate the SimpleTokenizer
st = SimpleTokenizer()

# Apply the tokenizer to our text
print("Tokenizing...")
%timeit -r1 df.loc[:, 'tokens'] = df.descriptor.map(lambda x: st.tokenize(x))

# Remove the stop words
print("Removing stop words...")
%timeit -r1 df.loc[:, 'tokens'] = df.descriptor.map(lambda x: WordTools.stopword_filter(x))

# Case folding
print("Case folding...")
%timeit -r1 df.loc[:, 'tokens'] = df.descriptor.map(lambda x: WordTools.case_folding(x))

# Removing punctuation signs
print("Removing punctuation signs...")
%timeit -r1 df.loc[:, 'tokens'] = df.descriptor.map(lambda x: WordTools.punct_remover(x))

We have now a way of tokenizing our texts, and we have applied that system to all our texts. Is that enough?

Well, the answer is no.

We will also want to be able to tokenize our queries using exactly the same method we used for tokenize the descriptors, right?

<div style="background-color: #b1063a; color: #ffffff; padding: 10px;">
<strong>Exercise</strong>

Create a function called <kbd>pipeline</kbd> that gets as input a text and returns as ouput a list of tokens following the same steps we used before.
</div>

<br>

> <details>
> <summary>Not working at all? Take a look here</summary>
> 
> This could be a basic implementation of the function:
> <pre><code>
> def pipeline(text:str):
>     # instantiate the SimpleTokenizer
>     st = SimpleTokenizer()<br>
>     # apply the tokenizer to our text
>     tokens = st.tokenize(text)<br>
>     # remove the stop words
>     tokens = WordTools.stopword_filter(tokens)<br>
>     # case folding
>     tokens = WordTools.case_folding(tokens)<br>
>     # Removing punctuation signs
>     tokens = WordTools.punct_remover(tokens)<br>
>     return tokens
> </code></pre>
> </details>

In [None]:
# your code here below
def pipeline(text: str) -> list[str]:
    pass

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>8. Automatizing the process
</div>

Do you want to experiment with different pipelines by implementing different combinations of the functions before? 

No worries. We have prepared for you a function called <kbd>normalize()</kbd> in the module <kbd>src.normalizing</kbd>. With that function you can choose different settings for your pipeline by adjusting the input parameters. It will output for each text a dictionary with the parameters and a list with the tokens.

You can see with the next code how does it work.

In [None]:
from src.normalizing import normalize

import pandas as pd

# load the data
df = pd.read_parquet(RAW_DATA_PATH+'movie_descriptors.parquet')

# instantiate the SimpleTokenizer
tkn = NLTKTokenizer()

# normalize the texts
%timeit -r1 df.loc[:, 'tokens'] = df.descriptor.map(lambda x: normalize(x, tkn=tkn, punct_signs=True)[1])

Don't forget to save your tokenized data!

In [None]:
df.to_parquet(INTERIM_DATA_PATH+'my_tokenized_data.parquet')

<div style="background-color: #dd6108; color: #ffffff; padding: 10px;">
<h3>9. Optional: Keeping track of the tokenized data
</div>

If we want to make experiments with different settings, it would be a great idea to keep track of the settings we use for each configuration, so we can always compare different settings for looking for the best performing combination.

> <details>
> <summary>About tokenization tracking</summary>
> 
> Tokenization tracking is part of a wider concept called artifact tracking. In the context of MLOps, artifact tracking involves managing and versioning various artifacts, such as trained models, datasets, and preprocessing scripts. It ensures reproducibility and traceability of experiments.
>
> Tokenization information, along with other preprocessing steps, can be considered artifacts. Tracking these artifacts allows you to understand how data was transformed and processed before being used in model training.
> </details>

One way of doing it would be by automatically saving metadata about the tokenization process every time we serialize one tokenized dataset to parquet. That way we can reconstruct the same tokenizer configuration we used originally and reuse it for queries or for training even more data.

We have implemented as an example the function <kbd>df_pipeline()</kbd> in <kbd>src.normalizing</kbd> that gets as input a dataframe, and the same parameters as the <kbd>normalize()</kbd> function we ave seen before and out puts the dictionary with the parameters and the dataframe with the tokens in the column tokens. 

In [None]:
from src.normalizing import df_pipeline

params, df = df_pipeline(df, tkn=tkn, punct_signs=True)

We can then pass the dataframe, a file name (wihtout extension) and the parameters dictionary to <kbd>data_logger()</kbd> in <kbd>src.data</kbd> and this function will store the data as a parquet file in <code>data/interim/</code> with the given name, and will store (or update))the metadata in the file <code>data/data.json</code>.

In [None]:
from src.data import data_logger

data_logger(df, "my_tokenized_data.parquet", params)

You can later extract the data and the metadata with the following code:

<code>from src.data import data_loader

df, params = data_loader("my_tokenized_data.parquet")</code>