# Lab III - Advanced Topics

## Machine Learning II

* Andrés Castaño Licona
* Eileen Melissa Arevalo Garnica
* Moisés Alfonso Guerrero Jiménez

### Workshop III

1. In your own words, describe what vector embeddings are and what they are useful for.

**R/** Embeddings are vector representations of words in an N-dimensional space, where it is possible to represent them in a lower dimension than if they were represented in one-hot encoding, which assigns a dimension per word. These embeddings capture semantic and syntactic relationships between words, allowing NLP and machine learning algorithms to work more effectively in tasks such as sentiment analysis and automatic translation, among others. It is important to highlight that embeddings are not specifically used as a final part of the process, but rather as an intermediate step in the application of NLP models, where they are adjusted and refined with other parameters during the training process.

2. What do you think is the best distance criterion to estimate how far two embeddings (vectors) are from each other? Why?

**R/** The best distance criterion to estimate how far are two embeddings from each other is the cosine similarity. This is a measure that instead of calculating the distance between the 2 points as is done with the Euclidean distance, it evaluates the angle between the two vectors.

**Advantages**

* Cosine similarity is simple and only requires the dot product of the vectors and the multiplication of their magnitudes. This simplicity leads to efficient calculations, making it suitable for real-time applications and large data sets.
* Unlike other distance-based similarity measures, cosine similarity considers the angle between vectors, which provides a more intuitive sense of similarity. Smaller angles indicate greater similarity and the measurement ranges between -1 and 1, making it easier to interpret.
* Cosine similarity is scale-invariant, meaning that it is not affected by the magnitudes of the vectors. This is especially useful in scenarios where you want to focus solely on the directionality of the vectors, rather than their length. Whether the values in your vector are in the tens or the millions, the cosine similarity will remain the same, making it versatile across different scales.

**References**

https://www.datastax.com/guides/what-is-cosine-similarity

3. Let us build a Q&A (question answering) system! 😀For this, consider the following steps:
   - Pick whatever text you like, in the order of 20+ paragraphs.
   - Split that text into meaningful chunks/pieces.
   - Implement the embedding generation logic. Which tools and approaches would help you generate them easily and high-level?
   - For every question asked by the user, return a sorted list of the N chunks/pieces in your text that relate the most to the question. Do results make sense?

**R/** The selected text taken from the BBC: **El plan de EE.UU. para que se volviera a emitir la telenovela venezolana "Kassandra" en Bosnia ante el temor de que estallara de nuevo la guerra** (https://www.bbc.com/mundo/articles/c6pd6y37vd8o)

The text was splitted into paragraphs using the new line separator (\n).

The embedding logic was implemented using the `GIST-Embedding-v0` model developed by avsolatorio and published in Hugging face (https://huggingface.co/avsolatorio/GIST-Embedding-v0). This model "introduces a novel strategy that enhances in-batch negative selection during contrastive training through a guide model" (https://arxiv.org/pdf/2402.16829.pdf). This approach breaks away from using random sampling and the flawed assumption that all negative samples within a batch are equally valuable. This significantly reduces the impact of noise caused by data quality issues and improves the process of fine-tuning the model.

The implementetion logic is very simple:
1. Call the model.
2. Merge both the question and the split text into a single list (It is necessary to put the question into the first position in the created list).
3. Create the embedding using the `enconde` function from the model.
4. Calculate the scores of the embeddings using a cosine similarity and sort them descending.
5. Select the N chunks with the major scores. In our case we select the three most relevant scores.

In [1]:
# load the model
from sentence_transformers import SentenceTransformer
EMBEDDING_MODEL = SentenceTransformer("avsolatorio/GIST-Embedding-v0")

def split_text_by_paragraph(filename):
    """
    Opens a text file and splits its content by new line mark.

    Args:
        filename: The name of the text file to open.

    Returns:
        A list of strings, where each string is a sentence from the text file
        ending with a dot.
    """
    with open(filename, 'r') as file:
        text_ = file.read()
    sentences = text_.split("\n")
    return [sentence for sentence in sentences if sentence.strip()]

In [2]:
# load the example text
TEXT_BASE = split_text_by_paragraph('http_question_answering/question_answering/q_a/ml_models/texto_base.txt')

In [12]:

import numpy as np
from sentence_transformers import util


def get_most_related_paragraphs(model, text: list[str], question:str, number_of_chunks: int = 3):
    """
    This function takes a text split into some text logic units and, base on a question, 
    return the units most related with the question
    :param model: The embedding model
    :param text: the text split into text units
    :param question: the question to be answered
    :param number_of_chunks: the num of chunks returned
    :return: 
    a numpy array with the most related (relative to the question) units from the text
    """
    question = [question]
    full_text = question + text
    # calculate the embedding of the text
    embeddings = model.encode(full_text, convert_to_tensor=True)
    # calculates the scores of the question vs the paragraphs and sort them
    scores = util.cos_sim(embeddings[:1], embeddings[1:]) * 100
    
    order = np.argsort(scores.tolist()[0])[::-1].astype(int)
    # return the most related paragraphs
    text = np.array(text)
    if number_of_chunks <= len(text):
        return text[order[:number_of_chunks]]
    return text[order]

In [13]:
# Example of use
question = '¿A quién interpretó Rebeca González?'
get_most_related_paragraphs(EMBEDDING_MODEL, TEXT_BASE, question)

array(['Esta joven, que había sido Gisela -interpretada por Rebeca González- en la versión original de 1973 llamada "Peregrina", y que luego fuera Raiza (Catherine Fulop) en "La muchacha del circo" de 1988, volvió bajo el nombre de Kassandra interpretada por Coraima Torres.',
       '"Comencé a hablar con un tipo que me dijo: \'Escuche, ni siquiera puedo decirle mi nombre en este momento. Hay un canal de televisión público y usted tiene un programa que está vendiendo en esta zona. Realmente lo queremos mantener en el aire porque fue retirado y la guerra se ha intensificado debido a ello, así que necesitamos su ayuda para volver a ponerlo en el aire\'", relató.',
       '"Se enamoraron de la historia y creo que les trajo paz y una sensación de normalidad. La historia de la pobreza a la riqueza siempre atrapa a la gente y ver a una persona como Coraima que tuvo muchas dificultades en sus primeros años, que la trataron muy mal y resucitó y se convirtió en reina, es hermoso, porque da espe

4. What do you think that could make these types of systems more robust in terms of semantics and functionality?

**Ways to make these systems more robust**

* In order to improve the result of the model for selecting the most relevant paragraphs that could answer the question, we could establish a minimum threshold for cosine similarity score to be taken into account.
* Using advanced text representations like BERT which encodes text from different languages capturing semantic similarities between sentences more effectively than traditional representations.
* Another way to make these models more robust is the integration of semantic lexicons, that is, incorporating semantic properties of the words and the complex relationships between them.

There are databases that could be used to take into account the semantic relationships between words like:

* **WordNet** is a linguistic database built by humans that contains grouped synonyms of nouns, verbs, adjectives and adverbs in English, along with semantic relationships such as synonyms, hypernyms, hyponyms, meronyms, among others.
* **FrameNet** is a lexicon that relates phrases/sentences (semantic frames) with individual words (lexical units), providing examples that show their meaning and use.

**References**

https://www-sciencedirect-com.udea.lookproxy.com/science/article/pii/S0306457322000450 

https://www-sciencedirect-com.udea.lookproxy.com/science/article/pii/S0950705120300903

5. Bonus points if deployed on a local or cloud server.

In [5]:
# Implementation of the http server
!python http_question_answering/question_answering/manage.py runserver

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Watching for file changes with StatReloader
Performing system checks...

System check identified no issues (0 silenced).
February 29, 2024 - 00:43:03
Django version 4.2.10, using settings 'question_answering.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.

[29/Feb/2024 00:43:13] [m"GET / HTTP/1.1" 200 15743[0m
^C
