# Embeddings & You - A Brief Introduction to Embeddings in Machine Learning

If you've toyed with LangChain, LlamaIndex, or even OpenAI's `ada` model - you've likely run into the word: "Embeddings" a few time.

They've had a recent surge in popularity due to the profliferation of Retrieval Augmented Generation, but they've been around for a very long time.

If you come from an NLP background, embeddings are something you might be intimately familiar with - otherwise, you might find the topic a bit...dense. (this attempt at a joke will make more sense later)

In all seriousness, embeddings are a powerful piece of the NLP puzzle, so let's dive in!

> NOTE: While this notebook language/NLP-centric, embeddings have uses beyond just text!

### Why Do We Even Need Embeddings?

In order to fully understand what Embeddings are, we first need to understand why we have them:

Machine Learning algorithms, ranging from the very big to the very small, all have one thing in common:

*They need numeric inputs.*

So we need a process by which to translate the domain we live in, dominated by images, audio, language, and more, into the domain of the machine: Numbers.

Another thing we want to be able to do is capture "semantic information" about words/phrases so that we can use algorithmic approaches to determine if words are closely related or not!

So, we need to come up with a process that does these two things well:

1. Convert non-numeric data into numeric-data
2. Capture potential semantic relationships between individual pieces of data

## Training Word2Vec from Scratch

Now that we have a bit of background on Embeddings - let's look at what it takes to create our own embeddings using Word2Vec!

We'll be leveraging the `gensim` library, which you can read all about [here](https://pypi.org/project/gensim/).

Before we begin training, however, we need some data!

Let's use the Wikipedia pages for Barbie and Oppenheimer as examples.

### Data Collection

We'll leverage the `wikipedia` library, and `langchain`s `WikipediaLoader` to obtain our Wikipedia data!

In [None]:
!pip install -U -q wikipedia langchain langchain_community langchain_openai lxml

In [None]:
from langchain_community.document_loaders import WikipediaLoader

barbie_docs = WikipediaLoader(
    query="Barbie",
    load_max_docs=5,
    doc_content_chars_max=1_000_000
    ).load()

In [None]:
len(barbie_docs)

In [None]:
oppenheimer_docs = WikipediaLoader(
    query="Oppenheimer",
    load_max_docs=5,
    doc_content_chars_max=1_000_000
    ).load()

In [None]:
len(oppenheimer_docs)

Now that we have some text, we need to do some preprocessing! That's right - classic NLP!

Let's begin by cleaning up our text, we'll:

- Remove special characters
- Remove stop words
- Remove links
- Convert to lowercase
- Strip whitespace

To do this, we'll need two main modules:

- The `re` standard library module
- `spacy`, another NLP library

In [None]:
!pip install -U -q spacy

In [None]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')

In [None]:
from nltk.corpus import stopwords
stopwords.words('english')[:10]

####🏗️ Activity #1:

What should the output format of the `preprocess_text` function be?

Once you've determined the output format - please complete the code cell and ensure the appropriate format is returned.

In [None]:
import re
from typing import List
from nltk.tokenize import word_tokenize

def preprocess_text(text: str) -> List[str]:
  # remove links
  text = re.sub(r"YOUR PATTERN HERE", "", text)
  # remove all special characters (keep alphabet characters)
  text = re.sub("YOUR PATTERN HERE", " ", text)
  # tokenize text, make lowercase, and remove stop words
  stop_words = set(stopwords.words('english'))
  tokens = word_tokenize(text)
  filtered_tokens = [
      ### YOUR CODE HERE
  ]
  return filtered_tokens

Let's see how this works on some of our Wikipedia data!

In [None]:
preprocess_text(barbie_docs[0].page_content[:100])

###🏗️ Activity #2:

What should the output format of the `sentence_tokenization` function be?

Once you've determined the output format - please complete the code cell and ensure the appropriate format is returned.

In [None]:
from nltk.tokenize import sent_tokenize

def sentence_tokenization(text: str) -> List[List[str]]:
    # Tokenize the text into sentences
    sentences = ### YOUR CODE HERE
    # Tokenize each sentence into words and store them in a list of lists
    sentence_tokens = ### YOUR CODE HERE
    return sentence_tokens

In [None]:
sentence_tokenization(barbie_docs[0].page_content[:200])

Perfect, with that, we're ready to create our corpus!

In [None]:
corpus = []

for doc in barbie_docs:
  corpus += sentence_tokenization(doc.page_content)

for doc in oppenheimer_docs:
  corpus += sentence_tokenization(doc.page_content)

### Training Word2Vec

Now that we have our corpus set up, we can train our Word2Vec model.

Training is straightforward, thanks to `gensim`, and more can be understood about the process by reading the paper - but let's see it in code!

It's also worth considering/playing around with the `gensim` parameters.

In [None]:
!pip install -q -U gensim

###🏗️ Activity #3:

Set appropriate hyperparameters for the gensim `Word2Vec` model.

Please also describe what each parameter does, in your own words.

> NOTE: Documentation is available [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)

In [None]:
from gensim.models import Word2Vec

VECTOR_SIZE = ### YOUR CODE HERE
WINDOW = ### YOUR CODE HERE
MIN_COUNT = ### YOUR CODE HERE
SG = ### YOUR CODE HERE

model = Word2Vec(
    sentences=corpus,
    vector_size=VECTOR_SIZE,
    window=WINDOW,
    min_count=MIN_COUNT,
    sg=SG
    )

Blink and you'll miss it. You just trained an embeddings model!

Let's try it out and see what we did!

In [None]:
model.wv["barbie"]

Finally! We see it: An embedding in the wild.

Notice how we input a word, in this case Barbie, and we got back a 100-dimensional vector of floats.

Let's see if we can't get back a list of similar vectors to the vector for "barbie", and "oppenheimer"!

In [None]:
model.wv.most_similar(positive=["barbie"], topn=3)

In [None]:
model.wv.most_similar(positive=["oppenheimer"], topn=3)

Now, for the moment of truth, let's see if it can "do the thing" that is shown in every embeddings visualization ever.

In [None]:
ken_vec = model.wv["ken"]
man_vec = model.wv["man"]
mystery_vector = ken_vec - man_vec

In [None]:
model.wv.most_similar(positive=[mystery_vector], topn=3)

And there we have it - embeddings, and a demonstration of what makes them so powerful!

> Note: This is a very small sample size, and while this result is what we'd hope for - it is largely coincidental - this behaviour is expressed better in much larger corpus' of text.

## Fine-tuning an Embedding Model with Llama Index

Now that we've seen where embeddings "started", as it were, let's see where they've gotten.

In this section, we'll be fine-tuning Hugging Face's [sentence transformers](https://www.sbert.net/).

Sentence Transformers leverages the work done in the [Sentence-BERT](https://arxiv.org/abs/1908.10084) paper. So while the idea of converting input text into a dense vector representation is the same, the way we got to those embeddings is a bit different.

The code is largely adapted from [this](https://medium.com/llamaindex-blog/fine-tuning-embeddings-for-rag-with-synthetic-data-e534409a3971) amazing blog post by Jerry.

In [None]:
!pip install -U -q llama-index pypdf

### Generating Synthetic Data

Of course, when considering the easiest path forward to making our embeddings model better - it's tough to resist the siren's call of OpenAI's very cheap and very powerful models.

Usually you'd need a team of people to generate high quality labelled data, so we'll shortcut that process by generating our own synthetic data!

As always, we will first need to get some base data that we want to build our RAG pipeline on top of.

In [None]:
!wget https://justcheckingonall.files.wordpress.com/2008/01/hhgtg1.pdf

In [None]:
!wget https://justcheckingonall.files.wordpress.com/2008/01/hhgtg2.pdf

In [None]:
!wget https://justcheckingonall.files.wordpress.com/2008/01/hhgtg3.pdf

In [None]:
TRAINING_FILES = ["hhgtg1.pdf", "hhgtg2.pdf"]
EVAL_FILES = ["hhgtg3.pdf"]

In [None]:
%mkdir data

We'll set some paths to help the flow, if you're doing this locally and not in Colab this should let you do this process across sessions!

In [None]:
TRAIN_CORPUS_FPATH = "./data/train_corpus.json"
EVAL_CORPUS_FPATH = "./data/eval_corpus.json"

Next, we'll set up a helper function to help us convert our PDFs into a corpus - which is a collection of nodes!

In [None]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.schema import MetadataMode

def load_corpus(files, verbose=False):
    if verbose:
        print(f"Loading files {files}")

    reader = SimpleDirectoryReader(input_files=files)
    docs = reader.load_data()
    if verbose:
        print(f'Loaded {len(docs)} docs')

    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f'Parsed {len(nodes)} nodes')

    corpus = {node.node_id: node.get_content(metadata_mode=MetadataMode.NONE) for node in nodes}
    return corpus

In [None]:
train_corpus = load_corpus(TRAINING_FILES, verbose=True)
eval_corpus = load_corpus(EVAL_FILES, verbose=True)

Let's write these data files out.

In [None]:
import json

with open(TRAIN_CORPUS_FPATH, 'w+') as f:
    json.dump(train_corpus, f)

with open(EVAL_CORPUS_FPATH, 'w+') as f:
    json.dump(eval_corpus, f)

### Preparing Fine-tuning Data

Next up, we'll leverage `gpt-3.5-turbo` to create question and answer pairs that we will use to fine-tune our embeddings model.

You could choose `gpt-4`, `claude` or substitute real human curated data for this step - but we'll see the processs through with `gpt-3.5-turbo` model as a demonstration!

In [None]:
!pip install -qU llama-index-llms-openai llama-index-embeddings-openai

In [None]:
import re
import uuid

from llama_index.llms.openai import OpenAI
from tqdm.notebook import tqdm

In [None]:
TRAIN_QUERIES_FPATH = './data/train_queries.json'
TRAIN_RELEVANT_DOCS_FPATH = './data/train_relevant_docs.json'

EVAL_QUERIES_FPATH = './data/eval_queries.json'
EVAL_RELEVANT_DOCS_FPATH = './data/eval_relevant_docs.json'

In [None]:
with open(TRAIN_CORPUS_FPATH, 'r+') as f:
    train_corpus = json.load(f)

with open(EVAL_CORPUS_FPATH, 'r+') as f:
    eval_corpus = json.load(f)

As always, we need our OpenAI API key!

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key: ")

Let's use this helper function to create our question answer pairs.

We're going to use this prompt:

```
Context information is below.
    
---------------------
{context_str}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."
```

As you might be able to tell - we have the ability to control how many questions we generate, as well as the persona used to create the questions.

The rest of the helper function is simply parsing the questions!

In [None]:
def generate_queries(
    corpus,
    num_questions_per_chunk=2,
    prompt_template=None,
    verbose=False,
):
    """
    Automatically generate hypothetical questions that could be answered with
    doc in the corpus.
    """
    llm = OpenAI(model='gpt-3.5-turbo')

    prompt_template = prompt_template or """\
    Context information is below.

    ---------------------
    {context_str}
    ---------------------

    Given the context information and not prior knowledge.
    generate only questions based on the below query.

    You are a Teacher/ Professor. Your task is to setup \
    {num_questions_per_chunk} questions for an upcoming \
    quiz/examination. The questions should be diverse in nature \
    across the document. Restrict the questions to the \
    context information provided."
    """

    queries = {}
    relevant_docs = {}
    for node_id, text in tqdm(corpus.items()):
        query = prompt_template.format(context_str=text, num_questions_per_chunk=num_questions_per_chunk)
        response = llm.complete(query)

        result = str(response).strip().split("\n")
        questions = [
            re.sub(r"^\d+[\).\s]", "", question).strip() for question in result
        ]
        questions = [question for question in questions if len(question) > 0]

        for question in questions:
            question_id = str(uuid.uuid4())
            queries[question_id] = question
            relevant_docs[question_id] = [node_id]
    return queries, relevant_docs

> ### NOTE: The following cells take ~15min. to run - please ensure you have time for this step before continuing. I will provide the relevant data files if you wish to continue from this point! All of the data files can be found here: https://github.com/AI-Maker-Space/DataRepository

In [None]:
train_queries, train_relevant_docs = generate_queries(train_corpus)

In [None]:
eval_queries, eval_relevant_docs = generate_queries(eval_corpus)

Let's save our data in an appropriate format for later!

In [None]:
with open(TRAIN_QUERIES_FPATH, 'w+') as f:
    json.dump(train_queries, f)

with open(TRAIN_RELEVANT_DOCS_FPATH, 'w+') as f:
    json.dump(train_relevant_docs, f)

with open(EVAL_QUERIES_FPATH, 'w+') as f:
    json.dump(eval_queries, f)

with open(EVAL_RELEVANT_DOCS_FPATH, 'w+') as f:
    json.dump(eval_relevant_docs, f)

In [None]:
TRAIN_DATASET_FPATH = './data/train_dataset.json'
EVAL_DATASET_FPATH = './data/eval_dataset.json'

In [None]:
train_dataset = {
    'queries': train_queries,
    'corpus': train_corpus,
    'relevant_docs': train_relevant_docs,
}

eval_dataset = {
    'queries': eval_queries,
    'corpus': eval_corpus,
    'relevant_docs': eval_relevant_docs,
}

In [None]:
with open(TRAIN_DATASET_FPATH, 'w+') as f:
    json.dump(train_dataset, f)

with open(EVAL_DATASET_FPATH, 'w+') as f:
    json.dump(eval_dataset, f)

### Fine-tuning Our Embeddings Model

Finally, the set up is complete - and we can move on to fine-tuning our sentence transformer embedding model!

The process is simplified considerably by how amazing the Hugging Face `sentence-transformer` library is, so let's jump straight in!

In [None]:
!pip install -U -q sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer

We're going to use the `BAAI/bge-small-en` embedding model as an example, but you could use any of the `sentence-transformer` embeddings models.

In [None]:
model_id = "BAAI/bge-small-en"
model = SentenceTransformer(model_id)

In [None]:
model

Let's load our data into the desired format!

In [None]:
from torch.utils.data import DataLoader
from sentence_transformers import InputExample

In [None]:
TRAIN_DATASET_FPATH = './data/train_dataset.json'
VAL_DATASET_FPATH = './data/eval_dataset.json'

In [None]:
with open(TRAIN_DATASET_FPATH, 'r+') as f:
    train_dataset = json.load(f)

with open(VAL_DATASET_FPATH, 'r+') as f:
    val_dataset = json.load(f)

In [None]:
dataset = train_dataset

corpus = dataset['corpus']
queries = dataset['queries']
relevant_docs = dataset['relevant_docs']

examples = []
for query_id, query in queries.items():
    node_id = relevant_docs[query_id][0]
    text = corpus[node_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

We're going to be leveraging `sentence_transformers` `MultipleNegativesRankingLoss` as our loss function.

You can read more about it in the docs, [here](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss).

Note that there is [research](https://arxiv.org/pdf/1705.00652.pdf) that indicates that performance generally scales with `BATCH_SIZE`, but we're going to stick with an arbitrary 10 for the example in the notebook.

In [None]:
from sentence_transformers import losses

In [None]:
loss = losses.MultipleNegativesRankingLoss(model)

In [None]:
BATCH_SIZE = 10

loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

We'll set up the `InformationRetrievalEvaluator` to determine performance during training.

In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

In [None]:
dataset = val_dataset

corpus = dataset['corpus']
queries = dataset['queries']
relevant_docs = dataset['relevant_docs']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

You could use a larger epoch size here, but for the example in the Notebook, we'll stick with 10.

In [None]:
EPOCHS = 10

Nothing left to do but #trainthatmodel!

In [None]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='exp_finetune',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50,
)

### Evaluating our Embeddings Models

Now that we've fine-tuned our embedding model on our data - lets see how it performs compared to the base embeddings, and OpenAI's `ada` embeddings!

In [None]:
import json
from tqdm.notebook import tqdm
import pandas as pd

from llama_index.core import Settings, VectorStoreIndex
from llama_index.core.schema import TextNode
from llama_index.embeddings.openai import OpenAIEmbedding

In [None]:
TRAIN_DATASET_FPATH = './data/train_dataset.json'
EVAL_DATASET_FPATH = './data/eval_dataset.json'

In [None]:
with open(TRAIN_DATASET_FPATH, 'r+') as f:
    train_dataset = json.load(f)

with open(EVAL_DATASET_FPATH, 'r+') as f:
    eval_dataset = json.load(f)

We're going to leverage a "hit rate" for our evaluation.

Basically what it says on the tin, "hit rate" is just a measure of how often we retrieve the correct relevant document.

Since we have query/relevant document pairs, we can calculate this metric fairly easy.

If the top-k retrieved results contain the correct context for our query - we hit!

In [None]:
def evaluate(
    dataset,
    embed_model,
    top_k=2,
    verbose=False,
):
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
    index = VectorStoreIndex(
        nodes,
        show_progress=True,
        embed_model=embed_model
    )
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            'is_hit': is_hit,
            'retrieved': retrieved_ids,
            'expected': expected_id,
            'query': query_id,
        }
        eval_results.append(eval_result)
    return eval_results

####🏗️ Activity #4:

Describe what the `evaluate` function is doing in the above cell in natural language.

In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer

def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)
    model = SentenceTransformer(model_id)
    return evaluator(model, output_path="/content/")

In [None]:
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]

### Ada Results

We'll compare our results against OpenAI's `ada` model, so we'll need to load it up!

In [None]:
ada = OpenAIEmbedding(model="text-embedding-ada-002")
ada_val_results = evaluate(val_dataset, ada)

In [None]:
df_ada = pd.DataFrame(ada_val_results)

In [None]:
hit_rate_ada = df_ada['is_hit'].mean()
hit_rate_ada

### Base Embeddings Model Results

In [None]:
bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate(eval_dataset, bge)

In [None]:
df_bge = pd.DataFrame(bge_val_results)

In [None]:
hit_rate_bge = df_bge['is_hit'].mean()
hit_rate_bge

In [None]:
evaluate_st(eval_dataset, "BAAI/bge-small-en", name='bge')

### Fine-tuned Results

In [None]:
finetuned = "local:exp_finetune"
eval_results_finetuned = evaluate(eval_dataset, finetuned)

In [None]:
df_finetuned = pd.DataFrame(eval_results_finetuned)

In [None]:
hit_rate_finetuned = df_finetuned['is_hit'].mean()
hit_rate_finetuned

In [None]:
evaluate_st(eval_dataset, "exp_finetune", name='finetuned')

### Conclusion

Now we can compare the 3 embeddings models to see which performed the best!

In [None]:
df_ada['model'] = 'ada'
df_bge['model'] = 'bge'
df_finetuned['model'] = 'fine_tuned'

In [None]:
df_all = pd.concat([df_ada, df_bge, df_finetuned])
df_all.groupby('model').mean('is_hit')

####🏗️ Activity #5:

Determine the difference between the two types of embedding model's dimensions.

- `text-embedding-ada-002` dimension: `ENTER DIMENSION HERE`
- BGE Small dimension: `ENTER DIMENSION HERE`

What does that communicate about our fine-tuning process, in your own words?

In [None]:
df_st_bge = pd.read_csv('/content/Information-Retrieval_evaluation_bge_results.csv')
df_st_finetuned = pd.read_csv('/content/Information-Retrieval_evaluation_finetuned_results.csv')

In [None]:
df_st_bge['model'] = 'bge'
df_st_finetuned['model'] = 'fine_tuned'
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index('model')
df_st_all

Hopefully through this process you can see just how powerful this technique is!