# GenAI/RAG in Python 2025

## Session 04. Embeddings

In [2]:
import os
import numpy as np
import pandas as pd

### 1. Sentence Transformers

Required: `pip install pip install sentence-transformers`

In [3]:
df = pd.read_csv("_data/italian_recipes_clean.csv")

In [4]:
df

Unnamed: 0,title,receipt
0,BROTH OR SOUP STOCK,(Brodo) To obtain good broth the meat must be ...
1,BREAD SOUP,(Panata) This excellent and nutritious soup is...
2,GNOCCHI,"This is an excellent soup, but as it requires ..."
3,VEGETABLE SOUP,(Zuppa Sante) Any kind of vegetables may be us...
4,QUEEN'S SOUP,(Zuppa Regina) This is made with the white mea...
...,...,...
215,LEMON ICE,"(Gelato di limone) Granulated sugar, 3/4 lb. W..."
216,STRAWBERRY ICE,"(Gelato di fragola) Ripe strawberries, 3/4 lb...."
217,ORANGE ICE,(Gelato di aranci) Four big oranges. One lemon...
218,PISTACHE ICE CREAM,"(Gelato di pistacchi) Milk, one quart. Sugar, ..."


**Behind the scenes:** The model `all-MiniLM-L6-v2` is a distilled transformer that produces a 384-dimensional embedding for any given sentence or paragraph. Sentence-Transformers handles all the preprocessing (like tokenization) and the heavy lifting of the neural network internally. The resulting embeddings can be used for similarity comparisons, clustering, etc. The library abstracts away the complexity, letting us get embeddings in just a few lines of code.

In [5]:
# Import the SentenceTransformer class from the sentence_transformers library
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence embedding model (this will download the model if not cached)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Use the model to encode each recipe's text into an embedding vector
# We'll create a new column 'embedding' with the resulting list of floats for each recipe.
df['embedding'] = df['receipt'].apply(lambda text: model.encode(text).tolist())

# Let's print the first recipe's embedding (truncated) and its length to verify
print("Sample embedding for first recipe (truncated to 10 dims):", df['embedding'][0][:10], "...")
print("Embedding length:", len(df['embedding'][0]))

  from .autonotebook import tqdm as notebook_tqdm


Sample embedding for first recipe (truncated to 10 dims): [-0.09900078177452087, -0.003846808336675167, 0.01569218561053276, 0.026051921769976616, -0.07733330875635147, -0.049046289175748825, -0.007966993376612663, -0.003884089644998312, 0.01709478534758091, -0.12944228947162628] ...
Embedding length: 384


In [6]:
df

Unnamed: 0,title,receipt,embedding
0,BROTH OR SOUP STOCK,(Brodo) To obtain good broth the meat must be ...,"[-0.09900078177452087, -0.003846808336675167, ..."
1,BREAD SOUP,(Panata) This excellent and nutritious soup is...,"[-0.05840875208377838, 0.019056806340813637, 0..."
2,GNOCCHI,"This is an excellent soup, but as it requires ...","[-0.03781914338469505, -0.01353275217115879, -..."
3,VEGETABLE SOUP,(Zuppa Sante) Any kind of vegetables may be us...,"[-0.09221702814102173, 0.09501173347234726, -0..."
4,QUEEN'S SOUP,(Zuppa Regina) This is made with the white mea...,"[-0.07619944959878922, -0.03227389231324196, -..."
...,...,...,...
215,LEMON ICE,"(Gelato di limone) Granulated sugar, 3/4 lb. W...","[-0.05859484523534775, -0.022969068959355354, ..."
216,STRAWBERRY ICE,"(Gelato di fragola) Ripe strawberries, 3/4 lb....","[-0.016838310286402702, -0.03356937691569328, ..."
217,ORANGE ICE,(Gelato di aranci) Four big oranges. One lemon...,"[-0.02825223095715046, 0.030553115531802177, -..."
218,PISTACHE ICE CREAM,"(Gelato di pistacchi) Milk, one quart. Sugar, ...","[-0.015352049842476845, -0.05267338082194328, ..."


In [7]:
type(df['embedding'][0])

list

In [8]:
len(df['embedding'][0])

384

About Sentence Transformers:

- [SentenceTransformers Documentation](https://www.sbert.net/)
- [List of all 17059 (currently available and growing)](https://huggingface.co/models?library=sentence-transformers) embedding models in Sentence Transformers

### 2. Method 2: spaCy (Pre-trained GloVe Vectors via spaCy)

About this method: [spaCy](https://spacy.io/) is a popular open-source NLP library in Python.

Among its many features (like part-of-speech tagging, named entity recognition, etc.), spaCy includes pre-trained word vectors for some models. We will use spaCy's English medium model (en_core_web_md), which has 300-dimensional GloVe vectors for words. spaCy can provide a vector for an entire document (in our case, a recipe's text) by averaging the vectors of the words in the text

This is a simpler, more lightweight approach than transformers. It may not capture context as well as Sentence-Transformers, but it's fast and easy to use.

Required installation: we need to install spacy and download the English model.

```
pip install spacy
python -m spacy download en_core_web_md
```

In [9]:
import spacy

# Load the medium English model in spaCy (this takes a moment to load the model into memory)
nlp = spacy.load("en_core_web_md")

# Define a function that returns the document vector for a given text
def get_doc_embedding(text):
    doc = nlp(text)                # Process the text with spaCy (tokenization, etc.)
    vector = doc.vector            # The document's vector (average of token vectors for this model)
    return vector.tolist()         # Convert the vector (NumPy) to a list of floats

# Apply this function to each recipe in the DataFrame to create a new 'embedding' column
df['embedding'] = df['receipt'].apply(get_doc_embedding)

# Print an example embedding and its length for verification
print("Sample spaCy embedding for first recipe (10 dims):", df['embedding'][0][:10], "...")
print("Embedding length:", len(df['embedding'][0]))

Sample spaCy embedding for first recipe (10 dims): [-0.6954594254493713, 0.2128593772649765, -0.14669853448867798, -0.03319728374481201, -0.07744687795639038, 0.11704789847135544, -0.0851106271147728, -0.11118240654468536, 0.05072933807969093, 1.7072911262512207] ...
Embedding length: 300


In [10]:
type(df['embedding'][0])

list

In [11]:
len(df['embedding'][0])

300

### 3. Gensim Doc2Vec

In this approach, we'll create embeddings by training our own `Doc2Vec` model using the [Gensim](https://radimrehurek.com/gensim/) library. Gensim is a toolkit for topic modeling and vector space algorithms (includes Word2Vec, Doc2Vec, LDA, etc.). Doc2Vec (also known as "Paragraph Vector") is an algorithm introduced by Le and Mikolov (2014) that learns vector representations for entire documents, beyond just words. It is an extension of `Word2Vec` for larger text segments. The idea is to train a neural network on our corpus such that each document is assigned a vector that helps predict its words After training, documents with similar content should end up with similar vectors in this learned vector space.

Unlike the previous two methods, here we will **train a model on our specific dataset** (the Italian recipes). This means the embeddings might capture themes specific to our corpus (e.g., ingredients or cooking terms) but it also means we need to do a bit of setup and the results can vary based on training parameters. We'll keep it simple and use a small vector size for demonstration.

Required installation: we need gensim for this method.

```
pip install gensim
```

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Prepare training data for Doc2Vec:
documents = []
for idx, text in enumerate(df['receipt']):
    # Split the text into words (simple tokenization on whitespace)
    words = text.split()  
    # Tag each document with a unique ID (here we use the index)
    documents.append(TaggedDocument(words=words, tags=[idx]))

# Initialize and train the Doc2Vec model on our documents
# We'll use a small vector size (e.g., 50 dimensions) for speed, and train for a few epochs.

model = Doc2Vec(vector_size=50, window=5, min_count=2, workers=4, epochs=40)
model.build_vocab(documents)          # Build vocabulary from our data
model.train(documents, total_examples=len(documents), epochs=model.epochs)

# Once trained, the model.dv (document vectors) holds the learned embeddings for each document.
# Let's collect the vectors for each recipe in order:
doc_vectors = [model.dv[idx] for idx in range(len(documents))]

# Convert each vector (NumPy array) to list, then add to DataFrame
doc_vectors = [list(vec) for vec in doc_vectors]
df['embedding'] = doc_vectors

# Print the first recipe's embedding (first 10 dims) and length
print("Sample Doc2Vec embedding for first recipe (10 dims):", df['embedding'][0][:10], "...")
print("Embedding length:", len(df['embedding'][0]))

In [None]:
model.dv[0]

What the code does:

- We prepare the data for training. Gensim's `Doc2Vec` requires input as a list of `TaggedDocument` objects, where each `TaggedDocument` is basically a list of words plus a tag (an ID) for the document. We loop through each recipe text, split it into words (note: this is a simple tokenization; for better results one might want to use a smarter tokenizer to handle punctuation, lowercasing, etc., but this simple split suffices for our example). We tag each document with its index idx.

- We initialize a `Doc2Vec model`. **Key parameters:**

- `vector_size=50`: This sets the embedding dimensionality to 50. (You can increase this for more complex patterns, but 50 is okay for demonstration.)
- `window=5`: The context window size (how many words before/after to consider in the training context for predicting words).
`min_count=2`: Ignore words that appear less than 2 times (this helps ignore very rare words).
`workers=4`: Number of parallel threads to use (adjust based on your CPU cores).
`epochs=40`: How many iterations (epochs) to train for. More epochs = more training but also more time.

- We call `build_vocab(documents)` to prepare the vocabulary.

- Then we train the model on our documents with `model.train(...)`. This may take some seconds depending on corpus size and parameters. (Our dataset is not very large – 220 recipes – so this should be quite fast even on a CPU.)

- After training, `model.dv` contains the learned document vectors. We extract each vector by index and store them in a list `doc_vectors`.

- We convert each vector to a regular list of floats (since Gensim gives us a numpy array for each vector).

- We assign this list of lists to `df['embedding']`. Now each recipe in the DataFrame has a 50-dimensional embedding learned specifically from our dataset.

- We print the first embedding (truncated) and its length to verify it's 50.

In [None]:
type(df['embedding'][0])

In [None]:
len(df['embedding'][0])