# GenAI/RAG in Python 2025

## Session 02. Expanding the Basic RAG Framework

In [None]:
import os
import numpy as np
import pandas as pd
from openai import OpenAI

## 1. Let's grab some text... Italian cuisine, for example?

In [None]:
# Path to the CSV file
file_path = "_data/italian_recipes_clean.csv"

# Load the CSV into a Pandas DataFrame
df = pd.read_csv(file_path)

# Display some basic information
print(df.info())
print(df.head())

In [None]:
df

In [None]:
df["receipt"][0]

### We would like to build a system that...

(1) Takes user input in the form of a question (e.g. "I'd like to cook something with carrots"), (2) performs a similarity search across the recipes in the `df` DataFrame, (3) obtains the most similar five recipes, lists them, and (4) combines them with a prompt sent to ChatGPT to shape the final response that is shared with the user.  

## 2. Vector embeddings for similarity search

All recipes must be embedded in order to be prepared for similarity search.

We will use OpenAI's embedding models this time.

In [None]:
# Set your API key (ensure OPENAI_API_KEY is set in your environment)
api_key = os.getenv("OPENAI_API_KEY")

# Instantiate the OpenAI client with your API key  
client = OpenAI(api_key=api_key)                    

# Select the embedding model to use (as per OpenAI docs)  
model_name = "text-embedding-3-small"      

# Prepare a list to collect embedding vectors  
embeddings = []                            

# Iterate over each row in your DataFrame `df`  
for idx, row in df.iterrows():
    # grab the receipt text for this row              
    text = row["receipt"]  
    # If it's not a valid string, skip embedding  
    if not isinstance(text, str) or text.strip() == "":  
        embeddings.append(None)             
        continue                            

    # Call the embeddings endpoint on the client  
    resp = client.embeddings.create(        
        model=model_name,                   
        input=[text]                        
    )                                     

    # Extract the embedding vector from the response object  
    emb = resp.data[0].embedding            

    # Append that embedding vector to our list  
    embeddings.append(emb)                  

# After the loop, assign embeddings list to a new DataFrame column  
df["embedding"] = embeddings               

# Show first few rows to verify  
df.head()      

In [None]:
type(df['embedding'][0])

In [None]:
len(df['embedding'][0])

## 3. Now we need a user input...

In [None]:
user_text = """
Hi! I’d like to cook a good Italian dish for lunch! I have potatoes, carrots, 
rosemary, and pork. Can you recommend a recipe and help me a bit with 
preparation tips?
"""

... and of course we need an embedding of `user_text` as well:

In [None]:
resp = client.embeddings.create(        
        model=model_name,                   
        input=[user_text]                        
    )
user_query = resp.data[0].embedding

print(type(user_query))
print(len(user_query))

## 4a. Find the most suitable examples that match the user input: Cosine Distance (Similarity)

In [None]:
# scipy has a function to compute cosine distance: cosine()
from scipy.spatial.distance import cosine

# Compute similarity scores: similarity = 1 − cosine_distance
scores = []
for emb in df["embedding"]:
    if emb is None:
        scores.append(-1.0)
    else:
        # np.array is a vector data type that scipy wants to see
        # in place of a list 
        scores.append(1.0 - cosine(np.array(emb), np.array(user_query)))

# Get top 5 indices
top5 = np.argsort(scores)[-5:]
# N.B. np.argsort(scores) — returns an array of indices that would 
# sort scores in ascending order. 
# [-5:] — takes the last 5 indices from that sorted‐indices array. 
# Since the full array is in ascending order, its last 5 indices correspond to 
# the 5 highest scores.

# Build a single output string with titles and recipes
output_lines = []
for i in top5:
    title = df.iloc[i]["title"]
    recipe = df.iloc[i]["receipt"]
    output_lines.append(f"{title}:\n{recipe}")
prompt_recipes = "\n\n".join(output_lines)

print(prompt_recipes)

$$
\cos\theta = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\|\;\|\mathbf{b}\|}
= \frac{\sum_{i=1}^n a_i\,b_i}{\sqrt{\sum_{i=1}^n a_i^2}\;\sqrt{\sum_{i=1}^n b_i^2}}
$$

A common definition of **cosine similarity** is:

$$
d_{\text{cos}}(\mathbf{a},\mathbf{b}) = 1 - \cos\theta
$$

- In text / embedding applications, higher cosine similarity (or lower cosine distance) means vectors are more semantically aligned.



## 4b. More distance/similarity metrics

In [None]:
from scipy.spatial import distance

### *Cosine* 

$$
\text{sim}_{\cos}(\mathbf{a}, \mathbf{b}) =
\frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|}
$$

In [None]:
a = np.array([0.1, 0.3, 0.5, 0.0])
b = np.array([0.2, 0.1, 0.4, 0.3])

cos_sim = cosine(a, b)

print("1. Cosine similarity:", cos_sim)

---

### *Understanding Cosine Similarity*

Cosine similarity is one of the most popular ways to measure how **similar two documents (or vectors)** are — especially in **text retrieval** or **semantic search**.

---

### Intuition

Imagine every document as a **point** (or arrow) in a multi-dimensional space.  If two documents use **similar words in similar proportions**, their arrows point in roughly the same direction — even if one is longer (has more words). Cosine similarity measures **how close their directions are**, not their lengths.

---

### The Formula

For two vectors **a** and **b**:

$$
\
\text{similarity}_{\cos}(\mathbf{a}, \mathbf{b}) =
\frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|}
\
$$

Where:
- $\,\mathbf{a}\cdot\mathbf{b}\,$ is the **dot product** — how much the two vectors align.
- $\,\lVert \mathbf{a}\rVert$ and $\,\lVert \mathbf{b}\rVert$ are their **magnitudes (lengths)**.
- The result is a number between **−1 and 1**, but in most IR applications (non-negative vectors) it ranges from **0 to 1**:
  - **1 →** same direction (identical content)
  - **0 →** completely different
  - **values in between →** partial similarity

In `scipy.spatial.distance`, the function `distance.cosine(a, b)` returns the **cosine *distance*** — not the similarity.

$$
d_{\cos}(\mathbf{a}, \mathbf{b}) = 1 - 
\frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|}
$$

- `distance.cosine(a, b)` → **distance** (0 means identical, 1 means orthogonal)  
- `1 - distance.cosine(a, b)` → **similarity** (1 means identical, 0 means orthogonal)

In short:
> **Cosine distance** measures *how far apart* two vectors are.  
> **Cosine similarity** measures *how aligned* they are.

**Range:**
- For **non-negative vectors** the cosine similarity ∈ [0, 1], so the **distance ∈ [0, 1]**.  
- If vectors can have **negative components**, similarity ∈ [−1, 1], so **distance ∈ [0, 2]**.

---

### *Euclidean Distance*

$$
d_{\text{Euc}}(\mathbf{a}, \mathbf{b}) =
\sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}
$$

In [None]:
euc_dist = distance.euclidean(a, b)
print("2. Euclidean distance:", euc_dist)

### *Understanding Euclidean Distance (L2 norm)*

Euclidean distance is the most familiar way to measure how **far apart** two things are —  
it’s the same idea as measuring the straight-line distance between two points in space.

---

### Intuition

Imagine two documents, or two data points, as **positions in space**.  
Each feature (like a word weight or an embedding dimension) is one coordinate axis.  
The Euclidean distance tells us how long the straight line is between these two points.

If two points are close together, their values are similar.  
If they’re far apart, the difference between them is large.

---

### The Formula

For two vectors **a** and **b** with $n$ components:

$$
d_{\text{Euc}}(\mathbf{a}, \mathbf{b}) =
\sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}
$$

---

### Breaking it Down

- $\,a_i - b_i\,$ measures how much the two vectors differ on each coordinate.  
- Squaring each difference, $(a_i - b_i)^2$, keeps all values positive.  
- Summing them up adds all the little differences together.  
- Taking the square root gives the actual straight-line distance.

---

### Key Idea

- **Small distance →** the two items are very similar.  
- **Large distance →** they are quite different.  

### *Manhattan Distance*

$$
d_{\text{Man}}(\mathbf{a}, \mathbf{b}) =
\sum_{i=1}^{n} |a_i - b_i|
$$

In [None]:
man_dist = distance.cityblock(a, b)
print("Manhattan distance:", man_dist)

### Manhattan Distance (L1 Norm)

Manhattan distance measures how far apart two vectors are by **summing the absolute differences** of their coordinates —  like moving through a city grid where you can only go along streets, not diagonally.

$$
d_{\text{Man}}(\mathbf{a}, \mathbf{b}) =
\sum_{i=1}^{n} |a_i - b_i|
$$

**Range:**
- Minimum: **0** (the vectors are identical).  
- No fixed maximum — it grows with the number of dimensions and the scale of the data.  
- If features are normalized to [0, 1], the maximum distance is the **number of dimensions** *n*.

#### 4. Jaccard Similarity (for discrete features)
$$
\text{sim}_{\text{Jaccard}}(A, B) =
\frac{|A \cap B|}{|A \cup B|}
$$

In [None]:
a_bin = np.array([1, 0, 1, 0])
b_bin = np.array([1, 1, 0, 0])
jac_sim = 1 - distance.jaccard(a_bin, b_bin)
print("Jaccard similarity:", jac_sim)

### Jaccard Similarity (for Discrete Features)

The **Jaccard similarity** measures how much two sets (or binary feature vectors) **overlap**.

It’s ideal for comparing **discrete data**, such as:

- tags assigned to items,
- binary attributes (e.g., feature present / not present).

---

### The Formula

For two sets $A$ and $B$:

$$
\text{similarity}_{\text{Jaccard}}(A, B) =
\frac{|A \cap B|}{|A \cup B|}
$$

- $|A \cap B|$ — number of shared elements (the overlap)  
- $|A \cup B|$ — number of unique elements across both sets  

The result is a value between **0 and 1**:
- **1 →** sets are identical  
- **0 →** sets share nothing in common  

---

### Distance vs. Similarity

The **Jaccard distance** is simply the inverse measure:

$$
d_{\text{Jaccard}}(A, B) = 1 - \text{similarity}_{\text{Jaccard}}(A, B)
$$

So:
- **Similarity** → how much two sets *share*  
- **Distance** → how much they *differ*  

In short:
> **Jaccard similarity** counts the overlap.  
> **Jaccard distance** counts the non-overlap.


## 5. Handling Discrete Features in Information (Document) Retrieval

In [None]:
# Path to the CSV file
file_path = "_data/italian_recipes_features.csv"

# Load the CSV into a Pandas DataFrame
df = pd.read_csv(file_path)

# Display some basic information
print(df.info())

In [None]:
df

### User Input

> “Hey, I’m in the mood for something hearty but not too complicated. I’d love to cook a traditional Italian pasta dish, maybe with a rich tomato sauce, some garlic and olive oil, and a bit of Parmesan on top. I prefer something savory, not sweet — and ideally something that’s cooked on the stove, not baked. Any classic recipes you can recommend?”

In [None]:
user_request = """
Hey, I’m in the mood for something hearty but not too complicated. 
I’d love to cook a traditional Italian pasta dish, maybe with a rich tomato sauce, 
some garlic and olive oil, and a bit of Parmesan on top. 
I prefer something savory, not sweet — and ideally something that’s cooked on the stove, not baked. 
Any classic recipes you can recommend?
"""

### Tag User Input (via LLM)

In [None]:
prompt = f"""
You are given a user's request.  
Based on the request, output ONLY a valid JSON object with the following binary features, 
where each value must be either 0 or 1:

[
  "is_soup_broth", "is_pasta", "is_rice", "is_meat_dish", "is_fish_dish", "is_egg_dish", 
  "is_vegetable_dish", "is_dessert", "contains_pasta", "contains_rice", "contains_meat", 
  "contains_fish_seafood", "contains_egg", "contains_cheese", "contains_tomato", 
  "contains_olive_oil", "contains_garlic", "contains_wine", "contains_herbs",
  "is_boiled", "is_baked", "is_fried", "is_grilled", "is_raw_preparation", 
  "is_sauce_based", "is_slow_cooked", "has_stuffing", "served_with_sauce", "is_soup_like", 
  "is_bread_based", "is_spicy", "is_savory", "is_sweet", "contains_citrus",
  "mentions_region", "mentions_dialect_term", "is_classic_named_dish"
]

Do not include explanations, extra text, or any other formatting—only 
the JSON object with keys and binary (0/1) values for each of the above features.

User request: {user_request}
"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful Italian cooking assistant."},
        {"role": "user", "content": prompt}
    ],
    temperature=1,
    # Force JSON response
    response_format={ "type": "json_object" },
    max_tokens=5000
)

tags = response.choices[0].message.content
print(tags)

#### Parse JSON

In [None]:
import json
# --- Step 1: Parse JSON ---
feature_dict = json.loads(tags)
print(feature_dict)
# Extract feature columns (order matters)
feature_cols = list(feature_dict.keys())
# --- Step 2: Convert LLM output to vector ---
user_vector = pd.Series(feature_dict, index=feature_cols).astype(int)
print(user_vector)

#### Compare to `df` to find five similar examples 

In [None]:
# --- Step 3: Extract binary feature matrix from df ---
feature_matrix = df[feature_cols].astype(int).values

# --- Step 4: Compute Jaccard similarity using scipy ---
# distance.jaccard returns distance = 1 - similarity
jaccard_similarities = [1 - distance.jaccard(user_vector, row) for row in feature_matrix]

# --- Step 5: Find Top 5 most similar recipes ---
df['jaccard_similarity'] = jaccard_similarities
top5 = df.nlargest(5, 'jaccard_similarity')

print(top5[['title', 'receipt', 'jaccard_similarity']])

## 6. (In)determinism in LLMs: `temperature` and `top_p` parameters 

In [None]:
prompt_recipes = ""

for _, row in top5.iterrows():
    prompt_recipes += f"{row['title'].strip()}\n"
    prompt_recipes += f"{row['receipt'].strip()}\n\n"

prompt_recipes = prompt_recipes.strip()

print(prompt_recipes)

In [None]:
prompt = f"""
You are a helpful Italian cooking assistant.  
Here are some recipe examples I found that may or may not be relevant to the user's request:

{prompt_recipes}

User’s question: "{user_request}"

From the examples above:
1. Determine which recipes are *relevant* to what the user asked and which are not.
2. Discard or ignore irrelevant ones, and focus on relevant ones.
3. For each relevant example, rephrase the recipe in a more narrative, 
conversational style, adding cooking tips, alternative ingredients, variations, 
or suggestions.
4. Then produce a final response to the user: a narrative that weaves 
together those enhanced recipes (titles + steps + tips) in an engaging way.
5. Don't forget to use the original titles of the recipes.
6. Advise on more than one recipe - if there are more than one relevant!

Do not just list recipes — tell a story, connect to the user's question, 
and use the examples as inspirations, but enhance them.  
Make sure your response is clear, helpful, and focused on what the user wants.
"""


In [None]:
response = client.chat.completions.create(
    model="gpt-4",    # or whichever model you prefer
    messages=[
        {"role": "system", "content": "You are a helpful Italian cooking assistant."},
        {"role": "user", "content": prompt}
    ],
    temperature = 0,
    max_tokens=5000
)

reply_text = response.choices[0].message.content

print(user_request)

print(reply_text)

### Now... increase the heat (`temperature`)!

In [None]:
response = client.chat.completions.create(
    model="gpt-4",    # or whichever model you prefer
    messages=[
        {"role": "system", "content": "You are a helpful Italian cooking assistant."},
        {"role": "user", "content": prompt}
    ],
    temperature = 1.5,
    max_tokens=5000
)

reply_text = response.choices[0].message.content

print(user_request)

print(reply_text)

### Ooops...

In [None]:
response = client.chat.completions.create(
    model="gpt-4",    # or whichever model you prefer
    messages=[
        {"role": "system", "content": "You are a helpful Italian cooking assistant."},
        {"role": "user", "content": prompt}
    ],
    temperature = .75,
    max_tokens=5000
)

reply_text = response.choices[0].message.content

print(user_request)

print(reply_text)

### One more time

In [None]:
response = client.chat.completions.create(
    model="gpt-4",    # or whichever model you prefer
    messages=[
        {"role": "system", "content": "You are a helpful Italian cooking assistant."},
        {"role": "user", "content": prompt}
    ],
    temperature = 1.25,
    max_tokens=5000
)

reply_text = response.choices[0].message.content

print(user_request)

print(reply_text)

### 🔍 Definitions

* **`temperature`**
  This parameter controls how *random or deterministic* the model’s token-sampling is. At a low temperature (closer to 0), the model becomes very deterministic, tending to pick the highest-probability next tokens consistently. At a higher temperature (closer to 1 or above, depending on model) the probability distribution of possible next tokens is flattened, meaning tokens with lower original probabilities become more likely to be sampled. 

  From the OpenAI docs:

  > “What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.”

* **`top_p`** (also known as *nucleus sampling*)
  This parameter sets a threshold on *cumulative probability mass* of tokens to consider. Rather than scaling all probabilities (as temperature does), you sort possible next tokens by their probability, then include the smallest set whose combined probability is ≥ `top_p`. The next token is sampled from that subset. So if `top_p = 0.9`, you consider the tokens that together account for 90 % of the probability mass and ignore the rest (lowest-probability tokens).

---

### Key Differences and Practical Implications

* **Mechanism**:

  * `temperature` adjusts *how flat or peaked* the probability distribution of next tokens is.
  * `top_p` restricts the *space of possible next tokens* to the most probable subset (cumulative mass) and discards the rest.

* **Effect on output**:

  * A *low* `temperature` tends to yield more predictable, consistent, and “safe” output.
  * A *high* `temperature` yields more varied, creative, and risk-taking output.
  * A *low* `top_p` (e.g., 0.1–0.3) restricts the token selection drastically → very focused/deterministic output.
  * A *high* `top_p` (e.g., 0.8–1.0) allows a broader set of token choices → more diversity.

* **When to use which**:

  * If you want **maximum control** and deterministic output (e.g., factual answers, code generation, tables) → use low `temperature` (maybe ~0–0.3) and/or low `top_p`.
  * If you want **creativity, variety, open-ended responses** (e.g., story generation, brainstorming) → use higher `temperature` (e.g., 0.7–1.0) and/or higher `top_p`.
  * Many developers follow the guideline: **use one of them**, not both aggressively. As the docs note:

    > “We generally recommend altering this or `top_p` but not both.”

---

### Summary

* Use **`temperature`** to **scale the randomness** of token selection (how adventurous the model is).
* Use **`top_p`** to **limit the pool** of token candidates (how many possible next tokens the model may consider).
* Both influence **diversity vs. determinism** of output, but they do so via different mechanisms.
* For many use cases, adjusting **one** of them is sufficient (and simpler) rather than tweaking both simultaneously.