# ODSC 2025 RAG + Structured Generation with Outlines

In this lesson, we'll be covering how to use Retrieval-Augmented Generation (RAG) to search for product reviews relevant to a question and then generate structured data from the reviews. We'll have well-structured data at the end that we can use for later analysis.

At the end of this lesson, you'll be able to:

- Use RAG (via [Milvus](https://milvus.io/)) to search for product reviews relevant to a question
- Use LLMs and [Outlines](https://github.com/dottxt-ai/outlines) to generate structured data from the reviews to understand sentiment for various categories
- Visualize the results using PCA

The lesson is structured as follows:

1. Load the dataset
2. Preprocess the data
3. Set up the vector database for RAG
4. Use LLMs and Outlines to generate structured data from the reviews
5. Visualize the results using PCA

## Definitions

Before we get started, let's define a few terms. Some of you may not be familiar with all of these terms, but that's okay! We'll be using them throughout the lesson.

### Retrieval-augmented generation (RAG)

RAG stands for retrieval-augmented generation. It's a technique that allows us to use a large language model (LLM) to generate text that is relevant to a specific question. For example, we can use RAG to search for product reviews that are relevant to a question like "How do people feel about our prices?".

RAG is a good way to find text that is "similar" to a question, and placing it into a language model's context window so that the model can use it to generate a response.

### Outlines + structured generation

Outlines is an open-source framework for structured generation. Structured generation is a technique that allows us to use a large language model (LLM) to generate text that is structured in a specific way. For example, we can use structured generation to generate a list of categories that a user's experience falls into and provide a sentiment for each category.

[DIAGRAM HERE]

## Setup

You will need to install the following packages:

```bash
pip install "pymilvus[model]" "outlines[transformers]" datasets sentence-transformers scikit-learn matplotlib pandas
```

You may also install the package from the `requirements.txt` file in the repository:

```bash
pip install -r requirements.txt
```

## Load the dataset

We'll be working with a dataset of Amazon reviews for video games, provided by the [McAuley Lab](https://cseweb.ucsd.edu/~jmcauley/).

The dataset is available on [Hugging Face](https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023).

We'll load the dataset using the `datasets` library. It may take a few seconds to download, so please be sure to start it early. It took me about 1.5 minutes to download the dataset.

In [1]:
from datasets import load_dataset

dataset = load_dataset(
    "McAuley-Lab/Amazon-Reviews-2023", 
    "raw_review_Video_Games", 
    trust_remote_code=True,
    cache_dir="cache",
    split="full"
)
df = dataset.to_pandas()

# Change the title column to be the "review_title" column
df.rename(columns={'title': 'review_title'}, inplace=True)

  from .autonotebook import tqdm as notebook_tqdm


Next, we need to download the metadata for the dataset. This metadata contains information about the product, such as the product title, description, and other attributes. We primarily need the product title.

Downloading the metadata may take 30-60 seconds.

In [2]:
meta_dataset = load_dataset(
    "McAuley-Lab/Amazon-Reviews-2023", 
    "raw_meta_Video_Games", 
    trust_remote_code=True,
    cache_dir="cache",
    split="full"
)

meta_df = meta_dataset.to_pandas()

# Change the title column to be the "product_title" column
meta_df.rename(columns={'title': 'product_title'}, inplace=True)


Next, we need to join the two datasets on the `parent_asin` column. This will allow us to get the product title for each review.

`parent_asin` is the product ID for the product that the review is for. Some products on Amazon have multiple versions (`asin`), so the `parent_asin` is the same for all versions of the product.

In [3]:
# Join the two datasets on the parent_asin column
joined_df = df.merge(meta_df, on='parent_asin', how='inner')
joined_df.head()

Unnamed: 0,rating,review_title,text,images_x,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,...,description,price,images_y,videos,store,categories,details,bought_together,subtitle,author
0,4.0,It’s pretty sexual. Not my fav,I’m playing on ps5 and it’s interesting. It’s...,[],B07DJWBYKP,B07DK1H3H5,AGCI7FAH4GL5FI65HYLKWTMFZ2CQ,1608186804795,0,True,...,"[Cyberpunk 2077 is an open world, an action ad...",,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': ['Cyberpunk 2077 gameplay', 'AMZN_Cy...",WARNER BROS,"[Video Games, PC, Games]","{""Release date"": ""December 10, 2020"", ""Best Se...",,,
1,5.0,Good. A bit slow,Nostalgic fun. A bit slow. I hope they don’t...,[],B00ZS80PC2,B07SRWRH5D,AGCI7FAH4GL5FI65HYLKWTMFZ2CQ,1587051114941,1,False,...,[A spectacular reimagining of one of the most ...,25.95,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': ['Watch before you play this game!',...",Square Enix,"[Video Games, PlayStation 4, Games]","{""Release date"": ""April 10, 2020"", ""Best Selle...",,,
2,5.0,... an order for my kids & they have really en...,This was an order for my kids & they have real...,[],B01FEHJYUU,B07MFMFW34,AGXVBIUFLFGMVLATYXHJYL4A5Q7Q,1490877431000,0,True,...,[Civilization VI is a game about building an e...,29.99,{'hi_res': ['https://m.media-amazon.com/images...,{'title': ['Civilization 6: Rise and Fall Revi...,2K,"[Video Games, PC, Games]","{""Release date"": ""March 18, 2018"", ""Best Selle...",,,
3,5.0,Great alt to pro controller,"These work great, They use batteries which is ...",[],B07GXJHRVK,B0BCHWZX95,AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q,1577637634017,0,True,...,[Play your favorite Nintendo Switch games like...,67.61,{'hi_res': ['https://m.media-amazon.com/images...,{'title': ['PowerA Animal Crossing Nintendo Sw...,PowerA,"[Video Games, Nintendo Switch, Accessories, Co...","{""Release date"": ""September 17, 2018"", ""Best S...",,,
4,5.0,solid product,I would recommend to anyone looking to add jus...,[],B00HUWA45W,B00HUWA45W,AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q,1427591932000,0,True,...,[],,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': [], 'url': [], 'user_id': []}",KontrolFreek,"[Video Games, Xbox One, Accessories]","{""Brand"": ""KontrolFreek"", ""Item model number"":...",,,


That's a lot of data! Let's try and find a few products that we can work with more easily.

In [4]:
# Count products 
print(joined_df.groupby(['parent_asin', 'product_title']).size().sort_values(ascending=False).head(50))


parent_asin  product_title                                                                                                                                                                                           
B01N3ASPNV   amFilm Tempered Glass Screen Protector for Nintendo Switch 2017 (2-Pack)                                                                                                                                    18105
B0BN942894   BENGOO Stereo Pro Gaming Headset for PS4, PC, Xbox One Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Wii Accessory Kits          17310
B077GG9D5D   DualShock 4 Wireless Controller for PlayStation 4 - Jet Black                                                                                                                                               15594
B000N5Z2L4   Xbox Live Gold: 1 Month Membership [Digital Code]                                                       

I've selected a few games that I think will be interesting to work with, as most people have likely heard the name. These games are also well reviewed on Amazon.

- Minecraft `B00BU3ZLJQ`
- The Legend of Zelda: Breath of the Wild `B087NNPYP3`
- Assassin's Creed IV: Black Flag `B00BN5T30E`
- The Elder Scrolls V: Skyrim (Playstation 4) `B07YBXFDYN`

I also add some timestamp parsing tools to make it easier to work with the data.

In [5]:
# Add date parsing tools
from datetime import datetime, timezone

# Filter the dataset to only include the selected games
videogames = joined_df[joined_df['parent_asin'].isin(['B00BU3ZLJQ', 'B087NNPYP3', 'B00BN5T30E', 'B07YBXFDYN'])]

# Parse the timestamp column from unix time to a datetime object
videogames.loc[:, 'timestamp'] = videogames['timestamp'].astype(int).map(lambda x: datetime.fromtimestamp(x/1000, tz=timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ'))

 '2014-07-02T00:53:26Z' '2011-12-30T00:14:48Z' '2014-11-29T18:39:39Z']' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  videogames.loc[:, 'timestamp'] = videogames['timestamp'].astype(int).map(lambda x: datetime.fromtimestamp(x/1000, tz=timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ'))


RAG and structured generation are both computationally expensive operations, so we'll want to keep the dataset small for exploration purposes. We have to calculate an [embedding](https://platform.openai.com/docs/guides/embeddings) for each review, so we'll want to keep the dataset small to avoid having to wait too long for results.

This is still a relatively large dataset, so let's randomly sample 2000 rows to keep the computation time manageable. If you have a GPU, you can increase the sample rate.

In [6]:

# Randomly sample 1000 rows to keep the computation time manageable
videogames = videogames.sample(n=1000)


Great! Now we have a dataset of 1000 reviews for 4 games. Let's take a look at the dataset.

In [7]:
videogames.head()

Unnamed: 0,rating,review_title,text,images_x,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase,...,description,price,images_y,videos,store,categories,details,bought_together,subtitle,author
895985,5.0,Awesome on land and over sea. Rule the seas!,I have had an amazing time captaining my ship ...,[],B00BMFIXOW,B00BN5T30E,AHVIIVUSR7AFB2QOYA5ZCMTQBSJA,2013-12-06T21:59:26Z,1,False,...,"[From the Manufacturer, Assassin's Creed, ®, I...",43.89,{'hi_res': ['https://m.media-amazon.com/images...,{'title': ['See the Game Case! Great Ship Batt...,Ubisoft,"[Video Games, PC, Games]","{""Release date"": ""November 19, 2013"", ""Best Se...",,,
824903,5.0,Same old Skyrim Just Better,If you liked Skyrim then this is a great way t...,[],B01GW902DM,B07YBXFDYN,AFCB6BTUBDB4OFJWXPOITK44EZJA,2019-12-31T13:51:22Z,0,False,...,"[A true, full-length open-world game for VR ha...",39.89,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': ['Skyrim VR - PlayStation 4', 'Skyri...",Bethesda,"[Video Games, PlayStation 4, Games]","{""Release date"": ""November 17, 2017"", ""Best Se...",,,
2093905,5.0,Five Stars,Its Skyrim....What can I say,[],B01GW8XJVU,B07YBXFDYN,AGM5IFHJEJAP7ZYTQNCAXYSJSWPA,2016-11-05T00:35:11Z,0,True,...,"[A true, full-length open-world game for VR ha...",39.89,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': ['Skyrim VR - PlayStation 4', 'Skyri...",Bethesda,"[Video Games, PlayStation 4, Games]","{""Release date"": ""November 17, 2017"", ""Best Se...",,,
1578389,5.0,Five Stars,Perfect condition,[],B01GW902DM,B07YBXFDYN,AF4E4LSHDCTBGSVK3ZTB7YQPATXA,2017-01-09T15:05:33Z,0,True,...,"[A true, full-length open-world game for VR ha...",39.89,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': ['Skyrim VR - PlayStation 4', 'Skyri...",Bethesda,"[Video Games, PlayStation 4, Games]","{""Release date"": ""November 17, 2017"", ""Best Se...",,,
3624354,4.0,"Great game, Bad UI","Make no mistake, Skyrim is good...Game wise. T...",[],B004HYIAPM,B07YBXFDYN,AGSXASGRENFUMT4BNNH7UHKAQJFQ,2012-06-03T01:57:09Z,4,True,...,"[A true, full-length open-world game for VR ha...",39.89,{'hi_res': ['https://m.media-amazon.com/images...,"{'title': ['Skyrim VR - PlayStation 4', 'Skyri...",Bethesda,"[Video Games, PlayStation 4, Games]","{""Release date"": ""November 17, 2017"", ""Best Se...",,,


# Using Milvus

```bash
pip install "pymilvus[model]"
```

In [8]:
from pymilvus import MilvusClient

client = MilvusClient("milvus_demo.db")

if client.has_collection(collection_name="demo_collection"):
    client.drop_collection(collection_name="demo_collection")

client.create_collection(
    collection_name="demo_collection",
    dimension=768,  # The vectors we will use in this demo has 768 dimensions
)

## Quick primer on embeddings

Embeddings are a way to represent text as a vector of numbers. This vector of numbers communicates the __meaning__ of the text to the embedding model -- for example, the embeddings for "cat" and "dog" are close to each other in the embedding space, while the embeddings for "cat" and "car" are far apart.

Let me show you an example. First, we need to set up our embedding model.In this case, we'll be using the [nomic-embed-text-v1](https://huggingface.co/nomic-ai/nomic-embed-text-v1) model to embed the reviews.

NOTE: all documents for this embedding model must be prepended with `search_document: ` or `search_query: `. This is model-specific -- you have to check the documentation for the model you are using.

In [10]:
from sentence_transformers import SentenceTransformer

device = 'cuda'

# https://huggingface.co/nomic-ai/nomic-embed-text-v1
embedding_model = SentenceTransformer(
    "nomic-ai/nomic-embed-text-v1", 
    trust_remote_code=True,
    device=device # Uncomment this line if you have a GPU
)

# NOTE: all documents for this embedding model must be
# prepended with search_document: or search_query: 
document_prefix = "search_document: "
query_prefix = "search_query: "

ImportError: /home/cameron/dottxt/odsc/.venv/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12

As I mentioned before, embeddings are a way to represent text as a vector of numbers. Let's take a look at the embedding space for a few documents.

Don't worry too much about what's going on in this code block -- all we're doing is 

1. Calculating the embeddings for a few documents
2. Reducing the dimensionality of the embeddings from 768 dimensions to 2 dimensions so we can visualize them

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

documents = ["cat", "dog", "king", "queen", "video game", 'minecraft', 'zelda', 'skyrim']
vectors = embedding_model.encode([document_prefix + doc for doc in documents])

# Show the principle components of the embedding space
pca = PCA(n_components=2)
transformed = pca.fit_transform(vectors)

# Create the scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(transformed[:, 0], transformed[:, 1])

# Add labels above each point
for i, doc in enumerate(documents):
    plt.annotate(doc, 
                (transformed[i, 0], transformed[i, 1]),
                xytext=(0, 10),  # 10 points vertical offset
                textcoords='offset points',
                ha='center',  # horizontal alignment
                va='bottom') # vertical alignment

plt.show()

Note that the embeddings for "cat" and "dog" are close to each other, as are "king" and "queen". Video games are close to each other and form a cluster.

When we do RAG, all we are doing is finding embeddings that are close to the query embedding. Let me add a query to the embedding space and see what happens.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

documents = ["cat", "dog", "king", "queen", "video game", 'minecraft', 'zelda', 'skyrim']
query = "What games do people like?"
vectors = embedding_model.encode([document_prefix + doc for doc in documents])
query_vector = embedding_model.encode([query_prefix + query])

# Show the principle components of the embedding space
pca = PCA(n_components=2)
transformed = pca.fit_transform(vectors)
transformed_query = pca.transform(query_vector[0].reshape(1, -1))

# Create the scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(transformed[:, 0], transformed[:, 1])

# Add labels above each point
for i, doc in enumerate(documents):
    plt.annotate(doc, 
                (transformed[i, 0], transformed[i, 1]),
                xytext=(0, 10),  # 10 points vertical offset
                textcoords='offset points',
                ha='center',  # horizontal alignment
                va='bottom') # vertical alignment

plt.scatter(transformed_query[0][0], transformed_query[0][1], color='red', marker='x', s=100)
plt.show()

In practice, we don't visualize the embedding space for RAG purposes. We use a metric called __cosine similarity__ to find the most similar documents to a query.

Here's an example of how we can calculate the cosine similarity between a query and a document.

In [None]:
import numpy as np


def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

for (doc, vector) in zip(documents, vectors):
    print(f"{doc:<12}{cosine_similarity(query_vector[0], vector):.2f}")


This is what vector databases like Milvus do, albeit more efficiently. You upload embedding vectors, and query the database using a query vector.

## Returning to the dataset

Now we've introduced embeddings and vector databases, let's return to the dataset.

We want to upload embeddings to the vector that contain as much semantic information as possible. In this case, I've chosen to create a single string for each review that contains the review title, text, product title, and rating.

In [None]:
# Preprocess the text -- compacts the r eview title, text, and product title into a single string
def preprocess_text(row):
    product_title_row = ""
    # product_title_row = "Review for product: " + row['product_title']
    review_title_row = "Review Title: " + row['review_title']
    rating_row = "Rating: " + str(row['rating'])
    review_text_row = "Review Text: " + row['text']
    return f"{product_title_row} \n{review_title_row} \n{rating_row} \n{review_text_row}"

videogames.loc[:, 'processed_text'] = videogames.apply(preprocess_text, axis=1)


In [None]:
print(videogames.processed_text.iloc[5])

Notice how much information is preserved in the processed text. We have the name of the product, the review title, the rating, and the review text. Any vector search will be able to use this information to find the most relevant reviews. You may wish to remove the product title to allow the embeddings to focus on the review text -- this is a good exercise for you to try!

Next, we'll upload the embeddings to the vector database. This is one of the more computationally expensive operations, so it may take a few seconds to complete depending on your hardware.

In [None]:
# Text strings to search from.
docs = videogames.processed_text.tolist()
vectors = embedding_model.encode([document_prefix + doc for doc in docs])

`vectors` contains the embeddings for each review, and `docs` contains the processed text for each review.

Let's upload the embeddings to the vector database.

In [None]:
# Each entity has id, vector representation, raw text, and a subject laabel that we use
# to demo metadata filtering later.
data = [{
    'id':i, 
    'vector':vectors[i].tolist(), 
    'text':docs[i],
    'product_title': videogames.product_title.iloc[i],
    'parent_asin': videogames.parent_asin.iloc[i],
    'asin': videogames.asin.iloc[i],
    'rating': videogames.rating.iloc[i],
    'timestamp': videogames.timestamp.iloc[i],
    'user_id': videogames.user_id.iloc[i],
} for i in range(len(vectors))]

print("Data has", len(data), "entities, each with fields: ", data[0].keys())
print("Vector dim:", len(data[0]["vector"]))


Milvus uses `client.insert` to upload data to the vector database. We simply provide the collection name and the data to upload.


In [None]:
client.insert(collection_name="demo_collection", data=data)

We have now uploaded the data to the vector database, which means we can perform semantic search on the data.

We need to pick a question that will determine which reviews are relevant to the question. 

Suppose we want to know about prices. Are our prices too low? Too high?

I'll propose a question that will determine which reviews are relevant to the question. I've phrased it to be general enough to be relevant to all of the products in our dataset, and I've attempted to keep the question from focusing on high or low prices specifically.

In [None]:
question = """
How do people feel about our prices? I'm trying to understand how people feel about our prices, 
both good and bad.
"""


Vector search compares the query vector to the vectors in the database. The query vector is the embedding for the question. Using our model, we have to append `search_query: ` to the question.

In [None]:
# Embed the query and convert the resulting embedding to a list
query_embeddings = [
    m.tolist() 
    for m in embedding_model.encode([query_prefix + question])
]


Let's search the database using the query vector.

The `limit` parameter determines how many results we want to return -- set this to a lower number if your computer is not powerful.

In [None]:
res = client.search(
    collection_name="demo_collection",
    data=query_embeddings,
    limit=50,
    # Let's limit the search to Minecraft for now
    filter="product_title == 'Minecraft'",
    output_fields=[
        "text", 
        "product_title", 
        "parent_asin", 
        "asin", 
        "rating", 
        "timestamp", 
        "helpful_vote", 
        "verified_purchase", 
        "user_id", 
        "review_title",
        'id'
    ],
)

print(res[0][0]['entity']['text'])

Finally, let's extract the text from the search results to pass them to the LLM.

In [None]:
# Get the resulting query text
# Note: we only provided a single query, so we take the 0th result.
# Milvus search queries are lists by default.
reviews = [doc['entity']['text'] for doc in res[0]] 


# Using Outlines


Outlines is a framework for structured generation. It allows us to use a large language model (LLM) to generate text that is structured in a specific way.

We are going to use the `HuggingFaceTB/SmolLM2-135M-Instruct` model. This is a small model that is fast and cheap to run. Most computers should be able to run this. 

It can be relatively low quality however, so you may consider using a larger, more performant model.

In [None]:
import outlines

# Choose our model. These are arranged from smallest to largest,
# so choose the one that is appropriate for your hardware.
#
# The larger models will generally give better results, but
# will also be slower and memory-intensive.

# model_string = 'HuggingFaceTB/SmolLM2-135M-Instruct'
# model_string = 'HuggingFaceTB/SmolLM2-360M-Instruct'
# model_string = 'HuggingFaceTB/SmolLM2-1.7B-Instruct' 
model_string = 'microsoft/Phi-3.5-mini-instruct' # this is a larger model that has generally good results

# Load the model
llm = outlines.models.transformers(
    model_string,
    device=device # Uncomment this line if you have a GPU
)

Outlines does not add chat template tokens to the model, so we need to add them manually. Chat template tokens are special tokens used in instruction-tuned models to indicate system prompts, user prompts, and assistant prompts.

Failing to include chat template tokens will cause the model to generate text with unpredicatable results. The output will _always_ conform to the schema, but the actual text may be nonsensical or incorrect.

`transformers` provides an `AutoTokenizer` class that can be used to easily apply chat template tokens to the model.


In [None]:
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_string)

# Simple template function
def template(user_prompt: str, system_prompt: str="You analyze reviews of a product and help analysts understand the user experience."):
    return tokenizer.apply_chat_template(
        [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}],
        tokenize=False,
        truncation=True,
        add_bos_token=True,
        add_generation_prompt=True
    )

print(template("Hello"))

Note the `<|im_start|>system`, `<|im_start|>user`, and `<|im_end|>` tokens. These are the system prompt and user prompt tokens. 

We insert a trailing `<|im_start|>assistant` token to indicate the beginning of the assistant prompt. This can help the model understand that it is responding to the prompt. Not including this token can cause unpredictable results.

Let's compare the two outputs. We'll use the unstructured text generator to compare the two outputs. A `generator` is a function that takes a prompt and returns language model output. To return structured data, we typically construct the generator with `outlines.generate.json`. You may also use `outlines.generate.text` to return unstructured text.

Here is how to initialize the unstructured text generator:

In [None]:
# Generator function to produce arbitrary responses
generator = outlines.generate.text(llm)

Here is the result with and without the chat template tokens:

In [None]:
print("Without chat template tokens:")
output = generator("Hello", max_tokens=10)
# print("\nWith chat template tokens:")
# print(generator(template("Hello")))

Note that the untemplated response is confusing and nonsensical. The templated response is consistent with our general experience of talking to chatbots and other instruction-tuned models.

## Using Outlines to extract structured data with a language model

The next part is to choose a few pieces of structured data that we would like to extract from the reviews.

We typically define the structure of the data we want to extract by defining a Pydantic class. Pydantic is a library that allows us to define strongly-typed data structures.

Outlines will force the language model to output data in the format you specify -- think of it as defining a "shape" of the response, and then allowing the model to fill in the details.

Let's take a look at a review.

In [None]:
print(reviews[2])

Great -- spot checking this, it seems to be about prices. Our vector search worked well!

Let's define the data we want to extract from the reviews. I've chosen to choose a few categories that I think are important to understand the users' experience with the product.

- Customer service
- Gameplay
- Bugs
- Performance
- Price value

For each category, we'll want to extract a sentiment. We'll use a `Literal` type to indicate that the sentiment must be one of the three options: `positive`, `negative`, or `neutral`. The model __must__ return a sentiment for each category.

In [None]:
from pydantic import BaseModel
from typing import Literal

class FeedbackItem(BaseModel):
    sentiment: Literal['positive', 'negative', 'neutral', 'unclear']
    comments: str

class Review(BaseModel):
    # Feedback items
    customer_service: FeedbackItem
    gameplay: FeedbackItem
    bugs: FeedbackItem
    performance: FeedbackItem
    price_value: FeedbackItem
    feedback_summary: str

    def __str__(self):
        lines = []
        # Iterate through the field names directly
        for field_name in self.model_fields.keys():
            if getattr(self, field_name) is not None:
                lines.append(f"{field_name}: {getattr(self, field_name)}")
        return "\n".join(lines)

    

We have the structure we want to extract, so let's create the generator. 

`outlines.generate.json` is a generator that takes a prompt and returns a structured response.

In this case, we know that the result of the generator will be a `Review` object.

In [None]:
# This function will take an arbitrary review string and
# return a structured `Review` object.
review_generator = outlines.generate.json(llm, Review)

We can test what it looks like with random input:

In [None]:
review_generator("This is a great game!")

Great. Next, we need to prompt our model, as you would in any language model context.

Note! You still have to provide good prompts. Structured generation is not a magic bullet. You will always get the structure you specify, but good prompting will improve the quality of the output.

In [None]:
def review_prompt(question: str, reviews: str):
    return template(f"""
    Please see a list of reviews of a product. Provide information to an analyst
    analyzing the product reviews.
                    
    The analyst is trying to answer the question: {question}
                    
    The analyst is interested in understanding the user experience with the product.

    For the reviews, extract sentiments and comments across the following categories:

    - customer_service
    - gameplay
    - bugs
    - performance
    - price_value

    Comments should be a single sentence that summarizes the user's experience with the product.
    These are comments for the analyst, not the users.

    Sentiments must be one of the following:

    - positive (the users said something positive about the category)
    - negative (the users said something negative about the category)
    - neutral (the users seemed balanced about the category)
    - unclear (the users did not mention the category)

    For example, if the users enjoyed the customer service but the 
    game was buggy, the categories would be [customer_service, bugs] and the
    sentiments would be [positive, negative]. 

    Lastly, provide a summary of the feedback for the product.

    Respond in JSON format, following the schema: {Review.model_json_schema()}

    # Begin review

    {reviews}
    """)

The function `review_prompt` takes a review and returns a chat-templated prompt for the language model.

We can now use the `review_generator` to generate structured data from the reviews. This is where the language model is used -- it can be slow depending on your hardware. If you need to speed it up, you can reduce the number of reviews returned by the vector search (the `limit` keyword) earlier in the lesson.

In [None]:
# Generate structured data from the reviews
reviews_all = "\n\n".join(reviews)
review_result = review_generator(review_prompt(reviews_all, question))

What do the results look like? Let's see!

In [None]:
# Print the first result
print(review_result)

The model flagged a generally positive review. Notice that the model flagged too much information as positive -- the review was only about price and customer service, but the model flagged the entire review as positive. This is a good example of language models not solving all your problems! You need to prompt the model, perform evaluations, and look at your data. Larger models tend to be more accurate if you notice poor output quality.

Now we can start to analyze the results. 

In [None]:
def analyze_product(question: str, product_title: str):
    search_results = client.search(
        collection_name="demo_collection",
        data=query_embeddings,
        limit=50,
        # Let's limit the search to Minecraft for now
        filter=f"product_title == '{product_title}'",
        output_fields=[
            "text", 
            "product_title", 
            "parent_asin", 
            "asin", 
            "rating", 
            "timestamp", 
            "helpful_vote", 
            "verified_purchase", 
            "user_id", 
            "review_title",
            'id'
        ],
    )

    # Review text
    all_review_texts = [doc['entity']['text'] for doc in search_results[0]] 

    # Generate structured data from the reviews
    reviews_all = "\n\n".join(all_review_texts)
    review_result = review_generator(review_prompt(reviews_all, question))

    return review_result

print(analyze_product("What do people think about the price of Minecraft?", "Minecraft"))

###################
####################3

In [None]:
# Summarize the results.
# Want to have sentiment counts by category.
import pandas as pd

# Iterate through the sentiment results
rows = []
for (llm_result, database_result) in zip(review_result, res[0]):

    # Converts the Pydantic model to a dictionary
    rdict = llm_result.model_dump()


    # Iterate through the dictionary
    for (k, v) in rdict.items():
        # Iterate through the dictionary
        row = {
            'timestamp': database_result['entity']['timestamp'],
            'user_id': database_result['entity']['user_id'],
            'product_title': database_result['entity']['product_title'],
            'id': database_result['entity']['id'],
            'processed_text': database_result['entity']['text'],
            'category': k,
            'sentiment': v['sentiment']
        }

        # Add the row to the list
        rows.append(row)

# Pack the results into a dataframe
df = pd.DataFrame(rows)

# Convert the timestamp to a datetime object for easier analysis
df['timestamp'] = pd.to_datetime(df['timestamp'])

df.head()

We can use this dataframe as we might in any other data analysis project. For example, we can count the number of positive and negative sentiments for each category and product.

In [None]:
# General overview by product
df.groupby(['product_title', 'category', 'sentiment']).size().unstack().fillna(0)

Or, we can look at the percentage of positive feedback for each product. This can be biased by the question you ask of the database, so be careful!

In [None]:
# Calculate the percentage of positive feedback for each product
df.groupby(['product_title', 'sentiment']).size().unstack().apply(lambda x: x['positive'] / (x['positive'] + x['negative']), axis=1)


What does the data look like when we visualize it? Suppose we want to visualize total sentiment across categories.

In [None]:
# Visualize the results
df.groupby(['category', 'sentiment']).size().unstack().plot(kind='bar', stacked=True)


We can also calculate the percentage of positive sentiment by category and product, to focus a little more on which products are doing well.

In [None]:
# Calculate percentage of positive sentiment by category and product
sentiment_by_groups = df.groupby(['category', 'product_title'])['sentiment'].agg(
    total_reviews=('count'),
    positive_sentiment=lambda x: (x == 'positive').mean(),
).reset_index().sort_values('product_title', ascending=False)

# Plot as a grouped bar chart
plt.figure(figsize=(15, 8))
categories = sentiment_by_groups['category'].unique()
x = np.arange(len(categories))
width = 0.2  # Width of bars

# Create bars for each product
for i, product in enumerate(sentiment_by_groups['product_title'].unique()):
    product_data = sentiment_by_groups[sentiment_by_groups['product_title'] == product]
    plt.bar(
        x + (i * width),  # Offset each product's bars
        product_data['positive_sentiment'],
        width,
        label=product
    )

plt.xlabel('Category')
plt.ylabel('Positive Sentiment (%)')
plt.title('Positive Sentiment by Category and Product')
plt.xticks(x + width, categories, rotation=45)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

# # Also display the numerical results
# print("\nDetailed Sentiment Analysis:")
# print(sentiment_by_groups.sort_values(['category', 'positive_percentage'], ascending=[True, False]))

We can also visualize the data in the embedding space. This is a good way to what the data looks like in a high-dimensional space. We're using the same technique (principal component analysis) as we did in the vector search section.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

# Create PCA
pca = PCA(n_components=2)
transformed = pca.fit_transform(vectors)

# Create color map
unique_products = videogames['product_title'].unique()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_products)))
color_dict = dict(zip(unique_products, colors))

# Create scatter plot
plt.figure(figsize=(10, 6))
for product in unique_products:
    mask = videogames['product_title'] == product
    plt.scatter(
        transformed[mask, 0], 
        transformed[mask, 1],
        label=product,
        alpha=0.6
    )

# Add the query vector
reshaped_query = np.array(query_embeddings[0]).reshape(1, -1)
transformed_query = pca.transform(reshaped_query)[0]

plt.scatter(
    transformed_query[0], 
    transformed_query[1],
    label='Query',
    color='red',
    marker='x'
)

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('PCA of Game Reviews by Product')
plt.tight_layout()
plt.show()

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

# Get the IDs from the search results
result_ids = [doc['entity']['id'] for doc in res[0]]

# Extract vectors for only the search results
result_vectors = vectors[result_ids]

# Create PCA
pca = PCA(n_components=2)
transformed = pca.fit_transform(result_vectors)

# Get product titles for the results
result_products = [doc['entity']['product_title'] for doc in res[0]]

# Create color map
unique_products = np.unique(result_products)
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_products)))
color_dict = dict(zip(unique_products, colors))

# Create scatter plot
plt.figure(figsize=(10, 6))
for product in unique_products:
    mask = np.array(result_products) == product
    plt.scatter(
        transformed[mask, 0], 
        transformed[mask, 1],
        label=product,
        alpha=0.6,
        color=color_dict[product]
    )

# Add the query vector
reshaped_query = np.array(query_embeddings[0]).reshape(1, -1)
transformed_query = pca.transform(reshaped_query)[0]
plt.scatter(
    transformed_query[0], 
    transformed_query[1],
    label='Query',
    color='red',
    marker='x',
    s=100  # Make the X larger
)

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.title('PCA of Retrieved Game Reviews by Product')
plt.tight_layout()
plt.show()

## Structuring every data point

The RAG approach outlined above is designed for very large datasets where structuring output is computationally expensive. We performed RAG to find a subsample of the data and then produced some structured output.

However, If you have the resources, there is a more robust approach.

You can add sentiment and feedback items to each review in your dataset (in this case, possibly millions of reviews). Structuring reviews as you put them into the database can help you perform analysis using more standard data analysis tools through database queries or standard statistical tools.

Outlines is a good tool for this, but you can also use a structured generation provider like OpenAI.

# Conclusion

In this lesson, we've covered how to use RAG to search for product reviews relevant to a question and then generate structured data from the reviews. We've also covered how to use Outlines to generate structured data from the reviews.

I hope you found this lesson useful! If you have any questions, please feel free to reach out to me on [LinkedIn](https://www.linkedin.com/in/cameron-pfiffer/).
