In [None]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# LLaMaBot's `StructuredBot` in under 5 minutes

When using LLMs, an ideal goal would be to 
pull structured data out of unstructured text. 
When the data is structured, 
we can then use it programmatically in later steps.

In this example, we'll look at a small dataset of SciPy videos uploaded to YouTube. 
The videos are given a title and a description. 
We want to extract the name of the speaker giving the talk, 
and the topics the talk is about.
We also want to be able to validate the data we've extracted 
not only matches the structured format we expect, 
but that it also meets some custom requirements.

### Read video descriptions

Firstly, let's look at the video descriptions file. 
It is stored as a JSON file.
We can read it into pandas by using `pd.read_json`:

In [None]:
# load in unstructured text data
import pandas as pd

df = pd.read_json("../scipy_videos.json", orient="index")
df

Unnamed: 0,name,description,view_count
0ALKGR0I5MA,Basic Sound Processing in Python | SciPy 2015 ...,,261832
ZB7BZMhfPgk,Introduction to Numerical Computing with NumPy...,NumPy provides Python with a powerful array pr...,208823
v5ijNXvlC5A,Modern Time Series Analysis | SciPy 2019 Tutor...,This tutorial will cover the newest and most s...,199372
tYYVSEHq-io,Getting Started with TensorFlow and Deep Learn...,"A friendly introduction to Deep Learning, taug...",161483
xAoljeRJ3lU,A Better Default Colormap for Matplotlib | Sci...,Complete SciPy 2015 Talk & Tutorial Playlist h...,160912
5rNu16O3YNE,Introduction to Data Processing in Python with...,This is a tutorial for beginners on using the ...,117260
JNfxr4BQrLk,Time Series Analysis with Python Intermediate ...,Tutorial materials for the Time Series Analysi...,113018
KhAUfqhLakw,Frequentism and Bayesianism: What's the Big De...,,106990
gtejJ3RCddE,NumPy Beginner | SciPy 2016 Tutorial | Alexand...,Materials for this tutorial may be found here:...,106043
nq6iPZVUxZU,UMAP Uniform Manifold Approximation and Projec...,This talk will present a new approach to dimen...,96781


Let's now define a Pydantic schema for the data that we wish to extract from movie entry.
This is doen by defining a BaseModel class and field validators.

In [None]:
from typing import List, Optional
from pydantic import BaseModel, Field, field_validator


class TopicExtract(BaseModel):
    """This object stores the name of the speaker presenting the video.

    It also generates a list of topics
    that best describe what this talk is about.
    """

    speaker_name: Optional[str] = Field(
        default=None,
        description=(
            "The name of the speaker giving this talk. "
            "If there is no speaker named, leave empty."
        ),
    )
    topics: List[str] = Field(
        description=(
            "A list of upto 5 topics that this text is about. "
            "Each topic should be at most 1 or 2 word descriptions. "
            "All lowercase."
        )
    )

    @field_validator("topics")
    def validate_num_topics(cls, topics):
        # validate that the list of topics contains atleast 1, and no more than 5 topics
        if len(topics) <= 0 or len(topics) > 5:
            raise ValueError("The list of topics can be no more than 5 items")
        return topics

    @field_validator("topics")
    def validate_num_topic_words(cls, topics):
        # for each topic the model generated, ensure that the topic contains no more than 2 words
        for topic in topics:
            if len(topic.split()) > 2:
                # make the validation message helpful to the LLM.
                # Here we repeat which topic is failing validation, and remind it what it must do to pass the validation.
                raise ValueError(
                    f'The topic "{topic}" has too many words, A topic can contain AT MOST 2 words'
                )
        return topics

Now we can initialize the PydanticBot and assign this model to it.

In [None]:
from llamabot import prompt, StructuredBot


@prompt
def topicbot_sysprompt() -> str:
    """You are an expert topic labeller.
    You read a video title and description
    and extract the speakers name and the topics the video is about.
    """


# Will use the OpenAI API by default, which requires an API key.
# If you want to, you can change this to a local LLM (from Ollama)
# by specifying, say, `model_name="ollama/mistral"`.
bot = StructuredBot(
    system_prompt=topicbot_sysprompt(),
    temperature=0,
    pydantic_model=TopicExtract,
    # model_name="ollama/mistral"
)

Now we can pass in our text, and extract the topics

In [None]:
video_extracts = []
for index, video_row in df.iterrows():
    video_text = f"video title: {video_row['name']}\nvideo description: {video_row['description']}"

    extract = bot(video_text)

    video_extracts.append(extract)

{
  "speaker_name": "Allen Downey",
  "topics": [
    "sound",
    "processing",
    "python",
    "scipy",
    "basic"
  ]
}{
  "speaker_name": "Alex Chabot-Leclerc",
  "topics": [
    "numerical computing",
    "NumPy",
    "array processing",
    "mathematical functions",
    "matplotlib"
  ]
}{
  "speaker_name": "Aileen Nielsen",
  "topics": [
    "time series analysis",
    "bayesian methods",
    "machine learning",
    "deep learning",
    "python implementations"
  ]
}

Let's now inspect what the topics looked like.

In [None]:
for video in video_extracts:
    print(video)

Look's pretty accurate!