# Vaudeville: Structured Output

I give a lot more detail about Structured Output in the tutorial on github. There are, however, some brief descriptions about what each block of code does here.

## Goals

In this notebook, we want to set up a framework to extract a list of Musical Moments from a given Vaudeville play. To do this, we: 

* Set up a Pydantic Basemodel defining the aspects of a Musical Moment.
* Set up a Basemodel for a whole play (or scene) that we pass it, containing a list of Musical Moments.
* Bind an LLM to this structured output
* Split the play up into scenes, to have the LLM analyze one at a time
    * With too much content at once, it begins to yield lower accuracy results
* Feed each scene into the LLM, and group it back together into one object
* Convert and export the final object, containing an entire play, as a CSV file

## Full Workflow

Below, we have grouped the full workflow into one large sequence. Thus, you can change the pdf_filepath variable, hit run all below, and a csv will be exported into your directory. 

### Loading the PDF

In [78]:
# Hit execute cells below once you add your pdf_filepath

<div style="
  background:#00e676;           /* bright green */
  color:#111;                   /* dark text for strong contrast */
  padding:14px 18px;
  border-left:6px solid #00C853;/* slightly darker green accent */
  border-radius:10px;
  font-weight:600;
  line-height:1.4;
">
  Once you're ready to process a PDF, upload your PDF then change the filepath below. Create a folder in your workspace called `csv_outputs`. Then restart your kernel and run all. 
</div>

In [None]:
pdf_filepath = "Files/PDFs/Scribe-Cornu_-_La_chanoinesse.pdf" 
# Place your PDF filepath here

In [None]:
# This is what your output will be named later
csv_filename = pdf_filepath.replace("Files/PDFs/","").replace(".pdf",".csv") 

In [None]:
# Setting up pdf loading

from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
from langchain.document_loaders import PDFPlumberLoader
from langchain_core.documents import Document
from typing_extensions import List, TypedDict, Optional
from langchain_core.prompts import ChatPromptTemplate


async def loadPDF(filepath: str) -> list:
    loader = PyPDFLoader(filepath)
    pages = []
    async for page in loader.alazy_load():
        pages.append(page)   
    return pages

# filepath:str = input("Please enter the filepath: ")
source = await loadPDF(pdf_filepath)

for page in source:
        page.metadata['source'] = page.metadata['source'].replace("Files\PDFs\\","")

source_content = ""
for page in source:
    source_content += page.page_content
source_full = Document(page_content = source_content, metadata = source[0].metadata)

### Splitting up the PDF

Here, we have the LLM return the scene headers as they appear in the text. Then, we split on these headers.

This is not perfect and a few headers are always missed, but it works well enough to split up the text into manageable chunks for the app. It also guarantees a split cannot occur in the middle of a scene.

In [81]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o", model_provider="openai")
processing_llm = init_chat_model("gpt-4o-mini", model_provider="openai")

The "Description" field is what the LLM uses to find each variable within a scene. For each variable, we want to tell it *what it is*, and *where to find it*.

In [82]:
from typing import Optional, List
from pydantic import BaseModel, Field

class Scene(BaseModel):
    """A single scene from a Vaudeville play"""

    act: int = Field(description="The act number or label as it appears in the text.")
    scene: int = Field(description="The scene number or label as it appears in the text.")
    header: str = Field(description="The exact scene header line, copied verbatim from the text.")

class FullPlay(BaseModel):
    """A full play, that has yet to be broken into individual scenes."""

    all_scenes: List[Scene] = Field(description="A list of every single scene's header and label - each as a Scene object.")

formatted_splitter_llm = processing_llm.with_structured_output(FullPlay)

Here, we're setting up our first LLM "node" in our app. This node returns the headers of scenes so that we can split it up.

In [83]:
prompt = f"""
The following is the full text of a French Vaudeville play. Your job is to identify every scene boundary.
For each scene, return:
- The act number (as it appears in the text)
- The scene number (as it appears in the text)
- The exact scene header line (copy it verbatim from the text)

Return a list of objects like:
{{"act": "...", "scene": "...", "header": "..."}}

Do not attempt to count character indexes. Only return the scene headers as they appear in the text.

Play Content: \n
{source_full.page_content}
"""

In [84]:
def split_up_play(doc):
    response = formatted_splitter_llm.invoke(prompt)
    return response

In [85]:
all_indexes = split_up_play(source_full)

This is pretty complex code, so don't worry if you can't figure it out. In essence, it's finding the scenes titles that the LLM provided and splitting on them, returning a list of scenes.

In [86]:
from langchain_core.documents import Document

all_splits = []
scene_headers = all_indexes.all_scenes
full_text = source_full.page_content

prev_end_idx = 0
for i, scene in enumerate(scene_headers):
    # Find the start index of this scene's header after the previous end index
    start_idx = full_text.find(scene.header, prev_end_idx)
    if start_idx == -1:
        print(f"Scene header not found: {scene.header}")

    # Determine the end index: start of next scene header, or end of document
    if i + 1 < len(scene_headers):
        next_start_idx = full_text.find(scene_headers[i + 1].header, start_idx + len(scene.header))
        if next_start_idx == -1:
            end_idx = len(full_text)
        else:
            end_idx = next_start_idx
    else:
        end_idx = len(full_text)

    scene_text = full_text[start_idx:end_idx]
    doc = Document(page_content=scene_text, metadata={"act": scene.act, "scene": scene.scene, "header": scene.header})
    all_splits.append(doc)
    prev_end_idx = end_idx

### Setting up the pydantic object and LangGraph

This is where we define what the want back. This is usually the first thing you'll do. As you can see, we're asking it to find a whole lot of things within the MusicalMoment: characters, rhyme scheme, etc.

In [87]:
from typing import Optional
from pydantic import BaseModel, Field


# Pydantic
class MusicalMoment(BaseModel):
    """Many of these musical moments reuse some preexisting (and often well-known)  melody or tune.  These are variously called "melodie”, or “air”, and identified with a short title that refers in some way to an opera or collection of melodies from which it was drawn.  The titles might include the names of works, or other characters in those original works. In the context of the plays, these tunes become the vehicle for newly composed lyrics, which are normally rhymed, and which normally follow the poetic scansion and structure of the original lyrics.  Rhyme, versification and structure are thus of interest to us."""

    act: int = Field(description="The act number in which this musical moment takes place. Will be labeled at the top of the act or scene in which it takes place.")
    scene: int = Field(description="The scene number in which the musical moment takes place. Will be labeled at the top of the scene.")
    number: int = Field(description = "The index of the musical moment in the scene. For example, if this is the first musical moment in the scene, this should be 1.")
    characters: list[str] = Field(description="the character or characters who are singing (or otherwise making music) within this specific musical moment,")
    dramatic_situation: str = Field(description="the dramatic situation (a love scene, a crowd scene) in which the musical moment is occurring")
    air_or_melodie: str = Field(description="The title of the 'air' or 'melodie' of which the musical moment is based. It will be labeled in the text as 'air' or 'melodie'.")
    poetic_text: str = Field(description="The text from the music number. Do not include stage directions, only the lyrics sung by the characters in this musical moment")
    rhyme_scheme: str = Field(description = "The rhyme scheme for the poetic text in the musical moment. For example, sentences that end in 'tree' 'be' 'why' and 'high' would have a rhyme scheme of AABB.")
    poetic_form: str = Field(description="form of the poetic text, which might involve some refrain")
    end_of_line_accents: list[str] = Field(description = "the end accent for each line (masculine or féminine)")
    syllable_count_per_line: list[int] = Field(description = "the number of syllables per line. look out for contractions and colloquialisms.that might make the count of syllables less than obvious. Normally a word like ‘voilà’ would of course have 2 syllables. But the musical rhythm of a particular melodie might require that it be _sung_ as only one syllable, as would be the case if the text reads ‘v’la’. Similarly ‘mademoiselle’ would have 4 syllables in spoken French. But the musical context might make it sound like 5. Or a character speaking dialect might sing “Mam’zelle”, which would have only 2 (or perhaps 3) syllables.")
    irregularities: Optional[str] = Field(description="any irregularities within the musical number")
    stage_direction_or_cues: Optional[str] = Field(description="any stage directions, which tell a character what to do, but aren't a part of another character's dialogue. These are usually connected with a character’s name, and often are in some contrasting typography (italics, or in parentheses - though this may not be picked up by the filereader).  Sometimes these directions even happen in the midst of a song! In a related way there are sometimes ‘cues’ for music, or performance (as when there is an offstage sound effect, or someone is humming) Most times the stage directions appear just before or after the song text. But sometimes they appear in the midst of the texts. The directions should be reported here and not in the transcription of the poem.")
    reprise: Optional[str] = Field(description="there are sometimes directions that indicate the ‘reprise’ of some earlier number or chorus.")

class VaudevillePlay(BaseModel):
    musicalMoments: list[MusicalMoment] = Field(description="""A list of musical moments in a Vaudeville play, as MusicalMoment objects. Many of these musical moments reuse some preexisting (and often well-known)  melody or tune.  These are variously called "melodie”, or “air”, and identified with a short title that refers in some way to an opera or collection of melodies from which it was drawn.  The titles might include the names of works, or other characters in those original works. In the context of the plays, these tunes become the vehicle for newly composed lyrics, which are normally rhymed, and which normally follow the poetic scansion and structure of the original lyrics.  Rhyme, versification and structure are thus of interest to us.""")    

structured_llm = llm.with_structured_output(VaudevillePlay)

This is our system prompt. It's always a good technique to tell it that it's an *expert* in your field. It seems strange, but it does help.

In [88]:
system_prompt = """
You are a literary analyst specializing in French Vaudeville plays from the 18th century. 
Your goal is to identify each musical moment in the text, and for each, extract detailed structured information, 
including act, scene, characters, dramatic situation, air or melodie, poetic text, rhyme scheme, poetic form, end-of-line accents, syllable count, and any irregularities. 
Some parts of the text were slightly misinterpreted by the file reader (e.g., missing spaces or strange line breaks).
"""
human_prompt = """
Given the following chunk of the play, analyze and return the musical moments as a structured VaudevillePlay object.
"""



Here, we set up a basic LangGraph sequence. This specific version of the app uses indexes to go the list of scenes, and is thus called with a for loop. 

In [89]:
from typing_extensions import List, TypedDict, Optional
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system",system_prompt),
    ("human","Context:\n{context}\n\nQuestion:\n{question}")
     ])

class State(TypedDict):
    index: int
    context: Document
    answer: str

def check_index(state: State):
    return state

def retrieve_doc(state: State):
    document = all_splits[state["index"]]
    return {"context": document}

def generate(state: State):
    i = state["index"]
    message = prompt.invoke({"question":human_prompt,"context" : f'Act {all_indexes.all_scenes[i].act}, Scene {all_indexes.all_scenes[i].scene}:\n\n {state["context"].page_content}'})
    response = structured_llm.invoke(message)
    return {"answer": response}

from langgraph.graph import START, StateGraph

graph_builder = StateGraph(State).add_sequence([check_index, retrieve_doc, generate])
graph_builder.add_edge(START, "check_index")
graph = graph_builder.compile()

### Analyzing the scenes and merging them together

This is the new code. It goes through each scene, calls the LangGraph, then merges all of the scenes together into one object. Then, it exports it as a csv (based on the filename of the PDF) to a csv_outputs folder in your directory. In the github repository, you can view the 7 objects I tested. Or, you can follow this [link](https://docs.google.com/spreadsheets/d/1WBVLnW_EfVwT60LsOVykd4lNaYEB0NAPvJUaMvyyBrM/edit?usp=sharing) to see the spreadsheet on Google Sheets.

You might have to add a folder `csv_outputs` to your workspace.

In [90]:
def analyze_scenes(docs: List[Document]) -> List[MusicalMoment]:
    all_moments: List[MusicalMoment] = []

    for i,doc in enumerate(docs):
        response = graph.invoke({"index": i})
        moments = response["answer"].musicalMoments
        all_moments.extend(moments)
    
    return all_moments

all_moments = analyze_scenes(all_splits)

Finally, we dump the results into a dictionary, then export that as a csv file.

In [91]:
import csv
import os

# Convert all MusicalMoment objects to dicts
moments_dicts = [moment.model_dump() for moment in all_moments]

# Get all field names from the first moment
fieldnames = moments_dicts[0].keys() if moments_dicts else []

# Write to CSV
# Ensure the output folder exists
output_folder = "csv_outputs"
os.makedirs(output_folder, exist_ok=True)

# Build the output path
output_path = os.path.join(output_folder, os.path.basename(csv_filename))

with open(output_path, "w", newline='', encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for row in moments_dicts:
        # Convert lists to strings for CSV output
        for key, value in row.items():
            if isinstance(value, list):
                row[key] = "; ".join(str(v) for v in value)
        writer.writerow(row)

## Conclusions

This system is highly effective, and serves as a solid template for a Structured Output App. 

However, some improvements are needed for this to reach its most accurate form. Most notably, the descriptions for the variable within the Pydantic schema are lacking expertise in the subject. To reach it's most effective form, an expert in these plays would have to write these descriptions.

That being said, this system is impressive as is. Reading through all 7 Vaudeville plays (~50 pages each) only took about 20 minutes of passive runtime and a few dollars with the OpenAI API. 