# Vaudeville: Structured Output

## Goals

In this notebook, we want to set up a framework to extract a list of Musical Moments from a given Vaudeville play. To do this, we: 

* Set up a Pydantic Basemodel defining the aspects of a Musical Moment.
* Set up a Basemodel for a whole play (or scene) that we pass it, containing a list of Musical Moments.
* Bind an LLM to this structured output
* Split the play up into scenes, to have the LLM analyze one at a time
    * With too much content at once, it begins to yield lower accuracy results
* Feed each scene into the LLM, and group it back together into one object
* Convert and export the final object, containing an entire play, as a CSV file

## Setting up the chat model and pydantic schema

Here, we use gpt-4o as our model, because this is a fairly complex task.

In [21]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o", model_provider="openai")

The "Description" field is what the LLM uses to find each variable within a scene. For each variable, we want to tell it *what it is*, and *where to find it*.

In [22]:
from typing import Optional
from pydantic import BaseModel, Field


# Pydantic
class MusicalMoment(BaseModel):
    """Many of these musical moments reuse some preexisting (and often well-known)  melody or tune.  These are variously called "melodie”, or “air”, and identified with a short title that refers in some way to an opera or collection of melodies from which it was drawn.  The titles might include the names of works, or other characters in those original works. In the context of the plays, these tunes become the vehicle for newly composed lyrics, which are normally rhymed, and which normally follow the poetic scansion and structure of the original lyrics.  Rhyme, versification and structure are thus of interest to us."""

    act: int = Field(description="The act number in which this musical moment takes place. Will be labeled at the top of the act or scene in which it takes place.")
    scene: int = Field(description="The scene number in which the musical moment takes place. Will be labeled at the top of the scene.")
    number: int = Field(description = "The index of the musical moment in the scene. For example, if this is the first musical moment in the scene, this should be 1.")
    characters: list[str] = Field(description="the character or characters who are singing (or otherwise making music) within this specific musical moment,")
    dramatic_situation: str = Field(description="the dramatic situation (a love scene, a crowd scene) in which the musical moment is occurring")
    air_or_melodie: str = Field(description="The title of the 'air' or 'melodie' of which the musical moment is based. It will be labeled in the text as 'air' or 'melodie'.")
    poetic_text: str = Field(description="The text from the music number. Do not include stage directions, only the lyrics sung by the characters in this musical moment")
    rhyme_scheme: str = Field(description = "The rhyme scheme for the poetic text in the musical moment. For example, sentences that end in 'tree' 'be' 'why' and 'high' would have a rhyme scheme of AABB.")
    poetic_form: str = Field(description="form of the poetic text, which might involve some refrain")
    end_of_line_accents: list[str] = Field(description = "the end accent for each line (masculine or féminine)")
    syllable_count_per_line: list[int] = Field(description = "the number of syllables per line. look out for contractions and colloquialisms.that might make the count of syllables less than obvious. Normally a word like ‘voilà’ would of course have 2 syllables. But the musical rhythm of a particular melodie might require that it be _sung_ as only one syllable, as would be the case if the text reads ‘v’la’. Similarly ‘mademoiselle’ would have 4 syllables in spoken French. But the musical context might make it sound like 5. Or a character speaking dialect might sing “Mam’zelle”, which would have only 2 (or perhaps 3) syllables.")
    irregularities: Optional[str] = Field(description="any irregularities within the musical number")
    stage_direction_or_cues: Optional[str] = Field(description="any stage directions, which tell a character what to do, but aren't a part of another character's dialogue. These are usually connected with a character’s name, and often are in some contrasting typography (italics, or in parentheses - though this may not be picked up by the filereader).  Sometimes these directions even happen in the midst of a song! In a related way there are sometimes ‘cues’ for music, or performance (as when there is an offstage sound effect, or someone is humming) Most times the stage directions appear just before or after the song text. But sometimes they appear in the midst of the texts. The directions should be reported here and not in the transcription of the poem.")
    reprise: Optional[str] = Field(description="there are sometimes directions that indicate the ‘reprise’ of some earlier number or chorus.")
    


structured_llm = llm.with_structured_output(MusicalMoment)

# structured_llm.invoke("Create a musical moment")

Below is the datatype it will return for each scene/play - a list of Musical Moments.

In [23]:
class VaudevillePlay(BaseModel):
    musicalMoments: list[MusicalMoment] = Field(description="""A list of musical moments in a Vaudeville play, as MusicalMoment objects. Many of these musical moments reuse some preexisting (and often well-known)  melody or tune.  These are variously called "melodie”, or “air”, and identified with a short title that refers in some way to an opera or collection of melodies from which it was drawn.  The titles might include the names of works, or other characters in those original works. In the context of the plays, these tunes become the vehicle for newly composed lyrics, which are normally rhymed, and which normally follow the poetic scansion and structure of the original lyrics.  Rhyme, versification and structure are thus of interest to us.""")
    

structured_llm = llm.with_structured_output(VaudevillePlay)

Here, we define the system prompt. We include a sentence about file readers missing spaces and breaking lines, so that the LLM knows to look out for it. We have found that the LLM is highly effective in correcting these inaccuracies if we mention them. 

In [None]:
system_prompt = """
You are a literary analyst specializing in French Vaudeville plays from the 19th century. 
Your goal is to identify each musical moment in the text, and for each, extract detailed structured information, 
including act, scene, characters, dramatic situation, air or melodie, poetic text, rhyme scheme, poetic form, end-of-line accents, syllable count, and any irregularities. 
Some parts of the text were slightly misinterpreted by the file reader (e.g., missing spaces or strange line breaks).
"""
human_prompt = """
Given the following chunk of the play, analyze and return the musical moments as a structured VaudevillePlay object.
"""



### Basic Test Case

Here, we simply feed it a scene to see what it pulls out. The code is commented out to ensure it isn't accidentally run again.

In [None]:
# moments = structured_llm.invoke(r"""Analyze the musical moment within the chunk of this Vaudeville play. Some parts of the text were slightly misinterpreted by the filereader, especially missing spaces. Here is the text from the play:
                      
#                       ACTE I, SCÈNE VI. 11
# DERBIGNY.
# Elle n'est pasassezmontante, puisqu'il fautvousdirelemot.
# (Pendantcequi suit, Paulinevaaufondetmetsonchapeau.)
# PAMELA.
# Onlesporte comme ça... voyez, moi?... ça avantage. (Elle
# ouvresonchâle.)
# DERBIGNY,sévèrement.
# C'estimmodeste, mademoiselle...Jevouspried'y ajouterquel
# quesmillimètres...
# PAMELA.
# Commevousvoudrez; chacun son goût... maisvousavez tort.
# DERBIGNY, sévèrement.
# Mademoiselle...jevous prie d'y ajouter quelquesmillimètres!
# PAMELA.
# Nevousfâchez pas... je la ferai montante jusqu'au boutdu
# nez... qu'est-cequeçamefait?
# AIR:Desommeillerencor, machère.
# Dubongoût,jen'suispasl'enn'mie,
# Maisçam'estbien égal,mafoi!
# Ens'env'loppantcommeun' momie,
# Mam'zelleyperdraplusquemoi;
# J'respectevosscrupul'sbarbares,
# Maisjetiensàmesopinions;
# Etjedéteste lesavares
# Quicachentleursnapoléons.
# (Elleremonte.)
# DERBIGNY.
# Eh bien! etPauline?...tuneluisouhaitespaslebonjour?
# ISIDORE.
# Si, monpapa...
# PAULINE, quiaredescendu lascène, etd'un tonrailleur.
# Oh! moi, j'ail'habitude d'ètreoubliée d'Isidore... Il metraite
# enamie...Je necomptepas.
# ISIDORE, allantàPauline.
# Bonjour,Pauline...* (RegardantPaméla.)Elleestencoremieux
# deprès.
# PAMELA, àpart.
# Qu'est-cequ'iladonc àmeregarder, ce petit?(Haut.)Etcette
# robe, mademoiselle?
# PAULINE
# Là, dansma chambre.
# FAMELA.
# Ça sera fait endeuxtemps...J'ai aissé del'étoffe endedans.
# (Eileentredanslachambrededroitetroisièmeplan.)
# *Derbigny,Pauline,Isidore;Pamélaausecond plan.
# 12 UN DOCTEUR EN HERBE.
# ISIDORE, àpart, désolé.
# Oh!...elles'enva!...*
# DERBIGNY.
# Ah!ça, avantderetourneràBriare, nous avons quelques em-
# plettesàfaire,offretonbras à Pauline. (Bas, enpassantderrière
# lui.)Tuferastapaixenroute*.
# ISIDORE,embarrassé.
# Monpapa... c'estque...
# DERBIGNY.
# Quoiencore?
# ISIDORE.
# C'estdemainl'examen.
# DERBIGNY.
# Nem'as-tupasditqueta compositionestfaite?
# ISIDORE.
# Oui, papa, oui... maison croitêtreprêt, et puis... il arrive...
# vouscomprenez...
# DERBIGNY.
# Pastrop!
# PAULINE,riant.
# MonDieu, Isidore, que vous avezl'airdrôle!.. Dites donctout
# desuitequevousavezbesoinderevoirvotretravail….. noussorti-
# ronsbiensans vous...Onnevous envoudrapas...
# ISIDORE.
# Oh! merci, mademoisellePauline!
# PAULINE.
# Iln'y apasdequoi, allez!
# DERBIGNY.
# Queneledisais-tu?...le devoir avanttout. (Ildonnelebrasà
# Pauline.)
# ELBMESNE.
# AIR: DeLuciede Lamermoor***.
# Allons,courageetpersévère;
# Soisbienattentif, etdemain,
# Avecsuccès, oui,je l'espère,
# Tupasserastonexamen.
# ISIDORE.
# Jetrembleetjemedésespère,
# Quandjesongequec'est demain
# Que,devantunjugesévère,
# Jedoispassermonexamen.
# PAULINE.
# Jevoisce quiledésespère,
# Quandjesongequec'estdemain
# Que,devantunjugesévère,
# Ildoitpassersonexamen.
# *Pauline, Derbigny, Isidore.
# **Pauline, Isidore, Derbigny.
# ***Pauline,Derbigny,Isidore,
#                       """)

In [8]:
for musicalMoment in moments.musicalMoments:
    for attribute in musicalMoment:
        print(attribute)
    print("\n")

('act', 1)
('scene', 6)
('characters', ['Pamela', 'Derbigny'])
('dramatic_situation', 'Derbigny is reprimanding Pamela for her immodest dressing style while she defends her fashion choice.')
('air_or_melodie', 'De sommeiller encor, ma chère')
('poetic_text', "Dubon goût, je n'suis pas l'enn'mie, Mais çam'est bien égal, ma foi! En s'env'loppant comme un' momie, Mam'zelle yperdr'a plusque moi; J'respecte vos scrupul's barbares, Mais je tiens à mes opinions; Et je déteste les avares Qui cachent leurs napoléons.")
('rhyme_scheme', 'ABABCDCD')
('poetic_form', 'Verse with alternating rhymes, addressing the situation at hand.')
('end_of_line_accents', ['feminine', 'masculine', 'feminine', 'masculine', 'feminine', 'masculine', 'feminine', 'masculine'])
('syllable_count_per_line', [12, 8, 12, 8, 12, 8, 12, 8])
('irregularities', None)


('act', 1)
('scene', 6)
('characters', ['Isidore', 'Pauline', 'Derbigny'])
('dramatic_situation', 'Pauline encourages Isidore to focus on his exam while Derbign

Our Pydantic object is working, and it did a good job at extracting the requested information.

## Loading Test Files

In [None]:
# Setting up pdf loading

from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents import Document
from typing_extensions import List, TypedDict, Optional
from langchain_core.prompts import ChatPromptTemplate


async def loadPDF(filepath: str) -> list:
    loader = PyPDFLoader(filepath)
    pages = []
    async for page in loader.alazy_load():
        pages.append(page)   
    return pages

source = await loadPDF("Files/PDFs/La_Dette_d_honneur.pdf")

for page in source:
        page.metadata['source'] = page.metadata['source'].replace("Vaudeville/Files/PDFs/","")

source_content = ""
for page in source:
    source_content += page.page_content
source_full = Document(page_content = source_content, metadata = source[0].metadata)

  page.metadata['source'] = page.metadata['source'].replace("C:\\Users\\charl\\Documents\\VSCode\\Vaudeville\\Files\PDFs\\","")


In [None]:
print(source_full.page_content)

## Converting a play into scene chunks

A few possible approaches here:
* Run a basic script to split the string at "Acte" or "Scene" (regex)
    * We may have to clean the inputted file first. This could be done with a cheap llm call to gpt-4o-mini.
* Use a semantic splitter, which should roughly get everything right because content should be noticeably different between scenes in general. 
    * It could mess a few things up, including splitting in the middle of a musical moment
* Have the "cleaning" llm call also split by act and scene 
    * Unsuccessful approach: Having it return the start and end characters of the acts.
    * Successful approach: Have a smaller LLM model look through the play and return a list of scene headers as they appear. Then, use regex splitting to get one scene at a time.

We included 2 approaches. The first is an example of what not to do - included for pedagogical purposes. 

### First approach - did not work: Character indexes

In this approach, we attempt to have an LLM find the start and end character counts of each scene. This did not work.

In [2]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o", model_provider="openai")
processing_llm = init_chat_model("gpt-4o-mini", model_provider="openai")

In [10]:
from typing import Optional, List
from pydantic import BaseModel, Field

class Scene(BaseModel):
    """A single scene from from a Vaudeville play"""

    act: int = Field(description="The act number. Will be labeled at the top of the act or scene in which it takes place.")
    scene: int = Field(description="The scene number within the act. Will be labeled at the top of the scene. For example, if it is Act 3 Scene 2, this value should be 2.")
    start_character_number: int = Field(description="In terms of the entire document, what is the index of the first character of the scene, including the label of the scene, and act if mentioned next to the scene number (mostly only applicable if scene 1 of an act). For example, if the whole document was 'Play 1: Title. Act 1, Scene 2. Content.' then the start index would be 15, since 'Act' starts on the 16th character.")
    end_character_number: int = Field(description="In terms of the entire document, the index of the last character of the scene. It should be the last character before the next scene or act is labeled.")

class FullPlay(BaseModel):
    """A full play, that has yet to be broken into individual scenes."""

    all_scenes: List[Scene] = Field(description="A list of every single scene's start and end - each as a Scene object containing indexes and the first and last character, so that it can easily be passed into a substring just containing the scene.")

formatted_splitter_llm = processing_llm.with_structured_output(FullPlay)

In [None]:
def split_up_play(doc):
    response = formatted_splitter_llm.invoke(f"The following is a Vaudeville play in French. Your job is to return the necessary indexes required to split it up into individual scenes. Thus, you will be looking for 'Acte' and 'Scène' throughout the text. For each scene, you will document the act, scene number (ex Act 4, Scene 1), start character index, and end character index as a Scene object, then add it to the list of scenes in the FullPlay object. All start and end indexes should be in terms of the full document; the goal is to create a substring using FullText[start_index:end_index] for each scene. After you've gone through the full document, you will return the FullPlay object.\n\nPlay Content:\n{doc.page_content}")
    return response

In [15]:
all_indexes = split_up_play(source_full)

Upon looking into the result, it was clearly incorrect. Thus, we tried another approach.

### Second Approach: Scene Headers and Regex

Here, we have the LLM return the scene headers as they appear in the text. Then, we split on these headers.

This is not perfect and a few headers are always missed, but it works well enough to split up the text into manageable chunks for the app. It also guarantees a split cannot occur in the middle of a scene.

In [None]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o", model_provider="openai")
processing_llm = init_chat_model("gpt-4o-mini", model_provider="openai")

In [24]:
from typing import Optional, List
from pydantic import BaseModel, Field

class Scene(BaseModel):
    """A single scene from a Vaudeville play"""

    act: int = Field(description="The act number or label as it appears in the text.")
    scene: int = Field(description="The scene number or label as it appears in the text.")
    header: str = Field(description="The exact scene header line, copied verbatim from the text.")

class FullPlay(BaseModel):
    """A full play, that has yet to be broken into individual scenes."""

    all_scenes: List[Scene] = Field(description="A list of every single scene's header and label - each as a Scene object.")

formatted_splitter_llm = processing_llm.with_structured_output(FullPlay)

In [26]:
prompt = f"""
The following is the full text of a French Vaudeville play. Your job is to identify every scene boundary.
For each scene, return:
- The act number (as it appears in the text)
- The scene number (as it appears in the text)
- The exact scene header line (copy it verbatim from the text)

Return a list of objects like:
{{"act": "...", "scene": "...", "header": "..."}}

Do not attempt to count character indexes. Only return the scene headers as they appear in the text.

Play Content: \n
{source_full.page_content}
"""

In [27]:
def split_up_play(doc):
    response = formatted_splitter_llm.invoke(prompt)
    return response

In [28]:
all_indexes = split_up_play(source_full)

In [30]:
scenes: List[Scene] = all_indexes.all_scenes
for i,scene in enumerate(scenes):
    print (f"""Act {scene.act}, Scene {scene.scene} Header: {scene.header}""")

Act 1, Scene 1 Header: Scène première
Act 1, Scene 2 Header: Scène II
Act 1, Scene 3 Header: Scène III
Act 1, Scene 4 Header: Scène IV
Act 1, Scene 5 Header: Scène V
Act 1, Scene 6 Header: Scène VI
Act 1, Scene 7 Header: Scène VII
Act 1, Scene 8 Header: Scène VIII
Act 1, Scene 9 Header: Scène IX
Act 2, Scene 1 Header: Scène première
Act 2, Scene 2 Header: Scène II
Act 2, Scene 3 Header: Scène III
Act 2, Scene 4 Header: Scène IV
Act 2, Scene 5 Header: Scène V
Act 2, Scene 6 Header: Scène VI
Act 2, Scene 7 Header: Scène VII
Act 2, Scene 8 Header: Scène VIII


In [39]:
from langchain_core.documents import Document

all_splits = []
scene_headers = all_indexes.all_scenes
full_text = source_full.page_content

prev_end_idx = 0
for i, scene in enumerate(scene_headers):
    # Find the start index of this scene's header after the previous end index
    start_idx = full_text.find(scene.header, prev_end_idx)
    if start_idx == -1:
        raise ValueError(f"Scene header not found: {scene.header}")

    # Determine the end index: start of next scene header, or end of document
    if i + 1 < len(scene_headers):
        next_start_idx = full_text.find(scene_headers[i + 1].header, start_idx + len(scene.header))
        if next_start_idx == -1:
            end_idx = len(full_text)
        else:
            end_idx = next_start_idx
    else:
        end_idx = len(full_text)

    scene_text = full_text[start_idx:end_idx]
    doc = Document(page_content=scene_text, metadata={"act": scene.act, "scene": scene.scene, "header": scene.header})
    all_splits.append(doc)
    prev_end_idx = end_idx

Upon looking into the result, it was fairly accurate. 

## Setting up the sequence

Here, we set up a basic LangGraph sequence. This specific version of the app uses indexes to go the list of scenes, and is thus called with a for loop. 

In [None]:
from typing_extensions import List, TypedDict, Optional
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system",system_prompt),
    ("human","Context:\n{context}\n\nQuestion:\n{question}")
     ])

class State(TypedDict):
    index: int
    context: Document
    answer: str

def check_index(state: State):
    if "index" in state and isinstance(state["index"], int) and 0 <= state["index"] < len(all_indexes):
        return state 
    raise ValueError("Need to include an index for list of documents.") 


def retrieve_doc(state: State):
    document = all_indexes[state["index"]]
    return {"context": document}

def generate(state: State):
    message = prompt.invoke({"question":human_prompt,"context":state["context"].page_content})
    response = structured_llm.invoke(message)
    return {"answer": response}

from langgraph.graph import START, StateGraph

graph_builder = StateGraph(State).add_sequence([check_index, retrieve_doc, generate])
graph_builder.add_edge(START, "check_index")
graph = graph_builder.compile()

## Loading a Single Scene / PDF to Test

Here, we load one scene. We had to customize our sequence for this specific test (), so the code 2 cells below is mostly a duplicate of the cell above.

In [None]:
scene_test = await loadPDF("Vaudeville/Tests and Outputs/Scene_1_La_Dette_d_honneur.pdf")

In [None]:
from langchain_core.documents import Document

scene_test_full = ""
for page in scene_test:
    scene_test_full += page.page_content
scene_test_final = Document(page_content = scene_test_full)
all_indexes = [scene_test_final]


from typing_extensions import List, TypedDict, Optional
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system",system_prompt),
    ("human","Context:\n{context}\n\nQuestion:\n{question}")
     ])

class State(TypedDict):
    index: int
    context: Document
    answer: str

def check_index(state: State):
    if "index" in state and isinstance(state["index"], int) and 0 <= state["index"] < len(all_indexes):
        return state 
    raise ValueError("Need to include an index for list of documents.") 


def retrieve_doc(state: State):
    document = all_indexes[state["index"]]
    return {"context": document}

def generate(state: State):
    message = prompt.invoke({"question":human_prompt,"context":state["context"].page_content})
    response = structured_llm.invoke(message)
    return {"answer": response}

from langgraph.graph import START, StateGraph

graph_builder = StateGraph(State).add_sequence([check_index, retrieve_doc, generate])
graph_builder.add_edge(START, "check_index")
graph = graph_builder.compile()

In [27]:
moments = graph.invoke({"index": 0})

In [31]:
moment = moments["answer"].musicalMoments[0]

In [75]:
moments = moments["answer"]
for musicalMoment in moments.musicalMoments:
    for attribute in musicalMoment:
        print(attribute)
    print("\n")

('act', 1)
('scene', 1)
('number', 1)
('characters', ['Pétronille'])
('dramatic_situation', 'Pétronille is recounting to Pauline how two young men, believed to be lovers, rented a room from her aunt.')
('air_or_melodie', 'De sommeiller encor, ma chère')
('poetic_text', 'Lorsqu’arrivés un jour par aventure,\nIls vinr’nt chez nous pour se loger tous deux ;\nOn vit tout d’suite à leur figure,\nQu’ça d’vait être des amoureux.\nPour qu’à leur gré tous les instants s’écoulent,\nMa tant’ s’est dit : ils s’raient mal au premier ;\nEt nuit et jour puisqu’ils roucoulent,\nJ’ m’en vas les mettre au colombier.')
('rhyme_scheme', 'ABAB CDCD')
('poetic_form', 'Quatrains')
('end_of_line_accents', ['feminine', 'masculine', 'feminine', 'masculine', 'feminine', 'masculine', 'feminine', 'masculine'])
('syllable_count_per_line', [13, 10, 9, 8, 11, 10, 10, 8])
('irregularities', None)


('act', 1)
('scene', 1)
('number', 2)
('characters', ['Pétronille'])
('dramatic_situation', 'Pétronille describes a visit

In [13]:
# FULL PLAY
moments = graph.invoke({"index": 0})

In [None]:
moments = moments["answer"]
for musicalMoment in moments.musicalMoments:
    for attribute in musicalMoment:
        print(attribute)
    print("\n")

The output has been removed because of it's length. The test was successful.

## Full Workflow

Below, we have grouped the full workflow into one large sequence. Thus, you can change the pdf_filepath variable, hit run all below, and a csv will be exported into your directory. Most of the code below is the same as above.

### Loading the PDF

In [78]:
# Hit execute cells below once you add your pdf_filepath

In [1]:
pdf_filepath = "Files/PDFs/Scribe-Cornu_-_La_chanoinesse.pdf"
csv_filename = pdf_filepath.replace("Files/PDFs/","").replace(".pdf",".csv")

In [None]:
# Setting up pdf loading

from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
from langchain.document_loaders import PDFPlumberLoader
from langchain_core.documents import Document
from typing_extensions import List, TypedDict, Optional
from langchain_core.prompts import ChatPromptTemplate


async def loadPDF(filepath: str) -> list:
    loader = PyPDFLoader(filepath)
    pages = []
    async for page in loader.alazy_load():
        pages.append(page)   
    return pages

# filepath:str = input("Please enter the filepath: ")
source = await loadPDF(pdf_filepath)

for page in source:
        page.metadata['source'] = page.metadata['source'].replace("Files\PDFs\\","")

source_content = ""
for page in source:
    source_content += page.page_content
source_full = Document(page_content = source_content, metadata = source[0].metadata)

### Splitting up the PDF

In [81]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o", model_provider="openai")
processing_llm = init_chat_model("gpt-4o-mini", model_provider="openai")

In [82]:
from typing import Optional, List
from pydantic import BaseModel, Field

class Scene(BaseModel):
    """A single scene from a Vaudeville play"""

    act: int = Field(description="The act number or label as it appears in the text.")
    scene: int = Field(description="The scene number or label as it appears in the text.")
    header: str = Field(description="The exact scene header line, copied verbatim from the text.")

class FullPlay(BaseModel):
    """A full play, that has yet to be broken into individual scenes."""

    all_scenes: List[Scene] = Field(description="A list of every single scene's header and label - each as a Scene object.")

formatted_splitter_llm = processing_llm.with_structured_output(FullPlay)

In [83]:
prompt = f"""
The following is the full text of a French Vaudeville play. Your job is to identify every scene boundary.
For each scene, return:
- The act number (as it appears in the text)
- The scene number (as it appears in the text)
- The exact scene header line (copy it verbatim from the text)

Return a list of objects like:
{{"act": "...", "scene": "...", "header": "..."}}

Do not attempt to count character indexes. Only return the scene headers as they appear in the text.

Play Content: \n
{source_full.page_content}
"""

In [84]:
def split_up_play(doc):
    response = formatted_splitter_llm.invoke(prompt)
    return response

In [85]:
all_indexes = split_up_play(source_full)

In [86]:
from langchain_core.documents import Document

all_splits = []
scene_headers = all_indexes.all_scenes
full_text = source_full.page_content

prev_end_idx = 0
for i, scene in enumerate(scene_headers):
    # Find the start index of this scene's header after the previous end index
    start_idx = full_text.find(scene.header, prev_end_idx)
    if start_idx == -1:
        print(f"Scene header not found: {scene.header}")

    # Determine the end index: start of next scene header, or end of document
    if i + 1 < len(scene_headers):
        next_start_idx = full_text.find(scene_headers[i + 1].header, start_idx + len(scene.header))
        if next_start_idx == -1:
            end_idx = len(full_text)
        else:
            end_idx = next_start_idx
    else:
        end_idx = len(full_text)

    scene_text = full_text[start_idx:end_idx]
    doc = Document(page_content=scene_text, metadata={"act": scene.act, "scene": scene.scene, "header": scene.header})
    all_splits.append(doc)
    prev_end_idx = end_idx

### Setting up the pydantic object and LangGraph

In [87]:
from typing import Optional
from pydantic import BaseModel, Field


# Pydantic
class MusicalMoment(BaseModel):
    """Many of these musical moments reuse some preexisting (and often well-known)  melody or tune.  These are variously called "melodie”, or “air”, and identified with a short title that refers in some way to an opera or collection of melodies from which it was drawn.  The titles might include the names of works, or other characters in those original works. In the context of the plays, these tunes become the vehicle for newly composed lyrics, which are normally rhymed, and which normally follow the poetic scansion and structure of the original lyrics.  Rhyme, versification and structure are thus of interest to us."""

    act: int = Field(description="The act number in which this musical moment takes place. Will be labeled at the top of the act or scene in which it takes place.")
    scene: int = Field(description="The scene number in which the musical moment takes place. Will be labeled at the top of the scene.")
    number: int = Field(description = "The index of the musical moment in the scene. For example, if this is the first musical moment in the scene, this should be 1.")
    characters: list[str] = Field(description="the character or characters who are singing (or otherwise making music) within this specific musical moment,")
    dramatic_situation: str = Field(description="the dramatic situation (a love scene, a crowd scene) in which the musical moment is occurring")
    air_or_melodie: str = Field(description="The title of the 'air' or 'melodie' of which the musical moment is based. It will be labeled in the text as 'air' or 'melodie'.")
    poetic_text: str = Field(description="The text from the music number. Do not include stage directions, only the lyrics sung by the characters in this musical moment")
    rhyme_scheme: str = Field(description = "The rhyme scheme for the poetic text in the musical moment. For example, sentences that end in 'tree' 'be' 'why' and 'high' would have a rhyme scheme of AABB.")
    poetic_form: str = Field(description="form of the poetic text, which might involve some refrain")
    end_of_line_accents: list[str] = Field(description = "the end accent for each line (masculine or féminine)")
    syllable_count_per_line: list[int] = Field(description = "the number of syllables per line. look out for contractions and colloquialisms.that might make the count of syllables less than obvious. Normally a word like ‘voilà’ would of course have 2 syllables. But the musical rhythm of a particular melodie might require that it be _sung_ as only one syllable, as would be the case if the text reads ‘v’la’. Similarly ‘mademoiselle’ would have 4 syllables in spoken French. But the musical context might make it sound like 5. Or a character speaking dialect might sing “Mam’zelle”, which would have only 2 (or perhaps 3) syllables.")
    irregularities: Optional[str] = Field(description="any irregularities within the musical number")
    stage_direction_or_cues: Optional[str] = Field(description="any stage directions, which tell a character what to do, but aren't a part of another character's dialogue. These are usually connected with a character’s name, and often are in some contrasting typography (italics, or in parentheses - though this may not be picked up by the filereader).  Sometimes these directions even happen in the midst of a song! In a related way there are sometimes ‘cues’ for music, or performance (as when there is an offstage sound effect, or someone is humming) Most times the stage directions appear just before or after the song text. But sometimes they appear in the midst of the texts. The directions should be reported here and not in the transcription of the poem.")
    reprise: Optional[str] = Field(description="there are sometimes directions that indicate the ‘reprise’ of some earlier number or chorus.")

class VaudevillePlay(BaseModel):
    musicalMoments: list[MusicalMoment] = Field(description="""A list of musical moments in a Vaudeville play, as MusicalMoment objects. Many of these musical moments reuse some preexisting (and often well-known)  melody or tune.  These are variously called "melodie”, or “air”, and identified with a short title that refers in some way to an opera or collection of melodies from which it was drawn.  The titles might include the names of works, or other characters in those original works. In the context of the plays, these tunes become the vehicle for newly composed lyrics, which are normally rhymed, and which normally follow the poetic scansion and structure of the original lyrics.  Rhyme, versification and structure are thus of interest to us.""")    

structured_llm = llm.with_structured_output(VaudevillePlay)

In [88]:
system_prompt = """
You are a literary analyst specializing in French Vaudeville plays from the 18th century. 
Your goal is to identify each musical moment in the text, and for each, extract detailed structured information, 
including act, scene, characters, dramatic situation, air or melodie, poetic text, rhyme scheme, poetic form, end-of-line accents, syllable count, and any irregularities. 
Some parts of the text were slightly misinterpreted by the file reader (e.g., missing spaces or strange line breaks).
"""
human_prompt = """
Given the following chunk of the play, analyze and return the musical moments as a structured VaudevillePlay object.
"""



In [89]:
from typing_extensions import List, TypedDict, Optional
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system",system_prompt),
    ("human","Context:\n{context}\n\nQuestion:\n{question}")
     ])

class State(TypedDict):
    index: int
    context: Document
    answer: str

def check_index(state: State):
    return state

def retrieve_doc(state: State):
    document = all_splits[state["index"]]
    return {"context": document}

def generate(state: State):
    i = state["index"]
    message = prompt.invoke({"question":human_prompt,"context" : f'Act {all_indexes.all_scenes[i].act}, Scene {all_indexes.all_scenes[i].scene}:\n\n {state["context"].page_content}'})
    response = structured_llm.invoke(message)
    return {"answer": response}

from langgraph.graph import START, StateGraph

graph_builder = StateGraph(State).add_sequence([check_index, retrieve_doc, generate])
graph_builder.add_edge(START, "check_index")
graph = graph_builder.compile()

### Analyzing the scenes and merging them together

This is the new code. It goes through each scene, calls the LangGraph, then merges all of the scenes together into one object. Then, it exports it as a csv (based on the filename of the PDF) to a csv_outputs folder in your directory. In the github repository, you can view the 7 objects I tested. Or, you can follow this [link](https://docs.google.com/spreadsheets/d/1WBVLnW_EfVwT60LsOVykd4lNaYEB0NAPvJUaMvyyBrM/edit?usp=sharing) to see the spreadsheet on Google Sheets.

In [90]:
def analyze_scenes(docs: List[Document]) -> List[MusicalMoment]:
    all_moments: List[MusicalMoment] = []

    for i,doc in enumerate(docs):
        response = graph.invoke({"index": i})
        moments = response["answer"].musicalMoments
        all_moments.extend(moments)
    
    return all_moments

all_moments = analyze_scenes(all_splits)

In [91]:
import csv
import os

# Convert all MusicalMoment objects to dicts
moments_dicts = [moment.model_dump() for moment in all_moments]

# Get all field names from the first moment
fieldnames = moments_dicts[0].keys() if moments_dicts else []

# Write to CSV
# Ensure the output folder exists
output_folder = "csv_outputs"
os.makedirs(output_folder, exist_ok=True)

# Build the output path
output_path = os.path.join(output_folder, os.path.basename(csv_filename))

with open(output_path, "w", newline='', encoding="utf-8") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for row in moments_dicts:
        # Convert lists to strings for CSV output
        for key, value in row.items():
            if isinstance(value, list):
                row[key] = "; ".join(str(v) for v in value)
        writer.writerow(row)

## Conclusions

This system is highly effective, and serves as a good template for the niche that is a Structured Output App. 

However, some improvements are needed for this to reach its most accurate form. Most notably, the descriptions for the variable within the Pydantic schema are lacking expertise in the subject. To reach it's best form, an expert in these plays would have to write these descriptions.

That being said, this system is impressive as is. Reading through all 7 Vaudeville plays (~50 pages each) only took about 20 minutes of passive run time and a few dollars with the OpenAI API. 