# RAG: Haverford College Concert Programs Fall 2009 - Spring 2022

## Goals

* Create a Retrieval Augmented Generation app for Haverford Concert programs 2009-2022
* Implement metadata filters for more refined context, thus higher chance of accurate responses

If the kernel restarts, do not run all!! Instead, run the code [here](full_concert_programs.ipynb#run-this-cell-to-reestablish-variables-and-keys).

## Imports and Initial Setup

In [13]:
# Setting up chat model
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

In [14]:
# Setting up embeddings
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [15]:
# Setting up chroma
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

## Indexing

### Loading Files

First we have to load all of our documents as text. We use Langchain's loaders. Then, we merge the CSV metadata with the documents.

In [16]:
# Setting up csv loading

from langchain_community.document_loaders.csv_loader import CSVLoader

def loadCSV(filepath: str) -> list:
    loader = CSVLoader(file_path=filepath, source_column="Filename",metadata_columns=["Category","Year","Term"])
    data = loader.load()
    return data


In [17]:
# Load CSV

csv_metadata = loadCSV("Files/Concert_Metadata.csv")
print(csv_metadata[0].metadata['source'])
print(csv_metadata[1].metadata['Category'])

1. Curt Cacioppo _ Ying Li Program.pdf
Chamber


Here we use PDFPlumberLoader, rather than the default PyPDFLoader, because it seemed to work better with these specific documents. Both have pros and cons.

In [18]:
# Setting up pdf loading

from pathlib import Path
from langchain.document_loaders import PDFPlumberLoader

async def loadPDF(filepath: str) -> list:
    loader = PDFPlumberLoader(filepath)
    pages = []
    async for page in loader.alazy_load():
        pages.append(page)   
    return pages


def get_files_from_directory(directory_path: str) -> list[str]:
    directory = Path(directory_path)
    file_paths = [str(file) for file in directory.iterdir() if file.is_file()]
    return file_paths


directory_path:str = "Files/PDFs"
files: list[str] = get_files_from_directory(directory_path)

In [19]:
# Load PDFs w/out metadata
loaded_PDFs: list = []
for file in files:
    pages = await loadPDF(file)
    loaded_PDFs.append(pages)

In [33]:
# Check PDF content
for i in range(3):
    document = loaded_PDFs[i]
    print(f"Document {i} \n")
    document_content: str = ""
    for document_page in document:
        document_content += document_page.page_content
    print(document_content[1000:1200] + "\n \n")

Document 0 

hy by Curt Cacioppo; video montage by John Thornton
Digital image and live piano
Synaesthesis I† (2015)
artwork by Ying Li; videography by John Thornton
* world premiere † U.S. premiere
‡ from the Nav
 

Document 1 

on,
Rockport, Sedona, and Sarasota. Currently cellist of the Banff award-winning Amernet String Quartet,
Ensemble-in-Residence at Florida International University in Miami, Mr. Calloway was previously
 

Document 2 

w Music Conference, Five College New Music Festival,
North American Saxophone Alliance Biennial Conference at Texas Tech University, Michigan State
University, Arizona State University, University of 
 



In [21]:
# Check PDF metadata
for i in range(3):
    document = loaded_PDFs[i]
    print(f"Document {i} \n")
    document_metadata: list = []
    for document_page in document:
        document_metadata += document_page.metadata.items()
    print(document_metadata)

Document 0 

[('source', 'Files\\PDFs\\1. Curt Cacioppo _ Ying Li Program.pdf'), ('file_path', 'Files\\PDFs\\1. Curt Cacioppo _ Ying Li Program.pdf'), ('page', 0), ('total_pages', 2), ('CreationDate', 'D:20160919154334Z'), ('Creator', 'Word'), ('Keywords', ''), ('ModDate', "D:20161101094302-04'00'"), ('Producer', 'Mac OS X 10.10.4 Quartz PDFContext'), ('Title', 'Microsoft Word - LIBRARY program.docx'), ('source', 'Files\\PDFs\\1. Curt Cacioppo _ Ying Li Program.pdf'), ('file_path', 'Files\\PDFs\\1. Curt Cacioppo _ Ying Li Program.pdf'), ('page', 1), ('total_pages', 2), ('CreationDate', 'D:20160919154334Z'), ('Creator', 'Word'), ('Keywords', ''), ('ModDate', "D:20161101094302-04'00'"), ('Producer', 'Mac OS X 10.10.4 Quartz PDFContext'), ('Title', 'Microsoft Word - LIBRARY program.docx')]
Document 1 

[('source', 'Files\\PDFs\\1. Jason Calloway Program 2017.pdf'), ('file_path', 'Files\\PDFs\\1. Jason Calloway Program 2017.pdf'), ('page', 0), ('total_pages', 2), ('Title', 'Calloway progra

So, some of this info is useful and some isn't.

We should keep:
* Source (but clean it so it's just the filename)
* Filepath
* Page
* Total pages
* Title
* CreationDate

We should remove:
* Author (not relevant)
* Subject (blank)
* Producer (not relevant)
* Creator (not relevant)
* ModDate (not relevant)
* Keywords (empty)

Essentially, we want to preserve the fields that will help the embeddings model place these documents in the vector database, but remove extra fields that will confuse the model.

In [28]:
# Cleaning the sources to match with the CSV source names
for source in loaded_PDFs:
    for page in source:
        page.metadata['source'] = page.metadata['source'].replace("Files\\PDFs\\","")

for i in range(5):
    source = loaded_PDFs[i]
    print(source[0].metadata['source'])

1. Curt Cacioppo _ Ying Li Program.pdf
1. Jason Calloway Program 2017.pdf
1. Jonathan Hulting-Cohen Program.pdf
1. Jordan Dodson Program.pdf
1. Matthew Bengtson Program.pdf


In [29]:
# Sorting both the CSV and PDF metadata to match
csv_metadata.sort(key=lambda x: x.metadata['source'])
loaded_PDFs.sort(key=lambda x: x[0].metadata['source'])

The following cell ensures that we have matches between PDF's and CSV rows.

In [30]:
# Use assert to ensure the sources match
for i in range(len(csv_metadata)):
    assert csv_metadata[i].metadata['source'] == loaded_PDFs[i][0].metadata['source'], f"Mismatch at index {i}: {csv_metadata[i].metadata['source']} != {loaded_PDFs[i][0].metadata['source']}"

#### Cleaning PDF Metadata and Adding CSV Content

We should remove:
* Author (not relevant)
* Subject (blank)
* Producer (not relevant)
* Creator (not relevant)
* ModDate (not relevant)
* Keywords (empty)

In [None]:
print(len(csv_metadata))
print(len(loaded_PDFs))

356
356


In [None]:
for i in range(len(csv_metadata)):
    row = csv_metadata[i]
    for pdf in loaded_PDFs:
        assert row.metadata["source"] == loaded_PDFs[i][0].metadata["source"], f"Mismatch in source for row {i}: {row.metadata['source']} != {loaded_PDFs[i][0].metadata['source']}"

In [34]:
irrelevant_metadata = ['Author', 'Subject', 'Producer', 'ModDate', 'Keywords', 'Creator', 'Title']

for i in range(len(loaded_PDFs)):
    document = loaded_PDFs[i]
    for page in document:
        for key in irrelevant_metadata:
            if key in page.metadata:
                del page.metadata[key]
        # Add the category and year from the CSV metadata
        assert page.metadata['source'] == csv_metadata[i].metadata['source'], "Source mismatch between PDF and CSV metadata"
        page.metadata['Ensemble_Type'] = csv_metadata[i].metadata['Category']
        page.metadata['Year'] = csv_metadata[i].metadata['Year']
        page.metadata['Term'] = csv_metadata[i].metadata['Term']
    if i < 5:
        print(document[0].metadata)

{'source': '1. Curt Cacioppo _ Ying Li Program.pdf', 'file_path': 'Files\\PDFs\\1. Curt Cacioppo _ Ying Li Program.pdf', 'page': 0, 'total_pages': 2, 'CreationDate': 'D:20160919154334Z', 'Ensemble_Type': 'Chamber', 'Year': '2016', 'Term': 'Fall'}
{'source': '1. Jason Calloway Program 2017.pdf', 'file_path': 'Files\\PDFs\\1. Jason Calloway Program 2017.pdf', 'page': 0, 'total_pages': 2, 'CreationDate': "D:20170924161250Z00'00'", 'Ensemble_Type': 'Chamber', 'Year': '2017', 'Term': 'Fall'}
{'source': '1. Jonathan Hulting-Cohen Program.pdf', 'file_path': 'Files\\PDFs\\1. Jonathan Hulting-Cohen Program.pdf', 'page': 0, 'total_pages': 2, 'CreationDate': "D:20200609163839Z00'00'", 'Ensemble_Type': 'Chamber', 'Year': '2018', 'Term': 'Fall'}
{'source': '1. Jordan Dodson Program.pdf', 'file_path': 'Files\\PDFs\\1. Jordan Dodson Program.pdf', 'page': 0, 'total_pages': 2, 'CreationDate': "D:20200609143156Z00'00'", 'Ensemble_Type': 'Chamber', 'Year': '2019', 'Term': 'Fall'}
{'source': '1. Matthew B

In [36]:
# Convert each sublist into a single document
from langchain.schema import Document
def convert_list_to_document(pages: list) -> Document:
    document_content: str = ""
    for page in pages:
        document_content += page.page_content
    document: Document = Document(
        page_content=document_content,
        metadata=pages[0].metadata  # Use the metadata from the first page
    )
    return document

# Convert loaded PDFs to documents
docs = []
for source in loaded_PDFs:
    doc = convert_list_to_document(source)
    docs.append(doc)  # Append the single Document object

In [37]:
print(docs[0].page_content[:1000])

“The Colors and the Sounds Respond”
music and performance Sharpless Gallery
by composer & pianist Magill Library
CURT CACIOPPO Haverford College
in conjunction with the Wednesday, September 21, 2016
7:30 PM
artwork of Ying Li
Piano music to complement the artwork on exhibit
Summer Moon over Cape Cod* (2016)
Boathouse Row (from Philadelphia Diary; 2008)
Chloe* (2014)
January Thaw (2013)
Somnamble† (after Bellini; 2015)
Parisian Room Waltz* (from Stories from the 7th Ward; 2016)
Sharon’s Song* (2015)
Drehleier Blues (after Schubert, “Der Leiermann;” 2012)
Paean* (2015)
Vaya con Dios (tango; 2007)
August Rose† (2015)
Burlesca (from Sestinamento <<Operistica>>, after Mozart; 2008)
Digital image and audio
Reflections of Flames on Wet Pavement‡ (from Impressioni venexiane,
String Quartet No. 3; 2008)
artwork: “Wuji (Infinity),” by Ying Li; video-choreography by Alexej Steinhardt HC ’99
Sorriso a Catania‡‡ (from Divertimenti in Italia, String Quartet No. 6; 2012)
artwork by Ying Li; photograp

In [38]:
# Double check metadata
for i in range(90,98):
    print(f"Document {i} Metadata: {docs[i].metadata}")

Document 90 Metadata: {'source': '18. Orchestra Program Spring 2019.pdf', 'file_path': 'Files\\PDFs\\18. Orchestra Program Spring 2019.pdf', 'page': 0, 'total_pages': 7, 'CreationDate': 'D:20190412224743Z', 'Ensemble_Type': 'Orchestra', 'Year': '2019', 'Term': 'Spring'}
Document 91 Metadata: {'source': '18. Steven Mayer-Matthew Plenk Program.pdf', 'file_path': 'Files\\PDFs\\18. Steven Mayer-Matthew Plenk Program.pdf', 'page': 0, 'total_pages': 4, 'CreationDate': "D:20180329155545Z00'00'", 'Ensemble_Type': 'Chamber', 'Year': '2018', 'Term': 'Spring'}
Document 92 Metadata: {'source': '18. Taiko _ Dance Program 2016.pdf', 'file_path': 'Files\\PDFs\\18. Taiko _ Dance Program 2016.pdf', 'page': 0, 'total_pages': 4, 'CreationDate': 'D:20160523150851Z', 'Ensemble_Type': 'Chamber', 'Year': '2016', 'Term': 'Spring'}
Document 93 Metadata: {'source': '19. Chamber Singers Rotunda Program 2015.pdf', 'file_path': 'Files\\PDFs\\19. Chamber Singers Rotunda Program 2015.pdf', 'page': 0, 'total_pages': 

### Chunking Files

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
all_splits = text_splitter.split_documents(docs)

print(f"Split {len(docs)} PDFs into {len(all_splits)} sub-documents.")

Split 356 PDFs into 3757 sub-documents.


In [None]:
# If we run this again when reestablishing variables, it will duplicate documents and charge us


# document_ids = vector_store.add_documents(documents=all_splits) 
# print(document_ids[:3])

# 

['f6b62a29-03d0-4b4c-a5f8-3a841d114756', '63653de1-d878-497f-b752-385c2717fe76', '28e3571c-4a8e-4a33-8937-ccc553c5cbe3']


The above code is commented out to avoid adding documents again, which would duplicate them in the vector store. The vector store should already be set up in `/chroma_langchain_db`

## Setting Up Retrieval and Generation

We use LangGraph and a State class to set up our RAG. LangGraph allows us to set up clear steps, and the State allows us to pass information between each step. 

See the graph below the 3 code cells.

In [None]:
from langchain_core.documents import Document
from typing_extensions import List, TypedDict
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use only the information provided in the context below to answer the question. If the answer is not in the context, say 'I don't know' or 'The information is not available.'"),
    ("human", "Context:\n{context}\n\nQuestion: {question}")
])

class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"], k = 10)
    return {"context": retrieved_docs}

def generate(state: State):
    docs_content = "\n\n".join([doc.page_content for doc in state["context"]])
    message = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(message)
    return {"answer": response.content}


In [32]:
from langgraph.graph import START, StateGraph

graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

In [33]:
print(graph.get_graph().draw_ascii())

+-----------+  
| __start__ |  
+-----------+  
      *        
      *        
      *        
+----------+   
| retrieve |   
+----------+   
      *        
      *        
      *        
+----------+   
| generate |   
+----------+   
      *        
      *        
      *        
 +---------+   
 | __end__ |   
 +---------+   


## Run this cell to reestablish variables and keys

In [3]:
# Setting up chat model
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

# Setting up embeddings
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Reestablishing up our persist directory for Chroma
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where we stored our data before
)

# Reestablishing retrieval and generation functions
from langchain_core.documents import Document
from typing_extensions import List, TypedDict

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use only the information provided in the context below to answer the question. If the answer is not in the context, say 'I don't know' or 'The information is not available.'"),
    ("human", "Context:\n{context}\n\nQuestion: {question}")
])

class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"], k = 10)
    return {"context": retrieved_docs}

def generate(state: State):
    docs_content = "\n\n".join([doc.page_content for doc in state["context"]])
    message = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(message)
    return {"answer": response.content}

# Reestablishing langgraph
from langgraph.graph import START, StateGraph

graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

## The Basic RAG is set up - Let's ask some questions about our data

In [4]:
result = graph.invoke({"question": "Who has conducted the Chamber Singers over the years?"})

print(f'Answer: {result["answer"]}')
print(f'Context: {result["context"]}\n\n')

Answer: The Chamber Singers have been conducted by Thomas Lloyd and Ng Tian Hui during the academic year of 2010/11, with Lloyd being the director before taking a sabbatical and Ng Tian Hui leading in his absence. More recently, Dr. Nathan Zullinger has been the conductor.
Context: [Document(id='13720808-826c-45c8-b695-85c2b85d9445', metadata={'total_pages': 12, 'start_index': 22880, 'Year': '2011', 'Ensemble_Type': 'Choral', 'file_path': 'C:\\Users\\charl\\Documents\\VSCode\\HC All Programs\\Files\\PDFs\\30. Chamber Singers-Recalling Ariadne Program.pdf', 'source': '30. Chamber Singers-Recalling Ariadne Program.pdf', 'Term': 'Spring', 'page': 0, 'CreationDate': "D:20160524182813Z00'00'"}, page_content='academic year of 2010/11, the choir is led by Ng Tian Hui while Prof. Lloyd is away on sabbatical.\nAs the premier vocal ensemble in the bi-college community, the Chamber Singers performs challenging repertoire ranging from\nthe Renaissance to the present day in a variety of languages a

In [5]:
result = graph.invoke({"question": "Who played trumpet for the haverford-bryn mawr orchestra in 2019?"})

print(f'Answer: {result["answer"]}')
print(f'Context: {result["context"]}\n\n')

Answer: I don't know.
Context: [Document(id='a9aaff9b-6388-4acc-8c75-0d85ae7d8daa', metadata={'start_index': 17739, 'Ensemble_Type': 'Choral', 'file_path': 'C:\\Users\\charl\\Documents\\VSCode\\HC All Programs\\Files\\PDFs\\9. Chorale Program Fall 2011.pdf', 'page': 0, 'Year': '2011', 'source': '9. Chorale Program Fall 2011.pdf', 'Term': 'Fall', 'total_pages': 8, 'CreationDate': 'D:20200612115549Z'}, page_content='Haverford/Bryn Mawr Chorale Orchestra – Fall 2011\nViolin I Bass Contrabassoon\nLorenzo Rayal Jennifer Bradbury Ben Hoyle\nKiran Rajamani HC ’14 Matt Roberts\nChi Park Brent Edmondson Horns\nNatalia Banfi HC ’15 Katie Jordan\nHarp\nJennifer Horne Kristina Gannon\nAriane Giles HC ’15\nVanessa Felso BMC ’15 Sabrina Huber\nSophia Forker HC ’15\nVena Johnson Ryan Stewart\nGabriella Goodman HC ’12\nPiccolo\nTrumpets\nViolin II Katherine Barbato\nBrian Rascon\nRodolfo Leuenberger\nMatthew Thomas\nMarie Greaney HC ’14 Flute\nAaron Matthias-Long\nAngela Sulzer BJ Hillinck HC ’15\nNor

Clearly, we didn't pull the right documents in the above prompt. This is where the metadata filters will come in, once we set them up.

In [6]:
result = graph.invoke({"question": "Who are some trumpet players in the haverford-bryn mawr orchestra?"})

print(f'Answer: {result["answer"]}')
print("\nSources:")
for source in result["context"]:
    print(f'Source: {source.metadata["source"]}')

Answer: Some trumpet players in the Haverford-Bryn Mawr Orchestra include Christian Fagre HC ’16, Sam Istvan HC ’21, and Andrew Cornell HC ’20.

Sources:
Source: 19. Chorale Program Spring 2013.pdf
Source: 9. Chorale Program Fall 2011.pdf
Source: 6. Chorale Program Fall 2012.pdf
Source: HC Family Weekend 08 program.pdf
Source: 8. Chorale Program Fall 2019.pdf
Source: 7. Orchestra Program Fall 2018.pdf
Source: 2. Family Weekend Choral Program-2013.pdf
Source: 10. Chorale Program Fall 2010.pdf
Source: 8. Chorale Program Fall 2018.pdf
Source: HC Family Weekend 08 program.pdf


## Setting up Filters

As you can see, it's working pretty well, but the similarity search isn't always pulling the right documents. If we could add a simple filter, such as specifying the year or ensemble type, our results would be much better, without having to do any extra calls to the API.

Setting up a filter in similarity search:

In [4]:
vector_store.similarity_search("Who was the principle trumpet player in the Haverford-Bryn Mawr Orchestra in 2019?", k=4, filter={"Ensemble_Type": "Orchestra"})

[Document(id='6178eb27-1284-4041-8289-1d804847fcf7', metadata={'total_pages': 7, 'start_index': 776, 'file_path': 'C:\\Users\\charl\\Documents\\VSCode\\HC All Programs\\Files\\PDFs\\9. Orchestra Program Fall 2010.pdf', 'source': '9. Orchestra Program Fall 2010.pdf', 'Ensemble_Type': 'Orchestra', 'Term': 'Fall', 'CreationDate': "D:20160524173625Z00'00'", 'page': 0, 'Year': '2010'}, page_content='Josh Bucheister, ’14, Associate Rachael Goldstein, ’14 Co- Principal Katie Van Aken, ’12 %\nConcertmaster Ezekiel Barnett, ’13 Assistant Principal Elizabeth Biernat, ’14@\nEthan Joseph, ’11, Assistant Concertmaster Delaney Page, ’12 Kristina Kronauer, ’13\nXinbei Guan, ’14, Assistant Concertmaster Laura Alexander, ’11 Kristina Gannon **\nSarah Capasso, ’11 Noory O, ’13\nDrew Twitchell, ’11 Erin Korth, ’13 Trumpet\nNora Schmidt, ’12 Mary Schultz, ’12 Ian Gavigan, ’14 &@\nTiffany Fritz, ’12 Seoung Won Jung, ’14 Linus Marco, ’13\nYiran Zhang , ’14 Kathryn Hayden, ’14 Chelsea Miller, ’11* %\nMarie G

Here, we repeat content from earlier, making this our new "reestablishment" cell. Running this cell allows the RAG to work. Then, we add filters within the LangGraph.

In [7]:
# Setting up chat model
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

# Setting up embeddings
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Reestablishing up our persist directory for Chroma
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db",  # Where we stored our data before
)

# Reestablishing retrieval and generation functions
from langchain_core.documents import Document
from typing_extensions import List, TypedDict, Optional
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Use only the information provided in the context below to answer the question. If the answer is not in the context, say 'I don't know' or 'The information is not available.'"),
    ("human", "Context:\n{context}\n\nQuestion: {question}")
])

class State(TypedDict):
    question: str
    filter: Optional[dict]
    context: List[Document]
    answer: str

def apply_filter(state: State):
    """
    Apply a filter to the state based on the user's input.
    The filter is expected to be a dictionary that can be used in the similarity search.
    """
    if "filter" in state and state["filter"]:
        # If a filter is provided, use it
        state["filter"] = {k: v for k, v in state["filter"].items() if v is not None}
    else:
        # If no filter is provided, set it to None
        state["filter"] = None
    return state

def retrieve(state: State):
    filter_dict = state["filter"] if state.get("filter") else None
    retrieved_docs = vector_store.similarity_search(state["question"], k=10, filter=filter_dict)
    return {"context": retrieved_docs}

def generate(state: State):
    docs_content = "\n\n".join([doc.page_content for doc in state["context"]])
    message = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(message)
    return {"answer": response.content}

# Reestablishing langgraph
from langgraph.graph import START, StateGraph

graph_builder = StateGraph(State).add_sequence([apply_filter, retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

In [3]:
result = graph.invoke({
    "question": "Who played trumpet for the orchestra in the year 2019?", "filter": {"Ensemble_Type": "Orchestra"}})

print(f'Answer: {result["answer"]}')
print("\n\nSources:")
for i, source in enumerate(result["context"]):
    print(f'Source {i+1}: {source.metadata["source"]}')

Answer: I don't know.


Sources:
Source 1: 21. Orchestra Program-Spring 2012.pdf
Source 2: 17. Orchestra Program Spring 2013.pdf
Source 3: 9. Orchestra Program Fall 2010.pdf
Source 4: 19. Orchestra Program Spring 2016.pdf
Source 5: Fall 11-19-21 Orchestra Program Fall 2021.pdf
Source 6: 7. Orchestra Program Fall 2011.pdf
Source 7: 4. Orchestra Program Fall 2012.pdf
Source 8: 5. Orchestra Program Fall 2014.pdf
Source 9: 8. Orchestra Program Fall 2016.pdf
Source 10: 27. Orchestra Program-Spring 2011.pdf


In [8]:
result = graph.invoke({
    "question": "Who played trumpet for the haverford-bryn mawr orchestra in 2019?",
    "filter": {"$and": [{"Ensemble_Type": "Orchestra"}, {"Year": "2019"}]}
})

print(f'Answer: {result["answer"]}')
print("\n\nSources:")
for i, source in enumerate(result["context"]):
    print(f'Source {i+1}: {source.metadata["source"]}')

Answer: The trumpet players for the Haverford-Bryn Mawr College Orchestra in 2019 were Jack Weinstein, HC ’23, and Sam Istvan, HC ’21, who was the principal trumpet. Additionally, Jackie Toben, BMC ’22, was also listed as a Co-Associate Principal Trumpet.


Sources:
Source 1: 7. Orchestra Program Fall 2019.pdf
Source 2: 7. Orchestra Program Fall 2019.pdf
Source 3: 18. Orchestra Program Spring 2019.pdf
Source 4: 7. Orchestra Program Fall 2019.pdf
Source 5: 18. Orchestra Program Spring 2019.pdf
Source 6: 18. Orchestra Program Spring 2019.pdf
Source 7: 7. Orchestra Program Fall 2019.pdf
Source 8: 7. Orchestra Program Fall 2019.pdf
Source 9: 18. Orchestra Program Spring 2019.pdf
Source 10: 18. Orchestra Program Spring 2019.pdf


## Results

We successfully set up a RAG app with effective filters. We can now "chat" with our data, and "look up" specific information in seconds with a high level of accuracy. The way that this is set up creates a chance of no result, but almost no chance of a purely *incorrect* answer. 

Our filters, however, were very effective in reducing the chance of no answer. Though they currently require a manual call, a future, streamlined version of this project could interpret a question and auto-apply filters with an LLM call. 

Overall, this project resulted in an effective and rapid way to explore our data. 