## RAG Board Game Assistant
The following code blocks build two different variations on a RAG application using Gemini to answer questions about board game rules. The goal is to provide a simple interface for asking questions about rules, so that no one has to stop play to dig through the game manual. Both applications cite the manual page number from which the information was gathered, so that further research can be done quickly if necessary.

In [None]:
#Import libraries
import os
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_core.documents import Document
from langchain_core.prompts import PromptTemplate
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from google import genai
from typing_extensions import List, TypedDict
from pydantic import BaseModel, Field

#### Baseline Assessment: Simple LLM ping without RAG
First, let's check how well a basic implementation of Gemini can answer our question.

In [19]:
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
q = "What counts as a production building in the game Puerto Rico?"
response = client.models.generate_content(model="gemini-2.0-flash", contents=q)
print(response.text)

In Puerto Rico, a production building is any building that produces goods when the **Producer** role is selected. These are the buildings that allow you to generate resources to then be traded or shipped for victory points.

Here's a breakdown of what counts as a production building:

*   **Indigo Plant:** Produces Indigo
*   **Sugar Mill:** Produces Sugar
*   **Corn Mill:** Produces Corn
*   **Coffee Roaster:** Produces Coffee
*   **Tobacco Storage:** Produces Tobacco

**Key Considerations:**

*   **Size:** The size of the building (Small, Large) doesn't affect whether it's a production building.
*   **Not Included:** Buildings like the Guild Hall, Customs House, or University, even though they provide victory points or other benefits, are **not** production buildings because they don't directly produce goods.
*   **Only the building itself:** Buildings which improve the production of a production building, such as the Sugar Mill adding a barrel, are not production buildings themselve

The information above is mostly correct, though the LLM is hallucinating about corn mills. It is also providing more information than is really needed to answer the question (e.g. the 'Key Considerations' section). There is a case to be made for trying RAG to get more reliable and relevant information.

#### Create a Vector Store
The code block below builds a vector store containing fragments from four different game manual pdfs. Alternatively, the pre-made vector store can be loaded directly using the next code block.  

Here, pdfs are loaded page-by-page, and extracted langchain documents cannot span pages. This was done to make it easy for the application to reference the specific manual page used to generate the response. This behavior can be changed, allowing the model to read text chunks that span multiple pdf pages, by adding the mode='single' argument to the PyPDFLoader function.  

A chunk size of 600 and overlap of 100 seems to work well for most manuals.

In [None]:
## Create vector store
google_embedding = GoogleGenerativeAIEmbeddings(google_api_key=os.getenv("GEMINI_API_KEY"), model="models/embedding-001")
vector_store = InMemoryVectorStore(embedding=google_embedding)

## Load manuals into vector store 
manual_list = [("Everdell","everdell-rulebook.pdf"),
               ("Mysterium","mysterium-rulebook.pdf"),
               ("Puerto Rico", "Puerto-Rico-Deluxe-Rules.pdf"),
               ("Ticket to Ride Europe", "ticket_to_ride_europe.pdf")]
path = "./game_manuals/" #path to directory containing game manuals

for game in manual_list:
    pdf_name = path+game[1]
    game_name = game[0]
    try:
        #load text from pdf
        loader = PyPDFLoader(pdf_name) #mode='single' to not separate by pages
        pages = []
        docs_lazy = loader.lazy_load()
        for doc in docs_lazy:
            doc.metadata['game'] = game_name
            pages.append(doc)

        ##Split text into chunks
        text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", " ", "."],
                                                    chunk_size=600,
                                                    chunk_overlap=100,
                                                    add_start_index=True)
        split_text = text_splitter.split_documents(pages)

        ##Encode chunks and add to vector store
        doc_ids = vector_store.add_documents(split_text)
    
    except:
        print("Issue adding info from %s"%pdf_name)
        continue

## Save vector store to json file
vector_store.dump("board_game_vector_store.json")

In [None]:
## Load vector store from file
google_embedding = GoogleGenerativeAIEmbeddings(google_api_key=os.getenv("GEMINI_API_KEY"), model="models/embedding-001")
vector_store = InMemoryVectorStore(embedding=google_embedding).load("board_game_vector_store.json", google_embedding)

### RAG application version 1
This RAG implementation returns the LLM's answer to the query along with a list of the documents retrieved from the similarity search of the vector store. Compared to version 2 of the application, it produces clearer instructions but cannot filter out irrelevant retrieved documents from the vector store.

In [6]:
##Build RAG state workflow, following https://python.langchain.com/docs/tutorials/rag

#Prompt engineering
prompt_text = "You are an assistant for looking up board game rules. Use the following pieces of retrieved context from the board game manual, in addition to your own knowledge, to answer the question.\n" \
"Game: {game}\n"\
"Question: {question}\n" \
"Context: {context}"
prompt = PromptTemplate.from_template(prompt_text)

#Initiate LLM
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")

#Define application state
class State(TypedDict):
    game: str
    question: str
    context: List[Document]
    answer: str

#Retrieval step
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"], filter=lambda doc: doc.metadata.get("game")==state["game"])
    return {"context": retrieved_docs}

#Generation step
def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"game": state["game"], "question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return({"answer":response.content})

#Build graph
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

#Run test question and print response
response = graph.invoke({"question":"What counts as a production building?", "game":"Puerto Rico"})
print(response["answer"])
print("\n\nReferenced manual sections:\n")
for doc in response["context"]:
    print("Page %s:\n%s\n\n"%(doc.metadata['page_label'], doc.page_content))


Based on the provided text, production buildings are buildings that are required, along with plantations, to produce certain goods. These buildings are:

*   Indigo processing plants (produce indigo dye - blue goods)
*   Sugar mills (process sugar cane into sugar - white goods)
*   Tobacco storage (shred tobacco leaves into tobacco - light brown goods)
*   Coffee roasters (roast coffee beans into coffee - dark brown goods)

There is no production building for corn.


Referenced manual sections:

Page 8:
them. Of course, the player must also have sufficient occupied plantations of the 
appropriate kind to produce the raw materials needed to produce the goods in the 
production buildings.
The buildings
A player may build only
one of each building.
Only occupied  buildings have
any use or value (except for VP
value at game end) .
The production buildings
For corn there is no
production building!
The circles on the production
buildings indicate the maximum
number of goods the building can


The application produces a correct and well-formatted answer to the question without including irrelevant information. It lists the four types of production buildings and notes that there is no production building for the fifth resource, corn. However, it returns all documents gathered through a similarity search of the vector database, rather than just the most relevant document, and it does not format the retrieved document text very well.

### RAG application version 2
This implementation uses the with_structured_output tool to produce the specific manual snippet used to create the response. Responses tend to be shorter and less creative.

In [17]:
##Build RAG state workflow, following https://python.langchain.com/docs/how_to/qa_citations/

#Prompt engineering
prompt_text = "You are an assistant for looking up board game rules. Use the following pieces of retrieved context from the board game manual, in addition to your own knowledge, to answer the question.\n" \
"Game: {game}\n"\
"Question: {question}\n" \
"Context: {context}"
prompt = PromptTemplate.from_template(prompt_text)

#Initialize LLM
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")

#Function to include pdf page labels with retrieved document contents
def format_docs_with_page(docs: List[Document]) -> str:
    formatted = [
        f"Page Number: {doc.metadata['page_label']}\nManual Snippet: {doc.page_content}"
        for doc in docs
    ]
    return "\n\n" + "\n\n".join(formatted)

#Define quoted answer and citation classes

class Citation(BaseModel):
    page_num: int = Field(
        ...,
        description="The page number of the SPECIFIC manual snippet which justifies the answer.",
    )
    quote: str = Field(
        ...,
        description="The VERBATIM quote from the specified manual snippet that justifies the answer.",
    )

class QuotedAnswer(BaseModel):
    """Answer the player's question based on the retrieved context from the game manual as well as your knowledge, and cite the relevant retrieved manual snippet."""
    answer: str = Field(
        ...,
        description="The answer to the user question, based on the retrieved context and your knowledge.",
    )
    citations: List[Citation] = Field(
        ..., description="Citations from the given context that justify the answer."
    )

#Define application state
class State(TypedDict):
    game: str
    question: str
    context: List[Document]
    answer: QuotedAnswer

#Retrieval step
def retrieve(state: State):
    retrieved_docs = vector_store.similarity_search(state["question"], filter=lambda doc: doc.metadata.get("game")==state["game"])
    return {"context": retrieved_docs}

#Generation step
def generate(state: State):
    formatted_docs = format_docs_with_page(state["context"])
    messages = prompt.invoke({"game": state["game"], "question": state["question"], "context": formatted_docs})
    structured_llm = llm.with_structured_output(QuotedAnswer)
    response = structured_llm.invoke(messages)
    return({"answer":response})

#Build graph
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

#Run test question and print response
response = graph.invoke({"question":"What counts as a production building?", "game":"Puerto Rico"})
#print(response["answer"])
print(response['answer'].answer)
print("\nCitations:")
for cite in response['answer'].citations:
    print(cite)

Production buildings are blue, white, light, and dark brown buildings that are required, together with the plantations, for the production of certain goods. These buildings are the indigo processing plants, sugar mills, tobacco storage, and coffee roasters.

Citations:
page_num=8 quote='The production buildings (blue, white, light and dark brown)\nThe production buildings are required, together with the plantations, for the\nproduction of certain goods:\n- In the indigo processing plants, the indigo plants are processed to produce\nindigo dye (blue goods).\n- In the sugar mills, sugar cane is processed into sugar (white goods).\n- In the tobacco storage, the tobacco leaves are shredded into tobacco (light brown \ngoods).\n- In the coffee roasters, the coffee beans are roasted into coffee (dark brown \ngoods).'


The application produces a correct and succinct answer to the question, with verbiage closely resembling that of the manual. It only cites the relevant documents retrieved from the vector store.

#### Conclusion
RAG is helpful in generating more relevant answers to questions about board game rules. Both RAG implementations implemented here have benefits and drawbacks. Both are available as python scripts elsewhere in this repository for interactive use on the command line.