# Friends RAG

The objective of this RAG is for the user to describe a scene from a Friends' episode and the RAG to output the title of the episode, the season, the episode number and also a summary of the entire episode. This way, Friends fans can search for a specific scene of the series and know all of these informations, including a description of the hole episode, letting them remember exactly what happens in it.

In [1]:
from langchain_aws import BedrockEmbeddings, ChatBedrock
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
import pandas as pd
from langfuse.langchain import CallbackHandler
from langfuse import Langfuse
from dotenv import load_dotenv

load_dotenv()

True

### Data Loading

Using two CSV files:
- FriendsEpisodes: Has more generic information about each episode, including the season number, the episode number and a small description
- FriendsScripts: Has all the lines said in each episode by each character

In [2]:
def create_documents(episodes_df, scripts_df):
    all_documents = []
    
    # Create episode-level documents
    for _, episode in episodes_df.iterrows():
        episode_doc = Document(
            page_content=f"Episode: {episode['title']}\n\n{episode['description']}", # What gets embedded
            metadata={
                'season': int(episode['season']),
                'number': int(episode['number']),   # Episode within season
                'episode': int(episode['episode']), # Overall episode number
                'title': episode['title'],
                'air_date': episode['air_date'],
                'content_type': 'episode_summary',
            }
        )
        all_documents.append(episode_doc)

    # Create script-based documents by episode
    for (season, episode_in_season), group in scripts_df.groupby(['Season', 'Episode']):
        
        # Get episode generic info
        episode_info = episodes_df[
            (episodes_df['season'] == season) & 
            (episodes_df['number'] == episode_in_season)    # In episodes_df, 'number' is episode within season, but in scripts_df it's 'Episode'
        ]

        # Skip if no episode info found
        if len(episode_info) == 0:
            continue

        # There should be exactly one match  
        episode_info = episode_info.iloc[0]
        
        # Combine all dialogues of the episode
        script_text = '\n'.join([f"{row['Character']}: {row['Text']}" for _, row in group.iterrows()])
        
        # Split into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,     
            chunk_overlap=100,                 # Preserve context between chunks
            separators=["\n\n", "\n", " "],    # Prefer splitting at natural breaks
            length_function=len
        )
        
        chunks = text_splitter.split_text(script_text)
        
        # Create a document for each chunk
        for i, chunk in enumerate(chunks):
            chunk_doc = Document(
                page_content=chunk, # What gets embedded
                metadata={
                    'season': int(episode_info['season']),
                    'number': int(episode_info['number']),    # Episode within season
                    'episode': int(episode_info['episode']),  # Overall episode number  
                    'title': episode_info['title'],
                    'air_date': episode_info['air_date'],
                    'scene_number': i + 1,
                    'total_scenes': len(chunks),
                    'episode_description': episode_info['description'],
                    'content_type': 'script_scene'
                }
            )
            all_documents.append(chunk_doc)
    
    return all_documents

In [3]:
episodes_df = pd.read_csv('FriendsEpisodes.csv')
scripts_df = pd.read_csv('FriendsScripts.csv')

In [4]:
all_docs = create_documents(episodes_df, scripts_df)

### Embed the documents, index the data and save it in a vector store 

Chose to use the amazon.titan-embed-text-v2:0 model since is the most recent one from AWS and for simplicity since I'm going to use AWS services(Bedrock).

Chose FAISS for practicality since I can save the vector store locally.

In [None]:
REGION_NAME = # Insert your AWS region name here
CREDENTIALS_PROFILE_NAME = # Insert your AWS credentials profile name here
EMBEDDER_MODEL_ID = "amazon.titan-embed-text-v2:0"
EMBEDDER_MODEL_KWARGS = {
    "dimensions": 1024,
    "normalize": True
}

VECTOR_STORE_PATH = "./vector_database/"

In [6]:
embedder = BedrockEmbeddings(
    model_id=EMBEDDER_MODEL_ID,
    model_kwargs=EMBEDDER_MODEL_KWARGS,
    region_name=REGION_NAME,
    credentials_profile_name=CREDENTIALS_PROFILE_NAME
)

In [None]:
vector_store = FAISS.from_documents(documents=all_docs, embedding=embedder)
vector_store.save_local(VECTOR_STORE_PATH)

Checkpoint till this moment: 
- ✅ Embedded all documents - Each document was converted into a 1024-dimensional vector using Titan Embeddings v2
- ✅ Indexed the data - FAISS created an index structure for fast similarity search
- ✅ Saved it locally - The vector store is persisted to disk at VECTOR_STORE_PATH

What we have now:
- A searchable vector database of all Friends episodes
- Each vector is linked to its metadata (season, episode, title, etc.)
- Ready to query!

### Retriver and Generator Pipeline

Chose Claude Haiku 3 as the LLM model since it's robust and adequate for this task. Sonnet would probably be a bit too much for this task in particular.

In [27]:
LLM_MODEL_ID = "anthropic.claude-3-haiku-20240307-v1:0"
LLM_MODEL_KWARGS = {
    "max_tokens": 1024,
    "temperature": 0.1
}

SEARCH_TYPE = "similarity"
RETRIEVER_KWARGS = {
    "k": 3 # Number of documents to retrieve
}

In [28]:
llm = ChatBedrock(
        region_name=REGION_NAME, 
        credentials_profile_name=CREDENTIALS_PROFILE_NAME,
        model_id=LLM_MODEL_ID, 
        model_kwargs=LLM_MODEL_KWARGS
    )

In [29]:
vector_store = FAISS.load_local(VECTOR_STORE_PATH, embeddings=embedder, allow_dangerous_deserialization=True)
retriever = vector_store.as_retriever(search_type=SEARCH_TYPE, search_kwargs=RETRIEVER_KWARGS)

In [30]:
PROMPT_TEMPLATE = """
You are a Friends TV show episode assistant. Based on the episode information provided, give a clear and concise response.
Don't make up any information. If the episode is not found in the context, respond with "Episode not found in the provided information."

Episode Information:
{context}

User Query: {question}

Please respond with the season number, episode number within the season, episode number of the entire series, title, and a brief description of the episode. 
Don't forget to mention if the information is based on episode summaries or script scenes.

RAG Answer:"""

In [31]:
prompt = PromptTemplate(
    template=PROMPT_TEMPLATE,
    input_variables=["context", "question"]
)

In [32]:
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [33]:
langfuse_handler = CallbackHandler()

In [14]:
def print_response(query: str):
    print(f"User Query:\n{query}")
    response = rag_chain.invoke(query, config={"callbacks": [langfuse_handler]})
    print(f"\nRAG Response:\n{response}")
    docs = retriever.invoke(query)
    print(f"\nRetrieved {len(docs)} documents:")
    for i, doc in enumerate(docs):
        print(f"\nDocument {i+1}:\n{doc.page_content}\nMetadata: {doc.metadata}")

In [16]:
print_response("What is the episode where Rachel gets off the plane?")

User Query:
What is the episode where Rachel gets off the plane?

RAG Response:
The episode where Rachel gets off the plane is "The Last One: Part 1" (Season 10, Episode 17, Episode 235).

Based on the script scenes provided, the episode description is:
"Erica gives birth to the baby that Monica and Chandler are adopting. However, there's one small added surprise. Meanwhile, Ross and Rachel sleep together one last time before Rachel leaves the plane to tell Ross she loves him."

This information is based on the episode summaries and script scenes provided in the context.

Retrieved 5 documents:

Document 1:
Air stewardess: Miss, I can't let you off the plane.
Ross: Let her off the plane!
Air stewardess: I am afraid you are gonna have to take a seat.
Rachel: Oh, please, miss, you don't understand!
Ross: Try to understand!
Rachel: Oh, come on, miss, isn't there any way that you can just let me off...
Ross: No! No! Oh my God. Did she get off the plane? Did she get off the plane?
Rachel: I

### Evaluation with Langfuse

Used the Claude Sonnet 3.5 model to evaluate since is more robust than Haiku which is what I used in the RAG

In [34]:
langfuse = Langfuse()

In [35]:
test_queries = [
    "What is the episode where Rachel gets off the plane?",
    "What is the episode where Rachel sings Copacabana?",
    "Which episode features the holiday armadillo?",
    "Which episode features Rachel's trifle?",
    "What is the episode where Ross gets a tan?",
    "What are the episodes on Thanksgiving?",
    "What is the episode with the routine?",
    "What is the episode with the leather pants?",
    "What is the episode where Ross and Emily get married?",
    "What is the episode where Ross and Rachel sing to Emma the song Baby Got Back?",
    "What is the episode where Ross finds out about Monica and Chandler?",
    "What is the episode where Phoebe gives birth?",
    "What is the episode where Phoebe and Rachel try to distract Ross from seeing Monica and Chandler through the window?",
    "What is the episode with the couch?",
    "What is the episode where Marcel, Ross's monkey, first appears?"
]

In [36]:
llm_evaluator = ChatBedrock(
        region_name=REGION_NAME, 
        credentials_profile_name=CREDENTIALS_PROFILE_NAME,
        model_id="us.anthropic.claude-3-5-sonnet-20240620-v1:0", 
        model_kwargs=LLM_MODEL_KWARGS
    )

In [37]:
def get_score(prompt):
    response = llm_evaluator.invoke(prompt, config={"callbacks": [langfuse_handler]}).content
    if "1.0" in response or "fully" in response.lower() or "highly" in response.lower():
        return 1.0
    elif "0.5" in response or "partial" in response.lower() or "somewhat" in response.lower():
        return 0.5
    else:
        return 0.0

In [38]:
print("Starting evaluation...\n")

results = []

for i, query in enumerate(test_queries, 1):
    print(f"[{i}/{len(test_queries)}] {query}")
    
    # Get context with metadata and response
    retrieved_docs = retriever.invoke(query)
    context = "\n\n".join([
        f"Document {i+1}:\nContent: {doc.page_content}\nMetadata: {doc.metadata}"
        for i, doc in enumerate(retrieved_docs)
    ])
    response = rag_chain.invoke(query, config={"callbacks": [langfuse_handler]})
    
    # Evaluate correctness
    correctness = get_score(f"Rate if this answer correctly reflects the context. Score 1.0 if fully correct, 0.5 if partially correct, 0.0 if incorrect.\n\nContext: {context}\nAnswer: {response}\n\nScore:")
    
    # Evaluate relevance
    relevance = get_score(f"Rate if this answer addresses the question. Score 1.0 if highly relevant, 0.5 if somewhat relevant, 0.0 if off-topic.\n\nQuestion: {query}\nAnswer: {response}\n\nScore:")
    
    # Evaluate groundedness
    groundedness = get_score(f"Rate if this answer uses only the context. Score 1.0 if fully grounded, 0.5 if partially grounded, 0.0 if includes outside info.\n\nContext: {context}\nAnswer: {response}\n\nScore:")
    
    print(f"  ✓ Correctness: {correctness}, Relevance: {relevance}, Groundedness: {groundedness}")
    
    results.append({
        "query": query,
        "correctness": correctness,
        "relevance": relevance,
        "groundedness": groundedness
    })

# Summary
print("\n" + "="*80)
print("EVALUATION SUMMARY")
print("="*80)
avg_correctness = sum(r["correctness"] for r in results) / len(results)
avg_relevance = sum(r["relevance"] for r in results) / len(results)
avg_groundedness = sum(r["groundedness"] for r in results) / len(results)

print(f"Average Correctness:   {avg_correctness:.2f}/1.0")
print(f"Average Relevance:     {avg_relevance:.2f}/1.0")
print(f"Average Groundedness:  {avg_groundedness:.2f}/1.0")

print("\n✓ Done! Check your Langfuse dashboard: https://cloud.langfuse.com")

Starting evaluation...

[1/15] What is the episode where Rachel gets off the plane?
  ✓ Correctness: 0.5, Relevance: 1.0, Groundedness: 0.5
[2/15] What is the episode where Rachel sings Copacabana?
  ✓ Correctness: 1.0, Relevance: 1.0, Groundedness: 1.0
[3/15] Which episode features the holiday armadillo?
  ✓ Correctness: 1.0, Relevance: 1.0, Groundedness: 1.0
[4/15] Which episode features Rachel's trifle?
  ✓ Correctness: 1.0, Relevance: 1.0, Groundedness: 1.0
[5/15] What is the episode where Ross gets a tan?
  ✓ Correctness: 1.0, Relevance: 1.0, Groundedness: 1.0
[6/15] What are the episodes on Thanksgiving?
  ✓ Correctness: 1.0, Relevance: 1.0, Groundedness: 1.0
[7/15] What is the episode with the routine?
  ✓ Correctness: 1.0, Relevance: 1.0, Groundedness: 1.0
[8/15] What is the episode with the leather pants?
  ✓ Correctness: 1.0, Relevance: 1.0, Groundedness: 1.0
[9/15] What is the episode where Ross and Emily get married?
  ✓ Correctness: 1.0, Relevance: 1.0, Groundedness: 1.0
[