# LangChain & ChromaDB POC
* This notebook serves playground to integrate `LangChain` with `ChromaDB`. 
* The main goal of this notebook is to,
    * Load data into ChromaDB using LangChain  and OpenAI Embeddings.
    * Try to retrieve the data based on similarity using LangChain.
* We'll use production data for this so that we can directly convert this notebook into scripts when ready.

## Import Libraries

In [1]:
from pathlib import Path
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain.schema import Document
from langchain_openai import OpenAIEmbeddings


import os
import pandas as pd
import numpy as np
import chromadb

## Data Exploration

* So before we load the data in `ChromaDB` we need to explore the data to understand whether we should group multiple lines into a document or treat each line as a document. 

### Read Data


In [2]:
data_dir_path = Path("..","data")
raw_dir_path = Path(data_dir_path, "raw")
file_path = Path(raw_dir_path, "data.csv")

In [3]:
data = pd.read_csv(file_path)
data.head()

Unnamed: 0,id,season,episode,scene,line_text,speaker,deleted,title,air_date,rating,...,description,directed_by,written_by,total_lines,total_scenes,year,month,day,air_dates_diff_days,lines
0,1,1,1,1,All right Jim. Your quarterlies look very good...,Michael,False,Pilot,2005-03-24,7.5,...,The premiere episode introduces the boss and s...,Ken Kwapis,Ricky Gervais | Stephen Merchant | Greg Daniels,229,39,2005,3,3,0,Michael: All right Jim. Your quarterlies look ...
1,2,1,1,1,"Oh, I told you. I couldn't close it. So...",Jim,False,Pilot,2005-03-24,7.5,...,The premiere episode introduces the boss and s...,Ken Kwapis,Ricky Gervais | Stephen Merchant | Greg Daniels,229,39,2005,3,3,0,"Jim: Oh, I told you. I couldn't close it. So..."
2,3,1,1,1,So you've come to the master for guidance? Is ...,Michael,False,Pilot,2005-03-24,7.5,...,The premiere episode introduces the boss and s...,Ken Kwapis,Ricky Gervais | Stephen Merchant | Greg Daniels,229,39,2005,3,3,0,Michael: So you've come to the master for guid...
3,4,1,1,1,"Actually, you called me in here, but yeah.",Jim,False,Pilot,2005-03-24,7.5,...,The premiere episode introduces the boss and s...,Ken Kwapis,Ricky Gervais | Stephen Merchant | Greg Daniels,229,39,2005,3,3,0,"Jim: Actually, you called me in here, but yeah."
4,5,1,1,1,"All right. Well, let me show you how it's done.",Michael,False,Pilot,2005-03-24,7.5,...,The premiere episode introduces the boss and s...,Ken Kwapis,Ricky Gervais | Stephen Merchant | Greg Daniels,229,39,2005,3,3,0,"Michael: All right. Well, let me show you how ..."


In [4]:
## shape of data
data.shape

(59909, 21)

* Lets figure out whats the average number of words, lines per episode

In [5]:
## average lines per episode
data.groupby(["season","episode"])["total_lines"].aggregate(['mean']).aggregate(["min","max","mean","std", "median"])

Unnamed: 0,mean
min,72.0
max,625.0
mean,311.682796
std,91.060547
median,291.0


* Lets find out average number of words per episode, for that we'll need to add a column called wordcount for each line. 
* We'll add that to the new `lines` column that we added since we are going to use that for embedding and querying. 

In [6]:
data["lines_word_count"] = data["lines"].apply(lambda val: len(val.split(" ")))

In [7]:
data["line_text_word_count"] = data["line_text"].apply(lambda val: len(val.split(" ")))

In [8]:
data["lines_word_count"].aggregate(["min","max","mean","std", "median"])

min         2.000000
max       274.000000
mean       12.471966
std        13.539944
median      8.000000
Name: lines_word_count, dtype: float64

Observation:
* WHAT! There is a line with 274 words!

In [9]:
data[data["lines_word_count"] == 274]["lines"].values[0]

"Michael: I've really learned from the greats. The great improvisers, Drew Carey, Ryan Stiles, uh, the Brady guy not so much. He's more the signing, Wayne Brady. Um, Robin Williams. Oh, man, would I love to go head-to-head with him. Oh! That would be exciting. [as Robin Williams] 'Hi. I'm Mork from Ork.' Well, I'm Bork from Spork. Nanoo, nanoo. Jibelee, baloobaloo. [as Robin Williams] 'That's Good morning, Vietnam!' Well, hello to you. You know it would be... God. And you know what, sometimes when I'm watching somebody like um, like Jay Leno. He'll be half way through his step [snaps his fingers] And I will already be laughing at the punch line. He hasn't even gotten to it. He doesn't even know what it is it. So it's fun, you know it's fun having a mind that works like that. That is just a few steps ahead of... comedically ahead of like what's going on. Like I'll watch T.V. and I'll be watching a show and I will think, oh, I know someone's gonna walk in here right now and say something

* I assumed correctly that it was a michael scott rant :D

## Populating Vector Store

### Plan
* We will create multiple documents by dividing the episode into scenes (`pseudo-scene` since we are just dividing number of lines by total scenes in that episode)
* We will then use the `ReursiveCharacterTextSplitter` with chunk_size of 400. 
    * 400 represents ~40 to 50 lines with ~10 words per line. 
    * We can tweak with the chunk size after that. 
* We'll add `Episode Start` and `Episode End` markers
* We'll include speaker names in the text (already updated the column `lines` ) and also in metadata
* We'll add `episode description` to improve retrieval. 

* First lets make sure that the our dataframe is sorted by IDs so that our document breakdown works as expected

In [10]:
## sorting the data by ID
data = data.sort_values(by=["season","episode","id"])

In [11]:
data.head()

Unnamed: 0,id,season,episode,scene,line_text,speaker,deleted,title,air_date,rating,...,written_by,total_lines,total_scenes,year,month,day,air_dates_diff_days,lines,lines_word_count,line_text_word_count
0,1,1,1,1,All right Jim. Your quarterlies look very good...,Michael,False,Pilot,2005-03-24,7.5,...,Ricky Gervais | Stephen Merchant | Greg Daniels,229,39,2005,3,3,0,Michael: All right Jim. Your quarterlies look ...,15,14
1,2,1,1,1,"Oh, I told you. I couldn't close it. So...",Jim,False,Pilot,2005-03-24,7.5,...,Ricky Gervais | Stephen Merchant | Greg Daniels,229,39,2005,3,3,0,"Jim: Oh, I told you. I couldn't close it. So...",10,9
2,3,1,1,1,So you've come to the master for guidance? Is ...,Michael,False,Pilot,2005-03-24,7.5,...,Ricky Gervais | Stephen Merchant | Greg Daniels,229,39,2005,3,3,0,Michael: So you've come to the master for guid...,15,14
3,4,1,1,1,"Actually, you called me in here, but yeah.",Jim,False,Pilot,2005-03-24,7.5,...,Ricky Gervais | Stephen Merchant | Greg Daniels,229,39,2005,3,3,0,"Jim: Actually, you called me in here, but yeah.",9,8
4,5,1,1,1,"All right. Well, let me show you how it's done.",Michael,False,Pilot,2005-03-24,7.5,...,Ricky Gervais | Stephen Merchant | Greg Daniels,229,39,2005,3,3,0,"Michael: All right. Well, let me show you how ...",11,10


In [12]:
data.tail()

Unnamed: 0,id,season,episode,scene,line_text,speaker,deleted,title,air_date,rating,...,written_by,total_lines,total_scenes,year,month,day,air_dates_diff_days,lines,lines_word_count,line_text_word_count
59904,59905,9,23,112,It all seems so very arbitrary. I applied for ...,Creed,False,Finale,2013-05-16,9.8,...,Greg Daniels,522,116,2013,5,3,238,Creed: It all seems so very arbitrary. I appli...,60,59
59905,59906,9,23,113,I just feel lucky that I got a chance to share...,Meredith,False,Finale,2013-05-16,9.8,...,Greg Daniels,522,116,2013,5,3,238,Meredith: I just feel lucky that I got a chanc...,42,41
59906,59907,9,23,114,I���m happy that this was all filmed so I can ...,Phyllis,False,Finale,2013-05-16,9.8,...,Greg Daniels,522,116,2013,5,3,238,Phyllis: I���m happy that this was all filmed ...,32,31
59907,59908,9,23,115,I sold paper at this company for 12 years. My ...,Jim,False,Finale,2013-05-16,9.8,...,Greg Daniels,522,116,2013,5,3,238,Jim: I sold paper at this company for 12 years...,47,46
59908,59909,9,23,116,I thought it was weird when you picked us to m...,Pam,False,Finale,2013-05-16,9.8,...,Greg Daniels,522,116,2013,5,3,238,Pam: I thought it was weird when you picked us...,47,46


* Grouping the data by season, episode

### Preparing Documents

In [13]:
## grouping the data by season, episode
grouped_data = data.groupby(["season", "episode"])

In [14]:
# initializing list of documents
documents = []
# loop thru the grouped data to create documents and metadata
for (season, episode), group in grouped_data:
    '''
        Here season, episode is of type numpy.int64 and group is a pandas.DataFrame
    '''
    # read all the seasons from this group i.e. per episode
    total_scenes = group["total_scenes"].iloc[0]

    # break group into scene chunks
    scene_chunks = np.array_split(group, total_scenes)

    # loop thru each scene to create document, and metadata
    for i, chunk in enumerate(scene_chunks):
        scene_number = i + 1  # start with scene number 1
        # Step 1: Combine lines from pseudo episode
        # TODO Does it make sense in adding new line character after each line? I don't think so, but something to confirm later.
        combined_lines = " ".join(chunk["lines"].values)

        # Step 2: Create an array of speakers in this scene
        # most of the times the speakers are repeated, since a pseudoscene might be between 2 people
        speaker_list = list(
            map(lambda speaker: speaker.strip(), chunk["speaker"].unique().tolist()))

        # Step 3: Retrive episode description
        episode_description = chunk["description"].iloc[0]

        # Step 4: Add episode start/end markers if needed
        if i == 0:
            combined_lines = f"--- Episode Start ---\n{episode_description}\n{combined_lines}"
        elif i == len(scene_chunks) - 1:
            combined_lines = f"{combined_lines}\n--- Episode End ---"

        # Step 5: Create metadata object        
        ## we are converting lists to string, cause ChromaDB metadat doesn't support list
        ## will need to look into other solution for better querying
        
        written_by = ",".join(list(map(lambda writer: writer.strip(), chunk["written_by"].iloc[0].split("|"))))
        
        metadata = {
            "season": int(season), ## converting to int cause metadata only supports int
            "episode": int(episode),
            "scene": int(scene_number),
            "speakers": ",".join(speaker_list),
            "episode_description": episode_description,
            "rating": float(chunk["rating"].iloc[0]),
            "directed_by": chunk["directed_by"].iloc[0].strip(),
            "written_by": written_by
        }

        # Step 6: Append document to the document list
        documents.append({
            "text": combined_lines,
            "metadata": metadata
        })

  return bound(*args, **kwds)


In [15]:
documents[0]

{'text': "--- Episode Start ---\nThe premiere episode introduces the boss and staff of the Dunder-Mifflin Paper Company in Scranton, Pennsylvania in a documentary about the workplace.\nMichael: All right Jim. Your quarterlies look very good. How are things at the library? Jim: Oh, I told you. I couldn't close it. So... Michael: So you've come to the master for guidance? Is this what you're saying, grasshopper? Jim: Actually, you called me in here, but yeah. Michael: All right. Well, let me show you how it's done. Michael: [on the phone] Yes, I'd like to speak to your office manager, please. Yes, hello. This is Michael Scott. I am the Regional Manager of Dunder Mifflin Paper Products. Just wanted to talk to you manager-a-manger. [quick cut scene] All right. Done deal. Thank you very much, sir. You're a gentleman and a scholar. Oh, I'm sorry. OK. I'm sorry. My mistake. [hangs up] That was a woman I was talking to, so... She had a very low voice. Probably a smoker, so... [Clears throat] S

### Raw Text to LangChain Documents

In [16]:
## create array of LangChain Document
docs = [
    Document(page_content=doc["text"], metadata=doc["metadata"])
    for doc in documents
]

### Chunking

In [17]:
# Use RecursiveCharacterTextSplitter for better chunking
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,  # Estimated based on avg words per scene
    chunk_overlap=50  # Ensures context continuity
)

# Apply text splitter to break large chunks into smaller ones
split_docs = text_splitter.split_documents(docs)

### Creating Vector Embeddings & Storing in ChromaDB

In [18]:
## initialize OpenAIEmbeddings
embedding_function = OpenAIEmbeddings()

In [19]:
# Connect to ChromaDB and store documents
chroma_client = chromadb.PersistentClient(path="../db")

vector_db = Chroma.from_documents(split_docs, embedding_function, persist_directory="../db")

print("Data successfully loaded into ChromaDB!")

Data successfully loaded into ChromaDB!


### Verifying VectorDB

In [21]:
## check db collection count
vector_db._collection.count()

14998

In [23]:
## Retrieve random document from ChromaDB
docs = vector_db._collection.get(limit=1)
print(docs)

{'ids': ['a0bf6530-ec61-401f-8918-f84fa202030c'], 'embeddings': None, 'documents': ['--- Episode Start ---\nThe premiere episode introduces the boss and staff of the Dunder-Mifflin Paper Company in Scranton, Pennsylvania in a documentary about the workplace.'], 'uris': None, 'data': None, 'metadatas': [{'directed_by': 'Ken Kwapis', 'episode': 1, 'episode_description': 'The premiere episode introduces the boss and staff of the Dunder-Mifflin Paper Company in Scranton, Pennsylvania in a documentary about the workplace.', 'rating': 7.5, 'scene': 1, 'season': 1, 'speakers': 'Michael,Jim', 'written_by': 'Ricky Gervais,Stephen Merchant,Greg Daniels'}], 'included': [<IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}


In [26]:
## Retrieve using similarity search
query = "When did Michael say 'I love you'"
results = vector_db.similarity_search(query=query, k=4)

for doc in results:
    print(doc.page_content)
    print(doc.metadata)

in love on Valentine's Day. Holly: Two people in love? Michael: I love you. Holly: Wait, wait, wait, what do you mean you love me? We've only been dating for a week. Do you mean you love me like, 'oh, hey, there's Holly. I love that girl.' Or you do you mean you love me like you love me-love me? Michael: I love you-love you.
{'directed_by': 'Greg Daniels', 'episode': 15, 'episode_description': "It's Valentine's Day, and the office is fed up with Michael and Holly's PDA, Andy helps Erin solve Gabe's riddles to find her gift, and Jim and Pam get drunk and try to find a place in the office to have sex.", 'rating': 8.4, 'scene': 23, 'season': 7, 'speakers': 'Holly,Michael', 'written_by': 'Robert Padnick'}
I'm too shy to tell you that I love you.' Michael: Pam.  Pam, you gave me your word. Ryan: [kissing Kelly against her desk] You did that for me? Kelly: Mmhmm. Ryan: Are you happy you did? Toby: Hey guys that's really inappropriate. Ryan: [kisses for a little longer]  What's up? Michael: U

In [30]:
## verifying using metadata search
results = vector_db.similarity_search(
    query="What did Dwight say about bears?",
    k=10,
)
for doc in results:
    print(doc.page_content)

Jim: [Dressed as Dwight] It's kind of blurry. [puts on his glasses] That's better. [exhales] Question.  What kind of bear is best? Dwight: That's a ridiculous question. Jim: False.  Black bear. Dwight: Well that's debatable.  There are basically two schools of thought--- Jim: Fact.  Bears eat beets.  Bears.  Beets.  Battlestar Galactica. Dwight: Bears do not--- What is going on--- What are you
[Dwight scoffs] The job was not mine to give. [sighs] Look, I need your advice on something. I am told that there are bears in the Rockies. Dwight: Where did you hear that? Obvious XM Radio? Michael: Well, I was just thinking that maybe I should keep a salami in my pocket... Dwight: Great idea.
Dwight: [whispering to Jim] Trade seats with me. Jim: No. Dwight: I've got a better angle on Pam. I can see everything. Jim: Please stop. Dwight: [grabs a spoon from Jim's coffee cup and checks behind him with it] I need a soup spoon. Dwight: Rule 17: don't turn your back on bears, men you have wronged, or