## Rag Structure for Inference

https://github.com/svpino/youtube-rag/blob/main/rag.ipynb

We are using a youtube video to 
- make a rag dataset
- infer from queries using langchain

use whatever models that you want
- 3.5 turbo
- 4

In [None]:
import os
from dotenv import load_dotenv
from langchain_openai.chat_models import ChatOpenAI

load_dotenv()

# delete keys before commiting to github
OPENAI_API_KEY = "YOUR KEY"
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

os.environ['PINECONE_API_KEY'] = 'YOUR KEY'



# This is the YouTube video we're going to use.
YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=cdiD-9MMpb0" # THIS VIDEO is 3hr+ long!
YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=ZmtvsX7aWzo" # 1:48hr. Reiner knizia 100 games
YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=O6C66NFCkJ8" # 5min. creator on monetization
# YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=SHNAw80wbUs" #10min korean video. hard to get
YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=SBGG4WNweEc" #nobel prize in physics

# test the model
model.invoke("What MLB team won the World Series during the COVID-19 pandemic?")

since we only want to get script, let's get the answer straight from the parser
`StrOutputParser` extracts answer as a string

In [120]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

# instead of invoking model, we will invoke the model parser to get string directly (query-->model-->response-->parser-->result)
chain = model | parser
chain.invoke("What MLB team won the World Series during the COVID-19 pandemic?")

'The Los Angeles Dodgers won the World Series during the COVID-19 pandemic in 2020.'

### Introducing prompt templates
We want to provide the model with some context and the question. Prompt templates are a simple way to define and reuse prompts.

In [121]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt.format(context="Mary's sister is Susana", question="Who is Mary's sister?")

'Human: \nAnswer the question based on the context below. If you can\'t \nanswer the question, reply "I don\'t know".\n\nContext: Mary\'s sister is Susana\n\nQuestion: Who is Mary\'s sister?\n'

We can now chain the prompt with the model and the output parser. Which means:

We chain a prompt like this:
prompt = context + question

and invoke the chain

In [122]:
chain = prompt | model | parser
chain.invoke({
    "context": "Mary's sister is Susana",
    "question": "Who is Mary's sister?"
})

'Susana'

### Combining chains
We can combine different chains to create more complex workflows. For example, let's create a second chain that translates the answer from the first chain into a different language.

1) First chain: invoke the first chain to get paresed response
2) Second chain: first answer is used in the second chain's prompt, along with new context

Let's start by creating a new prompt template for the translation chain:

In [123]:

translation_prompt = ChatPromptTemplate.from_template(
    "Translate {answer} to {language}"
)

We can now create a new translation chain that combines the result from the first chain with the translation prompt.

Here is what the new workflow looks like:

In [124]:
from operator import itemgetter

translation_chain = (
    {"answer": chain, "language": itemgetter("language")} | translation_prompt | model | parser
)

translation_chain.invoke(
    {
        "context": "Mary's sister is Susana. She doesn't have any more siblings.",
        "question": "How many sisters does Mary have?",
        "language": "Spanish",
    }
)

'María tiene una hermana, Susana.'

### EXAMPLE: Transcribing the YOUTUBE VIDEO
The context we want to send the model comes from a YouTube video. Let's download the video and transcribe it using OpenAI's Whisper.

youtube video ==> text script


In [None]:
#install whisper with thie command : pip install git+https://github.com/openai/whisper.git 
#install ffmpeg : apt instll ffmpeg or use sudo
# If there is an error, use pytubefix. MAYBE youtube api change affected pytube's performance (still issue in 2024.11)
import tempfile
import whisper
from pytubefix import YouTube # does not work
from pytubefix.cli import on_progress


# if not os.path.exists("transcription.txt"):
#     youtube = YouTube(YOUTUBE_VIDEO)
#     audio = youtube.streams.filter(only_audio=True).first()

#     # Let's load the base model. This is not the most accurate
#     # model but it's fast.
#     whisper_model = whisper.load_model("base")

#     with tempfile.TemporaryDirectory() as tmpdir:
#         file = audio.download(output_path=tmpdir)
#         transcription = whisper_model.transcribe(file, fp16=False)["text"].strip()

#         with open("transcription.txt", "w") as file:
#             file.write(transcription)


# Let's do this only if we haven't created the transcription file yet.
# We already defined "YOUTUBE_VIDEO" at the starting cell
if not os.path.exists("transcription.txt"):
    youtube = YouTube(YOUTUBE_VIDEO, on_progress_callback = on_progress)
    audio = youtube.streams.filter(only_audio=True).first()

    # Let's load the base model. This is not the most accurate
    # model but it's fast.
    whisper_model = whisper.load_model("base")

    with tempfile.TemporaryDirectory() as tmpdir:
        file = audio.download(output_path=tmpdir)
        transcription = whisper_model.transcribe(file, fp16=False)["text"].strip()

        with open("transcription.txt", "w") as file:
            file.write(transcription)

  checkpoint = torch.load(fp, map_location=device)


 ↳ |████████████████████████████████████████████| 100.0%

Print first 100 characters in the trascription text to check if transcription is correct

In [128]:

with open("transcription.txt") as file:
    transcription = file.read()

transcription[:100]

'Välkommen till Kungliga vetenskap Sakademin och när presskonferensen då vi ska presentera årets Nobe'

### Using the entire transcription as context
If we try to invoke the chain using the transcription as context, the model will return an error because the context is too long.

Large Language Models support limitted context sizes. The video we are using is too long for the model to handle, so we need to find a different solution.

In [130]:
# This will generate error code because prompt is simply too big!
try:
    chain.invoke({
        "context": transcription,
        "question": "Is reading papers a good idea?"
    })
except Exception as e:
    print(e)

### Splitting the transcription
Since we can't use the entire transcription as the context for the model, a potential solution is to split the transcription into smaller chunks. We can then invoke the model using only the relevant chunks to answer a particular question:

Let's start by loading the transcription in memory:

In [131]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("transcription.txt")
text_documents = loader.load()
text_documents

[Document(metadata={'source': 'transcription.txt'}, page_content="Välkommen till Kungliga vetenskap Sakademin och när presskonferensen då vi ska presentera årets Nobelpris i fysisk. Välkomna till presskonferens och Royal Swedish Academy of Sciences. Vad vi vill presentar är Nobelpris i fysisk. Vi vill kippa till vår tradition och börja med presentation i Swedish och då kan vi gå in i English. Och du är välkomna att få en fråga i en länge som ska gå upp. Jag heter Hans Ellegren och är ständig sektera här på Kungliga vetenskapsakademin. Till höger om mig sitter professor Ellen Wons, ordförande i Nobelkomitén för fysisk. Mot i vänster sitter professor Anders Irbeck, leda mot de välkomitén för fysisk och expert inom älnesområdet. My name is Hans Ellegren, under Secretary General of the Royal Swedish Academy of Sciences. And to my right is professor Ellen Wons, chair of the Nobelkomitén för fysisk. And to my left is professor Anders Irbeck, member of the Nobelkomitén för fysisk och expert i

There are many different ways to split a document. For this example, we'll use a simple splitter that splits the document into chunks of a fixed size. Check Text Splitters for more information about different approaches to splitting documents.

For illustration purposes, let's split the transcription into chunks of 100 characters with an overlap of 20 characters and display the first few chunks:

In [132]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
text_splitter.split_documents(text_documents)[:5]

[Document(metadata={'source': 'transcription.txt'}, page_content='Välkommen till Kungliga vetenskap Sakademin och när presskonferensen då vi ska presentera årets'),
 Document(metadata={'source': 'transcription.txt'}, page_content='presentera årets Nobelpris i fysisk. Välkomna till presskonferens och Royal Swedish Academy of'),
 Document(metadata={'source': 'transcription.txt'}, page_content='Swedish Academy of Sciences. Vad vi vill presentar är Nobelpris i fysisk. Vi vill kippa till vår'),
 Document(metadata={'source': 'transcription.txt'}, page_content='vill kippa till vår tradition och börja med presentation i Swedish och då kan vi gå in i English.'),
 Document(metadata={'source': 'transcription.txt'}, page_content='vi gå in i English. Och du är välkomna att få en fråga i en länge som ska gå upp. Jag heter Hans')]

For our specific application, let's use 1000 characters instead:

In [133]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
documents = text_splitter.split_documents(text_documents)

### Finding the relevant chunks
Given a particular question, we need to find the relevant chunks from the transcription to send to the model. Here is where the idea of embeddings comes into play.

An embedding is a mathematical representation of the semantic meaning of a word, sentence, or document. It's a projection of a concept in a high-dimensional space. Embeddings have a simple characteristic: The projection of related concepts will be close to each other, while concepts with different meanings will lie far away. You can use the Cohere's Embed Playground to visualize embeddings in two dimensions.

To provide with the most relevant chunks, we can use the embeddings of the question and the chunks of the transcription to compute the similarity between them. We can then select the chunks with the highest similarity to the question and use them as the context for the model:

In [134]:
# embeddings from an arbitrary query
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
embedded_query = embeddings.embed_query("Who is Mary's sister?")

print(f"Embedding length: {len(embedded_query)}")
print(embedded_query[:10])

Embedding length: 1536
[-0.0013731546932831407, -0.034482136368751526, -0.011498215608298779, 0.0012331805191934109, -0.0261743925511837, 0.009077209047973156, -0.015739668160676956, 0.0017250451492145658, -0.011854797601699829, -0.03325599059462547]


o illustrate how embeddings work, let's first generate the embeddings for two different sentences:

In [135]:
sentence1 = embeddings.embed_query("Mary's sister is Susana")
sentence2 = embeddings.embed_query("Pedro's mother is a teacher")

We can now compute the similarity between the query and each of the two sentences. The closer the embeddings are, the more similar the sentences will be.

We can use Cosine Similarity to calculate the similarity between the query and each of the sentences:

In [136]:
from sklearn.metrics.pairwise import cosine_similarity

query_sentence1_similarity = cosine_similarity([embedded_query], [sentence1])[0][0]
query_sentence2_similarity = cosine_similarity([embedded_query], [sentence2])[0][0]

query_sentence1_similarity, query_sentence2_similarity

(0.9173394718348284, 0.7680513114191245)

### Setting up a Vector Store
We need an efficient way to store document chunks, their embeddings, and perform similarity searches at scale. To do this, we'll use a vector store.

A vector store is a database of embeddings that specializes in fast similarity searches.

To understand how a vector store works, let's create one in memory and add a few embeddings to it:

In [137]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore1 = DocArrayInMemorySearch.from_texts(
    [
        "Mary's sister is Susana",
        "John and Tommy are brothers",
        "Patricia likes white cars",
        "Pedro's mother is a teacher",
        "Lucia drives an Audi",
        "Mary has two siblings",
    ],
    embedding=embeddings,
)

We can now query the vector store to find the most similar embeddings to a given query:

In [138]:

vectorstore1.similarity_search_with_score(query="Who is Mary's sister?", k=3)

[(Document(metadata={}, page_content="Mary's sister is Susana"),
  0.9173394801008797),
 (Document(metadata={}, page_content='Mary has two siblings'),
  0.9044728659809682),
 (Document(metadata={}, page_content='John and Tommy are brothers'),
  0.8013463844876163)]

### Connecting the vector store to the chain
We can use the vector store to find the most relevant chunks from the transcription to send to the model. Here is how we can connect the vector store to the chain:


We need to configure a Retriever. The retriever will run a similarity search in the vector store and return the most similar documents back to the next step in the chain.

We can get a retriever directly from the vector store we created before:

In [139]:
retriever1 = vectorstore1.as_retriever()
retriever1.invoke("Who is Mary's sister?")

[Document(metadata={}, page_content="Mary's sister is Susana"),
 Document(metadata={}, page_content='Mary has two siblings'),
 Document(metadata={}, page_content='John and Tommy are brothers'),
 Document(metadata={}, page_content="Pedro's mother is a teacher")]

In [140]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

setup = RunnableParallel(context=retriever1, question=RunnablePassthrough())
setup.invoke("What color is Patricia's car?")

{'context': [Document(metadata={}, page_content='Patricia likes white cars'),
  Document(metadata={}, page_content='Lucia drives an Audi'),
  Document(metadata={}, page_content="Pedro's mother is a teacher"),
  Document(metadata={}, page_content="Mary's sister is Susana")],
 'question': "What color is Patricia's car?"}

In [141]:
chain = setup | prompt | model | parser
chain.invoke("What color is Patricia's car?")

'White'

In [142]:
chain.invoke("What car does Lucia drive?")

'Lucia drives an Audi.'

### Loading transcription into the vector store
We initialized the vector store with a few random strings. Let's create a new vector store using the chunks from the video transcription.

In [143]:
vectorstore2 = DocArrayInMemorySearch.from_documents(documents, embeddings)

In [148]:
chain = (
    {"context": vectorstore2.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
chain.invoke("what was hinton's contribution?")

"Hinton's contribution was creating a learning algorithm for multilayered neural networks, specifically for feedforward structures, which made important contributions to what we now call artificial intelligence (AI)."

### Setting up Pinecone
So far we've used an in-memory vector store. In practice, we need a vector store that can handle large amounts of data and perform similarity searches at scale. For this example, we'll use Pinecone.

The first step is to create a Pinecone account, set up an index, get an API key, and set it as an environment variable PINECONE_API_KEY.

Then, we can load the transcription documents into Pinecone:

Run the cell that uses the langchain approach.

In [None]:
## This method is not compatible with Langchain
## https://www.whatwant.com/entry/Vector-Database-Pinecone

#import os
#import pinecone

## pinecone api (personal key)
#PINECONE_API_KEY = ''
#PINECONE_ENV = 'gcp-starter'

#pinecone.init(api_key=PINECONE_API_KEY,
#              environment=PINECONE_ENV)


In [None]:

# https://docs.pinecone.io/integrations/langchain
import os
from langchain_pinecone import PineconeVectorStore



# in my pinecone databse index, the openAI embedding model uses 1536 dim which can also be selected in the browser interface
index_name = "youtube-rag-index"
#OpenAIEmbeddings already assigned in prior cell

pinecone = PineconeVectorStore.from_documents(
    documents, embeddings, index_name=index_name
)

In [115]:
pinecone.similarity_search("What is the  problem?")[:3]

[Document(id='c131602f-ad68-4f2f-9bec-dd3ef32b2302', metadata={'source': 'transcription.txt'}, page_content='사람들 뭐 한번 해드볼까 아꺼워 아꺼워 아꺼워 으음 뭐했다고? 아꺼워 뭐 없어? 아꺼워 베리야 아까 한 마디도 못하니 이렇게 시끄러워 어? 왜 이렇게 시끄러워? 베리야 뭐라고? 아꺼워 아꺼워 아꺼워 아꺼워 아꺼워 아꺼워 아꺼워 아꺼워 아꺼워 아꺼워 아꺼워 아꺼워 아꺼워'),
 Document(id='03f74b36-09e9-4bea-8c6c-c977cdfbfff6', metadata={'source': 'transcription.txt'}, page_content='아, 잘하네 아, 잘하네 아, 잘하네 요리 안녕 해봐 서로 등도로 안녕하세요 베렛버입니다 여러분 베리가 말도 잘하고 노래도 잘하고 화도 잘 내고 그런 주만 하시죠 오늘 베리가 행무샤 친구를 만나러 갔는데 맞죠? 한마디도 못하고 잔만 찾아가고 왔니? 그렇게 시끄럽던 베리가 아, 깨서는 꿀먹어 방어리가 됩니다 오늘은 요리에 색다른 모습을 보시게 될 겁니다 몰라지 마세요 아, 봐봐 아, 봐봐 아, 봐봐 아, 봐봐 요리는 요리는 요리는 요리는 그만 그만 그만 그만 그만 그만 그만 그만 하라고 그만 요리는 요리는 요리는 어지러? 어지러 갈거야 요리 요리는 요리는 요리는 요리는 요리는 아, 잘하네 이제 그만해 이제 아이커 아, 이제 시끄러 그만해 이제 그만 시끄러 그만 그래 너 노래 잘해 이제 그만해 그만해 아, 봐봐 꽤가 아, 봐봐 아, 봐봐 아, 봐봐 아, 그래 그만해 노래 잘해 별이 그래 그만 그만 그만 이제 그만해 그만해 그만 별이 노래 잘해 이제 그만해 야 야 야 야 야 그만하라고 그만 시끄럽자고 이제 그만해 그래 너 노래 잘해 그만해 이제 아, 잘하네 아, 봐봐 아, 봐봐 아, 봐봐 아, 봐봐 아, 봐봐 아, 봐봐 아, 봐봐 아, 봐봐 아, 봐봐 아, 봐봐 내 중국에서 온 친구 마음에 들어? 이 친구가 아니

In [117]:
# 
chain = (
    {"context": pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

chain.invoke("What is the problem?")

'The problem is that the speaker is finding the situation noisy and disruptive, with someone repeatedly saying "아꺼워" (akkeowo) which translates to "I\'m cold" or "It\'s annoying" in Korean.'