# Building a RAG application from scratch

Here is a high-level overview of the system we want to build:

<img src='./images/system1.png' width="1200">

Let's start by loading the environment variables we need to use.

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

# This is the YouTube video [Jeff Bezos and Lex Fridman]
YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=lB_0hR5s41Y&ab_channel=BeerBiceps"
S3_BUCKET = 'ml-dl-demo-data'

## Setting up the model
Let's define the LLM model that we'll use as part of the workflow.

In [2]:
from langchain_community.chat_models import BedrockChat
from langchain_core.messages import HumanMessage

model = BedrockChat(model_id="anthropic.claude-v2", model_kwargs={"temperature": 0.1})

We can test the model by asking a simple question.

In [3]:
messages = [
    HumanMessage(
        content="Who won the ICC Criket World Cup 2019?"
    )
]

model.invoke(messages)

AIMessage(content="England won the ICC Cricket World Cup 2019 by defeating New Zealand in a thrilling final at Lord's in London. The scores were tied after both teams scored 241 runs in their 50 overs, and the match went to a Super Over where the scores were again tied. England won based on boundary count.")

The result from the model is an `AIMessage` instance containing the answer. We can extract this answer by chaining the model with an [output parser](https://python.langchain.com/docs/modules/model_io/output_parsers/).

Here is what chaining the model with an output parser looks like:

<img src='./images/chain1.png' width="1200">

For this example, we'll use a simple `StrOutputParser` to extract the answer as a string.

In [4]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser
chain.invoke("Who won the ICC Criket World Cup 2019?")

"England won the ICC Cricket World Cup 2019 by defeating New Zealand in a thrilling final at Lord's in London. The scores were tied after both teams scored 241 runs in their 50 overs, and the match went to a Super Over where the scores were again tied. England won based on boundary count, as they had hit more boundaries (26 to New Zealand's 17) in the match. It was the first time a World Cup final was decided on boundary count. Overall, it was an incredible match that came down to the finest of margins."

## Introducing prompt templates

We want to provide the model with some context and the question. [Prompt templates](https://python.langchain.com/docs/modules/model_io/prompts/quick_start) are a simple way to define and reuse prompts.

In [5]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt.format(context="Mary's sister is Susana", question="Who is Mary's sister?")

'Human: \nAnswer the question based on the context below. If you can\'t \nanswer the question, reply "I don\'t know".\n\nContext: Mary\'s sister is Susana\n\nQuestion: Who is Mary\'s sister?\n'

We can now chain the prompt with the model and the output parser.

<img src='./images/chain2.png' width="1200">

In [6]:
chain = prompt | model | parser
chain.invoke({
    "context": "Mary's sister is Susana",
    "question": "Who is Mary's sister?"
})

'Susana'

## Combining chains

We can combine different chains to create more complex workflows. For example, let's create a second chain that translates the answer from the first chain into a different language.

Let's start by creating a new prompt template for the translation chain:

In [7]:
translation_prompt = ChatPromptTemplate.from_template(
                                                        "Translate {answer} to {language}"
                                                    )

We can now create a new translation chain that combines the result from the first chain with the translation prompt.

Here is what the new workflow looks like:

<img src='./images/chain3.png' width="1200">

In [8]:
from operator import itemgetter

translation_chain = (
    {"answer": chain, "language": itemgetter("language")} | translation_prompt | model | parser
)

translation_chain.invoke(
    {
        "context": "Mary's sister is Susana. She doesn't have any more siblings.",
        "question": "How many sisters does Mary have?",
        "language": "Hindi",
    }
)

'दिए गए संदर्भ के आधार पर, मैरी की एक बहन सुसाना है और उसके और कोई भाई-बहन नहीं हैं। इसलिए, उत्तर यह है कि मैरी की एक बहन है।'

## Transcribing the YouTube Video

The context we want to send the model comes from a YouTube video. Let's download the video and transcribe it using [OpenAI's Whisper](https://openai.com/research/whisper).

In [9]:
from utils import transcribe_video

transcribe_video(s3_bucket_name=S3_BUCKET, youtube_video_url=YOUTUBE_VIDEO)

Transcription file already exists.


Let's read the transcription and display the first few characters to ensure everything works as expected.

In [10]:
import json

with open("transcription.txt", "r") as file:
    transcription = json.loads(file.read())
    transcription = transcription['results']['transcripts'][0]['transcript']

transcription[:100]

"You're a multibillionaire European founder who's moved to Gandhinagar. Yes. Why did you choose Gujar"

## Using the entire transcription as context

If we try to invoke the chain using the transcription as context, the model will return an error because the context is too long.

Large Language Models support limitted context sizes. The video we are using is too long for the model to handle, so we need to find a different solution.

In [11]:
len(transcription)

114926

In [12]:
%%timeit

chain.invoke({"context": transcription,
              "question": "What matters when selecting a location for a business in India ?"
            })

34.7 s ± 521 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Splitting the transcription

Since we can't use the entire transcription as the context for the model, a potential solution is to split the transcription into smaller chunks. We can then invoke the model using only the relevant chunks to answer a particular question:

<img src='./images/system2.png' width="1200">

Let's start by loading the transcription in memory:

In [13]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("transcription.txt")
text_documents = loader.load()

# text_documents

There are many different ways to split a document. For this example, we'll use a simple splitter that splits the document into chunks of a fixed size. Check [Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/) for more information about different approaches to splitting documents.

For illustration purposes, let's split the transcription into chunks of 100 characters with an overlap of 20 characters and display the first few chunks:

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
text_splitter.split_documents(text_documents)[:5]

[Document(page_content='{"jobName":"Multi-BillionairesJourneyInIndia-LeadershipCultureAndOpportunityOdooTRS386.mp41711386077', metadata={'source': 'transcription.txt'}),
 Document(page_content='TRS386.mp41711386077","accountId":"507922848584","status":"COMPLETED","results":{"transcripts":[{"tr', metadata={'source': 'transcription.txt'}),
 Document(page_content='{"transcripts":[{"transcript":"You\'re', metadata={'source': 'transcription.txt'}),
 Document(page_content="a multibillionaire European founder who's moved to Gandhinagar. Yes. Why did you choose Gujarat? In", metadata={'source': 'transcription.txt'}),
 Document(page_content='choose Gujarat? In India? We have a ruler to do is we never go to tier one cities. We always go to', metadata={'source': 'transcription.txt'})]

For our specific application, let's use 1000 characters instead:

In [15]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
documents = text_splitter.split_documents(text_documents)

## Finding the relevant chunks

Given a particular question, we need to find the relevant chunks from the transcription to send to the model. Here is where the idea of **embeddings** comes into play.

An embedding is a mathematical representation of the semantic meaning of a word, sentence, or document. It's a projection of a concept in a high-dimensional space. Embeddings have a simple characteristic: The projection of related concepts will be close to each other, while concepts with different meanings will lie far away. You can use the [Cohere's Embed Playground](https://dashboard.cohere.com/playground/embed) to visualize embeddings in two dimensions.

To provide with the most relevant chunks, we can use the embeddings of the question and the chunks of the transcription to compute the similarity between them. We can then select the chunks with the highest similarity to the question and use them as the context for the model:

<img src='./images/system3.png' width="1200">

Let's generate embeddings for an arbitrary query:

In [16]:
from langchain_community.embeddings import BedrockEmbeddings

embeddings = BedrockEmbeddings()
embedded_query = embeddings.embed_query("Berlin is in Germany")

print(f"Embedding length: {len(embedded_query)}")
print(embedded_query[:10])

Embedding length: 1536
[1.2890625, 0.4453125, 0.28320312, 0.3984375, 0.050048828, -0.123046875, 0.58984375, -0.0007247925, -0.23535156, 0.48046875]


To illustrate how embeddings work, let's first generate the embeddings for two different sentences:

In [17]:
sentence1 = embeddings.embed_query("Welcome to Frankfurt")
sentence2 = embeddings.embed_query("This is a table")

We can now compute the similarity between the query and each of the two sentences. The closer the embeddings are, the more similar the sentences will be.

We can use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to calculate the similarity between the query and each of the sentences:

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

query_sentence1_similarity = cosine_similarity([embedded_query], [sentence1])[0][0]
query_sentence2_similarity = cosine_similarity([embedded_query], [sentence2])[0][0]

query_sentence1_similarity, query_sentence2_similarity

(0.6138958023127881, 0.2699050016319834)

## Setting up a Vector Store

We need an efficient way to store document chunks, their embeddings, and perform similarity searches at scale. To do this, we'll use a **vector store**.

A vector store is a database of embeddings that specializes in fast similarity searches. 

<img src='./images/system4.png' width="1200">

To understand how a vector store works, let's create one in memory and add a few embeddings to it:

### Storing vectors in Amazon Aurora using `pgvector`

<div style="background-color: #f0f8ff; padding: 10px; border-radius: 5px; font-size: 1.1em;">
<b>Prerequisite:</b>
<ol>
    <li>Have an <b>Aurora cluster ready</b>.</li>
    <li>Create the <b>pgvector extension</b> on your Aurora PostgreSQL database (DB) cluster:
        <pre style="font-size: 1.1em;"><code>
        CREATE EXTENSION vector;
        </code></pre>
    </li>
</ol>
</div>


We can connect to the Aurora cluster and check 


```sql
-- SHOW the current database
SELECT current_database();

-- SHOW all the tables in the database
SELECT table_name
FROM postgres.information_schema.tables
WHERE table_schema = 'public';
```

In [19]:
from langchain_community.vectorstores.pgvector import PGVector, DistanceStrategy

# Loading all env variables 
load_dotenv()

COLLECTION_NAME = 'rag-intro-on-aws'

# Connection String
CONNECTION_STRING = PGVector.connection_string_from_db_params(driver = os.getenv("PGVECTOR_DRIVER"),
                                                              user = os.getenv("PGVECTOR_USER"),                                      
                                                              password = os.getenv("PGVECTOR_PASSWORD"),                                  
                                                              host = os.getenv("PGVECTOR_HOST"),                                            
                                                              port = os.getenv("PGVECTOR_PORT"),                                          
                                                              database = os.getenv("PGVECTOR_DATABASE"),
                                                              )  

# Text Embedding model
embeddings = BedrockEmbeddings()

# Creating the VectorDB store instance   
vectorstore1 = PGVector(collection_name=COLLECTION_NAME,
                           connection_string=CONNECTION_STRING,
                           embedding_function=embeddings,
                           distance_strategy = DistanceStrategy.EUCLIDEAN,
                           use_jsonb = True
                          )

In [20]:
vectorstore1.add_texts([
                    "Color of the bird is red"
                    "The cat slept by the fire.",
                    "We went to the park after school.",
                    "I finished my homework early.",
                    "The bird sang a beautiful song.",
                    "She read a book before bed.",
                    "Mary has two siblings",
                    "Song was in Spanish", 
                    ])

['c6de01da-ead0-11ee-ae82-3eaa6d286609',
 'c6de0428-ead0-11ee-ae82-3eaa6d286609',
 'c6de0496-ead0-11ee-ae82-3eaa6d286609',
 'c6de04e6-ead0-11ee-ae82-3eaa6d286609',
 'c6de052c-ead0-11ee-ae82-3eaa6d286609',
 'c6de0572-ead0-11ee-ae82-3eaa6d286609',
 'c6de05ae-ead0-11ee-ae82-3eaa6d286609']

We can now query the vector store to find the most similar embeddings to a given query:

In [21]:
vectorstore1.similarity_search_with_score(query="What the bird was singing", k=3)

[(Document(page_content='The bird sang a beautiful song.'),
  12.719913663751074),
 (Document(page_content='The bird sang a beautiful song.'),
  12.719913663751074),
 (Document(page_content='Color of the bird is redThe cat slept by the fire.'),
  18.77422507321166)]

## Connecting the vector store to the chain

We can use the vector store to find the most relevant chunks from the transcription to send to the model. Here is how we can connect the vector store to the chain:

<img src='./images/chain4.png' width="1200">

We need to configure a [Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/). The retriever will run a similarity search in the vector store and return the most similar documents back to the next step in the chain.

We can get a retriever directly from the vector store we created before: 

In [22]:
retriever1 = vectorstore1.as_retriever()
retriever1.invoke("Whats the color of the bird who was singing?")

[Document(page_content='Color of the bird is redThe cat slept by the fire.'),
 Document(page_content='Color of the bird is redThe cat slept by the fire.'),
 Document(page_content='The bird sang a beautiful song.'),
 Document(page_content='The bird sang a beautiful song.')]

Our prompt expects two parameters, "context" and "question." We can use the retriever to find the chunks we'll use as the context to answer the question.

We can create a map with the two inputs by using the [`RunnableParallel`](https://python.langchain.com/docs/expression_language/how_to/map) and [`RunnablePassthrough`](https://python.langchain.com/docs/expression_language/how_to/passthrough) classes. This will allow us to pass the context and question to the prompt as a map with the keys "context" and "question."

In [23]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

setup = RunnableParallel(context=retriever1, question=RunnablePassthrough())
setup.invoke("Whats the color of the bird who was singing?")

{'context': [Document(page_content='Color of the bird is redThe cat slept by the fire.'),
  Document(page_content='Color of the bird is redThe cat slept by the fire.'),
  Document(page_content='The bird sang a beautiful song.'),
  Document(page_content='The bird sang a beautiful song.')],
 'question': 'Whats the color of the bird who was singing?'}

Let's now add the setup map to the chain and run it:



In [24]:
chain = setup | prompt | model | parser
chain.invoke("Whats the color of the bird who was singing?")

'Based on the given context, I don\'t have enough information to determine the color of the bird who was singing. The context mentions a red bird, but does not specify if this red bird is the one singing. Since I cannot definitively answer the question, I must reply "I don\'t know".'

Let's invoke the chain using another example:

In [25]:
chain.invoke("Does Mary has any brother or sister ?")

'Based on the given context, I can infer that Mary has two siblings. Since siblings refers to brothers and/or sisters, Mary must have at least one brother or sister. Therefore, the answer is yes, Mary has a brother or sister.'

## Loading transcription into the vector store

We initialized the vector store with a few random strings. Let's create a new vector store using the chunks from the video transcription.

## Setting up Aurora

In [26]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import BedrockEmbeddings
from langchain_community.vectorstores.pgvector import PGVector
import os
from dotenv import load_dotenv

# Loading all env variables
load_dotenv()

# Load the text from the file
loader = TextLoader("transcription.txt")
documents = loader.load()

# Split the text into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# Initialize the embeddings
embeddings = BedrockEmbeddings()

# Set the collection name
COLLECTION_NAME = "rag-intro-yt"

# Connection String
CONNECTION_STRING = PGVector.connection_string_from_db_params(
    driver=os.getenv("PGVECTOR_DRIVER"),
    user=os.getenv("PGVECTOR_USER"),
    password=os.getenv("PGVECTOR_PASSWORD"),
    host=os.getenv("PGVECTOR_HOST"),
    port=os.getenv("PGVECTOR_PORT"),
    database=os.getenv("PGVECTOR_DATABASE"),
)

# Create the PGVector instance from the documents
db = PGVector.from_documents(
                                embedding=embeddings,
                                documents=docs,
                                collection_name=COLLECTION_NAME,
                                connection_string=CONNECTION_STRING,
                                use_jsonb = True
                            )

Let's now run a similarity search on pinecone to make sure everything works:

In [27]:
db.similarity_search("Can you detail the speaker's journey from starting as a coder to becoming a successful entrepreneur, including the pivot in their business model?")[:3]

[Document(page_content="that. You like how you're working, right? The reason is that um uh public companies tend to refocus on the short term, you know, you have to uh publish earning codes with the, the sales number and if the numbers are good, everyone is happy, they buy the shares. If numbers are bad, people are not happy and your employees are frustrated because they have shares and it's bad. So public companies have a tendency to focus on the short term to saves, saves of the moments of the quarter and so on. I don't want that the success of FU is always to build for the long term. And I don't want to see on to look for the short term or the sales number of the quarter. One, the other thing is I'm so much focused on productivity and efficiency. Uh When you get public, you need extra layers of reporting transparency, uh anything like that. Uh So second reason I don't want that I want to be super efficient, decide right away instead of asking the board of directors. Um And third, I 

Let's setup the new chain using Pinecone as the vector store:

In [28]:
chain = (
    {"context": db.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

response = chain.invoke("What are the main challenges and advantages of doing business in India, including insights on market sensitivity, price, and speed of decision-making?")

In [29]:
print(response)

Based on the context provided, some key challenges and advantages of doing business in India that were discussed include:

Challenges:

- India is a very price sensitive market. Customers focus heavily on getting the lowest price rather than prioritizing quality or service.

- Indian customers often want to do things themselves rather than buying services, in order to save money. This makes it hard to sell additional services.

- Marketing can be difficult as businesses need to build awareness from scratch when entering the Indian market. 

Advantages:

- India is very open to international businesses compared to some other markets like China. Many people speak English and the business environment is friendly.

- Decision making can be faster than other markets like Europe or Africa where sales cycles are longer. Indians want to move quickly.

- There is a big need for growth and modernization that international companies can fill, providing huge opportunities.

- The large population 