# Dependencies

In [13]:
from langchain.document_loaders import PyPDFDirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter 
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from pinecone import Pinecone, ServerlessSpec
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate 
from dotenv import load_dotenv
load_dotenv()
import os
import time

In [14]:
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
PINECONE_API_ENV = os.getenv("PINECONE_API_ENV")

# Loading and chunking the Data
Here, we will use the PyPDFDirectoryLoader for the PDF from the langchain wrapper to load the PDF data and chunk it into paragraphs. First step includes loading the current PDF file into the loader and converting it to a list of documents. Each document is a list of pages, which consists the metadata of the source PDF and the page number.

In [10]:
loader = PyPDFDirectoryLoader("data")
data = loader.load()
data

[Document(page_content='W9 \ue088 React\n1W 9  -  R e a c t\nSt ar ting a r e act pr oject loc all y\nThere are various ways to bootstrap a r eact project loc ally. Vite is the most  \nwidely used one t oday.\nV i t e\nRef - https://vite.dev/guide/\nVite \ue081French word for "quick", pronounced\xa0/vit/ , like "veet") is a build t ool that \naims to provide a faster and le aner development e xperience for moder n web \nprojects. It consist s of two major p arts:\nA dev server that pr ovides\xa0 r ich f e atur e enhancement s\xa0over\xa0 nativ e ES  \nmodules , for example e xtremely fast\xa0 Hot Module R eplacement \ue081HMR\ue082 .\nA build command that bundles y our code with\xa0 R ollup, pre-configur ed to output  \nhighly optimiz ed static asset s for production.\nI n t i a l i z i n g  a  r e a c t  p r o j e c t\nnpm create vite@latest\nComponent s\nIn React, component s are the building blocks of the user int erface. The y allow \nyou to split the UI int o independent, r eusabl

Now, since the context window of our LLM will be limited, the ideal way to handle this is to chunk the data into paragraphs. This is done by the chunker, which takes the list of documents and returns a list of paragraphs within the chunk limit we set (500 words in this case). The chunker also takes care of the page breaks and ensures that the paragraphs are not split across pages. Also, we will be introducing an overlap, which will be the number of words that will be repeated in the end of one chunk and the beginning of the next chunk. This is done to ensure that the context is not lost between the chunks.

Now, we should be able to do the same for text data as well.

In [55]:
text_loader = TextLoader("data/W5_Code.txt")
text_data = text_loader.load()
text_data

[Document(page_content='from pyspark.sql import SparkSession\nfrom pyspark.sql.functions import last\n\nspark = SparkSession.builder.appName("SCD Type II Merge W5 GA").getOrCreate()\n\nmaster_file_path = "gs://week-5-ga/source_data-w5.csv"\nupdate_file_path = "gs://week-5-ga/update_data-w5.csv"\noutput_file_path = "gs://week-5-ga/master_data-w5.csv"\n\nmaster_df = spark.read.csv(master_file_path, header=True, inferSchema=True)\nupdate_df = spark.read.csv(update_file_path, header=True, inferSchema=True)\n\ncombined_df = master_df.union(update_df)\n\nfinal_df = combined_df.groupBy("Customer ID").agg(\n    last("Name").alias("Name"),\n    last("Address").alias("Address"),\n    last("Membership Start Date").alias("Membership Start Date"),\n    last("Membership End Date").alias("Membership End Date")\n)\n\nfinal_df.show()\nfinal_df.write.csv(output_file_path, header=True)', metadata={'source': 'data/W5_Code.txt'})]

Now, we will have to add a page (could be 0) in the metadata on this, as it expects it to be there in the final chain.

In [56]:
text_data[0].metadata["page"] = 0
text_data

[Document(page_content='from pyspark.sql import SparkSession\nfrom pyspark.sql.functions import last\n\nspark = SparkSession.builder.appName("SCD Type II Merge W5 GA").getOrCreate()\n\nmaster_file_path = "gs://week-5-ga/source_data-w5.csv"\nupdate_file_path = "gs://week-5-ga/update_data-w5.csv"\noutput_file_path = "gs://week-5-ga/master_data-w5.csv"\n\nmaster_df = spark.read.csv(master_file_path, header=True, inferSchema=True)\nupdate_df = spark.read.csv(update_file_path, header=True, inferSchema=True)\n\ncombined_df = master_df.union(update_df)\n\nfinal_df = combined_df.groupBy("Customer ID").agg(\n    last("Name").alias("Name"),\n    last("Address").alias("Address"),\n    last("Membership Start Date").alias("Membership Start Date"),\n    last("Membership End Date").alias("Membership End Date")\n)\n\nfinal_df.show()\nfinal_df.write.csv(output_file_path, header=True)', metadata={'source': 'data/W5_Code.txt', 'page': 0})]

After this, this chunk will follow the exact same steps as the PDF data.

In [11]:
text_split = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 20)
text_chunks = text_split.split_documents(data)


text_chunks[:3]

[Document(page_content='W9 \ue088 React\n1W 9  -  R e a c t\nSt ar ting a r e act pr oject loc all y\nThere are various ways to bootstrap a r eact project loc ally. Vite is the most  \nwidely used one t oday.\nV i t e\nRef - https://vite.dev/guide/\nVite \ue081French word for "quick", pronounced\xa0/vit/ , like "veet") is a build t ool that \naims to provide a faster and le aner development e xperience for moder n web \nprojects. It consist s of two major p arts:\nA dev server that pr ovides\xa0 r ich f e atur e enhancement s\xa0over\xa0 nativ e ES', metadata={'source': 'data/react.pdf', 'page': 0}),
 Document(page_content='modules , for example e xtremely fast\xa0 Hot Module R eplacement \ue081HMR\ue082 .\nA build command that bundles y our code with\xa0 R ollup, pre-configur ed to output  \nhighly optimiz ed static asset s for production.\nI n t i a l i z i n g  a  r e a c t  p r o j e c t\nnpm create vite@latest\nComponent s\nIn React, component s are the building blocks of the user

As we can see, each of our chunks is in the limit of 500 characters and the overlap is 20 characters, let us view the total number of chunks. 

In [12]:
print(f"Length of chunks : {len(text_chunks)}")

Length of chunks : 108


# Pinecone Initialization
Now, we will be using the pinecone vectorDB to store the embeddings of the chunks. We will be using the `pinecone.init()` function to initialize the pinecone environment. We will be using the `pinecone.use_index()` function to use the index created for this project and setup the instance for the same.

In [15]:
pc = Pinecone(api_key = PINECONE_API_KEY, environment = PINECONE_API_ENV)

Now, let us view the indexs avaliable in the pinecone environment.

In [16]:
pc.list_indexes()

{'indexes': [{'deletion_protection': 'disabled',
              'dimension': 1536,
              'host': 'documents-hjunc2h.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'documents',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'},
              'tags': {'embedding_model': 'text-embedding-3-small'},
              'vector_type': 'dense'}]}

So this is the one we will be using for storing the documnets of this project.

# Embedding the Chunks using OpenAI text-embedding-3-small
Here, we will be using the OpenAI text-embedding-3-small model to embed the chunks, for which we will need an openAI instance initialised.

In [17]:
openAI_client = OpenAI(api_key=OPENAI_API_KEY)

Let us go ahead and set the embeddings model and a function to get the embeddings of any given text via the text-embedding-3-small model.

In [18]:
embedding_model = openAI_client.embeddings

def get_embedding(text) :
    response = embedding_model.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

Now, each chunk will ideally belong to a documnent stored in S3 bucket in AWS. So, for now we will assume that the **KEY** of that document is in the format as given below which we will be using in the metadata. To simiplify the process, each entry in the vectorDB should have : 

* **ID** : The unique ID of the document, which will be a combination of the document key and the chunk number.
* **VALUES** : The embedding of the chunk, as generated by the OpenAI text-embedding-3-small model.
* **METADATA** : The metadata of the document, which will include the document key and the chunk number.
    *  **USER_ID** : The ID of the user in our system.
    *  **DOCUMENT_TYPE** : The type of the document from either "pdf" or "text".
    *  **KEY** : The key of the document in the S3 bucket, which defines the location of the document.
    *  **CHUNK** : The text of the chunk.
    *  **PAGE_NUMBER** : The page number of the chunk in the document.
    *  **CHUNK_INDEX** : The number of the chunk in the document.

Since most of the information will be given by the backend server, for now we will use dummy values for the metadata.

In [19]:
KEY = "61f100abx/pdf/W9-react"
USER_ID = "61f100abx"
DOCUMENT_TYPE = "pdf"

Let us now create the function to create vectors in our desired format as defined above.

In [20]:
def create_vectors(text_chunks,KEY,USER_ID,DOCUMENT_TYPE):
    v = []
    chunk_num = 0
    
    for chunk in text_chunks: 
        page_num = chunk.metadata["page"]
        
        entry = {}
        entry["id"] = f"{KEY}_PAGE_{page_num}_CHUNK_{chunk_num}"
        entry["values"] = get_embedding(chunk.page_content)
        entry["metadata"] = {
            "userID" : USER_ID,
            "type" : DOCUMENT_TYPE,
            "key" : KEY,
            "chunk" : chunk.page_content,
            "page_number" : chunk.metadata["page"],
            "chunk_number" : chunk_num
        }
        
        chunk_num += 1
        v.append(entry)
        
    return v

In [21]:
vectors = create_vectors(text_chunks,KEY,USER_ID,DOCUMENT_TYPE)

With this, we have our vectors stored in the ideal format to be pushed into the vector DB. Let us now push the vectors into the vectorDB of pinecone.

# Pushing the Vectors into the Pinecone Index

In [22]:
index_name = "documents"
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

index = pc.Index(index_name)

index.upsert(
    vectors=vectors
)

{'upserted_count': 108}

# Querying the Vectors
We shall now query the vectors to check if the vectors have been stored correctly in the pinecone index, and how does this exactly work. We will fetch the relevant vectors from the pinecone index. For that, we will create a function which takes a text query, converts into to an embedding and queries the pinecone index to get the most similar texts from the vectors stored in the index.

In [25]:
def get_relevant_chunks(query,userID,KEY):
    query_vector = get_embedding(query)
    
    results = index.query(
        vector = query_vector,
        top_k = 10,
        include_values = False,
        include_metadata = True,
        filter={
            "userID" : userID,
            "key" : KEY
        }
    )
    
    relevant_texts = []
    for record in results['matches']:
        text = {}
        text['score'] = record['score']
        text['text'] = record['metadata']['chunk']
        text["reference"] = int(record["metadata"]["page_number"]) + 1
        relevant_texts.append(text)
    
    return relevant_texts

Finally, we can create a QA system which will take a query and return the most relevant chunks from the PDF document.

In [26]:
import sys 
while True:
    user_input = input(f"Input Prompt: ")
    if user_input=='exit':
        print( 'Exiting')
        sys.exit()
    if user_input == '':
        continue
    
    docs = get_relevant_chunks(user_input,USER_ID,KEY)
        
    for doc in docs:
        print(f"Rank {doc['score']} \n Reference {doc['reference']} \n Answer: \n {doc['text']}")
        print("------------------------")

    print("------------------------------------------------------------------------------------------------------------------------")
        

Rank 0.597881198 
 Reference 6 
 Answer: 
 The useEffect hook let s you perform side ef fects in functional component s in a 
safe, predictable way:
useEffect (() => {
  // Code here is the "effect" — this is where side effects  
happen
  fetchData ();
  // Optionally, return a cleanup function that runs when t
he component unmounts.
  return () => {
    // Cleanup code, e.g., unsubscribing from an event or c
------------------------
Rank 0.555085778 
 Reference 7 
 Answer: 
 Optional Cle anup If your side ef fect needs cle anup (e.g., unsubscr ibing 
from a WebSocket, clearing intervals), useEffect allows you to return a 
function that R eact will c all when the component unmount s or before re-
running the ef fect.
T o  r e c a p
useEffect is a Hook that let s you perform side ef fects in functional component s. 
It can be used f or data fetching, subscr iptions, or manuall y changing the DOM.
L i n k e d i n  l i k e  t o p b a r
------------------------
Rank 0.488263339 
 Referenc

SystemExit: 

With this, our pipeline is complete and we can now move on to the next steps which is sending these relvant documents to the LLM to answer our query.

# Prompt Template for the LLM
Here, we will need to define the prompt for the LLM to answer the query. The LLM will be given the query and the relevant documents, and it will be expected to return the answer to the query. 

In [27]:
query_prompt_template = """
    You are a specialised AI document analyser working at an edtech startup, and you will be assisting the users to answer their queries. You will be given 
    the top relevant documents and you have to use those to answer the query asked by the user, which will be given to you below. 
    In the relevant documents,you will be given the cosine similarity score, the reference (which is the page number where this 
    text was in the document) and the text itself. You can in you answer integrate the reference to build authenticity of your answer, 
    by precisely writing it like (reference page : page_num)
    
    \n\n User Query : {query}
    \n\n Documents : {documents}
    
    MAKE SURE YOU DO NOT ANSWER FROM ANYTHING APART FROM THE DOCUMENTS GIVEN TO YOU. 
"""

In [28]:
query_prompt = PromptTemplate(
    input_variables=["query","documents"],
    template=query_prompt_template
)

# Initializing the LLM Client and Chain for RAG Model

In [29]:
chat = ChatOpenAI(
    temperature = 0,  
    model = "gpt-4o",
    openai_api_key = OPENAI_API_KEY
)

In [30]:
query_chain = LLMChain(
    llm=chat,
    prompt=query_prompt
)

# Q&A System using the chain

In [32]:
user_query = "could you explain useState with some examples?"
docs = get_relevant_chunks(user_query,USER_ID,KEY)

# Run the chain
response = query_chain.invoke({
    "query": user_query,
    "documents": docs
})

# Print the response
print("Response from LLM:")
print(response['text'])

Response from LLM:
The `useState` hook is a fundamental part of React that allows you to add state to functional components. It returns an array with two elements: the current state value and a function to update that state. Here's a basic example to illustrate how `useState` works:

1. **Counter Example**:
   ```javascript
   import React, { useState } from 'react';

   const Counter = () => {
     const [count, setCount] = useState(0);

     return (
       <div>
         <p>You clicked {count} times</p>
         <button onClick={() => setCount(count + 1)}>
           Click me
         </button>
       </div>
     );
   };
   ```
   In this example, `useState(0)` initializes the state variable `count` to 0. The `setCount` function is used to update the state whenever the button is clicked, incrementing the count by 1 (reference page: 1).

2. **Posts Example**:
   ```javascript
   import { useState } from "react";
   import { PostComponent } from "./Post";

   function App() {
     co

Here, we have successfully built the RAG model and the Q&A system using the chain. With this, we get the functionality to query the relevant documents and get the answer to the query.

# Deleting via Pinecone (Listing indexes with prefix and then deleting the index)

In [None]:
index_name = "documents"


index = pc.Index(index_name)
list1 = []
for ids in index.list(prefix=KEY):
  list1.append(ids)

In [16]:
KEY = "674313bc5cbbf2d8da2a4649/pdf/"
res = index.delete([ids for ids in index.list(prefix=KEY)])
res

{}