#Bringing the Power of Large Language Models to Enterprise: Using RAG
## Second Example : Using Verizon 10k

## Example Case: Financial Statements Analysis Using LLMs Powered by RAG (Retrival Augmented Generation) v2
* Let us say as Analyst we are Analyzing Financial Statements
* Information is in PDFs or other document types.
* Want a Natural Language based approach to analysis - as opposed to Traditional



In [None]:
# !pip install langchain==0.1.14
# Tried New but not working !pip install langchain==0.3.21
# pip install langchain-core
# !pip install langchain-experimental==0.0.56
# !pip install langchain-community==0.0.31
!pip install langchain[all]

# !pip install -U langchain-openai


: 

# Section A: Initial Setup - Load Libraries, Keys...


In [None]:
!python --version
from pprint import pprint # Pretty Print
import os
openai_api_key = os.environ["OPENAI_API_KEY"]

In [None]:
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai_api_key)

In [None]:
review_text = "The product is amazing!"
review_embedding = embeddings.embed_query(review_text)
review_embedding

In [None]:
from langchain_openai import OpenAI

In [None]:
EMBEDDING_MODEL  = "text-embedding-3-small"
GENERATION_MODEL = "gpt-3.5-turbo-instruct"

llm = openai.OpenAI(model=GENERATION_MODEL)
embed_model = OpenAIEmbeddings(model=EMBEDDING_MODEL)

# Section B: Start Interacting with LLMs Programmatically...


In [None]:
llm.invoke("What is the Capital of India")

In [None]:
llm.invoke("What is a Quarterly Revenue of Infosys in USD for recent quarter?")

# Section C : Augment with Trusted Knowledge Source e.g. 10K/10-Q in PDF

In [None]:
# pdfFile = "/content/drive/MyDrive/google.pdf"
# pdfFile = "/content/drive/MyDrive/LayoutParser.pdf"
pdfFile = "/content/drive/MyDrive/InfosysPressReleaseQ32024.pdf"

from langchain_community.document_loaders import PDFPlumberLoader
loader = PDFPlumberLoader(pdfFile)
docs = loader.load()

# Check the number of pages
print("Number of pages in the PDF:",len(docs))

# Load the random page content
print(docs[2].page_content)  # Sample content

# Section D: Pre-processing of Data for Efficient Querying

## Step D1. Split the document into Chunks
- Why Chucking is needed?
   -  Large Datasets and Context Length
   -  Context Relevant Responses - Short/Concise and Focused inputs produce better results than long contexts.

- The SemanticChunker splits text into chunks based on semantic similarity, ensuring that related content stays together in the same chunk.

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain.embeddings import HuggingFaceEmbeddings

text_splitter = SemanticChunker(HuggingFaceEmbeddings())
documents = text_splitter.split_documents(docs)

In [None]:
print(len(documents))  # Number of Chunks

In [None]:
# Now Look at the content of Second Document post Chucking - the contents will be different
print(documents[2].page_content)

## Step D2. Create embeddings for each text chunk
- Text (Unstructured Data) Converted to Numeric Representation
- Store in specialised / purpose build Database - Vector Database

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

# Instantiate the embedding model
embedder = HuggingFaceEmbeddings()

# Create the vector store
vector = FAISS.from_documents(documents, embedder)

## Step D3. Test with Sample Retrieval of Data from the vector database

In [None]:
# Input
retriever = vector.as_retriever(search_type="similarity", search_kwargs={"k": 3})
retrieved_docs = retriever.invoke("What is name of the CFO?")


In [None]:
# Look at the name of CFO in this chuck of Text at the bottom or near end of the chunk
print(retrieved_docs[0].page_content)

# Section E: Augment the Enteprise Knowledge Source

In [None]:
from langchain.chains import RetrievalQA
from langchain.chains.llm import LLMChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.prompts import PromptTemplate

In [None]:
prompt = """
1. Use the following pieces of context to answer the question at the end.
2. If you don't know the answer, just say that "I don't know" but don't make up an answer on your own.\n
3. Keep the answer crisp and limited to 3,4 sentences.

Context: {context}

Question: {question}

Helpful Answer:"""

In [None]:
QA_CHAIN_PROMPT = PromptTemplate.from_template(prompt)

llm_chain = LLMChain(
                  llm=llm,
                  prompt=QA_CHAIN_PROMPT,
                  callbacks=None,
                  verbose=False)

document_prompt = PromptTemplate(
    input_variables=["page_content", "source"],
    template="Context:\ncontent:{page_content}\nsource:{source}",
)

combine_documents_chain = StuffDocumentsChain(
                  llm_chain=llm_chain,
                  document_variable_name="context",
                  document_prompt=document_prompt,
                  callbacks=None,
              )

In [None]:
qa = RetrievalQA(
                  combine_documents_chain=combine_documents_chain,
                  verbose=False,
                  retriever=retriever,
                  return_source_documents=False,
              )

# Section F: Kickstart your Natural Language Based Analysis and fire-away !!

In [None]:
# Input Prompt
# Note : Ignore warnings if you are getting the response
print(qa("What is name of the CFO?")["result"])

In [None]:
pprint(qa("What is a Quarterly Revenue of Infosys in USD for recent quarter?"))

# Section G: Take a Moment to Thank all the wonderful AI Leaders (Researchers,Industry etc.) for this wonderful Pivot !