In [None]:
import sys
sys.path.append('..')

# RAG Pipeline

## Structure:

 1. Extracting text from a .pdf document
 2. Chunking 
 3. Embeddings
 4. Vector DB
 5. RAG Query 


In [None]:
# This is the path to the PDF file you want to load
path = "../data/raw/AAPL.pdf"

### 1. Extracting text from a .pdf document

For this we will use the PDFMinerLoader and UnstructuredPDFLoader.

_PDFMinerLoader_:  
This extracts the raw text that is saved in the pdf document if it is available. This may not always be the case if for some reason the document was preprocessed as an image. The latter is very unlikely for financial documents

_UnstructuredPDFLoader_:  
In case the text is not available, this uses Optical Character Recognition (OCR) to extract text from an image.

The idea is to let both run over a document and then compare to assure all necessary data is available for analysis

In [None]:
# Load necessary modules
from langchain.document_loaders import (
    PDFMinerLoader,
    UnstructuredImageLoader,
    UnstructuredPDFLoader
)
from langchain.schema import Document

In [None]:
pdf_loader = PDFMinerLoader(path)
docs_text = pdf_loader.load()
print(f"Loaded {len(docs_text)} documents from {path}")

ocr_loader = UnstructuredPDFLoader(path)
ocr_docs_text = ocr_loader.load()
print(f"Loaded {len(ocr_docs_text)} documents from {path}")

### Structuring the text
To efficiently make a RAG pipeline we want to split this data up into to _chunks_ so that when we add them to the LLM prompt, we only pass the important information and not waste tokens.

A very basic implementation is the following with the _RecursiveCharacterTextSplitter_ also from the Langchain module.   
This text splitter takes the input of string and splits it at logical boundaries such as "\n\n" (double newline) "\n" (simple newline) in a hierarchal ranking of what could most likely be a semantical boundary

Other options:
- _TokenTextSplitter_ (splits by tokens not characters)
- _NLTKTextSplitter_ or _SpacyTextSplitter_ (splits by sentences)
- Domain specific splitter (e.g. Markdown / HTML)
- Regex splitters (uses Regular Expressions to make the splitting)


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200,separators=["\n\n", "\n", " ", ""])
texts = splitter.split_documents(docs_text)

In [None]:
# First 10 Chunks:
texts[:10]

### Creating embeddings and storing them

Now we have the separate chunks of information that we may want to pass to a NN/LLM.   

To assess which chunks we want to use for a prompt we need to have some metric top compare them top the question.   
To do this we create _embeddings_, very high dimensional numerical vectors, of these chunks.   
The euclidian distance between them is how close in meaning they are to each other.

#### Embeddings

To create the embeddings we will use pretrained models such as _all-MiniLM-L6-v2_ that can be called using the _SentenceTransformerEmbeddings_ from Langchain module.   
(You could also use OpenAI's Embedding model but this requires an API key)

#### Vector DB

To compare the diffrent chunks we will store the embeddings in a vector database. One of the most commonly used ones is the open sourced FAISS (Facebook AI Similarity Search). This can be called from the Langchain module as well. 

In [None]:
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS

In [None]:
embeddingsmodel = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(texts, embeddingsmodel)

### LLM

Now that we have the RAG Database in place, all we need is the LLM. For this we will use open source models found on Huggingface such as Google's _flan-t5-base_.   
The implementation pipeline is already built in the Langchain module as well.

(Of course here you could also use OpenAIs pipeline as well)

In [None]:
from langchain.llms import HuggingFacePipeline
from transformers import pipeline


In [None]:
llm_pipeline = pipeline("text2text-generation", model="google/flan-t5-base", max_length=512,temperature=0)
llm = HuggingFacePipeline(pipeline=llm_pipeline)

### RAG Pipeline

Now that we have the LLM and the Vector DB, we can bring it all together in a  simple RAG-pipeline.    
In the Langchain module the _RetrievalQA_ class brings the LLM pipeline and the vector database together so that on each prompt that is passed it adds the chunks closest in meanings to the prompt.

In [None]:
from langchain.chains import RetrievalQA

In [None]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",                  # simplest “stuff” chain
    retriever=vectorstore.as_retriever()
)

## QA:

### Now that everything is in place, we can ask the network questions about the documents. Try it Out!

In [None]:
query = "What is the name of the company?"

In [None]:
answer = qa.run(query)
print(answer)