In [1]:
import os
from huggingface_hub import hf_hub_download
from pathlib import Path
from time import time

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
repo_dir = Path('.').absolute().parent


# Download a hugging face model & make a Ollama modelfile
* Download huggingface CLI - [Hugging Face Hub](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli)
* Login to Hugging face - `huggingface-cli login --token $HUGGINGFACE_TOKEN`
* `huggingface-cli whoami`
* Download a llm model - specifically GGUF one - [GGUF model download](https://www.youtube.com/watch?v=7BH4C6-HP14)
* Write a `modelfile` : `FROM ./huggingface_models/mistral-7b-instruct-v0.2.Q4_K_M.gguf`
* Create a model: `ollama create mistrallite -f mistral_lite_modelfile`

* First, we have our original data source, the PDFs.
* This data is going to be split into small chunks and then transformed into an embedding and stored inside of the vector database.
* Then when we want to ask a question, we'll also turn our query into an embedding.
* This will let us fetch the most relevant entries from the database.
* We can then use those entries together in a prompt and that's how we get our final response.

# Load Docs

In [3]:
from langchain.document_loaders.pdf import PyPDFDirectoryLoader

DATA_PATH = r'F:\cc_data\SB'

def load_documents():
    document_loader = PyPDFDirectoryLoader(DATA_PATH)
    return document_loader.load()

In [4]:
# Create (or update) the data store.
start_time = time()

documents = load_documents()
print('\n Time taken: ', time() - start_time)
# documents[0]



 Time taken:  11.895373582839966


* So each document is basically an object containing the text content of each page in the PDF. 
* It also has some metadata attached, which tells you the page number and the source of the text.

In [5]:
# Extract the page_content from each document
page_contents = [doc.page_content for doc in documents]

# If you want to access the page_content of the first document
first_page_content = page_contents[0]
print(first_page_content)

 
Scotiabank  First  Quarter  Press  Release  2024    1  
 
First  Quarter  2024  Earnings  Release  
 
Scotiabank  reports  first  quarter  results  
 
All amounts  are in Canadian  dollars  and  are based  on our unaudited  Interim  Condensed  Consolidated  Financial  Statements  for the quarter  ended  January  31, 2024  and  
related  notes  prepared  in accordance  with  International  Financial  Reporting  Standards  (IFRS)  as issued  by the International  Accounting  Standards  Board  (IASB),  unless  
otherwise  noted.  Our  complete  First Quarter  2024  Report  to Shareholders,  including  our unaudited  interim  financial  statements  for the period  ended  January  31, 2024,  can 
also  be found  on the SEDAR+  website  at www.sedarplus.ca  and  on the EDG AR section  of the SEC’s  website  at www.sec.gov . Supplementary  Financial  Information  is also  
available,  together  with  the First  Quarter  2024  Report  to Shareholders  on the Investor  Relations  page  at www

## Chunk

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document

def split_documents(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=80,
        length_function=len,
        is_separator_regex=False,
    )
    return text_splitter.split_documents(documents)

In [7]:
start_time = time()

documents = load_documents()
chunks = split_documents(documents)
print(chunks[0])

print('\n Time taken: ', time() - start_time)

page_content='Scotiabank  First  Quarter  Press  Release  2024    1  
 
First  Quarter  2024  Earnings  Release  
 
Scotiabank  reports  first  quarter  results  
 
All amounts  are in Canadian  dollars  and  are based  on our unaudited  Interim  Condensed  Consolidated  Financial  Statements  for the quarter  ended  January  31, 2024  and  
related  notes  prepared  in accordance  with  International  Financial  Reporting  Standards  (IFRS)  as issued  by the International  Accounting  Standards  Board  (IASB),  unless  
otherwise  noted.  Our  complete  First Quarter  2024  Report  to Shareholders,  including  our unaudited  interim  financial  statements  for the period  ended  January  31, 2024,  can' metadata={'source': 'F:\\cc_data\\SB\\Q124_Quarterly_Press_Release-EN.pdf', 'page': 0}

 Time taken:  11.88722825050354


## Chuck Ids

We'll use the source path, the page number, and then the chunk number of that page.

In [8]:
def calculate_chunk_ids(chunks):

    # This will create IDs like "data/monopoly.pdf:6:2"
    # Page Source : Page Number : Chunk Index

    last_page_id = None
    current_chunk_index = 0

    for chunk in chunks:
        source = chunk.metadata.get("source")
        source = source[source.find('SB') : ]
        
        page = chunk.metadata.get("page")
        
        current_page_id = f"{source}:{page}"

        # If the page ID is the same as the last one, increment the index.
        if current_page_id == last_page_id:
            current_chunk_index += 1
        else:
            current_chunk_index = 0

        # Calculate the chunk ID.
        chunk_id = f"{current_page_id}:{current_chunk_index}"
        last_page_id = current_page_id

        # Add it to the page meta-data.
        chunk.metadata["id"] = chunk_id

    return chunks

# Embedding Functions & VectorDB

In [9]:
# function returns embedding function
# used at 2 places - 
# The first is going to be when we create the database itself. 
# And the second is when we actually want to query the database

from langchain_community.embeddings.ollama import OllamaEmbeddings
# from langchain_community.embeddings.bedrock import BedrockEmbeddings


def get_embedding_function():
    # embeddings = BedrockEmbeddings(
    #     credentials_profile_name="default", region_name="us-east-1"
    # )
    embeddings = OllamaEmbeddings(model="nomic-embed-text") # if completely local
    return embeddings

In [10]:
CHROMA_PATH = r"F:\cc_data\chroma_SB"


In [11]:
from langchain.vectorstores.chroma import Chroma

def add_to_chroma(chunks: list[Document]):
    # Load the existing database.
    db = Chroma(
        persist_directory=CHROMA_PATH, embedding_function=get_embedding_function()
    )

    # Calculate Page IDs.
    chunks_with_ids = calculate_chunk_ids(chunks)

    # Add or Update the documents.
    existing_items = db.get(include=[])  # IDs are always included by default
    existing_ids = set(existing_items["ids"])
    print(f"Number of existing documents in DB: {len(existing_ids)}")

    # Only add documents that don't exist in the DB.
    new_chunks = []
    for chunk in chunks_with_ids:
        if chunk.metadata["id"] not in existing_ids:
            new_chunks.append(chunk)

    if len(new_chunks):
        print(f"👉 Adding new documents: {len(new_chunks)}")
        new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks]
        db.add_documents(new_chunks, ids=new_chunk_ids)
        db.persist()
    else:
        print("✅ No new documents to add")

In [12]:
start_time = time()
add_to_chroma(chunks)

print('\n Time taken: ', time() - start_time)

  db = Chroma(


Number of existing documents in DB: 0
👉 Adding new documents: 278

 Time taken:  767.7068340778351


  db.persist()


In [13]:
import shutil
import os
def clear_database():
    if os.path.exists(CHROMA_PATH):
        shutil.rmtree(CHROMA_PATH)

In [14]:
# clear_database()

# Running RAG 

In [15]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

In [16]:
from langchain.prompts import ChatPromptTemplate
from langchain_community.llms.ollama import Ollama

def query_rag(query_text: str):
    # Prepare the DB.
    embedding_function = get_embedding_function()
    db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)

    # Search the DB. Gives k most relevant chunks to the query
    results = db.similarity_search_with_score(query_text, k=5)

    context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
    prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
    prompt = prompt_template.format(context=context_text, question=query_text)
    # print(prompt)

    model = Ollama(model="mistrallite:latest")
    response_text = model.invoke(prompt)

    sources = [doc.metadata.get("id", None) for doc, _score in results]
    formatted_response = f"Response: {response_text}\nSources: {sources}"
    print(formatted_response)
    return response_text, results

In [19]:
start_time = time()

query_text = 'How did the quater 3 compare with quater 2 and 1 for the company'
response_text, results = query_rag(query_text)

print('\n Time taken: ', time() - start_time)

Response:  According to the provided context, the company's net income attributable to equity holders increased by $102 million or 10% from Q2 2024 to Q3 2024. This was primarily due to higher revenues, partly offset by higher non-interest expenses and provision for credit losses.

Compared to Q3 2023, the company experienced a net loss of $729 million compared to a net loss of $299 million last year. The higher loss of $166 million was due mainly to lower revenues driven by higher funding costs, partly offset by higher revenue from liquid assets and a lower taxable equivalent basis gross-up as the Bank no longer claims the dividend received deduction on Canadian shares that are mark-to-market property.

Therefore, the company's financial performance improved from Q3 2023 to Q3 2024, despite experiencing a loss in both quarters. However, it should be noted that the comparison between Q3 2024 and Q2 2024 only shows an increase in net income attributable to equity holders, but no informa

In [20]:
results

[(Document(metadata={'id': 'SB\\Q324_Quarterly_Press_Release-EN.pdf:2:6', 'page': 2, 'source': 'F:\\cc_data\\SB\\Q324_Quarterly_Press_Release-EN.pdf'}, page_content='Other   \nQ3 2024  vs Q3 2023   \nNet income  attributable  to equity  holders  was a net loss of $729  million,  compared  to a net loss of $299  million  last year.   Adjusted  net income  \nattributable  to equity  holders  was a net loss of $465  million  compared  to a net loss of $299  million  last year.  The higher  loss of $166  million  was due \nmainly  to lower  revenues  driven  by higher  funding  costs . These  were  partly  offset  by higher  revenue  from  liquid  assets  and a lower  taxable  equivalent  \nbasis  (TEB)  gross -up as the Bank  no longer  claims  the dividend  received  deduction  on Canadian  shares  that  are mark -to-market  property.  The TEB  gross -\nup is offset  in income  taxes .  \nQ3 2024  vs Q2 2024'),
  384.5665222078829),
 (Document(metadata={'id': 'SB\\Q324_Quarterly_Press_Re