**Pre-requisites**
1. Install all libraries
2. Preload models (show how).

**Plan**

1. Use PDF document (e.g. Berkshire Hathaway financial report)
2. Split using SentenceTransformer
3. Load to MongoDB
4. Search 
5. Add a prompt
6. Generate

In [1]:
# !pip install langchain
# !pip install typing-inspect==0.8.0 typing_extensions==4.5.0
# !pip install pypdf

### Pre-load Models

In [2]:
def preload():
    s = SentenceTransformersTokenTextSplitter()
    emb = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-cos-v1')
    
# preload()

In [3]:
from pymongo import MongoClient
import os
from llama_cpp import Llama
import torch

# https://www.sbert.net/docs/pretrained_models.html#model-overview
# Sentence BERT, based on BERT
from sentence_transformers import SentenceTransformer

# https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.ht
# https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.SentenceTransformersTokenTextSplitter.html
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter, 
    SentenceTransformersTokenTextSplitter
)

from pypdf import PdfReader

class Object(object):
    pass

In [4]:
t = Object()

## MongoDB Config

In [5]:
t.uri = os.environ["MONGODB_URI"]
# Create a new client and connect to the server
t.client = MongoClient(t.uri)
# Send a ping to confirm a successful connection
try:
    t.client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


## Load and Parse Documents

In [6]:
# t.reader = PdfReader("data/brk-2023-q3.pdf")
t.reader = PdfReader("data/msft_2022.pdf")
t.pages = [p.extract_text().strip() for p in t.reader.pages]

Pages are of various sizes. We need to split into chunks that fit into the model window, specifically, the BERT embedding 256-token sized window. 

So we'll join all pages, and use the Sentense Transformer splitter to split doc into the chunks of the right size. 

In [7]:
# print(t.pages[10])

In [8]:
t.ch_splitter =  RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1024,
    chunk_overlap=0
)
t.ch_chunks = t.ch_splitter.split_text("\n".join(t.pages))
len(t.ch_chunks)

384

In [9]:
t.token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)
t.token_chunks = []
for ch in t.ch_chunks:
    t.token_chunks += t.token_splitter.split_text(ch)
len(t.token_chunks)

386

## Embedding Model

In [10]:
t.emb = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-cos-v1')

In [11]:
len(t.emb.encode(t.token_chunks[21]).tolist())

768

## Upload documents

In [12]:
t.db = t.client.rag_llama
t.coll = t.db.brk

In [13]:
list(t.coll.find().limit(10))

[]

In [14]:
t.docs = []
for t.ch in t.token_chunks:
    t.doc = {
        "text": t.ch,
        "emb": t.emb.encode(t.ch).tolist()
    }
    t.docs.append(t.doc)

In [15]:
_ = t.coll.insert_many(t.docs)

## Query Index

Index definition:

```
{
  "fields": [
    {
      "type": "vector",
      "path": "emb",
      "numDimensions": 768,
      "similarity": "dotProduct"
    }
  ]
}
```

In [42]:
t.query = "What was the total revenue?"

t.results = t.coll.aggregate([{
    "$vectorSearch": {
        "queryVector": t.emb.encode(t.query).tolist(),
        "path": "emb",
        "numCandidates": 100,
        "limit": 8,
        "index": "vector_index"
    }
}
])

In [None]:
t.context = "\n\n".join([d['text'] for d in t.results])
# print(t.context)

## Load LLama

In [18]:
t.llm_path = "../../data/llama/llama-2-13b-chat.Q6_K.gguf"
t.layers = 50

# t.llm_path = "../../data/llama/llama-2-13b.Q6_K.gguf"
# t.layers = 50

# t.llm_path = "../../data/llama/llama-2-70b-chat.Q6_K.gguf"
# t.layers = 22

In [19]:
# https://llama-cpp-python.readthedocs.io/en/latest/
t.llm = Llama(
    model_path=t.llm_path,
    n_gpu_layers=t.layers,
    n_threads=10, 
    n_ctx=4096, 
    verbose=False
)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ../../data/llama/llama-2-13b-chat.Q6_K.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q6_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader:

....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 3200.00 MiB
llama_new_context_with_model: kv self size  = 3200.00 MiB
llama_build_graph: non-view tensors processed: 924/924
llama_new_context_with_model: compute buffer total size = 361.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 358.00 MiB
llama_new_context_with_model: total VRAM used: 13613.54 MiB (model: 10055.54 MiB, context: 3558.00 MiB)


## Query LLaMa

In [20]:
def ask(prompt, temp=0.8, top_p=0.95):
    out = t.llm(
        prompt, 
        max_tokens=512, 
        stop=["Q:"], 
        temperature=temp,
        top_p=top_p,
        top_k=10,
        repeat_penalty=1.2,
        stream=True,
    )
    for c in out:
        print(c["choices"][0]["text"], end='')
    print()


Prompt Format:
```
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]
```

### Query with RAG

In [60]:
def ask_with_context(question, context):
    full_prompt = (
        f"<s>[INST]<<SYS>>\n"
        + f"You are a helpful expert financial research assistant." 
        + f"You answer questions about about information contained in a financial report."
        + f"You will be given the user's question, and the relevant informaton from " 
        + f"the financial report. Answer the question using only this information" 
        + f"\n<</SYS>>\n\n"
        + f"Question: {question}\n"
        + f"Information: {context}\n"
        + f"Answer:\n"
        + f"[/INST]"
    )
#     print(full_prompt)
    ask(full_prompt)

In [66]:
def ask_with_rag(question):
    results = t.coll.aggregate([{
    "$vectorSearch": {
        "queryVector": t.emb.encode(question).tolist(),
        "path": "emb",
        "numCandidates": 100,
        "limit": 8,
        "index": "vector_index"
    }}])
    context = "\n\n".join([d['text'] for d in results])
#     print(context)
    ask_with_context(question, context)

In [68]:
ask_with_rag("What was the total revenue?")

  Sure! I'm happy to help you with your questions based on the information provided in the financial report. What is your question?


In [70]:
ask_with_rag("What was the operating income?")

  Sure! Based on the information provided in the financial report, I can answer your question as follows:

What was the operating income?

According to the financial report, the operating income for the year ended June 30, 2022 was $83,383 million.


In [71]:
ask_with_rag("What was the operating income in year 2021?")

  Sure! Based on the information provided, I can answer your question.

What was the operating income in year 2021?

According to the financial report, the operating income for year 2021 was $69,916 million.


In [74]:
ask_with_rag("Compare the net income in the years 2022 and 2021")

  Sure! I'd be happy to help you compare the net income in the years 2022 and 2021 based on the information provided.

According to the financial report, the net income for the year ended June 30, 2022 was $72,738 million, while the net income for the year ended June 30, 2021 was $61,271 million. This represents a 19% increase in net income from 2021 to 2022.

Is there anything else you would like to know?


### Query Embedded Knowledge

In [76]:
def ask_llm(question):
    prompt = (
        f"<s>[INST]<<SYS>>\n"
        + f"You are a helpful expert financial research assistant." 
        + f"\n<</SYS>>\n\n"
        + f"Question: {question}\n"
        + f"Answer:\n"
        + f"[/INST]"
    )
    ask(prompt)

In [77]:
ask_llm("What was the total revenue of Microsoft for the fiscal year ended June 30, 2022?")

  Sure thing! According to Microsoft's latest annual report filed with the SEC on July 19, 2022, the company's total revenue for the fiscal year ended June 30, 2022 was $254.67 billion.


In [78]:
ask_llm("Compare the net income of Microsoft Corporation in the years 2022 and 2021")

  Sure, I'd be happy to help! According to the latest annual financial statements from Microsoft Corporation, the company reported a net income of $49.3 billion in fiscal year 2022 (ended June 30, 2022), compared to $41.7 billion in fiscal year 2021 (ended June 30, 2021).

So, the net income of Microsoft Corporation increased by approximately $7.6 billion or 18% from 2021 to 2022. This increase was driven by strong revenue growth in the company's Cloud and Artificial Intelligence (AI) businesses, as well as improved profitability in its Personal Computing segment.

I hope this helps! Let me know if you have any other questions or if there's anything else I can help with.
