**Pre-requisites**
1. Install all libraries
2. Preload models (show how).

**Plan**

1. Use PDF document (e.g. Berkshire Hathaway financial report)
2. Split using SentenceTransformer
3. Load to MongoDB
4. Search 
5. Add a prompt
6. Generate

In [1]:
# !pip install langchain
# !pip install typing-inspect==0.8.0 typing_extensions==4.5.0
# !pip install pypdf

### Pre-load Models

In [2]:
def preload():
    s = SentenceTransformersTokenTextSplitter()
    emb = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-cos-v1')
    
# preload()

In [92]:
from pymongo import MongoClient
import os
from llama_cpp import Llama
import torch

# https://www.sbert.net/docs/pretrained_models.html#model-overview
# Sentence BERT, based on BERT
from sentence_transformers import SentenceTransformer

# https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.ht
# https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.SentenceTransformersTokenTextSplitter.html
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter, 
    SentenceTransformersTokenTextSplitter
)

from transformers import (
    AutoTokenizer
)

from pypdf import PdfReader

class Object(object):
    pass

In [4]:
t = Object()

## MongoDB Config

In [6]:
t.uri = os.environ["MONGODB_URI"]
# Create a new client and connect to the server
t.client = MongoClient(t.uri)
# Send a ping to confirm a successful connection
try:
    t.client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


In [164]:
t.db = t.client.rag_llama
t.db_name = "rag_llama"

t.coll = t.db.mdb
t.coll_name = "mdb"

## Load and Parse Documents

In [66]:
# t.reader = PdfReader("data/brk-2023-q3.pdf")
# t.reader = PdfReader("data/msft-2022.pdf")
t.reader = PdfReader(f"data/{t.coll_name}-2022.pdf")
t.pages = [p.extract_text().strip() for p in t.reader.pages]

Pages are of various sizes. We need to split into chunks that fit into the model window, specifically, the BERT embedding 256-token sized window. 

So we'll join all pages, and use the SentenceTransformer splitter to split the doc into the chunks of the right size.

In [48]:
# print(t.pages[10])

In [49]:
t.ch_splitter =  RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1024,
    chunk_overlap=0
)
t.ch_chunks = t.ch_splitter.split_text("\n".join(t.pages))
len(t.ch_chunks)

573

In [87]:
t.token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=10, tokens_per_chunk=256)
t.token_chunks = []
for ch in t.ch_chunks:
    t.token_chunks.extend(t.token_splitter.split_text(ch))
len(t.token_chunks)

594

## Embedding Model

In [51]:
t.emb = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-cos-v1')

In [52]:
len(t.emb.encode(t.token_chunks[21]).tolist())

768

## Upload documents

In [53]:
list(t.coll.find().limit(10))

[]

In [54]:
t.docs = []
for t.ch in t.token_chunks:
    t.doc = {
        "text": t.ch,
        "emb": t.emb.encode(t.ch).tolist()
    }
    t.docs.append(t.doc)

In [55]:
_ = t.coll.insert_many(t.docs)

In [None]:
list(t.coll.find().limit(10))

## Query Index

Index definition:

```
{
  "fields": [
    {
      "type": "vector",
      "path": "emb",
      "numDimensions": 768,
      "similarity": "dotProduct"
    }
  ]
}
```

In [72]:
t.query = "What was the total revenue?"

t.results = t.coll.aggregate([{
    "$vectorSearch": {
        "queryVector": t.emb.encode(t.query).tolist(),
        "path": "emb",
        "numCandidates": 100,
        "limit": 4,
        "index": f"{t.coll_name}_vector_index"
    }
}
])

In [73]:
t.context = "\n\n".join([d['text'] for d in t.results])
print(t.context)

, except share and per share data ) years ended january 31, 2023 2022 2021 revenue : subscription $ 1, 235, 122 $ 842, 047 $ 565, 349 services 48, 918 31, 735 25, 031

the following table presents the company ’ s revenues disaggregated by primary geographical markets, subscription product categories and services ( in thousands ) : years ended january 31, 2023 2022 2021 primary geographical markets : americas $ 781, 763 $ 527, 081 $ 361, 351 emea 361, 566 257, 846 177, 448 asia pacific 140, 711 88, 855 51, 581 total $ 1, 284, 040 $ 873, 782 $ 590, 380 subscription product categories and services : mongodb atlas - related $ 808, 263 $ 492, 287 $ 270, 805 other subscription 426, 859 349, 760 294, 544 services 48, 918 31, 735 25, 031 total $ 1, 284, 040 $ 873, 782 $ 590, 380 customers located in the united states accounted for 55 %, 54 % and 56 % of total revenue for the years ended january 31, 2023, 2022 and 2021, respectively. customers located in the united kingdom accounted for 10 % of

## Load LLama

In [74]:
# t.model_path = "../../data"
t.model_path = "../../../../data"
t.llm_path = f"{t.model_path}/llama/llama-2-13b-chat.Q6_K.gguf"
t.layers = 50

# t.llm_path = "../../data/llama/llama-2-13b.Q6_K.gguf"
# t.layers = 50

# t.llm_path = "../../data/llama/llama-2-70b-chat.Q6_K.gguf"
# t.layers = 22

In [28]:
# https://llama-cpp-python.readthedocs.io/en/latest/
t.llm = Llama(
    model_path=t.llm_path,
    n_gpu_layers=t.layers,
    n_threads=10, 
    n_ctx=4096, 
    verbose=False
)

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ../../../../data/llama/llama-2-13b-chat.Q6_K.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q6_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q6_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q6_K     [  5120,  5120,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q6_K     [  5120,  5120,  

....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 3200.00 MiB
llama_new_context_with_model: kv self size  = 3200.00 MiB
llama_build_graph: non-view tensors processed: 924/924
llama_new_context_with_model: compute buffer total size = 361.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 358.00 MiB
llama_new_context_with_model: total VRAM used: 13613.54 MiB (model: 10055.54 MiB, context: 3558.00 MiB)


## Query LLaMa

In [75]:
def ask(prompt, temp=0.8, top_p=0.95):
    out = t.llm(
        prompt, 
        max_tokens=512, 
        stop=["Q:"], 
        temperature=temp,
        top_p=top_p,
        top_k=10,
        repeat_penalty=1.2,
        stream=True,
    )
    for c in out:
        print(c["choices"][0]["text"], end='')
    print()


Prompt Format:
```
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]
```

### Query with RAG

In [150]:
t.prompt = (
    "<s>[INST]<<SYS>>\n"
    + "You are a helpful expert financial research assistant." 
    + "You answer questions about about information contained in a financial report."
    + "You will be given the user's question, and the relevant informaton from " 
    + "the financial report. Answer the question using only this information" 
    + "\n<</SYS>>\n\n"
    + "Information: {context}\n"
    + "Question: {question}\n"
    + "Answer:\n"
    + "[/INST]"
)

def ask_with_context(question, context):
    full_prompt = t.prompt
    full_prompt = full_prompt.replace("{context}", context)
    full_prompt = full_prompt.replace("{question}", question)
    ask(full_prompt)

In [151]:
def find_context(question):
    results = t.coll.aggregate([{
    "$vectorSearch": {
        "queryVector": t.emb.encode(question).tolist(),
        "path": "emb",
        "numCandidates": 200,
        "limit": 8,
        "index": f"{t.coll_name}_vector_index"
    }}])
    result_texts = [d['text'] for d in results]
    assert len(result_texts) > 0
    context = "\n\n".join(result_texts)
    return context

In [152]:
def ask_with_rag(question):
    context = find_context(question)
    ask_with_context(question, context)

In [153]:
ask_with_rag("What was the total revenue?")

  Based on the information provided, the total revenue for the year ended January 31, 2023 was $1,284.0 million.


In [155]:
ask_with_rag("What was the operating income or loss?")

  Based on the information provided in the financial report, the company had an operating loss of $346.7 million for the year ended January 31, 2023.


In [145]:
ask_with_rag("What was the operating income or loss in year 2022?")

  Based on the information provided in the financial report, the operating loss for year 2022 was $(336,655).


In [144]:
ask_with_rag("Compare the total revenue between the years 2023 and 2022")

  Based on the information provided, the total revenue for the year ended January 31, 2023 was $1,284.0 million, while the total revenue for the year ended January 31, 2022 was $873.8 million. This represents a 47% increase in total revenue from 2022 to 2023.


In [158]:
ask_with_rag("What are the top 5 risks for MongoDB?")

  Based on the information provided in the financial report, the top 5 risks for MongoDB are:

1. Dependence on a single product (MongoDB Atlas) and any potential failure to meet customer demands or adverse impact on our business if it does not satisfy customer needs.
2. Intense competition in the market, which could negatively affect our ability to compete effectively and generate revenue growth.
3. Limited operating history, making it difficult to predict future results of operations and potential risks associated with rapidly changing industries.
4. Security breaches or other security incidents that may expose sensitive data, damage our reputation, and lead to financial losses.
5. Geopolitical risks, including political instability, social unrest, terrorist activities, natural disasters, and outbreaks of contagious diseases, which could negatively impact our operations, growth, and profitability.


### Query Embedded Knowledge

In [159]:
def ask_llm(question):
    prompt = (
        f"<s>[INST]<<SYS>>\n"
        + f"You are a helpful expert financial research assistant." 
        + f"\n<</SYS>>\n\n"
        + f"Question: {question}\n"
        + f"Answer:\n"
        + f"[/INST]"
    )
    ask(prompt)

In [160]:
ask_llm("What was the total revenue of MongoDB in the year 2023?")

  As a helpful expert financial research assistant, I can provide you with the latest available information on MongoDB's financial performance. According to MongoDB's annual report filed with the Securities and Exchange Commission (SEC) for the fiscal year ended January 31, 2023, the company's total revenue was $1,574 million.


## LangChain

In [187]:
# https://python.langchain.com/docs/integrations/vectorstores/mongodb_atlas

from langchain.chains import RetrievalQA
from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_community.embeddings import HuggingFaceEmbeddings

In [174]:
t.lang_emb = HuggingFaceEmbeddings(model_name="multi-qa-mpnet-base-cos-v1")

In [175]:
len(t.lang_emb.embed_documents(['This is a test document'])[0])

768

In [178]:
t.vector_search = MongoDBAtlasVectorSearch.from_connection_string(
    t.uri,
    f"{t.db_name}.{t.coll_name}",
    t.lang_emb,
    index_name=f"{t.coll_name}_vector_index",
)

In [186]:
t.vector_search.similarity_search(
    query="What was the total revenue?"
)

OperationFailure: PlanExecutor error during aggregation :: caused by :: embedding is not indexed as knnVector, full error: {'ok': 0.0, 'errmsg': 'PlanExecutor error during aggregation :: caused by :: embedding is not indexed as knnVector', 'code': 8, 'codeName': 'UnknownError', '$clusterTime': {'clusterTime': Timestamp(1705295715, 8), 'signature': {'hash': b'\x9al\xdc\xda4\xe7\x12\xd4\xdd\x95~\xf6\x83\x01T\x01\x83esk', 'keyId': 7272493978572816386}}, 'operationTime': Timestamp(1705295715, 8)}