**Prerequisite**
1. Download LLama Model locally
  1. https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/tree/main
2. Preload Sentence Transformer model (run the preload code below)

**Plan**

1. Use PDF document (e.g. a financial report)
2. Split using SentenceTransformer
3. Load to MongoDB
4. Search 
5. Add a prompt
6. Generate

In [1]:
!pip install langchain
!pip install sentence-transformers
!pip install "pymongo[srv]"
!pip install typing-inspect==0.8.0 typing_extensions==4.5.0
!pip install pypdf

Collecting langchain
  Obtaining dependency information for langchain from https://files.pythonhosted.org/packages/4c/1a/16ad07ffc514944907582cf7a0f9d61cb1165a7b1bb2650e55c8b37aef19/langchain-0.1.3-py3-none-any.whl.metadata
  Downloading langchain-0.1.3-py3-none-any.whl.metadata (13 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Obtaining dependency information for jsonpatch<2.0,>=1.33 from https://files.pythonhosted.org/packages/73/07/02e16ed01e04a374e644b575638ec7987ae846d25ad97bcc9945a3ee4b0e/jsonpatch-1.33-py2.py3-none-any.whl.metadata
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-community<0.1,>=0.0.14 (from langchain)
  Obtaining dependency information for langchain-community<0.1,>=0.0.14 from https://files.pythonhosted.org/packages/5e/fe/772dd89e3d823bb944bc6428674544dd761ac667eda4c13ec94d1ebc3d05/langchain_community-0.0.15-py3-none-any.whl.metadata
  Downloading langchain_community-0.0.15-py3-none-any.whl.metadata (7.6 kB)
Coll

In [2]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python==0.2.25 --force-reinstall --upgrade --no-cache-dir

Collecting llama-cpp-python==0.2.25
  Downloading llama_cpp_python-0.2.25.tar.gz (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting typing-extensions>=4.5.0 (from llama-cpp-python==0.2.25)
  Obtaining dependency information for typing-extensions>=4.5.0 from https://files.pythonhosted.org/packages/b7/f4/6a90020cd2d93349b442bfcb657d0dc91eee65491600b2cb1d388bc98e6b/typing_extensions-4.9.0-py3-none-any.whl.metadata
  Downloading typing_extensions-4.9.0-py3-none-any.whl.metadata (3.0 kB)
Collecting numpy>=1.20.0 (from llama-cpp-python==0.2.25)
  Obtaining dependency information for numpy>=1.20.0 from https://files.pythonhosted.org/packages/a5/37/d1453c9ff4f76

In [3]:
pip install -U numpy==1.24.1

Collecting numpy==1.24.1
  Downloading numpy-1.24.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.3
    Uninstalling numpy-1.26.3:
      Successfully uninstalled numpy-1.26.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.7 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you hav

### Pre-load Models

In [4]:
def preload():
    s = SentenceTransformersTokenTextSplitter()
    emb = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-cos-v1')

### Imports

In [5]:
from pymongo import MongoClient
import os
from llama_cpp import Llama
from langchain_community.llms import LlamaCpp
import torch

# https://www.sbert.net/docs/pretrained_models.html#model-overview
# Sentence BERT, based on BERT
from sentence_transformers import SentenceTransformer

# https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.ht
# https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.SentenceTransformersTokenTextSplitter.html
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter, 
    SentenceTransformersTokenTextSplitter
)
from pypdf import PdfReader

import ctypes
from llama_cpp import llama_log_set
def my_log_callback(level, message, user_data):
    pass

log_callback = ctypes.CFUNCTYPE(None, ctypes.c_int, ctypes.c_char_p, ctypes.c_void_p)(my_log_callback)
llama_log_set(log_callback, ctypes.c_void_p())

# We will keep all global variables in an object to not pullute the global namespace.
class Object(object):
    pass



In [6]:
t = Object()

In [7]:
KAGGLE = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '') != ''

## MongoDB Config

In [8]:
if KAGGLE:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    t.uri = user_secrets.get_secret("MONGODB_URI")
else:
    t.uri = os.environ["MONGODB_URI"]
# Create a new client and connect to the server
t.client = MongoClient(t.uri)
# Send a ping to confirm a successful connection
try:
    t.client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


In [9]:
t.db = t.client.rag_llama
t.coll = t.db.mdb

In [10]:
if KAGGLE:
    !wget https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q6_K.gguf    
    preload()

--2024-01-24 19:47:00--  https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q6_K.gguf
Resolving huggingface.co (huggingface.co)... 65.8.243.16, 65.8.243.46, 65.8.243.92, ...
Connecting to huggingface.co (huggingface.co)|65.8.243.16|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/8d/b1/8db1d1f73b4caa58e947ccbfe2fb27ac5e495c2ad8457ad299d15987aee3b520/5da6e8997c8fbb042d6b981270756e6fc8065e89fde5215b18ee1e93c87dba3f?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27llama-2-13b-chat.Q6_K.gguf%3B+filename%3D%22llama-2-13b-chat.Q6_K.gguf%22%3B&Expires=1706384821&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNjM4NDgyMX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy84ZC9iMS84ZGIxZDFmNzNiNGNhYTU4ZTk0N2NjYmZlMmZiMjdhYzVlNDk1YzJhZDg0NTdhZDI5OWQxNTk4N2FlZTNiNTIwLzVkYTZlODk5N2M4ZmJiMDQyZDZiOTgxMjcwNzU2ZTZmYzgwNjVlODlmZG

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/9.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

## Llama Config

In [17]:
# t.model_path = "../../data"
if KAGGLE:
    t.llm_path = "/kaggle/working/llama-2-13b-chat.Q6_K.gguf"
    t.layers = 50
else:    
    t.model_path = "../../../../data"
    t.llm_path = f"{t.model_path}/llama/llama-2-13b-chat.Q6_K.gguf"
    t.layers = 50

## Load and Parse Documents

In [18]:
# t.reader = PdfReader("data/brk-2023-q3.pdf")
# t.reader = PdfReader("data/msft-2022.pdf")
if KAGGLE:
    t.reader = PdfReader(f"../input/mdb-pdf/{t.coll.name}-2022.pdf")
else:
    t.reader = PdfReader(f"data/{t.coll.name}-2022.pdf")
t.pages = [p.extract_text().strip() for p in t.reader.pages]

Pages are of various sizes. We need to split into chunks that fit into the model window, specifically, the BERT embedding 256-token sized window. 

So we'll join all pages, and use the SentenceTransformer splitter to split the doc into the chunks of the right size.

In [19]:
# print(t.pages[10])

In [20]:
t.ch_splitter =  RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1024,
    chunk_overlap=0
)
t.ch_chunks = t.ch_splitter.split_text("\n".join(t.pages))
len(t.ch_chunks)

573

In [21]:
t.token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=10, tokens_per_chunk=256)
t.token_chunks = []
for ch in t.ch_chunks:
    t.token_chunks.extend(t.token_splitter.split_text(ch))
len(t.token_chunks)

594

## Embedding Model

In [22]:
t.emb_model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-cos-v1')

In [23]:
len(t.emb_model.encode(t.token_chunks[21]).tolist())

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

768

## Upload documents

In [24]:
len(list(t.coll.find().limit(10)))

10

In [26]:
# _ = t.coll.insert_many(t.docs)

In [27]:
len(list(t.coll.find().limit(10)))

10

## Query Index

Index definition:

```
{
  "fields": [
    {
      "type": "vector",
      "path": "emb",
      "numDimensions": 768,
      "similarity": "dotProduct"
    }
  ]
}
```

In [28]:
t.query = "What was the total revenue?"

t.results = t.coll.aggregate([{
    "$vectorSearch": {
        "queryVector": t.emb_model.encode(t.query).tolist(),
        "path": "embedding",
        "numCandidates": 100,
        "limit": 8,
        "index": f"{t.coll.name}_vector_index"
    }}])

t.context = "\n\n".join([d['text'] for d in t.results])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [29]:
print(t.context[0:1000])

operations ( in thousands of u. s. dollars, except share and per share data ) years ended january 31, 2023 2022 2021 revenue : subscription $ 1, 235, 122 $ 842, 047 $ 565, 349 services 48, 918 31, 735 25, 031

the following table presents the company ’ s revenues disaggregated by primary geographical markets, subscription product categories and services ( in thousands ) : years ended january 31, 2023 2022 2021 primary geographical markets : americas $ 781, 763 $ 527, 081 $ 361, 351 emea 361, 566 257, 846 177, 448 asia pacific 140, 711 88, 855 51, 581 total $ 1, 284, 040 $ 873, 782 $ 590, 380 subscription product categories and services : mongodb atlas - related $ 808, 263 $ 492, 287 $ 270, 805 other subscription 426, 859 349, 760 294, 544 services 48, 918 31, 735 25, 031 total $ 1, 284, 040 $ 873, 782 $ 590, 380 customers located in the united states accounted for 55 %, 54 % and 56 % of total revenue for the years ended january 31, 2023, 2022 and 2021, respectively. customers located i

## Load LLama

In [30]:
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# https://python.langchain.com/docs/guides/local_llms
t.llm = LlamaCpp(
    model_path=t.llm_path,
    n_gpu_layers=t.layers,
    n_threads=10, 
    n_ctx=4096, 
    n_batch=512,
    verbose=False,
    f16_kv=True,
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5
  Device 1: Tesla T4, compute capability 7.5


## Query LLaMa

In [31]:
def ask(prompt, temp=0.8, top_p=0.95):
    out = t.llm.invoke(
        prompt, 
        max_tokens=512, 
        stop=["Q:"], 
        temperature=temp,
        top_p=top_p,
        top_k=10,
        repeat_penalty=1.2,
    )
    return out

Prompt Format:
```
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]
```

### Query with RAG

In [32]:
def ask_with_context(question, context):
    full_prompt = (
    "<s>[INST]<<SYS>>\n"
    + "You are a helpful expert financial research assistant." 
    + "You answer questions about about information contained in a financial report."
    + "You will be given the user's question, and the relevant informaton from " 
    + "the financial report. Answer the question using only this information" 
    + "\n<</SYS>>\n\n"
    + "Information: {context}\n"
    + "Question: {question}\n"
    + "Answer:\n"
    + "[/INST]"
    )
    full_prompt = full_prompt.replace("{context}", context)
    full_prompt = full_prompt.replace("{question}", question)
    ask(full_prompt)

In [33]:
def find_context(question):
    results = t.coll.aggregate([{
    "$vectorSearch": {
        "queryVector": t.emb_model.encode(question).tolist(),
        "path": "embedding",
        "numCandidates": 200,
        "limit": 8,
        "index": f"{t.coll.name}_vector_index"
    }}])
    result_texts = [d['text'] for d in results]
    assert len(result_texts) > 0
    context = "\n\n".join(result_texts)
    return context

In [34]:
def ask_with_rag(question):
    context = find_context(question)
    ask_with_context(question, context)

In [35]:
%%time
ask_with_rag("What was the total revenue?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  Sure! Based on the information provided in the financial report, the total revenue for the year ended January 31, 2023 was $1,284.0 million.

In [36]:
%%time
ask_with_rag("What was the operating income or loss?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  Based on the information provided in the financial report, the operating loss for the years ended January 31, 2023, 2022 and 2021 was as follows:

Year Ended January 31, 2023:
Operating loss = $ (346,655)

Year Ended January 31, 2022:
Operating loss = $ (289,364)

Year Ended January 31, 2021:
Operating loss = $ (209,304)

In [37]:
%%time
ask_with_rag("What was the operating income or loss in year 2022?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  Based on the information provided in the financial report, the operating income (loss) for the year ended January 31, 2022 was:

Operating loss: $(302,889)

In [38]:
%%time
ask_with_rag("Compare the total revenue between the years 2023 and 2022")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  Sure! Based on the information provided, the total revenue for the year ended January 31, 2023 was $1,284.0 million, while the total revenue for the year ended January 31, 2022 was $873.8 million. This represents an increase of $410.2 million or 47% from 2022 to 2023.CPU times: user 1min 36s, sys: 112 ms, total: 1min 36s
Wall time: 1min 36s


In [39]:
%%time
ask_with_rag("What time period does the report cover?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  Based on the information provided in the report, the period covered is the fiscal year ended January 31, 2023.CPU times: user 47.3 s, sys: 68.9 ms, total: 47.3 s
Wall time: 47.4 s


In [40]:
%%time
ask_with_rag("Were there any changes to the executive team?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  Based on the information provided in the financial report, there were no changes to the executive team during the period covered by the report (January 31, 2023). The signature page of the report lists the current members of the executive team, including Dev Ittycheria as President and Chief Executive Officer, Michael Gordon as Chief Operating Officer and Chief Financial Officer, Thomas Bull as Chief Accounting Officer, Tom Killalea as Director, Archana Agrawal as Director, Roelof Botha as Director, Hope Cochran as Director, Francisco D'Souza as Director, and Charles M. Hazard Jr. as Director.CPU times: user 1min 40s, sys: 132 ms, total: 1min 40s
Wall time: 1min 40s


### Query Embedded Knowledge

In [41]:
def ask_llm(question):
    prompt = (
        f"<s>[INST]<<SYS>>\n"
        + f"You are a helpful expert financial research assistant." 
        + f"\n<</SYS>>\n\n"
        + f"Question: {question}\n"
        + f"Answer:\n"
        + f"[/INST]"
    )
    ask(prompt)

In [42]:
%%time
ask_llm("What was the total revenue of MongoDB in the year ended January 31, 2023?")

  As a helpful expert financial research assistant, I can provide you with the information you need. According to MongoDB's latest annual report filed on Form 10-K for the fiscal year ended January 31, 2023, the company's total revenue was $1,457 million.

In [43]:
%%time
ask_llm("Were there any changes to the executive team at MongoDB in the year ended January 31, 2023?")

  As a helpful expert financial research assistant, I can provide you with information on changes to the executive team at MongoDB for the year ended January 31, 2023.

According to MongoDB's annual report filed with the Securities and Exchange Commission (SEC) on February 28, 2023, there were no changes to the executive team during the fiscal year ending January 31, 2023. The Executive Team remained unchanged throughout the period.

The following individuals continue to serve as members of MongoDB's Executive Team:

1. Dev Ittycheria - President and Chief Executive Officer (CEO)
2. Eliot Horowitz - Co-Founder, Chairman of the Board, and Chief Technology Officer (CTO)
3. Dwight Merriman - Co-Founder and Head of Product
4. Kevin P. Mahaffey - Chief Information Security Officer (CISO)
5. Raj R. Rao - Chief Financial Officer (CFO)
6. Sarah A. Watts - General Counsel and Secretary
7. Matt C. Stinchcomb - Senior Vice President, Worldwide Field Operations
8. Michael J. Gordon - Senior Vice P

## LangChain

We'll use LangChain to tie this all together into a simple API.

In [44]:
# https://python.langchain.com/docs/integrations/vectorstores/mongodb_atlas

from langchain.chains import RetrievalQA
from langchain_community.vectorstores import MongoDBAtlasVectorSearch
from langchain_community.embeddings import HuggingFaceEmbeddings

In [45]:
l = Object()
l.llm = t.llm

In [46]:
l.lang_emb = HuggingFaceEmbeddings(model_name="multi-qa-mpnet-base-cos-v1")

Check that the embeddings model returns embeddings of the correct size of 768:

In [47]:
len(l.lang_emb.embed_documents(['This is a test document'])[0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

768

In [48]:
l.vector_search = MongoDBAtlasVectorSearch(
    t.coll, 
    l.lang_emb, 
    index_name="mdb_vector_index",
    embedding_key="embedding")

In [49]:
l.results = list(l.vector_search.max_marginal_relevance_search(
    query="What was the total revenue?",
    k = 8,
))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [50]:
len(l.results)

8

### Make a Retriever Object

In [51]:
l.retriever = l.vector_search.as_retriever(search_kwargs={"k": 8})

### Make the end-to-end chain object

In [52]:
l.qa = RetrievalQA.from_chain_type(
    llm=l.llm, 
    retriever=l.retriever)

### Query LLM with LangChain

In [53]:
%%time
l.qa.invoke("What was the total revenue?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 Based on the information provided, the total revenue for the year ended January 31, 2023, was $1,284.0 million.

{'query': 'What was the total revenue?',
 'result': ' Based on the information provided, the total revenue for the year ended January 31, 2023, was $1,284.0 million.'}

In [54]:
%%time
l.qa.invoke("What time period does the report cover?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 Based on the given information, the report covers the fiscal year ended January 31, 2023.

{'query': 'What time period does the report cover?',
 'result': ' Based on the given information, the report covers the fiscal year ended January 31, 2023.'}