

* LLM - Large Language Model  
* Llama 2.0 - LLM from Meta 
* Langchain - a framework designed to simplify the creation of applications using LLMs
* Vector database - a database that organizes data through high-dimmensional vectors  
* ChromaDB - vector database  
* RAG - Retrieval Augmented Generation (see below more details about RAGs)



# Installations, imports, utils

In [2]:
# !PATH=/home/ubuntu/.cargo/bin:$PATH cargo --version 
# !export cargo=/home/ubuntu/.cargo/bin/cargo
!cargo --version

#  einops==0.6.1

cargo 1.78.0 (54d8815d0 2024-03-26)


In [3]:
!pip install transformers accelerate==0.22.0 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes sentence_transformers==2.2.2 torch



In [1]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma


  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# model_id = '/home/ubuntu/RAG/models/config.json'

# device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# bnb_config = transformers.BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type='nf4',
#     bnb_4bit_use_double_quant=True,
#     bnb_4bit_compute_dtype=bfloat16
# )



# time_1 = time()
# model_config = transformers.AutoConfig.from_pretrained(
#     model_id,
# )
# model = transformers.AutoModelForCausalLM.from_pretrained(
#     model_id,
#     trust_remote_code=True,
#     config=model_config,
#     quantization_config=bnb_config,
#     device_map='auto',
# )
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# time_2 = time()
# print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")

Prepare the model and the tokenizer.

In [2]:
# Update the model path to point to your local directory

def prepare_model():
    model_id = "/home/ubuntu/RAG/models/"  # Assuming gemma is a directory containing the model files

    device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

    bnb_config = transformers.BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=bfloat16
    )

    time_1 = time()
    model_config = transformers.AutoConfig.from_pretrained(
        model_id,
    )
    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_id,
        trust_remote_code=True,
        config=model_config,
        quantization_config=bnb_config,
        device_map='auto',
    )
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    time_2 = time()
    print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")
    return model, tokenizer

prepare_model()

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.42s/it]


Prepare model, tokenizer: 7.125 sec.


(GemmaForCausalLM(
   (model): GemmaModel(
     (embed_tokens): Embedding(256000, 2048, padding_idx=0)
     (layers): ModuleList(
       (0-17): 18 x GemmaDecoderLayer(
         (self_attn): GemmaSdpaAttention(
           (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
           (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
           (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
           (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
           (rotary_emb): GemmaRotaryEmbedding()
         )
         (mlp): GemmaMLP(
           (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
           (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
           (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
           (act_fn): PytorchGELUTanh()
         )
         (input_layernorm): GemmaRMSNorm()
         (post_attention_layernorm):

Define the query pipeline.

In [3]:
def test_model(tokenizer, model, prompt_to_test):
    """
    Perform text generation using the provided model and tokenizer.
    Args:
        tokenizer: the tokenizer
        model: the language model
        prompt_to_test (str): the prompt text to generate from
    """
    time_1 = time()
    device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
    input_ids = tokenizer.encode(prompt_to_test, return_tensors="pt").to(model.device)
    sequences = model.generate(input_ids,
                               max_length=1000000,
                               num_return_sequences=1,
                               no_repeat_ngram_size=2,
                               eos_token_id=tokenizer.eos_token_id,
                               top_k=50,
                               do_sample=False, 
                               )
    time_2 = time()
    print(f"Test inference: {round(time_2 - time_1, 3)} sec.")

    for seq in sequences:
        generated_text = tokenizer.decode(seq, skip_special_tokens=True)
        print(f"Result: {generated_text}")
        
        
model, tokenizer = prepare_model()
test_model(tokenizer, model, "Please explain what is the State of the Union address. Give just a definition. Keep it in 500 words.")


In [17]:
llm = test_model(tokenizer, model, "Prepare a Common Size balance sheet in csv format that can be used in the excel sheet.") 

Test inference: 8.944 sec.
Result: Prepare a Common Size balance sheet in csv format that can be used in the excel sheet.

**Assets**

| Item | Amount |
|---|---|
 | Cash | 100 |

 | Accounts Receivable |  50  |
  
   | Inventory |   20   
    
     | Total Assets |    155 |


**Liabilities and Owner's Equity**
 
	|  Item |Amount |	
|:---:|:------:|
:Accounts Payable |125|	 

	 | Loan |53 |		
      
       | Owner’s Capital |30|

  	  Total Liabilities |278| 	

   	Owner’ Equity |85  

    |Total Owner Equity|   87 |



**Notes:**

* The company started with $1,058 in cash.
* They have $52 in accounts receivable. 


Please let me know if you have any other questions.


## Ingestion of data using Text loder



In [5]:
loader = TextLoader("/home/ubuntu/RAG/datasets/TXTs/Accounts101.txt", encoding="utf8")
documents = loader.load()

## Split data in chunks



In [7]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
print(documents)
all_splits = text_splitter.split_documents(documents)
print(all_splits)

[Document(page_content="Y\nou have learnt about the preparation of financial\nstatements for a sole proprietary concern. As the\nbusiness expands, one needs more capital and\nlarger number of people to manage the business and\nshare its risks. In such a situation, people usually\nadopt the partnership form of organisation.\nAccounting for partnership firms has it’s own\npeculiarities, as the partnership firm comes into\nexistence when two or more persons come together\nto establish business and share its profits. On many\nissues affecting distribution of profits, there may not\nbe any specific agreement between the partners. In\nsuch a situation the provisions of the Indian\nPartnership Act 1932 apply. Similarly, calculation\nof interest on capital, interest on drawings and\nmaintenance of partners capital accounts have their\nown peculiarities. Not only that a variety of\nadjustments are required on the death of a partner\nor when a new partner is admitted and so on. These\npeculiar s

## Creating Embeddings and Storing in Vector Store

In [14]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
print(f"Embeddings: {embeddings}")

Embeddings: client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
) model_name='sentence-transformers/all-mpnet-base-v2' cache_folder=None model_kwargs={'device': 'cuda'} encode_kwargs={} multi_process=False


Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [20]:


# all_embeddings = embeddings(documents)

vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")


ValueError: Expected EmbeddingFunction.__call__ to have the following signature: odict_keys(['self', 'input']), got odict_keys(['self', 'args', 'kwargs'])
Please see https://docs.trychroma.com/embeddings for details of the EmbeddingFunction interface.
Please note the recent change to the EmbeddingFunction interface: https://docs.trychroma.com/migration#migration-to-0416---november-7-2023 


## Initialize chain

In [None]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

## Test the Retrieval-Augmented Generation 



In [None]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

In [None]:
query = "What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words."
test_rag(qa, query)

In [None]:
query = "What is the nation economic status? Summarize. Keep it under 200 words."
test_rag(qa, query)

## Document sources

Let's check the documents sources, for the last query run.

In [None]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")