<a href="https://colab.research.google.com/github/dwu12/Machine-Learning-Project/blob/main/RAG_Langchain_Chroma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Get dependencies

In [1]:
!pip install langchain-chroma
!pip install pypdf
!pip install langchain
!pip install sentence_transformers ## for sentence embedding
!pip install accelerate # for quantization model loading
!pip install bitsandbytes # for quantizing models (less storage space)
!pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference



In [2]:
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_chroma import Chroma
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available
from transformers import BitsAndBytesConfig

import torch

### create documents

In [3]:
loader = PyPDFLoader("/content/drive/MyDrive/DL/LLM/Ace The Full Stack Data & AI Scientist.pdf")
document = loader.load()

In [4]:
for page in document:
  page.page_content = page.page_content.replace('\n', ' ')

In [5]:
### check one document
document[100]

Document(page_content='Distributed Data Parallel works in the following way: 1. At the beginning of the training, the model’s weights are initialized on one node and sent to all the other nodes (Broadcast) 2. Each node trains the same model (with the same initial weights) on a subset of the dataset. 3. Every few batches, the gradients of each node are accumulated on one node (summed up), and then sent back to all the other nodes (All-Reduce). 4. Each node updates the parameters of its local model with the gradients received using its own optimizer. 5. 5. Go back to step 2 Other techniques including: ● Fully Sharded Data Parallel (FSDP), motivated by the “ZeRO” paper - zero data overlap between GPUs ', metadata={'source': '/content/drive/MyDrive/DL/LLM/Ace The Full Stack Data & AI Scientist.pdf', 'page': 100})

### Split Document into chunk

In [6]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(document)

In [7]:
### check one chunk
docs[100]

Document(page_content='Attention Is All You Need (2017) The foundation model for current deep learning (Not only for Language model but also for Video, Audio, Computer Vision). Originally mainly for machine translation, use only attention mechanisms without recurrence and convolutions. Transformer has three major contribution: 1. Parallelizable ( RNN cannot be parallelizable ) 2. Reduce time to train 3. Generalizable to other tasks (Images, Audio, Video) Ensures global dependencies through attention. Transformer is also an encoder-decoder architecture that uses auto-regression to output results. Encoder contains two sublayers: 1. Multi-head self-attention mechanism 2. Simple position wise fully connected feed-forward network (MLP Layer) Decoder contains three sublayers: 1. Masked multi-head self-attention mechanism a. Why mask? i. Because decoder is auto-regression and attention mechanism will see the global information, we need mask for the information leaky 2. Multi-head cross-attent

### Embedding to chroma database

In [8]:
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [9]:
db = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")

### Give it a try for db

In [10]:
query = "Tell me about LLaVA model"
docs = db.similarity_search(query, k = 4)

In [11]:
def word_wrap(string, n_chars=72):
    # Wrap a string at the next space after n_chars
    if len(string) < n_chars:
        return string
    else:
        return string[:n_chars].rsplit(' ', 1)[0] + '\n' + word_wrap(string[len(string[:n_chars].rsplit(' ', 1)[0])+1:], n_chars)


In [12]:
for doc in docs:
  print(word_wrap(doc.page_content))
  print('\n')

LLaMA (2023) Code:


● Stage 2: Fine-tuning End-to-End. Both the projection matrix and LLM
are updated for two different use scenarios: ○ Visual Chat: LLaVA is
fine-tuned on the generated multimodal instruction-following data for
daily user-oriented applications. ○ Science QA: LLaVA is fine-tuned on
this multimodal reasonsing dataset for the science domain. Code:
Graph-Neural Network Graph NN (2021)


LLaVA (2023) LLaVA stands for Large Language and Vision Assistant, uses
machine-generated instruction-following data and has improved zero-shot
capabilities on new tasks in the language domain, but the idea is less
explored in the multimodal field. Contribution: 1. Multimodal Instruct
Data : Present the first attempt to use language-only GPT-4 to generate
multimodal language-image instruction-following data. 2. LLaVA Model :
Introduce LLaVA (Large Language-and-Vision Assistant), an end-to-end
trained large multimodal model that connects a vision encoder and LLM
for general-purpose visual 

### Get Model

In [13]:
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

if gpu_memory_gb < 5.1:
    print(f"Your available GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommend model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False
    model_id = "google/gemma-7b-it"

print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")

Available GPU memory: 16 GB
GPU memory: 16 | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.
use_quantization_config set to: False
model_id set to: google/gemma-2b-it


In [14]:
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_compute_dtype=torch.float16)

if (is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
  attn_implementation = "flash_attention_2"
else:
  attn_implementation = "sdpa"

print(f"[INFO] Using attention implementation: {attn_implementation}")

model_id = model_id
token = ''
print(f"[INFO] Using model_id: {model_id}")

# 3. Instantiate tokenizer (tokenizer turns text into numbers ready for the model)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id, token = token)

# 4. Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id,
                                                 token = token,
                                                 torch_dtype=torch.float16, # datatype to use, we want float16
                                                 quantization_config=quantization_config if use_quantization_config else None,
                                                 low_cpu_mem_usage=False, # use full memory
                                                 attn_implementation=attn_implementation) # which attention version to use

if not use_quantization_config: # quantization takes care of device setting automatically, so if it's not used, send model to GPU
    llm_model.to("cuda")

[INFO] Using attention implementation: sdpa
[INFO] Using model_id: google/gemma-2b-it


Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Without RAG

In [15]:
input_text = "Tell me about LLaVA model"

input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = llm_model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))



<bos>Tell me about LLaVA model.

The LLaVA model is a large language model


### With RAG

In [16]:
docs = db.similarity_search(input_text, k = 3)

In [17]:
for doc in docs:
  print(word_wrap(doc.page_content))
  print('\n')

LLaMA (2023) Code:


● Stage 2: Fine-tuning End-to-End. Both the projection matrix and LLM
are updated for two different use scenarios: ○ Visual Chat: LLaVA is
fine-tuned on the generated multimodal instruction-following data for
daily user-oriented applications. ○ Science QA: LLaVA is fine-tuned on
this multimodal reasonsing dataset for the science domain. Code:
Graph-Neural Network Graph NN (2021)


LLaVA (2023) LLaVA stands for Large Language and Vision Assistant, uses
machine-generated instruction-following data and has improved zero-shot
capabilities on new tasks in the language domain, but the idea is less
explored in the multimodal field. Contribution: 1. Multimodal Instruct
Data : Present the first attempt to use language-only GPT-4 to generate
multimodal language-image instruction-following data. 2. LLaVA Model :
Introduce LLaVA (Large Language-and-Vision Assistant), an end-to-end
trained large multimodal model that connects a vision encoder and LLM
for general-purpose visual 

In [18]:
context = "\n\n".join([docstring.page_content for docstring in docs])
context

'LLaMA (2023) Code:\n\n● Stage 2: Fine-tuning End-to-End. Both the projection matrix and LLM are updated for two different use scenarios: ○ Visual Chat: LLaVA is fine-tuned on the generated multimodal instruction-following data for daily user-oriented applications. ○ Science QA: LLaVA is fine-tuned on this multimodal reasonsing dataset for the science domain. Code: Graph-Neural Network Graph NN (2021)\n\nLLaVA (2023) LLaVA stands for Large Language and Vision Assistant, uses machine-generated instruction-following data and has improved zero-shot capabilities on new tasks in the language domain, but the idea is less explored in the multimodal field. Contribution: 1. Multimodal Instruct Data : Present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. 2. LLaVA Model : Introduce LLaVA (Large Language-and-Vision Assistant), an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visu

In [19]:
base_prompt = f"""Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
If you don't know the answer, just say that you don't know.
\n ####Now use the following context items to answer the user query:
{context}
\n ####Relevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

In [20]:
base_prompt

"Based on the following context items, please answer the query.\nGive yourself room to think by extracting relevant passages from the context before answering the query.\nDon't return the thinking, only return the answer.\nMake sure your answers are as explanatory as possible.\nIf you don't know the answer, just say that you don't know.\n\n ####Now use the following context items to answer the user query:\nLLaMA (2023) Code:\n\n● Stage 2: Fine-tuning End-to-End. Both the projection matrix and LLM are updated for two different use scenarios: ○ Visual Chat: LLaVA is fine-tuned on the generated multimodal instruction-following data for daily user-oriented applications. ○ Science QA: LLaVA is fine-tuned on this multimodal reasonsing dataset for the science domain. Code: Graph-Neural Network Graph NN (2021)\n\nLLaVA (2023) LLaVA stands for Large Language and Vision Assistant, uses machine-generated instruction-following data and has improved zero-shot capabilities on new tasks in the langua

In [21]:
input_ids = tokenizer(base_prompt, return_tensors="pt").to("cuda")

outputs = llm_model.generate(**input_ids, max_length = 1000)
output_text = tokenizer.decode(outputs[0])

In [22]:
output_text = output_text.replace("<bos>", "").replace("<eos>", "")
print(output_text)

Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
If you don't know the answer, just say that you don't know.

 ####Now use the following context items to answer the user query:
LLaMA (2023) Code:

● Stage 2: Fine-tuning End-to-End. Both the projection matrix and LLM are updated for two different use scenarios: ○ Visual Chat: LLaVA is fine-tuned on the generated multimodal instruction-following data for daily user-oriented applications. ○ Science QA: LLaVA is fine-tuned on this multimodal reasonsing dataset for the science domain. Code: Graph-Neural Network Graph NN (2021)

LLaVA (2023) LLaVA stands for Large Language and Vision Assistant, uses machine-generated instruction-following data and has improved zero-shot capabilities on new tasks in the language domain, b