<a href="https://colab.research.google.com/github/aswinaus/LLM_Inference/blob/main/mistral_quant_awq_load_local_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install git+https://github.com/huggingface/transformers torch accelerate langchain langchain_huggingface datasets --quiet

Code is essentially forcing Python to always use "UTF-8" as the preferred encoding, regardless of the user's actual system settings. UTF-8 is a widely used encoding that can represent a vast range of characters from different languages. By enforcing UTF-8, you can help ensure that your code works consistently across different platforms and avoids encoding-related errors. It's a common practice for improving compatibility and preventing issues with text handling in Python programs.

In [2]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [3]:
from google.colab import userdata
HUGGING_FACE_TOKEN = userdata.get('HUGGING_FACE_TOKEN')

In [4]:
!huggingface-cli login --token $HUGGING_FACE_TOKEN

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `Agents` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `Agents`


In [5]:
from google.colab import drive
drive.mount('/content/drive')
# Download Data
data_dir = '/content/drive/MyDrive'

Mounted at /content/drive


In [6]:
# Import libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch
from langchain_huggingface import HuggingFacePipeline
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
from threading import Thread

The nvidia-smi command is a utility provided by NVIDIA to query and display information about your NVIDIA GPU(s) (Graphics Processing Unit). This includes things like:

GPU model and name
Driver version
GPU utilization
Memory usage
Temperature
Power consumption
Processes running on the GPU

In [7]:
!nvidia-smi

Tue Feb 11 23:39:54 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P0             43W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [8]:
import textwrap

def wrap_text(text, width=90): #preserve_newlines
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

In [9]:
!pip install autoawq

Collecting autoawq
  Downloading autoawq-0.2.8.tar.gz (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.6/71.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<=4.47.1,>=4.45.0 (from autoawq)
  Downloading transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.47.1-py3-none-any.whl (10.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m83.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: autoawq
  Building wheel for autoawq (setup.py) ... [?25l[?25hdone
  Created wheel for autoawq: filename=autoawq-0.2.8-py3-none-any.whl size=108744 sha256=4f00fc893e210f039079ff91a29af9f9dfb32271f9f4bbd5f39d449cc51baa08
  Stored in directory: /root/.cache/pip/wheels/fd/03/fe/99c1c678bfe8a

In [1]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from typing import Tuple, Optional, Union, Dict, Any
from transformers import PreTrainedModel, AutoModel, AutoTokenizer, AutoConfig
from transformers.tokenization_utils_base import PreTrainedTokenizerBase
from google.colab import drive

In [2]:
drive.mount('/content/drive')
data_dir = '/content/drive/MyDrive' # Input a data dir path from your mounted Google Drive

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
quant_path = f"/{data_dir}/LLMs/Mistral/Mistral-Small-24B-Instruct-2501"

In [4]:
local_model_path = quant_path
local_tokenizer = AutoTokenizer.from_pretrained(quant_path)
local_model = AutoModel.from_pretrained(quant_path,low_cpu_mem_usage=True)

You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

local_model.to(device) moves all the model's parameters and buffers to the specified device (in this case, device, which is set to 'cuda' if a GPU is available). Deep learning models often have a large number of parameters and require significant computational power. GPUs are designed for parallel processing and can significantly speed up the training and inference of deep learning models. By moving the model to the GPU, you leverage its computational capabilities for faster execution.

In [7]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
local_model.to(device)

MistralModel(
  (embed_tokens): Embedding(131072, 5120)
  (layers): ModuleList(
    (0-39): 40 x MistralDecoderLayer(
      (self_attn): MistralSdpaAttention(
        (q_proj): WQLinear_GEMM(in_features=5120, out_features=4096, bias=False, w_bit=4, group_size=128)
        (k_proj): WQLinear_GEMM(in_features=5120, out_features=1024, bias=False, w_bit=4, group_size=128)
        (v_proj): WQLinear_GEMM(in_features=5120, out_features=1024, bias=False, w_bit=4, group_size=128)
        (o_proj): WQLinear_GEMM(in_features=4096, out_features=5120, bias=False, w_bit=4, group_size=128)
        (rotary_emb): MistralRotaryEmbedding()
      )
      (mlp): MistralMLP(
        (gate_proj): WQLinear_GEMM(in_features=5120, out_features=32768, bias=False, w_bit=4, group_size=128)
        (up_proj): WQLinear_GEMM(in_features=5120, out_features=32768, bias=False, w_bit=4, group_size=128)
        (down_proj): WQLinear_GEMM(in_features=32768, out_features=5120, bias=False, w_bit=4, group_size=128)
        (

In [8]:
from datasets import load_dataset
import pandas as pd
dataset = load_dataset("aswinaus/tax_statistics_dataset_by_income_range", download_mode="force_redownload")
df=pd.DataFrame(dataset['train'])
df.head(10)

README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

(…)x_statistics_dataset_by_income_range.csv:   0%|          | 0.00/272k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/303 [00:00<?, ? examples/s]

Unnamed: 0,STATEFIPS,STATE,zipcode,Size of adjusted gross income,No of returns,No of single returns,No of joint returns,No of head of household returns,Number of electronically filed returns,Number of computer prepared paper returns,...,Number of returns with net investment income tax,Net investment income tax amount,Number of returns with tax due at time of filing,Tax due at time of filing amount,Number of returns with total overpayments,Total overpayments amount,Number of returns with overpayments refunded,Overpayments refunded amount,Number of returns with credit to next years estimated tax,Credited to next years estimated tax amount
0,1,AL,0,"$1 under $25,000",778210,491030,84770,189600,712890,30670,...,0,0,62720,51936,671860,1700965,669570,1694792,1980,3512
1,1,AL,0,"$25,000 under $50,000",525940,247140,123910,139860,481760,18960,...,0,0,85860,122569,438020,1274802,435210,1266557,3670,7410
2,1,AL,0,"$50,000 under $75,000",285700,105140,128140,44560,260570,10670,...,0,0,73980,154932,212040,575315,208470,564202,5020,13653
3,1,AL,0,"$75,000 under $100,000",179070,38820,123110,13740,164300,5020,...,0,0,51330,139065,126850,401581,123310,388749,3040,10377
4,1,AL,0,"$100,000 under $200,000",257010,28180,216740,7150,236850,8400,...,90,141,104290,460071,152790,598248,144640,539385,9180,56257
5,1,AL,0,"$200,000 or more",74810,4540,66580,530,69330,1760,...,39430,141243,40030,946610,32060,681565,22420,257633,9450,372747
6,2,AK,0,"$1 under $25,000",102820,81530,7850,10850,89210,6360,...,0,0,17720,8643,75910,133915,75430,132166,370,670
7,2,AK,0,"$25,000 under $50,000",79910,48680,14900,13510,72490,3260,...,0,0,11540,19564,68020,185024,67520,182576,740,1890
8,2,AK,0,"$50,000 under $75,000",51890,25460,17880,6840,46570,2160,...,0,0,11330,26920,40310,118691,39720,115674,830,2922
9,2,AK,0,"$75,000 under $100,000",36350,12880,19390,3190,32620,1630,...,0,0,9690,29647,26340,90982,25670,87439,680,2366


RAG pipeline implementation

In [None]:
!pip install langchain_openai langchain_community chromadb tiktoken --quiet

In [10]:
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

from langchain_core.runnables import (
    RunnableParallel,
    RunnablePassthrough
)
from langchain.schema.output_parser import StrOutputParser
from typing import Dict, Any, List
from langchain.docstore.document import Document

In [11]:
from getpass import getpass
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

In [12]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['STATEFIPS', 'STATE', 'zipcode', 'Size of adjusted gross income', 'No of returns', 'No of single returns', 'No of joint returns', 'No of head of household returns', 'Number of electronically filed returns', 'Number of computer prepared paper returns', "Number of returns with paid preparer's signature", 'Number of returns with direct deposit', 'Number of individuals', 'Total number of volunteer prepared returns', 'Number of volunteer income tax assistance VITA prepared returns', 'Number of tax counseling for the elderly  TCE prepared returns', 'Number of volunteer prepared returns with Earned Income Credit', 'Number of refund anticipation check returns', 'Number of elderly returns', 'Adjust gross income AGI', 'Number of returns with total income', 'Total income amount', 'Number of returns with salaries and wages', 'Salaries and wages amount', 'Number of returns with taxable interest', 'Taxable interest amount', 'Number of returns wit

In [13]:
#RecursiveCharacterTextSplitter for splitting the documents into chunk size and overlapp for efficient meaningful chunks
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300,
    chunk_overlap=50,
)

In [37]:
#define a pad token for tokenizer and save it.
#local_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
#local_model.resize_token_embeddings(len(local_tokenizer))
#local_tokenizer.save_pretrained(quant_path)
#local_model = AutoModel.from_pretrained(quant_path,low_cpu_mem_usage=True, trust_remote_code=True)
#local_model.to(device)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

MistralModel(
  (embed_tokens): Embedding(131072, 5120)
  (layers): ModuleList(
    (0-39): 40 x MistralDecoderLayer(
      (self_attn): MistralSdpaAttention(
        (q_proj): WQLinear_GEMM(in_features=5120, out_features=4096, bias=False, w_bit=4, group_size=128)
        (k_proj): WQLinear_GEMM(in_features=5120, out_features=1024, bias=False, w_bit=4, group_size=128)
        (v_proj): WQLinear_GEMM(in_features=5120, out_features=1024, bias=False, w_bit=4, group_size=128)
        (o_proj): WQLinear_GEMM(in_features=4096, out_features=5120, bias=False, w_bit=4, group_size=128)
        (rotary_emb): MistralRotaryEmbedding()
      )
      (mlp): MistralMLP(
        (gate_proj): WQLinear_GEMM(in_features=5120, out_features=32768, bias=False, w_bit=4, group_size=128)
        (up_proj): WQLinear_GEMM(in_features=5120, out_features=32768, bias=False, w_bit=4, group_size=128)
        (down_proj): WQLinear_GEMM(in_features=32768, out_features=5120, bias=False, w_bit=4, group_size=128)
        (

In [15]:
from typing import List
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import Chroma
from sentence_transformers import SentenceTransformer
from transformers import AutoModel, AutoTokenizer
import shutil
class MyEmbedding:
    def __init__(self, model):
        self.model = SentenceTransformer(model, trust_remote_code=True)

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [self.model.encode(text).tolist() for text in texts]

    def embed_query(self, query: str) -> List[float]:
        encoded_query = self.model.encode(query)
        return encoded_query.tolist()

database_path = f"{data_dir}/RAG/VectorDB/chroma_db_RAG_quantnew"


# Delete the existing database
database_path = f"{data_dir}/RAG/VectorDB/chroma_db_RAG_quantnew"  # Update with your actual database path
if os.path.exists(database_path):
    shutil.rmtree(database_path)

# Create the database directory
os.makedirs(database_path, exist_ok=True)



def generate_embeddings(data):
    embeddings = []
    for text in data:
        text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=40)
        chunks = text_splitter.create_documents(text)

        # Get embeddings for each chunk using local_model
        chunk_embeddings = []
        for chunk in chunks:
            encoded_input = local_tokenizer(chunk.page_content, return_tensors="pt").to(device)
            embedding = local_model(**encoded_input)[0].detach().cpu().numpy()
            chunk_embeddings.append(embedding)

        # Use an average or other aggregation method if necessary
        # For example, averaging the embeddings:
        # embedding = np.mean(chunk_embeddings, axis=0)

        chromadb = Chroma.from_documents(chunks,
                                 persist_directory=database_path,
                                 collection_name='coll_cosine',
                                 collection_metadata={"hnsw:space": "cosine"},
                                 embedding=MyEmbedding(model=local_model_path))

        chromadb.persist()

    return embeddings  # Return a list of embeddings or any desired output
#print(embeddings)

In [16]:
#Clean Unused Tensors
torch.cuda.empty_cache()
generate_embeddings(dataset)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

  chromadb.persist()


[]

In [24]:
# Input your question in string type and the relative path of your own local model.
def retrieve(user_query, model_path):
    embedding_model = MyEmbedding(model_path)

    chromadb = Chroma(embedding_function=embedding_model,
                      collection_name='coll_cosine',
                      collection_metadata={"hnsw:space": "cosine"},
                      persist_directory=database_path)

    results = chromadb.similarity_search_with_score(user_query, 10)
    print(results)
    return results[0][0].page_content

In [25]:
retrieve("For the State of AL can you get me the No of returns for Size of adjusted gross income $50,000 under $75,000",local_model_path)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



[(Document(metadata={}, page_content='a'), 0.5884516575307391), (Document(metadata={}, page_content='i'), 0.6939342521958247), (Document(metadata={}, page_content='n'), 0.7494148097673325), (Document(metadata={}, page_content='t'), 0.7809196602126711), (Document(metadata={}, page_content='r'), 0.7923278734478486)]


'a'