<a href="https://colab.research.google.com/github/githubpradeep/notebooks/blob/main/OmniQuant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone https://github.com/ChenMnZ/AutoGPTQ-bugfix.git

Cloning into 'AutoGPTQ-bugfix'...
remote: Enumerating objects: 3109, done.[K
remote: Counting objects: 100% (575/575), done.[K
remote: Compressing objects: 100% (136/136), done.[K
remote: Total 3109 (delta 475), reused 473 (delta 439), pack-reused 2534[K
Receiving objects: 100% (3109/3109), 7.64 MiB | 16.71 MiB/s, done.
Resolving deltas: 100% (2062/2062), done.


In [2]:
%cd AutoGPTQ-bugfix

/content/AutoGPTQ-bugfix


In [3]:
!sudo apt-get install g++

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
g++ is already the newest version (4:11.2.0-1ubuntu1).
g++ set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 18 not upgraded.


In [4]:
!pip install gekko

Collecting gekko
  Downloading gekko-1.0.6-py3-none-any.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: gekko
Successfully installed gekko-1.0.6


In [None]:
!pip install -v .

In [6]:
import auto_gptq.nn_modules.qlinear.qlinear_cuda as qlinear_cuda


In [7]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [8]:
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_in_model
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import torch
import auto_gptq.nn_modules.qlinear.qlinear_cuda as qlinear_cuda
from transformers.models.falcon.modeling_falcon import FalconLinear
from tqdm import tqdm
import gc
import time

def get_named_linears(module):
    return {name: m for name, m in module.named_modules() if isinstance(m, FalconLinear)}

def set_op_by_name(layer, name, new_module):
    levels = name.split('.')
    if len(levels) > 1:
        mod_ = layer
        for l_idx in range(len(levels)-1):
            if levels[l_idx].isdigit():
                mod_ = mod_[int(levels[l_idx])]
            else:
                mod_ = getattr(mod_, levels[l_idx])
        setattr(mod_, levels[-1], new_module)
    else:
        setattr(layer, name, new_module)

In [9]:
!mkdir -p pre_quantized_models/


In [10]:
!git clone https://huggingface.co/ChenMnZ/falcon-7b-omniquant-w3a16g64 ./pre_quantized_models/falcon-7b-omniquant-w3a16g64


Cloning into './pre_quantized_models/falcon-7b-omniquant-w3a16g64'...
remote: Enumerating objects: 11, done.[K
remote: Total 11 (delta 0), reused 0 (delta 0), pack-reused 11[K
Unpacking objects: 100% (11/11), 775.56 KiB | 3.31 MiB/s, done.


In [11]:
model_path = './pre_quantized_models/falcon-7b-omniquant-w3a16g64'
wbits = 3
group_size = 64
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
enc = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config=config,torch_dtype=torch.float16, trust_remote_code=True)

layers = model.transformer.h
for i in tqdm(range(len(layers))):
    layer = layers[i]
    named_linears = get_named_linears(layer)
    for name, module in named_linears.items():
        q_linear = qlinear_cuda.QuantLinear(wbits, group_size, module.in_features,module.out_features,not module.bias is None,kernel_switch_threshold=128)
        q_linear.to(next(layer.parameters()).device)
        set_op_by_name(layer, name, q_linear)
torch.cuda.empty_cache()
gc.collect()
model.tie_weights()
device_map = infer_auto_device_map(model)
print("Loading pre-computed quantized weights...")
load_checkpoint_in_model(model,checkpoint=model_path,device_map=device_map,offload_state_dict=True)
print("Loading pre-computed quantized weights Successfully")

100%|██████████| 32/32 [00:00<00:00, 45.72it/s]


Loading pre-computed quantized weights...
Loading pre-computed quantized weights Successfully


In [12]:
model.eval()
prompt = "Once upon a time there was a"
input_ids = enc(prompt, return_tensors='pt').input_ids.cuda()
model = model.cuda()
start_time = time.time()
output = model.generate(inputs=input_ids, do_sample=True, top_k=10, max_new_tokens=128)
end_time = time.time()
speed = len(output[0])/(end_time-start_time)
print(enc.decode(output[0]))
print(f"speed:{speed}token/s")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
The current implementation of Falcon calls `torch.scaled_dot_product_attention` directly, this will be deprecated in the future in favor of the `BetterTransformer` API. Please install the latest optimum library with `pip install -U optimum` and call `model.to_bettertransformer()` to benefit from `torch.scaled_dot_product_attention` and future performance optimizations.


Once upon a time there was a girl who wanted to have a party for her birthday. She was so excited that she couldn't stop talking about it. Every time the topic came up she was so excited. Her birthday finally came and everyone came to celebrate with her but she wasn't really happy at her birthday. She was so sad and upset that no one understood that she wanted her party.
The girl was me.
When I started to grow up I always wanted something special when it came to celebrating my birthday. I always wanted it to be the best thing ever, but I didn't know how to have the best birthday.
I wanted everyone
speed:3.4299865540630656token/s


In [15]:
prompt = "What color is the sky?"
input_ids = enc(prompt, return_tensors='pt').input_ids.cuda()
model = model.cuda()
start_time = time.time()
output = model.generate(inputs=input_ids, do_sample=True, top_k=10, max_new_tokens=128)
end_time = time.time()
speed = len(output[0])/(end_time-start_time)
print(enc.decode(output[0]))
print(f"speed:{speed}token/s")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


What color is the sky?
It’s the first day of summer! So we went to the beach, and it was a perfect day!
We got a beach umbrella and sat right next to the water. We had lunch, ate a lot, and had fun playing in the sand and water. When the tide came in, we got to see lots of sea turtles coming to the shore to lay eggs. They were really interesting to watch because some would get up on their back legs and start scratching. They’d scratch and scratch until they got the eggs out from the sand. We watched them all the way until the tide went back out.
We
speed:3.435994102067567token/s
