## Homework：使用 AWQ 算法量化 Facebook OPT-6.7B 模型

In [1]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "facebook/opt-6.7b"
quant_model_dir = "models/opt-6.7b-awq-homework"

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

In [2]:
# 加载模型
model = AutoAWQForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)



Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
# 量化模型
model.quantize(tokenizer, quant_config=quant_config)

Repo card metadata block was not found. Setting CardData to empty.
AWQ: 100%|██████████| 32/32 [13:19<00:00, 24.99s/it]


In [4]:
# 为了使`quant_config` 与 transformers 兼容，我们需要修改配置文件：`使用 Transformers.AwqConfig 来实例化量化模型配置`
from transformers import AwqConfig, AutoConfig

# 修改配置文件以使其与transformers集成兼容
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# 预训练的transformers模型存储在model属性中，我们需要传递一个字典
model.model.config.quantization_config = quantization_config

In [5]:
# 保存模型权重
model.save_quantized(quant_model_dir)
# 保存分词器
tokenizer.save_pretrained(quant_model_dir)  

('models/opt-6.7b-awq-homework\\tokenizer_config.json',
 'models/opt-6.7b-awq-homework\\special_tokens_map.json',
 'models/opt-6.7b-awq-homework\\vocab.json',
 'models/opt-6.7b-awq-homework\\merges.txt',
 'models/opt-6.7b-awq-homework\\added_tokens.json',
 'models/opt-6.7b-awq-homework\\tokenizer.json')

In [6]:
model.eval()

OptAWQForCausalLM(
  (model): OPTForCausalLM(
    (model): OPTModel(
      (decoder): OPTDecoder(
        (embed_tokens): Embedding(50272, 4096, padding_idx=1)
        (embed_positions): OPTLearnedPositionalEmbedding(2050, 4096)
        (final_layer_norm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (layers): ModuleList(
          (0-31): 32 x OPTDecoderLayer(
            (self_attn): OPTAttention(
              (k_proj): WQLinear_GEMM(in_features=4096, out_features=4096, bias=True, w_bit=4, group_size=128)
              (v_proj): WQLinear_GEMM(in_features=4096, out_features=4096, bias=True, w_bit=4, group_size=128)
              (q_proj): WQLinear_GEMM(in_features=4096, out_features=4096, bias=True, w_bit=4, group_size=128)
              (out_proj): WQLinear_GEMM(in_features=4096, out_features=4096, bias=True, w_bit=4, group_size=128)
            )
            (activation_fn): ReLU()
            (self_attn_layer_norm): LayerNorm((4096,), eps=1e-05, elementwise_affin

### 使用 GPU 加载量化模型

In [7]:
from transformers import pipeline

quant_model_dir = "models/opt-6.7b-awq-homework"

generator_quant = pipeline('text-generation',
                     model=quant_model_dir,
                     device=0,
                     do_sample=True,
                     num_return_sequences=3)

You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [8]:
generator_quant("Good day! I'm Bob. I would like to")

  return dynamo.is_compiling()


[{'generated_text': "Good day! I'm Bob. I would like to join your server.\nI've never played"},
 {'generated_text': "Good day! I'm Bob. I would like to write articles for your website. I can have"},
 {'generated_text': "Good day! I'm Bob. I would like to get closer with you by writing you personal letters"}]

In [9]:
generator_quant("The woman worked as a")

[{'generated_text': 'The woman worked as a maid and the man as a labourer, police said.\n\nA'},
 {'generated_text': 'The woman worked as a private security officer for a law enforcement branch of Immigration and Customs Enforcement. The'},
 {'generated_text': 'The woman worked as a maid (chaiwan)\nDid she know anything about the murder,'}]

In [10]:
generator_quant("Merry Christmas! I'm glad to")

[{'generated_text': "Merry Christmas! I'm glad to see that you are still alive.\nI was alive!"},
 {'generated_text': "Merry Christmas! I'm glad to hear you're alive :D\nThank you! I'm"},
 {'generated_text': "Merry Christmas! I'm glad to see a fellow Star Wars lover on here.   EDIT"}]