## 安装AWQ

In [None]:
git clone https://github.com/casper-hansen/AutoAWQ.git
cd AutoAWQ
pip install -e .

In [None]:
pip install transformers datasets

- torch                          2.2.0+cu118
- triton                         3.0.0
- transformers                   4.45.2
- auto_gptq                      0.7.0+cu118
- ~autoawq                        0.2.6+cu118~
- datasets                       3.0.1
- optimum                        1.23.0

## AWQ量化微调

In [4]:
import os
import json
import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = 'cuda:0'
model_path = "Qwen2-1.5B-Instruct"

In [5]:
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

加载分词器，并根据配置quantize_config指定的量化位数来加载模型。

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto", safetensors=True)

AWQ 量化需要准备校准数据集

In [4]:
from datasets import load_dataset
import pandas as pd

# 从Hugging Face加载数据集
dataset = load_dataset("lamini/lamini_docs")

train_dataset = dataset['train']
train_df = pd.DataFrame(train_dataset)
questions_answers = train_df[['question', 'answer']]

with open('gptq_data_chat_format.jsonl', 'w') as jsonl_file:
    for index, example in questions_answers.iterrows():
        formatted_data = {
            "messages": [
                {"role": "system", "content": "You're a helpful assistant"}, 
                {"role": "user", "content": example['question']},
                {"role": "assistant", "content": example['answer']}
            ]
        }
        jsonl_file.write(json.dumps(formatted_data) + '\n')

In [7]:
def load_jsonl(path):
    conversations = []
    with open(path, 'r') as file:
        data = [json.loads(line) for line in file]
        conversations = [dialog['messages'] for dialog in data]
        return conversations

eval_data_path = 'gptq_data_chat_format.jsonl'
conversations = load_jsonl(eval_data_path)

conversations[0]

[{'role': 'system', 'content': "You're a helpful assistant"},
 {'role': 'user',
  'content': 'How can I evaluate the performance and quality of the generated text from Lamini models?'},
 {'role': 'assistant',
  'content': "There are several metrics that can be used to evaluate the performance and quality of generated text from Lamini models, including perplexity, BLEU score, and human evaluation. Perplexity measures how well the model predicts the next word in a sequence, while BLEU score measures the similarity between the generated text and a reference text. Human evaluation involves having human judges rate the quality of the generated text based on factors such as coherence, fluency, and relevance. It is recommended to use a combination of these metrics for a comprehensive evaluation of the model's performance."}]

定义一个预处理函数，将文本数据预处理为张量数据。

In [13]:
def preprocess(dataset, max_len=2048):
    data = []
    for msg in dataset:
        text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
        data.append(text.strip())
    return data

dataset = preprocess(conversations)

配置日志显示格式：

In [9]:
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

使用校准数据集来动态调整量化参数，使模型在量化时学习并适应数据分布。

In [14]:
%%time
model.quantize(tokenizer, quant_config=quant_config, calib_data=dataset)

AWQ:   0%|          | 0/28 [00:00<?, ?it/s]The attention layers in this model are transitioning from computing the RoPE embeddings internally through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed `position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be removed and `position_embeddings` will be mandatory.
AWQ: 100%|██████████| 28/28 [04:18<00:00,  9.22s/it]

CPU times: user 5min 23s, sys: 23 s, total: 5min 46s
Wall time: 4min 18s





保存量化后的模型和分词器状态。

- quant_path指定了量化模型的保存路径。
- use_safetensors=True 参数表示使用安全张量格式（SafeTensors）进行保存，具有更好的安全性和性能。
- tokenizer.save_pretrained为量化后的模型保存一份分词器配置。

In [15]:
quant_path = "./Qwen2-1__5B-Instruct-AWQ"
model.save_quantized(quant_path, safetensors=True, shard_size="4GB")
tokenizer.save_pretrained(quant_path)

Note that `shard_checkpoint` is deprecated and will be removed in v4.44. We recommend you using split_torch_state_dict_into_shards from huggingface_hub library


('./Qwen2-1__5B-Instruct-gptq-int4/tokenizer_config.json',
 './Qwen2-1__5B-Instruct-gptq-int4/special_tokens_map.json',
 './Qwen2-1__5B-Instruct-gptq-int4/vocab.json',
 './Qwen2-1__5B-Instruct-gptq-int4/merges.txt',
 './Qwen2-1__5B-Instruct-gptq-int4/added_tokens.json',
 './Qwen2-1__5B-Instruct-gptq-int4/tokenizer.json')

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "./Qwen2-1__5B-Instruct-AWQ"

model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "简单介绍一下什么是NLP？"
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


自然语言处理（Natural Language Processing，简称 NLP）是一门计算机科学和人工智能领域的分支学科，研究人如何与计算机进行交互，以及计算机如何理解人类的自然语言。NLP的目标是使机器能够“看懂”并理解人类的语言，从而实现文本识别、文本分析、文本生成等功能。

NLP主要涉及两个方面：一是语法分析，即对句子结构进行解析；二是语义分析，即理解句子的意义。此外，NLP还包括语音识别、语音合成等技术，用于将文本转换为语音或从语音中提取信息。

NLP的应用领域非常广泛，包括但不限于聊天机器人、智能客服、自动翻译、文本摘要、情感分析、垃圾邮件过滤、新闻摘要等。随着技术的发展，NLP正在改变我们的生活，帮助我们更好地理解和使用语言。


In [11]:
from transformers import AutoTokenizer, TextStreamer
from awq import AutoAWQForCausalLM

model_name = "./Qwen2-1__5B-Instruct-AWQ"

model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "简单介绍一下什么是NLP？"
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").input_ids.cuda()

generated_ids = model.generate(
    model_inputs,
    max_new_tokens=512,
)

print(generated_ids)
# generated_ids = [
#     output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
# ]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Replacing layers...: 100%|██████████| 28/28 [00:08<00:00,  3.31it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


tensor([[151644,   8948,    198,   2610,    525,   1207,  16948,     11,   3465,
            553,  54364,  14817,     13,   1446,    525,    264,  10950,  17847,
             13, 151645,    198, 151644,    872,    198, 100405, 109432, 106582,
             45,  12567,  11319, 151645,    198, 151644,  77091,    198,  99795,
         102064,  54542,   9909,  54281,  11434,  28125,     11,    451,  12567,
           7552,  20412, 104455, 104111, 103799,   3837,  99652,  99556, 104564,
         100007, 115167,  54542, 103971,  99795, 102064,   9909,  29524, 104105,
           5373, 109023,   5373,   8903,  72881,  49567,  74276,     45,  12567,
         104820,  20412,  32555, 104564, 100006, 101128,   5373, 104136,   5373,
          43959,  33108,  54542, 103971,  99795, 102064,   3837, 101982, 101884,
          17340,  32648, 108221,   1773,     45,  12567,  99361,  73670, 110645,
         100694, 100650,   3837,  29524, 102182, 105395,   5373, 108704,  70538,
           5373, 104934, 101

- https://qwen.readthedocs.io/zh-cn/latest/quantization/awq.html#quantize-your-own-model-with-autoawq
- https://github.com/casper-hansen/AutoAWQ