Skip to content

dennislee22/FT-Merge-Quantize-Infer-CML

Repository files navigation

LLM: Fine-Tune > Merge > Quantize > Infer .. on CML

Table of Contents

1. Objective
2. Benchmark Score & Summary
3. Preparation
    3.1. Dataset & Model
    3.2. CML Session
4. bigscience/bloom-1b1
    4.1. Fine-Tune (w/o Quantization) > Merge > Inference
    4.2. Quantize (GPTQ 8-bit) > Inference
5. bigscience/bloom-7b1
    5.1. Fine-Tune (w/o Quantization) > Merge > Inference
    5.2. Fine-Tune (4-bit) > Merge > Inference
    5.3. Quantize (GPTQ 8-bit) > Inference
6. tiiuae/falcon-7b
    6.1. Fine-Tune (w/o Quantization) > Merge > Inference
    6.2. Fine-Tune (8-bit) > Merge > Inference
    6.3. Quantize (GPTQ 8-bit) > Inference
7. Salesforce/codegen2-1B
    7.1. Fine-Tune (w/o Quantization) > Merge > Inference
8. Bonus: Use Custom Gradio for Inference

1. Objective

  1. To create a LLM that is capable of achieving an AI task with specific dataset, the traditional ML approach would need to train a model from the scratch. Study shows it would take nearly 300 years to train a GPT model using a single V100 GPU card. This excludes the iterative process to test, retrain and tune the model to achieve satisfactory results. This is where Parameter-Efficient Fine-tuning (PEFT) comes in handy. PEFT trains only a subset of the parameters with the defined dataset, thereby substantially decreasing the computational resources and time.
  2. The provided iPython codes in this repository serve as a comprehensive illustration of the complete lifecycle for fine-tuning a particular Transformers-based model using specific datasets. This includes merging LLM with the trained adapters, quantization, and, ultimately, conducting inferences with the correct prompt. The outcomes of these experiments are detailed in the following section. The target use case of the experiments is making use the Text-to-SQL dataset to train the model, enabling the translation of plain English into SQL query statements.
     a. ft-trl-train.ipynb: Run the code cell-by-cell interactively to fine-tune the base model with local dataset using TRL (Transformer Reinforcement Learning) mechanism. Merge the trained adapters with the base model. Subsequently, perform model inference to validate the results.
     b. quantize_model.ipynb: Quantize the model (post-training) in 8, or even 2 bits using auto-gptq library.
     c. infer_Qmodel.ipynb: Run inference on the quantized model to validate the results.
     d. gradio_infer.ipynb: You may use this custom Gradio interface to compare the inference results between the base and fine-tuned model.
  3. The experiments also showcase the post-quantization outcome. Quantization allows model to be loaded into VRAM with constrained capacity. GPTQ is a post-training method to transform the fine-tuned model into a smaller footprint. According to 🤗 leaderboard, quantized model is able to infer without significant results degradation based on the scoring standards such as MMLU and HellaSwag. BitsAndBytes (zero-shot) helps further by applying 8-bit or even 4-bit quantization to model in the VRAM to facilitate model training.
  4. Experiments were carried out using bloom, falcon and codegen2 models with 1B to 7B parameters. The idea is to find out the actual GPU memory consumption when carrying out specific task in the above PEFT fine-tuning lifecycle. Results are detailed in the following section. These results can also serve as the GPU buying guide to achieve a specific LLM use case.

2. Summary & Benchmark Score

  • Graph below depicts the GPU memory utilization during a specific stage. This graph is computed based on the results obtained from the experiments as detailed in the tables below.

image

  • Tables below summarize the benchmark result when running the experiments using 1 unit of Nvidia A100-PCIE-40GB GPU on CML with Openshift (bare-metal):

  a. Time taken to fine-tune different LLM with 10% of Text-to-SQL dataset (File size=20.7 MB):

Model Fine-Tune Technique Fine-Tune Duration Inference Result
bloom-1b1 No Quantization ~12 mins Good
bloom-7b1 No Quantization OOM N/A
bloom-7b1 4-bit BitsAndBytes ~83 mins Good
falcon-7b No Quantization OOM N/A
falcon-7b 8-bit BitsAndBytes ~65 mins Good
codegen2-1B No Quantization ~12 mins Bad

OOM = Out-Of-Memory

  b. Time taken to quantize the fine-tuned (merged with PEFT adapters) model using auto-GPTQ technique:

Model Quantization Technique Quantization Duration Inference Result
bloom-1b1 auto-gptq 8-bit ~5 mins Bad
bloom-7b1 auto-gptq 8-bit ~35 mins Good
falcon-7b auto-gptq 8-bit ~22 mins Good

  c. Table below shows the amount of memory of a A100-PCIE-40GB GPU utilised during specific experiment stage with different models.

Model Fine-Tune Technique Load (Before Fine-Tune) During Training Inference Merged Model During Quantization Inference 8-bit GPTQ Model
bloom-1b1 No Quantization ~4.5G ~21G ~6G ~6G ~2G
bloom-7b1 No Quantization ~27G OOM N/A N/A N/A
bloom-7b1 4-bit BitsAndBytes ~6G ~17G ~31G ~23G ~9G
falcon-7b No Quantization ~28G OOM N/A N/A N/A
falcon-7b 8-bit BitsAndBytes ~8G ~16G ~28G ~24G ~8G
codegen2-1B No Quantization ~4.5G ~16G ~5G N/A N/A

Summary:

  1. LLM fine-tuning and quantization are VRAM-intensive activities. If you are buying a GPU for fine-tuning purposes, please take note of the benchmark results.
  2. During model training, the model states such as optimizer, gradients, and parameters contribute heavily to the VRAM usage. The outcome of the experiments shows that model 1B parameter consumes more than 2GB VRAM when loaded for inference. When model fine-tuning/training is being carried out, VRAM consumption increases by 2x to 4x. Training a model without quantization (fp32) has a high memory overhead. Try reducing the batch size in the event of hitting OOM when loading the model.
  3. During model inference, each billion parameters consumes 4GB memory in FP32 precision, 2GB in FP16, and 1GB in int8, all excluding additional overhead (estimated ≤ 20%).
  4. When loading a huge model (without quantization) with OOM error, BitsAndBytes quantization allows the model to fit into the VRAM but at the expense of lower precision. Despite that limitation, the result was acceptable, depending on the use cases. As expected, 4-bit BitsAndBytes took longer duration to train compared to 8-bit BitsAndBytes setting.
  5. auto-gptq post-quantization mechanism helps to reduce the model size permanently.
  6. Not all pre-trained models are suitable for fine-tuning with the same dataset. Experiments show that falcon-7b and bloom-7b1 produce acceptable results but not for codegen2-1B model.
  7. CPU cores are heavily used when saving/copying the quantized model. You may enable CML's CPU bursting feature to speed up the process.
  8. GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16.
  9. During the training process using BitsAndBytes config, the forward and backward steps are done using FP16 weights and activations.

3. Preparation

3.1 Dataset & Model

  • You may download the model (using curl) into the local folder or pinpoint the model in the code so that the API will connect and download directly from 🤗 site.
     a. bigscience/bloom-1b1 and bigscience/bloom-7b1
     b. tiiuae/falcon-7b
     c. Salesforce/codegen2-1B

  • You may download the dataset (using curl) into the local folder or pinpoint the dataset in the code so that the API will connect and download directly from 🤗 site.
     a. Dataset for fine-tuning: Shreyasrp/Text-to-SQL
     b. Dataset for quantization: Quantization requires sample data to calibrate and enhance quality of the quantization. In this benchmark test, C4 dataset is utilized as only certain datasets are allowed.

3.2 CML Session

  • CML runs on the Kubernetes platform. When a CML session is requested, CML instructs K8s to schedule and provision a pod with the required resource profile.
     a. Create a CML project using Python 3.9 with Nvidia GPU runtime.
     b. Create a CML session (Jupyter) with the resource profile of 4CPU and 64GB memory and 1GPU.
     c. In the CML session, install the necessary Python packages.
pip install -r requirements.txt

4. bigscience/bloom-1b1

4.1. Fine-Tune (w/o Quantization) > Merge > Inference

  • In CML session, run this Jupyter code ft-merge-qt.ipynb to fine-tune, merge and perform a simple inference on the merged/fine-tuned model.

  • Code Snippet:

base_model = AutoModelForCausalLM.from_pretrained(base_model, use_cache = False, device_map=device_map)
  • Below shows the outcome after loading the model into the VRAM before running the fine-tuning/training code.
Base Model Memory Footprint in VRAM: 4063.8516 MB
--------------------------------------
Parameters loaded for model bloom-1b1:
Total parameters: 1065.3143 M
Trainable parameters: 1065.3143 M


Data types for loaded model bloom-1b1:
torch.float32, 1065.3143 M, 100.00 %

image

  • During fine-tuning/training:
image
  • It takes ~12mins to complete the training.
{'loss': 0.8376, 'learning_rate': 0.0001936370577755154, 'epoch': 2.03}
{'loss': 0.7142, 'learning_rate': 0.0001935522185458556, 'epoch': 2.03}
{'loss': 0.6476, 'learning_rate': 0.00019346737931619584, 'epoch': 2.03}
{'train_runtime': 715.2236, 'train_samples_per_second': 32.96, 'train_steps_per_second': 16.48, 'train_loss': 0.8183029612163445, 'epoch': 2.03}
Training Done
  • Inside the training_output directory:
$ ls -lh
total 23M
-rw-r--r--. 1 cdsw cdsw  427 Nov  6 02:07 adapter_config.json
-rw-r--r--. 1 cdsw cdsw 9.1M Nov  6 02:07 adapter_model.bin
drwxr-xr-x. 2 cdsw cdsw   11 Nov  6 01:59 checkpoint-257
drwxr-xr-x. 2 cdsw cdsw   11 Nov  6 02:03 checkpoint-514
drwxr-xr-x. 2 cdsw cdsw   11 Nov  6 02:07 checkpoint-771
-rw-r--r--. 1 cdsw cdsw   88 Nov  6 02:07 README.md
-rw-r--r--. 1 cdsw cdsw   95 Nov  6 02:07 special_tokens_map.json
-rw-r--r--. 1 cdsw cdsw  983 Nov  6 02:07 tokenizer_config.json
-rw-r--r--. 1 cdsw cdsw  14M Nov  6 02:07 tokenizer.json
-rw-r--r--. 1 cdsw cdsw 4.5K Nov  6 02:07 training_args.bin
  • After the training is completed, merge the base model with the PEFT-trained adapters.

  • Inside the merged model directory:

$ ls -lh
total 4.0G
-rw-r--r--. 1 cdsw cdsw  777 Nov  6 02:07 config.json
-rw-r--r--. 1 cdsw cdsw  137 Nov  6 02:07 generation_config.json
-rw-r--r--. 1 cdsw cdsw 4.0G Nov  6 02:07 pytorch_model.bin
-rw-r--r--. 1 cdsw cdsw   95 Nov  6 02:07 special_tokens_map.json
-rw-r--r--. 1 cdsw cdsw  983 Nov  6 02:07 tokenizer_config.json
-rw-r--r--. 1 cdsw cdsw  14M Nov  6 02:07 tokenizer.json
  • Inside the base model directory:
$ ls -lh
total 6.0G
-rw-r--r--. 1 cdsw cdsw  693 Oct 28 02:22 config.json
-rw-r--r--. 1 cdsw cdsw 2.0G Oct 28 01:32 flax_model.msgpack
-rw-r--r--. 1 cdsw cdsw  16K Oct 28 01:27 LICENSE
-rw-r--r--. 1 cdsw cdsw 2.0G Oct 28 01:31 model.safetensors
drwxr-xr-x. 2 cdsw cdsw   11 Oct 28 01:27 onnx
-rw-r--r--. 1 cdsw cdsw 2.0G Oct 28 01:29 pytorch_model.bin
-rw-r--r--. 1 cdsw cdsw  21K Oct 28 01:27 README.md
-rw-r--r--. 1 cdsw cdsw   85 Oct 28 01:27 special_tokens_map.json
-rw-r--r--. 1 cdsw cdsw  222 Oct 28 01:33 tokenizer_config.json
-rw-r--r--. 1 cdsw cdsw  14M Oct 28 01:33 tokenizer.json
  • Load the merged model into VRAM:
Merged Model Memory Footprint in VRAM: 4063.8516 MB

Data types:
torch.float32, 1065.3143 M, 100.00 %

image

  • Run inference on the fine-tuned/merged model and the base model, compare the results.
--------------------------------------
Prompt:
# Instruction:
Use the context below to produce the result
# context:
CREATE TABLE book (Title VARCHAR, Writer VARCHAR). What are the titles of the books whose writer is not Dennis Lee?
# result:

--------------------------------------
Fine-tuned Model Result :
SELECT Title FROM book WHERE Writer <> 'Dennis Lee'
Base Model Result :
CREATE TABLE book (Title VARCHAR, Writer VARCHAR). What are the titles of the books whose writer is Dennis Lee?
# result:
CREATE TABLE book (Title VARCHAR, Writer VARCHAR). What are the titles of the books whose writer is not Dennis Lee?
# result:
CREATE TABLE book (Title VARCHAR, Writer VARCHAR). What are the titles of the books whose writer is Dennis Lee?

4.2. Quantize (GPTQ 8-bit) > Inference

image

image

  • Time taken to quantize:
Total Seconds Taken to Quantize Using cuda:0: 282.6761214733124
  • Load the quantized model into VRAM:
cuda:0 Memory Footprint: 1400.0977 MB

Data types:
torch.float16, 385.5053 M, 100.00 %

image

  • Run inference on the quantized model and check the result:
--------------------------------------
Prompt:
# Instruction:
Use the context below to produce the result
# context:
CREATE TABLE book (Title VARCHAR, Writer VARCHAR). What are the titles of the books whose writer is not Dennis Lee?
# result:

--------------------------------------
Quantized Model Result :
SELECT Title FROM book WHERE Writer = 'Not Dennis Lee'
  • Inside the quantized directory:
$ ls -lh
total 1.4G
-rw-r--r--. 1 cdsw cdsw 1.4K Nov  6 02:39 config.json
-rw-r--r--. 1 cdsw cdsw  137 Nov  6 02:39 generation_config.json
-rw-r--r--. 1 cdsw cdsw 1.4G Nov  6 02:39 pytorch_model.bin
-rw-r--r--. 1 cdsw cdsw  551 Nov  6 02:39 special_tokens_map.json
-rw-r--r--. 1 cdsw cdsw  983 Nov  6 02:39 tokenizer_config.json
-rw-r--r--. 1 cdsw cdsw  14M Nov  6 02:39 tokenizer.json
  • Snippet of config.json file in the quantized model folder:
pretraining_tp: 1
▶
quantization_config:
batch_size: 1
bits: 8
block_name_to_quantize: "transformer.h"
damp_percent: 0.1
dataset: "c4"
desc_act: false
disable_exllama: false
group_size: 128
max_input_length: null
model_seqlen: 2048
▶
module_name_preceding_first_block: [] 2 items
pad_token_id: null
quant_method: "gptq"
sym: true
tokenizer: null
true_sequential: true
use_cuda_fp16: true

5. bigscience/bloom-7b1

5.1. Fine-Tune (w/o Quantization) > Merge > Inference

  • In CML session, run this Jupyter code ft-merge-qt.ipynb to fine-tune, merge and perform a simple inference on the merged/fine-tuned model.

  • Code Snippet:

base_model = AutoModelForCausalLM.from_pretrained(base_model, use_cache = False, device_map=device_map)
  • Below shows the outcome after loading the model into the VRAM before running the fine-tuning/training code.
Base Model Memory Footprint in VRAM: 26966.1562 MB
--------------------------------------
Parameters loaded for model bloom-7b1:
Total parameters: 7069.0161 M
Trainable parameters: 7069.0161 M


Data types for loaded model bloom-7b1:
torch.float32, 7069.0161 M, 100.00 %

image

  • During fine-tuning/training:
OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 373.94 MiB is free. Process 1793579 has 39.02 GiB memory in use. Of the allocated memory 38.23 GiB is allocated by PyTorch, and 305.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

5.2. Fine-Tune (4-bit) > Merge > Inference

  • Code Snippet:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16"
)
base_model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config=bnb_config, use_cache = False, device_map=device_map)
  • Below shows the outcome after loading the model into the VRAM before running the fine-tuning/training code.
Base Model Memory Footprint in VRAM: 4843.0781 MB
--------------------------------------
Parameters loaded for model bloom-7b1:
Total parameters: 4049.1172 M
Trainable parameters: 1028.1124 M

Data types for loaded model bloom-7b1:
torch.float16, 1029.2183 M, 25.42 %
torch.uint8, 3019.8989 M, 74.58 %

image

  • During fine-tuning/training:
image
  • It takes ~83mins to complete the training.
'loss': 0.5777, 'learning_rate': 0.0001935522185458556, 'epoch': 2.03}
{'loss': 0.5486, 'learning_rate': 0.0001935097989310257, 'epoch': 2.03}
{'loss': 0.465, 'learning_rate': 0.00019346737931619584, 'epoch': 2.03}
{'train_runtime': 5024.8159, 'train_samples_per_second': 4.692, 'train_steps_per_second': 4.692, 'train_loss': 0.6570684858410584, 'epoch': 2.03}
Training Done
  • After training is completed, merge the base model with the PEFT-trained adapters.

  • Load the merged model into VRAM:

Merged Model Memory Footprint in VRAM: 26966.1562 MB

Data types:
torch.float32, 7069.0161 M, 100.00 %

image

  • Run inference on the fine-tuned/merged model and check the result:
--------------------------------------
Prompt:
# Instruction:
Use the context below to produce the result
# context:
CREATE TABLE book (Title VARCHAR, Writer VARCHAR). What are the titles of the books whose writer is not Dennis Lee?
# result:

--------------------------------------
Fine-tuned Model Result :
SELECT Title FROM book WHERE Writer <> "Dennis Lee"

5.3. Quantize (GPTQ 8-bit) > Inference

image image
image
  • Time taken to quantize:
Total Seconds Taken to Quantize Using cuda:0: 2073.348790884018
  • Load the quantized model into VRAM:
cuda:0 Memory Footprint: 7861.3594 MB

Data types:
torch.float16, 1028.1124 M, 100.00 %
image
  • Run inference on the quantized model and check the result:
--------------------------------------
Prompt:
# Instruction:
Use the context below to produce the result
# context:
CREATE TABLE book (Title VARCHAR, Writer VARCHAR). What are the titles of the books whose writer is not Dennis Lee?
# result:

--------------------------------------
Quantized Model Result :
SELECT Title FROM book WHERE Writer <> "Dennis Lee"
  • Snippet of config.json file in the quantized model folder:
▶
quantization_config:
batch_size: 1
bits: 8
block_name_to_quantize: "transformer.h"
damp_percent: 0.1
dataset: "c4"
desc_act: false
disable_exllama: true
group_size: 128
max_input_length: null
model_seqlen: 2048
▶
module_name_preceding_first_block: [] 2 items
pad_token_id: null
quant_method: "gptq"
sym: true
tokenizer: null
true_sequential: true
use_cuda_fp16: true

6. tiiuae/falcon-7b

6.1. Fine-Tune (w/o Quantization) > Merge > Inference

  • In CML session, run this Jupyter code ft-merge-qt.ipynb to fine-tune, merge and perform a simple inference on the merged/fine-tuned model.

  • Code Snippet:

base_model = AutoModelForCausalLM.from_pretrained(base_model, use_cache = False, device_map=device_map)
  • Below shows the outcome after loading the model into the VRAM before running the fine-tuning/training code.
Base Model Memory Footprint in VRAM: 26404.2729 MB
--------------------------------------
Parameters loaded for model falcon-7b:
Total parameters: 6921.7207 M
Trainable parameters: 6921.7207 M


Data types for loaded model falcon-7b:
torch.float32, 6921.7207 M, 100.00 %

image

  • During fine-tuning/training:
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.11 GiB. GPU 0 has a total capacty of 39.39 GiB of which 345.94 MiB is free. Process 1618370 has 39.04 GiB memory in use. Of the allocated memory 37.50 GiB is allocated by PyTorch, and 1.05 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

image

6.2. Fine-Tune (8-bit) > Merge > Inference

  • Code Snippet:
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
)
base_model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config=bnb_config, use_cache = False, device_map=device_map)
  • Below shows the outcome after loading the model into the VRAM before running the fine-tuning/training code.
Base Model Memory Footprint in VRAM: 6883.1384 MB
--------------------------------------
Parameters loaded for model falcon-7b:
Total parameters: 6921.7207 M
Trainable parameters: 295.7690 M


Data types for loaded model falcon-7b:
torch.float16, 295.7690 M, 4.27 %
torch.int8, 6625.9517 M, 95.73 %

image

  • During fine-tuning/training:
image
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
  • It takes ~65mins to complete the training.
{'loss': 0.5285, 'learning_rate': 0.00019198269279714942, 'epoch': 2.04}
{'loss': 0.4823, 'learning_rate': 0.00019194027318231952, 'epoch': 2.04}
{'loss': 0.4703, 'learning_rate': 0.00019189785356748962, 'epoch': 2.04}
{'train_runtime': 3911.2114, 'train_samples_per_second': 6.027, 'train_steps_per_second': 6.027, 'train_loss': 0.5239265531830902, 'epoch': 2.04}
Training Done
  • After the training is completed, merge the base model with the PEFT-trained adapters.

  • Inside the merged model directory:

$ ls -lh
total 26G
-rw-r--r--. 1 cdsw cdsw 1.2K Nov  6 04:55 config.json
-rw-r--r--. 1 cdsw cdsw  118 Nov  6 04:55 generation_config.json
-rw-r--r--. 1 cdsw cdsw 4.7G Nov  6 04:55 pytorch_model-00001-of-00006.bin
-rw-r--r--. 1 cdsw cdsw 4.7G Nov  6 04:55 pytorch_model-00002-of-00006.bin
-rw-r--r--. 1 cdsw cdsw 4.7G Nov  6 04:55 pytorch_model-00003-of-00006.bin
-rw-r--r--. 1 cdsw cdsw 4.7G Nov  6 04:55 pytorch_model-00004-of-00006.bin
-rw-r--r--. 1 cdsw cdsw 4.7G Nov  6 04:55 pytorch_model-00005-of-00006.bin
-rw-r--r--. 1 cdsw cdsw 2.7G Nov  6 04:55 pytorch_model-00006-of-00006.bin
-rw-r--r--. 1 cdsw cdsw  17K Nov  6 04:55 pytorch_model.bin.index.json
-rw-r--r--. 1 cdsw cdsw  313 Nov  6 04:55 special_tokens_map.json
-rw-r--r--. 1 cdsw cdsw 2.6K Nov  6 04:55 tokenizer_config.json
-rw-r--r--. 1 cdsw cdsw 2.7M Nov  6 04:55 tokenizer.json
  • Load the merged model into VRAM:
Merged Model Memory Footprint in VRAM: 26404.2729 MB

Data types:
torch.float32, 6921.7207 M, 100.00 %

image

  • Run inference on the fine-tuned/merged model and the base model, compare the results.
--------------------------------------
Prompt:
# Instruction:
Use the context below to produce the result
# context:
CREATE TABLE book (Title VARCHAR, Writer VARCHAR). What are the titles of the books whose writer is not Dennis Lee?
# result:

--------------------------------------
Fine-tuned Model Result :
SELECT Title FROM book WHERE Writer <> 'Dennis Lee'
Base Model Result :
Title Writer
# Explanation:
The result shows the titles of the books whose writer is not Dennis Lee.
# 5.3.3.4.4.3.4.3.4.3.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2.2

6.3. Quantize (GPTQ 8-bit) > Inference

image
  • Time taken to quantize:
Total Seconds Taken to Quantize Using cuda:0: 1312.4991219043732
  • Load the quantized model into VRAM:
cuda:0 Memory Footprint: 7038.3259 MB

Data types:
torch.float16, 295.7690 M, 100.00 %
image
  • Run inference on the quantized model and check the result:
--------------------------------------
Prompt:
# Instruction:
Use the context below to produce the result
# context:
CREATE TABLE book (Title VARCHAR, Writer VARCHAR). What are the titles of the books whose writer is not Dennis Lee?
# result:

--------------------------------------
Quantized Model Result :
SELECT Title FROM book WHERE Writer <> 'Dennis Lee'
  • Inside the quantized directory:
$ ls -lh
total 6.9G
-rw-r--r--. 1 cdsw cdsw 1.7K Nov  6 05:26 config.json
-rw-r--r--. 1 cdsw cdsw  118 Nov  6 05:26 generation_config.json
-rw-r--r--. 1 cdsw cdsw 4.7G Nov  6 05:26 pytorch_model-00001-of-00002.bin
-rw-r--r--. 1 cdsw cdsw 2.3G Nov  6 05:26 pytorch_model-00002-of-00002.bin
-rw-r--r--. 1 cdsw cdsw  61K Nov  6 05:26 pytorch_model.bin.index.json
-rw-r--r--. 1 cdsw cdsw  541 Nov  6 05:26 special_tokens_map.json
-rw-r--r--. 1 cdsw cdsw 2.6K Nov  6 05:26 tokenizer_config.json
-rw-r--r--. 1 cdsw cdsw 2.7M Nov  6 05:26 tokenizer.json
  • Snippet of config.json file in the quantized model folder:
▶
quantization_config:
batch_size: 1
bits: 8
block_name_to_quantize: "transformer.h"
damp_percent: 0.1
dataset: "c4"
desc_act: false
disable_exllama: false
group_size: 128
max_input_length: null
model_seqlen: 2048
▶
module_name_preceding_first_block: [] 1 item
pad_token_id: null
quant_method: "gptq"
sym: true
tokenizer: null
true_sequential: true
use_cuda_fp16: true
rope_scaling: null
rope_theta: 10000
torch_dtype: "float16"
transformers_version: "4.35.0.dev0"
use_cache: true
vocab_size: 65024

7. Salesforce/codegen2-1B

7.1. Fine-Tune (w/o Quantization) > Merge > Inference

  • In CML session, run this Jupyter code ft-merge-qt.ipynb to fine-tune, merge and perform a simple inference on the merged/fine-tuned model.

  • Code Snippet:

base_model = AutoModelForCausalLM.from_pretrained(base_model, use_cache = False, device_map=device_map)
  • Below shows the outcome after loading the model into the VRAM before running the fine-tuning/training code.
Base Model Memory Footprint in VRAM: 3937.0859 MB
--------------------------------------
Parameters loaded for model codegen2-1B:
Total parameters: 1015.3062 M
Trainable parameters: 1015.3062 M


Data types for loaded model codegen2-1B:
torch.float32, 1015.3062 M, 100.00 %
image
  • During fine-tuning/training:
image
  • It takes ~12mins to complete the training.
{'loss': 2.8109, 'learning_rate': 0.00019189785356748962, 'epoch': 2.04}
{'loss': 2.2957, 'learning_rate': 0.00019185543395265972, 'epoch': 2.04}
{'loss': 2.598, 'learning_rate': 0.00019181301433782982, 'epoch': 2.04}
{'train_runtime': 683.683, 'train_samples_per_second': 34.481, 'train_steps_per_second': 34.481, 'train_loss': 3.380507248720025, 'epoch': 2.04}
Training Done
  • After the training is completed, merge the base model with the PEFT-trained adapters.
  • Load the merged model into VRAM:
Merged Model Memory Footprint in VRAM: 3937.0859 MB

Data types:
torch.float32, 1015.3062 M, 100.00 %

image

  • Run inference on the fine-tuned/merged model and the base model:
--------------------------------------
Prompt:
# Instruction:
Use the context below to produce the result
# context:
CREATE TABLE book (Title VARCHAR, Writer VARCHAR). What are the titles of the books whose writer is not Dennis Lee?
# result:

--------------------------------------
Fine-tuned Model Result :
    Result:
    SELECT t1.name FROM table_code JOINCT (name INTEGER), How many customers who have a department?
Base Model Result :
port,,vt,(vt((var(,st#

8. Bonus: Use Custom Gradio for Inference

  • In CML session, execute this Jupyter code gradio_infer.ipynb to run inference on a specific model using the custom Gradio interface.
  • This Gradio interface is designed to compare the inference results between the base model and the fine-tuned/merged model.
  • It also displays the GPU memory status after loading the selected model successfully. User experience is depicted below.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published