In [None]:
! pip install -q datasets transformers trl peft accelerate bitsandbytes auto-gptq optimum ctransformers[cuda] vllm

In [None]:
#from google.colab import drive
#drive.mount('/content/drive' )# force_remount=True
#%cd drive/MyDrive/Zephyr

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Zephyr


In [None]:
import torch
from datasets import load_dataset, Dataset
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig, TrainingArguments
from trl import SFTTrainer

# GGUF

Although GPTQ does compression well, its focus on GPU can be a disadvantage if you do not have the hardware to run it.

GGUF, previously GGML, is a quantization method that allows users to use the **CPU** to run an LLM but also offload some of its layers to the GPU for a speed up.
Although using the CPU is generally slower than using a GPU for inference, it is an incredible format for those running models on CPU or Apple devices.

Especially since we are seeing smaller and more capable models appearing, like Mistral 7B, the GGUF format might just be here to stay!

In [None]:
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer, pipeline

# Load LLM and Tokenizer
# Use `gpu_layers` to specify how many layers will be offloaded to the GPU.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/zephyr-7B-beta-GGUF",
    model_file="zephyr-7b-beta.Q4_K_M.gguf",
    model_type="mistral", gpu_layers=50, hf=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "HuggingFaceH4/zephyr-7b-beta", use_fast=True
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
prompt = "<|system|>\nYou are a friendly chatbot.</s>\n<|user|>\nTell me a funny joke about Large Language Models.</s>\n<|assistant|>\n"
print(prompt)

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>



In [None]:
# We will use the same prompt as we did originally
outputs = pipe(prompt, max_new_tokens=256)
print(outputs[0]["generated_text"])

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>
Why did the Large Language Model go to the party?

To impress everyone with its vocabulary!

But unfortunately, it kept repeating the same jokes over and over again, making everyone groan and roll their eyes. The partygoers soon realized that the Large Language Model was more of a party pooper than a party animal.

Moral of the story: Just because a Large Language Model can generate a lot of words, doesn't mean it knows how to be funny or entertaining. Sometimes, less is more!


In [None]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [None]:
!nvidia-smi

Thu Feb 15 01:11:40 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P0              30W /  70W |   5397MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# Install vllm dependency
#!pip install vllm

# **AWQ**

A new format on the block is AWQ ([Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978)) which is a quantization method similar to GPTQ. There are several differences between AWQ and GPTQ as methods but the most important one is that AWQ assumes that not all weights are equally important for an LLM's performance.

In other words, there is a small fraction of weights that will be skipped during quantization which helps with the quantization loss.

As a result, their paper mentions a significant speed-up compared to GPTQ whilst keeping similar, and sometimes even better, performance.

In [None]:
from vllm import LLM, SamplingParams

# Load the LLM
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=256)
llm = LLM(
    model="TheBloke/zephyr-7B-beta-AWQ",
    quantization='awq',
    dtype='half',
    gpu_memory_utilization=.95,
    max_model_len=4096
)

INFO 02-15 01:11:45 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/zephyr-7B-beta-AWQ', tokenizer='TheBloke/zephyr-7B-beta-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, seed=0)
INFO 02-15 01:11:50 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 02-15 01:12:12 llm_engine.py:322] # GPU blocks: 1732, # CPU blocks: 2048
INFO 02-15 01:12:14 model_runner.py:632] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 02-15 01:12:14 model_runner.py:636] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `

In [None]:
prompt = "<|system|>\nYou are a friendly chatbot.</s>\n<|user|>\nTell me a funny joke about Large Language Models.</s>\n<|assistant|>\n"
print(prompt)

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>



In [None]:
# Generate output based on the input prompt and sampling parameters
output = llm.generate(prompt, sampling_params)
print(output[0].outputs[0].text)

Processed prompts: 100%|██████████| 1/1 [00:08<00:00,  8.79s/it]

Why did the Large Language Model go to the party?

To network and expand its vocabulary!

Why did the Large Language Model blush?

Because it overheard another model saying it was a little too wordy!

Why did the Large Language Model get kicked out of the library?

It was being too loud and kept interrupting other models' conversations with its endless chatter!

Why did the Large Language Model get a standing ovation at the comedy club?

Because it told some really punny jokes!

Why did the Large Language Model get a job as a writer?

Because it was the most wordy model in the room!

Why did the Large Language Model get a job as a librarian?

Because it knew all the right words to shelve books in the right place!

Why did the Large Language Model get a job as a teacher?

Because it knew all the right words to help students learn and grow!

Why did the Large Language Model get a job as a lawyer?

Because it knew all the right words to argue a case in court!

Why did the Large Language M




In [None]:
# Delete any models previously created
#del pipe, accelerator
# Empty VRAM cache
#import torch
import gc
gc.collect()
#del model, pipe, tokenizer, llm
torch.cuda.empty_cache()

In [None]:
!nvidia-smi

Thu Feb 15 01:15:23 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P0              30W /  70W |   2237MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# GPTQ

GPTQ is a post-training quantization (PTQ) method for 4-bit quantization that focuses primarily on **GPU** inference and performance.

The idea behind the method is that it will try to compress all weights to a 4-bit quantization by minimizing the mean squared error to that weight. During inference, it will dynamically dequantize its weights to float16 for improved performance whilst keeping memory low.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load LLM and Tokenizer
model_id = "TheBloke/zephyr-7B-beta-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    trust_remote_code=False,
    revision="main"
)

# Create a pipeline
pipe = pipeline(model=model, tokenizer=tokenizer, task='text-generation')

Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


In [None]:
prompt = "<|system|>\nYou are a friendly chatbot.</s>\n<|user|>\nTell me a funny joke about Large Language Models.</s>\n<|assistant|>\n"
print(prompt)

<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>



In [None]:
# We will use the same prompt as we did originally
outputs = pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.1,
    top_p=0.95
)
print(outputs[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<|system|>
You are a friendly chatbot.</s>
<|user|>
Tell me a funny joke about Large Language Models.</s>
<|assistant|>
Why did the Large Language Model go to the party?

To make some small talk!

(Large Language Models are artificial intelligence models trained on vast amounts of text data to generate human-like responses. They are not capable of small talk in the traditional sense, but this joke plays on the idea that they can generate human-like responses.)
