## Load 72B AWQ Model using vLLM on L4 x4

In this notebook, we load the awq quantized of Qwen/Qwen2.5-72B-Instruct. The model card on huggingface.com can be found [here](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct-AWQ).

### Next steps

Specialized math models exist in the Qwen2.5 family of models. A good starting point might be to awq quantize their best math model: [Qwen/Qwen2.5-Math-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-72B-Instruct)!

The sky's the limit, happy kaggling!

In [1]:
import os
import gc
import ctypes
import warnings

import torch
from vllm import LLM, SamplingParams

2024-11-09 21:23:59,204	INFO util.py:124 -- Outdated packages:
  ipywidgets==7.7.1 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


In [2]:
warnings.simplefilter('ignore')

os.environ["CUDA_VISIBLE_DEVICES"]   = "0,1,2,3"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

def clean_memory(deep=False):
    gc.collect()
    if deep:
        ctypes.CDLL("libc.so.6").malloc_trim(0)
    torch.cuda.empty_cache()

In [3]:
llm_model_pth = '/kaggle/input/qwen2.5/transformers/72b-instruct-awq/1'

In [4]:
llm = LLM(
    llm_model_pth,
    dtype="half",                # The data type for the model weights and activations
    max_num_seqs=8,              # Maximum number of sequences per iteration. Default is 256
    max_model_len=4096,          # Model context length
    trust_remote_code=True,      # Trust remote code (e.g., from HuggingFace) when downloading the model and tokenizer
    tensor_parallel_size=4,      # The number of GPUs to use for distributed execution with tensor parallelism
    gpu_memory_utilization=0.98, # The ratio (between 0 and 1) of GPU memory to reserve for the model
)

INFO 11-09 21:24:21 awq_marlin.py:97] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 11-09 21:24:21 config.py:905] Defaulting to use mp for distributed inference
INFO 11-09 21:24:21 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/kaggle/input/qwen2.5/transformers/72b-instruct-awq/1', speculative_config=None, tokenizer='/kaggle/input/qwen2.5/transformers/72b-instruct-awq/1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_confi

Loading safetensors checkpoint shards:   0% Completed | 0/11 [00:00<?, ?it/s]


[1;36m(VllmWorkerProcess pid=1103)[0;0m INFO 11-09 21:28:39 model_runner.py:1067] Loading model weights took 9.7875 GB
INFO 11-09 21:28:39 model_runner.py:1067] Loading model weights took 9.7875 GB
[1;36m(VllmWorkerProcess pid=1102)[0;0m INFO 11-09 21:28:39 model_runner.py:1067] Loading model weights took 9.7875 GB
[1;36m(VllmWorkerProcess pid=1101)[0;0m INFO 11-09 21:28:39 model_runner.py:1067] Loading model weights took 9.7875 GB
INFO 11-09 21:28:46 distributed_gpu_executor.py:57] # GPU blocks: 9178, # CPU blocks: 3276
INFO 11-09 21:28:46 distributed_gpu_executor.py:61] Maximum concurrency for 4096 tokens per request: 35.85x
INFO 11-09 21:28:49 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-09 21:28:49 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running 

In [9]:
sampling_params = SamplingParams(
    temperature=0.3,              # randomness of the sampling
    seed=1,                       # Seed for reprodicibility
    skip_special_tokens=False,
    max_tokens=2400
)

msgs = [
    {"role": "user", "content": "give me a step-by-step explanation of the intermediate value theorem"}
]

response = llm.chat(msgs, sampling_params, use_tqdm=False)

print(response[0].outputs[0].text)

Certainly! The Intermediate Value Theorem (IVT) is a fundamental theorem in calculus that provides a way to determine if a function takes on a certain value within a given interval. Here’s a step-by-step explanation:

### Step 1: Understand the Theorem
The Intermediate Value Theorem states that if a function \( f \) is continuous on a closed interval \([a, b]\) and \( N \) is any number between \( f(a) \) and \( f(b) \), then there exists at least one number \( c \) in the interval \((a, b)\) such that \( f(c) = N \).

### Step 2: Define the Conditions
1. **Continuity**: The function \( f \) must be continuous on the closed interval \([a, b]\). This means that there are no breaks, jumps, or holes in the function within this interval.
2. **Closed Interval**: The interval \([a, b]\) is closed, meaning it includes both endpoints \( a \) and \( b \).
3. **Value \( N \)**: \( N \) is any number between \( f(a) \) and \( f(b) \). This means \( N \) lies within the range of the function value

In [11]:
sampling_params = SamplingParams(
    temperature=0.3,              # randomness of the sampling
    seed=1,                       # Seed for reprodicibility
    skip_special_tokens=False,
    max_tokens=2400
)

msgs = [
    {"role": "user", "content": "what are the different steps involved in implementing Godel's theorem in python? Note that (1) I already have an array which associates each logical symbol with an integer, and (2) I already have a decoding function to pass from an integer to a formula, using parsing."}
]

response = llm.chat(msgs, sampling_params, use_tqdm=False)

print(response[0].outputs[0].text)

Implementing Gödel's theorem in Python involves several steps, primarily focusing on encoding and decoding logical statements, and then demonstrating the key aspects of the theorem, such as the existence of undecidable statements. Given that you already have an array associating each logical symbol with an integer and a decoding function, we can proceed with the following steps:

### Step 1: Define the Encoding Function
You need a function to encode a logical formula into a Gödel number. This function will use the array you provided to map each symbol to an integer and then combine these integers into a single number.

```python
def encode_formula(formula, symbol_to_int):
    """
    Encode a logical formula into a Gödel number.
    
    :param formula: A string representing the logical formula.
    :param symbol_to_int: A dictionary mapping logical symbols to integers.
    :return: The Gödel number of the formula.
    """
    # Split the formula into individual symbols
    symbols = l