# LLM Inference comparison on A100 40GB: llama.ccp works best!

This notbook explorer the different frameworks for running inference of llms with GPU support.  
TLDR: GGUF+llama.ccp fastest with GPU support, AWQ, then GPTQ slower. Errors on vLLM due to torch fuckup. ctransformer error. classic model without quantization leads to out of memory.

### initial setup transformers based
python3.9 -m pip install virtualenv  
python3.9 -m virtualenv .venv  
source .venv/bin/activate  
pip3 install torch torchvision torchaudio  

>>> import torch  
>>> torch.cuda.is_available()  
True  

pip3 install transformers  
pip3 install autoawq  
pip install optimum 
pip install auto-gptq  

### initial setup ctransformers & llama.ccp based

python3.9 -m virtualenv .venvctrans  
source .venvctrans/bin/activate  
pip install ctransformers  
module load devel/cuda/12.0  
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python  
pip install langchain  

### Frameworks: vLLM, llama.ccp (python), ctransformer, huggingface transformers with Model Formats: GGML, GGUF, AWQ, GPTQ, HF, bin

Those are the frameworks built for running the llms and the availabe model formats.  


transformers with hf-format:  
dmesg -T| grep -E -i -B100 'killed process': Memory cgroup out of memory: Killed process 420230 (python) total-vm:13048064kB, anon-rss:6184252kB, file-rss:14572kB, shmem-rss:4kB, UID:248630 pgtables:20528kB oom_score_adj:0


AutoAWQ + AWQ: GPU inf. works  
Transformers + AWQ: error: dont use  

vllm install kills torch installation:  
ImportError: libcupti.so.11.7: cannot open shared object file: No such file or directory  
import torch gets error. Fresh env with "pip install vllm" still error. Maybe use NVIDIA PyTorch Docker image as recommended from vllm website  

GPTQ + AutoGPTQ + optimum + transformers: GPU inf. works  

GGUF + ctransformers: with cuda12.0 loaded: OSError: /lib64/libm.so.6: version `GLIBC_2.29' not found  

GGUF + llama.ccp.python: works with GPU support:  module load devel/cuda/12.0 needs to be loaded in every new console  

### execution time llama2 7B chat: "Tell me about AI" from before import to finish, model already downloaded.
!!!!The output is not the same for every prompt, so no exact comparison!!!!
Execution time GPTQ: 42.51317858695984 seconds  
Execution time AWQ: 22.660392999649048 seconds  
Execution time vllm+AWQ: 14.192400693893433 seconds
Execution time llama.ccp: 8.327299356460571 seconds  
### TODO
test vLLM in docker

# Pytorch bin

In [1]:
# Huggingface tranformers + pytorch bin
# no quantization e.g. https://huggingface.co/Xwin-LM/Xwin-LM-7B-V0.1/tree/main

from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM


model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
#model = AutoModelForCausalLM.from_pretrained("Xwin-LM/Xwin-LM-7B-V0.1")
#tokenizer = AutoTokenizer.from_pretrained("Xwin-LM/Xwin-LM-7B-V0.1")
(
    prompt := "A chat between a curious user and an artificial intelligence assistant. "
            "The assistant gives helpful, detailed, and polite answers to the user's questions. "
            "USER: Hello, can you help me? "
            "ASSISTANT:"
)
inputs = tokenizer(prompt, return_tensors="pt")
samples = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
output = tokenizer.decode(samples[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(output) 
# Of course! I'm here to help. Please feel free to ask your question or describe the issue you're having, and I'll do my best to assist you.

  from .autonotebook import tqdm as notebook_tqdm


: 

In [None]:
# vllm + pytorch bin
# no quantization e.g https://huggingface.co/Xwin-LM/Xwin-LM-7B-V0.1/tree/main

from vllm import LLM, SamplingParams
(
    prompt := "A chat between a curious user and an artificial intelligence assistant. "
            "The assistant gives helpful, detailed, and polite answers to the user's questions. "
            "USER: Hello, can you help me? "
            "ASSISTANT:"
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=4096)
llm = LLM(model="Xwin-LM/Xwin-LM-7B-V0.1")
outputs = llm.generate([prompt,], sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(generated_text)

INFO 10-17 16:44:20 llm_engine.py:72] Initializing an LLM engine with config: model='Xwin-LM/Xwin-LM-7B-V0.1', tokenizer='Xwin-LM/Xwin-LM-7B-V0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)


# AWQ
The newest quantization method: according to anthropic AWQ achieves 1.45x speedup over GPTQ and is 1.85x faster than cuBLAS FP16 implementation  
Code ist from: https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ  

In [2]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name_or_path = "TheBloke/Llama-2-7b-Chat-AWQ"

# Load model
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)

prompt = "Tell me about AI"
prompt_template=f'''[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt}[/INST]

'''

print("\n\n*** Generate:")

tokens = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

print("Output: ", tokenizer.decode(generation_output[0]))


Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 127995.19it/s]
Replacing layers...: 100%|██████████| 32/32 [00:02<00:00, 11.10it/s]




*** Generate:
Output:  <s> [INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
Tell me about AI[/INST]

AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, problem-solving, and decision-making. AI technology can be used in a variety of applications, including natural language processing, image and speech recognition, and predictive analytics.
There are several types of AI, including:
1. Narrow or weak AI:

In [3]:
# Inference can also be done using transformers' pipeline
from transformers import pipeline
from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM

model_name_or_path = "TheBloke/Llama-2-7b-Chat-AWQ"

# Load model
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)

prompt = "Tell me about AI"
prompt_template=f'''[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt}[/INST]

'''

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 110153.44it/s]
Replacing layers...: 100%|██████████| 32/32 [00:03<00:00,  8.87it/s]


*** Pipeline:


AttributeError: 'LlamaAWQForCausalLM' object has no attribute 'config'

In [1]:
import time

# get the start time
st = time.time()

from vllm import LLM, SamplingParams

prompt = "Tell me about AI"
prompt_template=f'''[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt}[/INST]

'''

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="awq")

outputs = llm.generate(prompt_template, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    
et = time.time()

# get the execution time
elapsed_time = et - st
print('Execution time vLLM+AWQ:', elapsed_time, 'seconds')

INFO 10-17 16:47:04 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/Llama-2-7b-Chat-AWQ', tokenizer='TheBloke/Llama-2-7b-Chat-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-17 16:47:04 tokenizer.py:31] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 10-17 16:47:14 llm_engine.py:207] # GPU blocks: 3858, # CPU blocks: 512


Processed prompts: 100%|█████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.79it/s]

Prompt: "[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\nTell me about AI[/INST]\n\n", Generated text: "I'm glad you're interested in learning about AI! AI"
Execution time vLLM+AWQ: 15.24031686782837 seconds





# GPTQ
https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ
ImportError: Loading a GPTQ quantized model requires optimum (`pip install optimum`) and auto-gptq library (`pip install auto-gptq`)

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/Llama-2-7b-Chat-GPTQ"
# To use a different branch, change revision
# For example: revision="gptq-4bit-64g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt}[/INST]

'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])


  from .autonotebook import tqdm as notebook_tqdm


# GGUF
https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF

In [None]:
from ctransformers import AutoModelForCausalLM

#wget download model link

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7b-Chat-GGUF", model_file="/home/iai/sb7059/git/llm_test/llama-2-7b-chat.Q4_K_M.gguf", model_type="llama", gpu_layers=50)

print(llm("AI is going to"))

In [None]:
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

n_gpu_layers = 40  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.


llm = LlamaCpp(
    model_path="/home/iai/sb7059/git/llm_test/llama-2-7b-chat.Q4_K_M.gguf",
    temperature=0.7,
    max_tokens=512,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    top_p=0.95,
    top_k=40,
    callback_manager=callback_manager, 
    verbose=True, # Verbose is required to pass to the callback manager
)

prompt = "Tell me about AI"
prompt_template=f'''[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt}[/INST]

'''

print(llm(prompt_template))