# Run Large Language Model in Python

### Models & Quantization

Most large language models, depending on the number of parameters, can be prohibitively large (>24GB).

To allow these models to run locally on consumer grade GPUs, they can be quantized into various bit sizes using a few different methods. 

- GPTQ: All model layers are loaded into VRAM and GPU is used for inference. Best for fast performance.
- GGUF: Successor to GGML. Inference is done via CPU + RAM. Model layers can optionally be loaded into VRAM.
- AWQ: New, GPTQ like method which offers 4-bit quantization at fast speeds with up to 3x less memory utlization. 

_Note: AutoAWQ does not yet support Mixtral_

### Install Dependencies

In [5]:
!pip install -qq -U autoawq optimum huggingface-hub
!pip install -qq -U auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

### Install Model(s)

In [17]:
!sudo mkdir -p /models/Mixtral-8x7B-v0.1-GPTQ
!huggingface-cli download TheBloke/Mixtral-8x7B-v0.1-GPTQ --local-dir /models/Mixtral-8x7B-v0.1-GPTQ --local-dir-use-symlinks False
!ls -lhr models/Mixtral-8x7B-v0.1-GPTQ

total 23G
-rwxr-xr-x 1 jovyan jovyan 482K Dec 25 23:31 tokenizer.model
-rwxr-xr-x 1 jovyan jovyan 1.8M Dec 25 23:31 tokenizer.json
-rwxr-xr-x 1 jovyan jovyan  967 Dec 25 23:31 tokenizer_config.json
-rwxr-xr-x 1 jovyan jovyan   72 Dec 25 23:31 special_tokens_map.json
-rwxr-xr-x 1 jovyan jovyan  22K Dec 25 23:31 README.md
-rwxr-xr-x 1 jovyan jovyan  185 Dec 25 23:31 quantize_config.json
-rwxr-xr-x 1 jovyan jovyan  23G Dec 25 23:37 model.safetensors
-rwxr-xr-x 1 jovyan jovyan  116 Dec 25 23:31 generation_config.json
-rwxr-xr-x 1 jovyan jovyan 2.2K Dec 25 23:31 config.json


### Setup Model

In [18]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/Mixtral-8x7B-v0.1-GPTQ"
# To use a different branch, change revision
# For example: revision="gptq-4bit-128g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    device_map="auto",
    trust_remote_code=False,
    revision="main"
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Write a story about llamas"
system_message = "You are a story writing assistant"
prompt_template=f'''{prompt}
'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

2023-12-26 00:09:03.294361: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-26 00:09:03.410342: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


config.json:   0%|          | 0.00/2.21k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/23.8G [00:00<?, ?B/s]

In [None]:
# Inference can also be done using transformers' pipeline

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])