# Run Large Language Model from Python

### Models & Quantization

Most large language models, depending on the number of parameters, can be prohibitively large (>24GB).

To allow these models to run locally on consumer grade GPUs, they can be quantized into various bit sizes using a few different methods. 

- GPTQ: All model layers are loaded into VRAM and GPU is used for inference. Best for fast performance.
- GGUF: Successor to GGML. Inference is done via CPU + RAM. Model layers can optionally be loaded into VRAM.
- AWQ: New, GPTQ like method which offers 4-bit quantization at fast speeds with up to 3x less memory utlization. 

_Note: AutoAWQ does not yet support Mixtral_

### Install Dependencies

In [None]:
!pip install -q -U optimum huggingface-hub auto-gptq
!pip uninstall transformers -y
!pip install -q -U git+https://github.com/huggingface/transformers

### Install Model(s)

Sample Models
- [Mixtral 8x7B GPTQ](https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ)
- [CodeLlama-34B-Instruct-GPTQ](https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GPTQ)

In [None]:
!sudo mkdir -p TheBloke/CodeLlama-34B-Instruct-GPTQ
!sudo huggingface-cli download TheBloke/CodeLlama-34B-Instruct-GPTQ --local-dir /TheBloke/CodeLlama-34B-Instruct-GPTQ --local-dir-use-symlinks False
!ls
!ls -lhr /TheBloke/CodeLlama-34B-Instruct-GPTQ

### Setup Model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/CodeLlama-34B-Instruct-GPTQ"
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    device_map="auto",
    trust_remote_code=False,
    revision="main"
)

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

### Generate Output

In [None]:
prompt = "Tell me about AI"
prompt_template=f'''[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer using ```:
{prompt}
[/INST]

'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(
    inputs=input_ids, 
    temperature=0.7, 
    do_sample=True, 
    top_p=0.95, 
    top_k=40, 
    max_new_tokens=512
)

print(tokenizer.decode(output[0]))

### Inference can also be done using transformers' pipeline

In [None]:
print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])