# How to load Open-Sourced LLMs

If you're in search of an open-source alternative to ChatGPT that can be operated on your local machine, large language models (LLMs) hosted within a Jupyter Notebook offer a potent and adaptable solution.

In this blog notebook, I will guide you through the installation, configuration, and utilization of open-source LLMs.


## Hugging Face: Your Gateway to LLMs

[Hugging Face](https://huggingface.co/) is a pivotal platform for working with large language models. It provides a wide range of pre-trained models and tools for various Natural Language Processing (NLP) tasks. Here, we are particularly interested in text generation and question answering. You can easily access these models via the Hugging Face model hub.

**Getting Models from Hugging Face**

To access LLMs for text generation and question answering, visit the Hugging Face model hub. You can find an extensive collection of models for different NLP tasks, including those suited for text generation and question answering.

[Hugging Face Model Hub](https://huggingface.co/models)

## Leaderboards and Model Licensing

Hugging Face has a model leaderboard where you can explore various models for benchmarking and comparisons. Additionally, the [LLSys Leaderboard](https://llsys.ai/leaderboard) is a valuable resource for assessing the performance of large language models.

When using models from Hugging Face, it's crucial to review their licenses to ensure compliance with your intended use.


## Model Cards

Hugging Face provides model cards for each model in their repository. These model cards offer detailed information about the models, including their capabilities, input formats, and performance characteristics.


## Types of (quantized) models

Large Language Models come in various flavors, each suited for different purposes. Here are some common types:

- **GGUF (...):** The library is written in C/C++ for efficient inference of Llama models. It can load GGML models and run them on a **`CPU`**. You can work with GGUF models using libraries like `llama-cpp`, `ctransformers`, and the `huggingface-hub` library.

- **GPTQ (...):** This quantization load and run the models using **`GPU`**. To work with GPTQ models, you'll need libraries such as `auto-gptq`, `transformers`, and `optium`.

- **AWQ (...):** For AWQ models, you can explore the `autoawq` library.

- **Foundational Models (Base Models):** These are the core models on which other LLMs are built. They are the foundation for various NLP tasks.

This Jupyter Notebook will provide you with step-by-step instructions on how to load and utilize these different types of LLMs for your specific NLP tasks. Let's get started!

# Working with foundational (base) models

For this example we will work with the Zephyr model, current SOTA model under 70B of parameters: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

Dependencies:
 
==============================================
```bash
pip install git+https://github.com/huggingface/transformers.git
pip install accelerate
```
==============================================

```python
```

In [None]:
import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
# <|system|>
# You are a friendly chatbot who always responds in the style of a pirate.</s>
# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# Ah, me hearty matey! But yer question be a puzzler! A human cannot eat a helicopter in one sitting, as helicopters are not edible. They be made of metal, plastic, and other materials, not food!

# GGUF / GGML models (CPU based)

We can use GGUF models using the `llama-cpp-python` or `ctransformers` libraries. As I am used to work with LlamaCpp I will use this one:

Dependencies:

========================
```bash
```
========================


```python
```

# GPTQ models (GPU based)

```python
```

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/zephyr-7B-beta-GPTQ"
# To use a different branch, change revision
# For example: revision="gptq-4bit-32g-actorder_True"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>
'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])