# Load LlaMA model with `Hugging Face` (CPU only)


## **Workflow**
1. **Installation**
2. **Login**
3. **Fetching the Model**
4. **Loading the Model & Tokenizer**
5. **Llama Pipeline**
6. **Interacting with Llama**

## **Installation**

In [5]:
# Disable TOKENIZERS_PARALLELISM=(true | false) warning:
!export TOKENIZERS_PARALLELISM=false

In [3]:
#!pip3 install --upgrade pip
!pip3 install -q -U transformers torch #--force-reinstall
#!pip3 install 'transformers[torch]'
#!pip3 install sentencepiece protobuf



In [4]:
!pip3 show transformers
print()
!pip3 show torch

Name: transformers
Version: 4.36.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /Users/antoine/Documents/GITHUB/Install-Llama2-locally/.venv/lib/python3.11/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 

Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /Users/antoine/Documents/GITHUB/Install-Llama2-locally/.venv/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Req

## **Login**

In [3]:
#!pip install --upgrade huggingface_hub

from huggingface_hub import notebook_login
notebook_login()

#!huggingface-cli login
#!huggingface-cli logout

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!huggingface-cli whoami

## **Fetching the Model**
Here are 3 ways to download the model `meta-llama/Llama-2-7b-hf` 

 **1. Fetch models and tokenizers to use offline - with connection**

In [25]:
from transformers import AutoTokenizer, AutoModel

#Download your files online
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

#Save your files to a specified directory
tokenizer.save_pretrained("./models/hf-frompretrained-download/Llama-2-7b-chat-hf/")
model.save_pretrained("./models/hf-frompretrained-download/Llama-2-7b-chat-hf/")

#Reload your files offline
tokenizer = AutoTokenizer.from_pretrained("./models/hf-frompretrained-download/Llama-2-7b-chat-hf/")
model = AutoModel.from_pretrained("./models/hf-frompretrained-download/Llama-2-7b-chat-hf/")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

This method allow us to download only the required files.

 **2. Fetch models and tokenizers to use offline - with connection**

In [7]:
from huggingface_hub import hf_hub_download, snapshot_download

MODEL_REPO = "meta-llama/Llama-2-7b-chat-hf"

model_path = snapshot_download(
    repo_id=MODEL_REPO,
    local_dir="./models/hf-snapshot-download/Llama-2-7b-chat-hf",
    cache_dir="./models/hf-snapshot-download/Llama-2-7b-chat-hf",
    local_files_only = False,
)

print(model_path)

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

/Users/antoine/Documents/GITHUB/Install-Llama2-locally/models/hf-snapshot-download/Llama-2-7b-chat-hf


This method download the entire content of the model's repository. We can filter the files extensions using `allow_patterns` and `ignore_patterns` parameters. 

**`snapshot_download`**: downloads an entire repository at a given revision

**`hf_hub_download`**: Download a given file (ex. `config.json`).
  
&nbsp;
 
The new cache file layout looks like this:
- The cache directory contains one subfolder per `repo_id` (namespaced by repo type)
- Inside each repo folder:
    - ***refs*** is a list of the latest known revision => commit_hash pairs
    - ***blobs*** contains the actual file blobs (identified by their git-sha or sha256, depending on whether they’re LFS files or not)
    - ***snapshots*** contains one subfolder per commit, each “commit” contains the subset of the files that have been resolved at that particular commit. Each filename is a symlink to the blob at that particular commit.

If `local_dir` is provided, the file structure from the repo will be replicated in this location. 

- `cache_dir` str, Path, (_optional_) — Path to the folder where cached files are stored.
- `local_dir` str or Path, (_optional_) — If provided, the downloaded model file will be placed under this directory.


 **3. Fetch models and tokenizers to use offline (3) - with connection**

Clone the repository  

````bash
cd models  
cd hf-git-clone  
git init  
git lfs install  
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf  
````

## **Loading the Model & Tokenizer**

In [9]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = model_path,
                                cache_dir="./models/hf-snapshot-download/Llama-2-7b-chat-hf",
                                local_files_only=True,
                                token=True) 

model = "./models/hf-snapshot-download/Llama-2-7b-chat-hf"

# model = AutoModel.from_pretrained(pretrained_model_name_or_path = model_path,
#                                 cache_dir="./models/hf-snapshot-download/Llama-2-7b-chat-hf",
#                                 local_files_only=True,
#                                 token=True)

The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.

`.from_pretrained()`: 
- `pretrained_model_name_or_path`: locally, a path to a directory containing model weights `config.json` or saved using `save_pretrained()`, e.g.: `./my_model_directory/`.
- `use_fast`: (_optional_) boolean, default `False`:
Indicate if transformers should try to load the [fast](https://huggingface.co/learn/nlp-course/chapter6/3#fast-tokenizers-special-powers) version of the tokenizer (True) or use the Python one (False)
- `local_files_only`: (_optional_) boolean, default `False`: Whether or not to only look at local files (i.e., do not try to download the model).  


&nbsp;

> ***Note:*** the `config`(_optional_) parameter is optional when Configuration can be automatically loaded:
>    - the model is a model provided by the library (`shortcut-name`)
>    - the model was saved using `save_pretrained()` and is reloaded by suppling the save directory.
>    - the model is loaded by suppling a local directory as `pretrained_model_name_or_path` and a configuration JSON file named `config.json` is found in the directory.

&nbsp;

see [`from_pretrained()` documentation](https://huggingface.co/transformers/v3.0.2/main_classes/model.html#transformers.PreTrainedModel.from_pretrained)

## **Llama Pipeline**

In [10]:
from transformers import pipeline
import torch

llama_pipeline = pipeline(
    "text-generation",  # LLM task
    model=model,
    torch_dtype=torch.bfloat16, #torch.float16 is for GPU only
    device = -1, # or "cpu".
    # device_map="auto" (Do not use device_map AND device at the same time as they will conflict)
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Pipeline supports running on **CPU** or **GPU** through the `device` argument.
- `device`: int, (_optional_), defaults to -1 – Device ordinal for CPU/GPU supports. Setting this to -1 will leverage CPU, >=0 will run the model on the associated CUDA device id.

## **Interacting with Llama**


In [11]:
# eos token
print(tokenizer.eos_token_id)

2


In [12]:
def get_llama_response(prompt: str) -> None:
    """
    Generate a response from the Llama model.

    Parameters:
        prompt (str): The user's input/question for the model.

    Returns:
        None: Prints the model's response.
    """

    sequences = llama_pipeline(
        prompt,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=256,
    )

    print(sequences, "\n\n")

    print("Llama 2 Chatbot:", sequences[0]['generated_text'])


prompt = "Which planet is the closest to the sun?"
get_llama_response(prompt)

[{'generated_text': 'Which planet is the closest to the sun?\n\nAnswer: Mercury is the closest planet to the sun, with an average distance of about 58 million kilometers (36 million miles).'}]

Llama 2 Chatbot: Which planet is the closest to the sun?

Answer: Mercury is the closest planet to the sun, with an average distance of about 58 million kilometers (36 million miles).


- `do_sample`: if set to `True`, this parameter enables [decoding strategies](https://huggingface.co/docs/transformers/generation_strategies#text-generation-strategies) such as _multinomial_ sampling, _beam-search multinomial_ sampling, _Top-K_ sampling and _Top-p_ sampling. All these strategies select the next token from the probability distribution over the entire vocabulary with various strategy-specific adjustments.

- `num_return_sequences`: the number of sequence candidates to return for each input. This option is only available for the decoding strategies that support multiple sequence candidates, e.g. variations of beam search and sampling. Decoding strategies like greedy search and contrastive search return a single output sequence.

- `eos_token_id` (Union[int, List[int]]) — The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens.