# Running LLM Locally Using Hugging Face

Running a LLM locally has a few benefits:

- It may be cheaper
- Better data privacy
- You have full control over fine-tuning and quantization

In this notebook we will learn to run models locally using the Hugging Face ``transformers`` package.


Hugging Face is a repository of open source LLMs. The Git repos are structured in a consistent manner as per Hugging Face specififcations. This makes it easy to download and use these models in a consistent manner. We use the ``transformers`` Python package for this.

We will now run the same tinyllama model but using Hugging Face. The model ID is ``"TinyLlama/TinyLlama-1.1B-Chat-v1.0"``. This translates to the model's home page ``https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0``. You can see a list of all available models [here](https://huggingface.co/models).

## Download and Load the Model
The ``AutoModelForCausalLM`` Python class is used to download a causal (text generating) language model.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import torch

In [None]:
# The device to load the model onto. 
#
# Available device types:
# "cuda" - NVIDIA GPU
# "cpu" - Plain CPU
# "mps" - Apple silicon
device = "cuda"

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# This model requires 24GB GPU memory and 32GB RAM.
# model_name = "mistralai/Mistral-7B-Instruct-v0.1"

#Load the model into GPU
model = AutoModelForCausalLM.from_pretrained(model_name)
model.to(device)
 
tokenizer = AutoTokenizer.from_pretrained(model_name)

## Run Inference
After the model is loaded it will be cached in the ``~/.cache/huggingface`` folder. We can then run inference. Note: Both the model and the sentence tokens are loaded into the GPU. Also, the sentence tokens are obtained as Pytorch tensors.

In [None]:
streamer = TextStreamer(tokenizer)
 
messages = [
    {"role": "user", "content": "Bob is taller than Jane. Jane is taller than Kim. Is Bob taller than Kim?"}
]
 
encoded = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    return_tensors="pt").to(device)
 
generated_ids = model.generate(encoded, streamer=streamer, max_new_tokens=4096, temperature=0.36)

The answer is obviously wrong. This is common for small models where the reasoning capability is limited. We can fix the problem by improving its reasoning power using Chain of Thought prompting.

In [None]:
messages = [
        {"role": "user", "content": 
"""
Tony is taller than Jane. Jane is taller than Agatha. Is Tony taller than Agatha?
"""},
    {"role": "assistant", "content": 
"""
If Jane is taller than Agatha and Tony is taller than Jane then it follows that Tony is also taller than Agatha. So, the
answer is yes, Tony is taller than Agatha.
"""},
    {"role": "user", "content": "Bob is taller than Jane. Jane is taller than Kim. Is Bob taller than Kim?"}
]
 
encoded = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    return_tensors="pt").to(device)
 
generated_ids = model.generate(encoded, streamer=streamer, max_new_tokens=4096, temperature=0.36)

## Shared Device Usage
In the example above we load the entire model into the GPU if available. This approach will fail if you try to load a large model and your GPU doesn't have enough VRAM. A better way to load a model is to fit the model into the GPU as much as possible and if necessary use SRAM and disk for the remaining. This way, things will be slow but at least it will work.

In the example below we load the ``EleutherAI/pythia-70m-deduped`` model by automatically sharing the memory of GPU, SRAM and disk. This is a base model and can only generate text but cannot do question/answer style chatting.

In [None]:
#Unload the previous model
del model
del tokenizer
torch.cuda.empty_cache()

In [None]:
model_name = "EleutherAI/pythia-70m-deduped"
model = AutoModelForCausalLM.from_pretrained(model_name, 
            torch_dtype=torch.float16, 
            #This will load model into cuda first
            #and then use SRAM if more space is needed.
            device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
inputs = tokenizer("A dog is a man's best", return_tensors="pt")

#Load the tokens into the same device as the model
inputs = inputs.to(model.device)

tokens = model.generate(**inputs)
outputs = tokenizer.decode(tokens[0])
print(outputs)

## Using Flash Attention
Flash attention is a mechanism that speeds up languaage models. It does that by optimizing data transfer from the computer's SRAM to the GPU's VRAM. Flash attention works only with newer versions of CUDA. Also, Windows isn't supported as of June 2025.

We install flash attention this way.

In [None]:
!pip install -U flash-attn --no-build-isolation

In [None]:
model_name = "EleutherAI/pythia-70m-deduped"
model = AutoModelForCausalLM.from_pretrained(model_name, 
            torch_dtype=torch.float16, 
            device_map="auto", 
            #Enable flash attention
            attn_implementation="flash_attention_2")
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
inputs = tokenizer("A dog is a man's best", return_tensors="pt")

#Load the tokens into the same device as the model
inputs = inputs.to(model.device)

tokens = model.generate(**inputs)
outputs = tokenizer.decode(tokens[0])
print(outputs)

# Running Smaller Transformer Models
Many tasks can be done by small transformer models cheaper and faster than an LLM. Below we use a fine-tuned version of Google's BERT model for sentiment classififcation. Note that, here we use the ``AutoModelForSequenceClassification`` class instead of ``AutoModelForCausalLM``.

In [None]:
from transformers import AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

#Run inference
with torch.no_grad():
    logits = model(**inputs).logits

#Get the predicted class with the highest probability
predicted_class_id = logits.argmax().item()

#Convert the class ID to plain English label
model.config.id2label[predicted_class_id]