# Running LLM locally from Hugging Face

A brief demo about running a couple of LLMs from Hugging Face on my local machine. We can see how what kind of answers do these models provide when given the same question. Particularly we will focus on the models:
* flan-t5-small
* dolly-v2-3b
* DialoGPT-large

## Relevant imports

In [120]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM, AutoModel, T5ForConditionalGeneration
import accelerate


In [107]:
from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain

## Creating a model loading function

In [121]:
# Setting a model-loading function, reusable for each of the considered models
def load_model(model_id, model_type):
    """
    Load a tokenizer and model from Hugging Face's Transformers library.

    Parameters:
    - model_id (str): Identifier for the model on Hugging Face's model hub.
    - model_type (str): Type of the model ('causal', 'seq2seq', etc.).

    Returns:
    - tokenizer: The loaded tokenizer.
    - model: The loaded model.
    """
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        if model_type == 'causal':
            model = AutoModelForCausalLM.from_pretrained(model_id)
        elif model_type == 'seq2seq':
            model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
        elif model_type == 't5':
            model = T5ForConditionalGeneration.from_pretrained(model_id)
        else:
            model = AutoModel.from_pretrained(model_id)
        return tokenizer, model
    except Exception as e:
        print(f"An error occurred while loading the model: {e}")
        return None, None


## flan-t5-small

In [143]:
# Setting up a structured prompt template to format inputs and expected responses
template = """Question: {question}
Answer: """

prompt = PromptTemplate(template = template, input_variables = ["question"])

In [144]:
# Model setup
tokenizer, model = load_model('google/flan-t5-small', 't5')

In [145]:
# Pipeline creation
pipe = pipeline(
    "text2text-generation",
    model = model,
    tokenizer = tokenizer,
    max_length = 100
)

local_flan = HuggingFacePipeline(pipeline = pipe)

In [146]:
# Direct Conditional Generation from the model
print(local_flan('What is the capital of Italy?'))

rome


In [147]:
# Using LLMChain with the model
llm_chain = LLMChain(prompt = prompt,
                     llm = local_flan)

question = "What is the capital of Italy?"

print(llm_chain.run(question))

rome


## dolly-v2-3b

In [148]:
# Setting up a structured prompt template to format inputs and expected responses
template = """Question: {question}
Answer: I believe that"""

prompt = PromptTemplate(template = template, input_variables = ["question"])

In [149]:
# Model setup
tokenizer, model = load_model('databricks/dolly-v2-3b', 'causal')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [164]:
# Pipeline creation
pipe = pipeline(
    "text-generation", 
    model = model, 
    tokenizer = tokenizer, 
    max_length = 100
)

local_dolly = HuggingFacePipeline(pipeline = pipe)

In [165]:
# Direct Conditional Generation from the model
print(local_dolly('What is the capital of Italy?'))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is the capital of Italy? Rome. Rome, Italy.


In [167]:
# Using LLMChain with the model
llm_chain = LLMChain(prompt = prompt,
                     llm = local_dolly)

question = "What is the capital of Italy?"

print(llm_chain.run(question))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What is the capital of Italy?
Answer: I believe that it's the capital of Italy.


## DialoGPT-large

In [153]:
# Setting up a structured prompt template to format inputs and expected responses
template = """Question: {question}
Answer: I believe that"""

prompt = PromptTemplate(template = template, input_variables = ["question"])

In [154]:
# Model setup
# model_id = "microsoft/DialoGPT-large"
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForCausalLM.from_pretrained(model_id)

tokenizer, model = load_model('microsoft/DialoGPT-large', 'causal')

In [155]:
# Pipeline creation
pipe = pipeline(
    "text-generation", 
    model = model, 
    tokenizer = tokenizer, 
    max_length = 100
)

local_dialo = HuggingFacePipeline(pipeline = pipe)

In [156]:
# Direct Conditional Generation from the model
print(local_dialo('What is the capital of Italy?'))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is the capital of Italy? Rome. Rome, Italy.


In [157]:
# Using LLMChain with the model
llm_chain = LLMChain(prompt = prompt,
                     llm = local_dialo)

question = "What is the capital city of Italy?"

print(llm_chain.run(question))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What is the capital city of Italy?
Answer: I believe that it's the city of Rome.
