# Running this with LOCAL models

Now let's look at passing prompts into a local model, rather than some model that gets run on an external service.

We'll use the `transformers` library and two approaches:
* using the text generation `pipeline` object
* directly using a model via `AutoModelForCausalLM`

We can exert more control with these, but it also requires that we get a bit more into the details of what's happening.

In [None]:
from transformers import pipeline
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
import torch

We choose a model available on Hugging Face, but this can be swapped with other text generation models.

In [None]:
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# transformers and pipeline

Initialize our text generation pipeline.

In [None]:
generator = pipeline("text-generation",
                     model = model_name,
                     tokenizer = model_name)

Initialize our prompt.

In [None]:
prompt = ("Explain what a neural network is in one sentence.")

Get the output from our pipeline for this prompt.

In [None]:
outputs = generator(prompt)

What output do we get?

In [None]:
print(outputs)

We have to exercise more care here than with the API calls.  

For example, the prompt itself is included in the output.

In [None]:
print(outputs[0]["generated_text"][len(prompt):])

We can start customizing and adapting the LLM by dialing the knobs like:
* temperature
* top_p
* max number of output tokens (max_new_tokens)

In [None]:
outputs = generator(prompt,
                    max_new_tokens=64)

print(outputs[0]["generated_text"][len(prompt):])

Beware that you may need to set `do_sample=True` in order for the text generation pipeline to work with variable temperature.  For our model that is not the case, but we set it anyways to make this generalizable across other models.

In [None]:
outputs = generator(prompt,
                    max_new_tokens=64,
                    do_sample=True,
                    temperature=2.0)

print(outputs[0]["generated_text"][len(prompt):])

In [None]:
outputs = generator(prompt,
                    max_new_tokens=64,
                    do_sample=True,
                    temperature=0.1)

print(outputs[0]["generated_text"][len(prompt):])

Bring in top_k too (for limiting the number of possible tokens to select from when predicting a next-token).

In [None]:
outputs = generator(prompt,
                    max_new_tokens=64,
                    do_sample=True,
                    temperature=0.1,
                    top_k=10)

print(outputs[0]["generated_text"][len(prompt):])

We can also start expanding the elements in our prompt:

In [None]:
prompt = (
    "You are a helpful assistant.\n"
    "User: Explain what a neural network is in one sentence.\n"
    "Assistant:"
)

In [None]:
outputs = generator(prompt,
                    max_new_tokens=64,
                    do_sample=True,
                    temperature=0.1)

In [None]:
print(outputs[0]["generated_text"][len(prompt):])
# print(outputs[0]["generated_text"])

In [None]:
outputs = generator(prompt,
                    max_new_tokens=64,
                    do_sample=True,
                    temperature=0.1,
                    top_k=1)

In [None]:
print(outputs[0]["generated_text"][len(prompt):])

# Running models more explicitly (without pipeline)

We can be more hands-on when running models, though we need to be even more explicit in what steps are occurring.
* tokenization
* generation
* conversion from tokens to output string

Initialize our model and tokenizer:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Slight (optional) customization of our model and tokenizer for padding tokens:

In [None]:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

Specify that we run our model on a GPU if available:

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

Initialize our prompt:

In [None]:
prompt = (
    "You are a helpful assistant.\n"
    "User: Explain what a neural network is in one sentence.\n"
    "Assistant:"
)

We need to tokenize the prompt and move to the GPU:

In [None]:
inputs = tokenizer(prompt, return_tensors="pt").to(device)

In [None]:
inputs

Get the output from our model for this prompt:

In [None]:
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=True,
        temperature=0.7,
    )

The model output is in token form, inside a tensor:

In [None]:
output_ids

In [None]:
output_ids[0]

It also includes the input prompt, which we can slice out:

In [None]:
output_ids[0][len(inputs['input_ids'][0]):]

To get a human-readable form, we can pass this into the tokenizer again, this time to decode the tokens:

In [None]:
response = tokenizer.decode(output_ids[0][len(inputs['input_ids'][0]):],
                            skip_special_tokens=False)
print(response)

# system and user roles

The basic technique above can be expanded in many ways.

One essential way is to use different roles for adapting the LLM behavior.

In [None]:
system_prompt = "You are a helpful, concise AI assistant."
user_prompt = "Explain what overfitting is in machine learning."

# Easiest: just stitch them together into one prompt string
full_prompt = (
    f"{system_prompt}\n\n"
    f"User: {user_prompt}\n"
    f"Assistant:"
)

full_prompt

In [None]:
system_prompt = "You are a helpful, concise AI assistant."
user_prompt = "Explain what overfitting is in machine learning."

# Easiest: just stitch them together into one prompt string
full_prompt = (
    f"{system_prompt}\n\n"
    f"User: {user_prompt}\n"
    f"Assistant:"
)

inputs = tokenizer(full_prompt, return_tensors="pt").to(device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.7,
    )

response_text = tokenizer.decode(output_ids[0][len(inputs['input_ids'][0]):], 
                                 skip_special_tokens=True)
print(response_text)

# Using chat template

Similarly to the API use, we can specify a list of dicts to specify roles and the prompt content specific to the different roles.

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful, concise AI assistant."},
    {"role": "user", "content": "Explain what overfitting is in machine learning."},
]

If we do this, however, we need to make sure the list gets translated into an appropriate syntax for whichever model we're using.

The `tokenizer` is specific to our model, and we can use its methods to do this translation.

In [None]:
tokenizer.apply_chat_template(
    messages,
    tokenize=False,           # return a plain string, not token IDs
    add_generation_prompt=True,  # append the assistant turn marker
)

In [None]:
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,           # return a plain string, not token IDs
    add_generation_prompt=True,  # append the assistant turn marker
)
print(prompt)

The ouput of `tokenizer.apply_chat_template(messages,...)` crafts our prompt into the form that now has the correct syntax to feed into our call to the model.

That said, we still need to be careful that the prompt is tokenized and sent to the GPU before passing it into the `generate` method.

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful, concise AI assistant."},
    {"role": "user", "content": "Explain what overfitting is in machine learning."},
]

# Build a single text prompt from messages
full_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,  # adds the assistant turn marker
)

inputs = tokenizer(full_prompt, return_tensors="pt").to(device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.7,
    )

response_text = tokenizer.decode(output_ids[0][len(inputs['input_ids'][0]):], 
                                 skip_special_tokens=True)
print(response_text)


# Connecting back to pipeline and API

We can use the roles similarly with the text generation pipeline.
* We can use the prompt directly as a single string.
* We can also use the list of roles/contents, provided we conform it with the apply_chat_template

In [None]:
generator = pipeline("text-generation",
                     model=model_name,
                     tokenizer=model_name)

system_prompt = "You are a helpful, concise AI assistant."
user_prompt = "Give me a short explanation of gradient descent."

prompt = (
    f"{system_prompt}\n\n"
    f"User: {user_prompt}\n"
    f"Assistant:"
)

outputs = generator(
    prompt,
    max_new_tokens=64,
    do_sample=True,
    temperature=0.7,
)

print(outputs[0]["generated_text"][len(prompt):])

In [None]:
generator = pipeline("text-generation",
                     model=model_name,
                     tokenizer=model_name)

system_prompt = "You are a helpful, concise AI assistant."
user_prompt = "Give me a short explanation of gradient descent."

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

outputs = generator(
    prompt,
    max_new_tokens=64,
    do_sample=True,
    temperature=0.7,
)

print(outputs[0]["generated_text"][len(prompt):])

### API

The API is even more bare, in that we don't need to worry about the tokenization/de-tokenization or chat template.

In [None]:
from openai import OpenAI
import os
NRP_TOK = os.environ.get('NRP_TOK')

In [None]:
client = OpenAI(api_key = NRP_TOK,
                base_url = "https://ellm.nrp-nautilus.io/v1")

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful, concise AI assistant."},
    {"role": "user", "content": "Write a one-sentence summary of the Big Bang theory."},
]

response = client.chat.completions.create(
    model='gpt-oss',   # or another chat model you have access to
    messages=messages,
    max_tokens=200,
    temperature=0.7,
)

answer = response.choices[0].message.content
print(answer)

The server handles:

* combining those messages into a single internal prompt,
* adding whatever special tokens / separators / role markers it uses,
* tokenizing that,
* running the model,
* and then detokenizing the output back to text.

You never see the equivalent of apply_chat_template or the exact raw prompt string; thatâ€™s all abstracted away.