# Llama 3.1 Inference

This notebook demonstrates using Llama 3.1 deployed on Baseten. The code for the underlying deployment can be found at https://docs.baseten.co/performance/examples/llama-trt.

In [None]:
# Setup

import requests

MODEL_ID = "8w6yyp2q"
BASETEN_API_KEY = "" # Paste from Discord

In [None]:
# Basic inference with streaming output

# The notebook may cut off streaming output, calling the model from the terminal instead could help.

messages = [
    {"role": "system", "content": "You are an expert software developer serving as a mentor at the HackMIT hackathon."},
    {"role": "user", "content": "What are some hackathon project ideas?"},
]

payload = {
    "messages": messages,
    "stream": True,
    "max_new_tokens": 2048,
    "temperature": 0.9
}

# Call model endpoint
res = requests.post(
    f"https://model-{MODEL_ID}.api.baseten.co/production/predict",
    headers={"Authorization": f"Api-Key {BASETEN_API_KEY}"},
    json=payload,
    stream=True
)

# Print the generated tokens as they get streamed
for content in res.iter_content():
    print(content.decode("utf-8"), end="", flush=True)

In [None]:
# Basic inference without streaming

# Again, the notebook may cut off the full output.

messages = [
    {"role": "system", "content": "You are an expert software developer serving as a mentor at the HackMIT hackathon."},
    {"role": "user", "content": "What are some hackathon project ideas?"},
]

payload = {
    "messages": messages,
    "stream": False,
    "max_new_tokens": 2048,
    "temperature": 0.9
}

# Call model endpoint
res = requests.post(
    f"https://model-{MODEL_ID}.api.baseten.co/production/predict",
    headers={"Authorization": f"Api-Key {BASETEN_API_KEY}"},
    json=payload,
    stream=False
)

# Print the generated tokens as they get streamed
print(res.text)

## What else to know

This Llama 3.1 implementation supports function calling (tool use) and structured output (JSON mode).

Check out the docs for more:

* Structured output: https://docs.baseten.co/invoke/structured-output
* Function calling: https://docs.baseten.co/invoke/function-calling

If you're doing any extremely long-running queries, it also supports async inference: https://docs.baseten.co/invoke/async.