# Build with Llama API

<p align="center">
	<a href="https://llama.developer.meta.com/?utm_source=llama-cookbook&utm_medium=readme&utm_campaign=main"><img src="https://img.shields.io/badge/Llama_API-Sign_up-4BA9FE?logo=meta" /></a>
	<a href="https://llama.developer.meta.com/docs?utm_source=llama-cookbook&utm_medium=readme&utm_campaign=main"><img src="https://img.shields.io/badge/Llama_API-Documentation-E4E6Eb?logo=meta" /></a>
</p>
<p align="center">
	<a href="https://github.com/meta-llama/llama-models/blob/main/models/?utm_source=llama-cookbook&utm_medium=readme&utm_campaign=main"><img alt="Llama Model cards" src="https://img.shields.io/badge/Llama-Model_cards-green?logo=meta" /></a>
	<a href="https://www.llama.com/docs/overview/?utm_source=llama-cookbook&utm_medium=readme&utm_campaign=main"><img alt="Llama Documentation" src="https://img.shields.io/badge/Llama-Documentation-e4e6eb?logo=meta" /></a>
	<a href="https://huggingface.co/meta-llama"><img alt="Hugging Face meta-llama" src="https://img.shields.io/badge/Hugging_Face-meta--llama-yellow?logo=huggingface" /></a>
</p>

This notebook introduces you to the functionality offered by Llama API, so that you can get up and running with the latest Llama 4 models quickly and efficiently.

## Running this notebook

To run this notebook, you'll need to sign up for a Llama API developer account at [llama.developer.meta.com](https://llama.developer.meta.com) and get an API key. You'll also need to have Python 3.8+ and a way to install the Llama API Python SDK such as [pip](https://pip.pypa.io/en/stable/).

### Installing the Llama API Python SDK

The [Llama API Python SDK](https://github.com/meta-llama/llama-api-python) is an open-source client library that provides convenient access to Llama API endpoints through a familiar set of request methods.

Install the SDK using pip.

In [None]:
#%pip install --pre llama-api

### Getting and setting up an API key

Sign up for, or log in to, a Llama API developer account at [llama.developer.meta.com](https://llama.developer.meta.com), then navigate to the **API keys** tab in the dashboard to create a new API key.

Assign your API key to the environment variable `LLAMA_API_KEY`.

In [None]:
import os
os.environ["LLAMA_API_KEY"] = {YOUR_API_KEY}

Now you can import the SDK and instantiate it. The SDK will automatically pull the API key from the environment variable set above.

In [5]:
from llama_api import LlamaAPI
client = LlamaAPI()

## Your first API call

With the SDK set up, you're ready to make your first API call. 

Start by checking the list of available models:

In [6]:
models = client.models.list()
for model in models:
    print(model.id)

Llama-3.3-70B-Instruct
Llama-3.3-8B-Instruct
Llama-4-Scout-17B-16E-Instruct-FP8
Llama-4-Maverick-17B-128E-Instruct-FP8


The list of models may change in accordance with model releases. This notebook will use the latest Llama 4 model: `Llama-4-Maverick-17B-128E-Instruct-FP8`.

## Chat completion

### Chat completion with text

Use the [chat completions](https://llama.developer.meta.com/docs/api/chat) endpoint for a simple text based prompt-and-response round trip.

In [7]:
response = client.chat.completions.create(
    model="Llama-4-Maverick-17B-128E-Instruct-FP8",
    messages=[
        {
            "role": "user",
            "content": "Hello, how are you?",
        }
    ],
    max_completion_tokens=1024,
    temperature=0.7,
)
  
print(response.completion_message.content.text)

I'm just a language model, so I don't have feelings or emotions like humans do, but I'm functioning properly and ready to help with any questions or tasks you have! How can I assist you today?


### Multi-turn chat completion

The [chat completions](https://llama.developer.meta.com/docs/api/chat) endpoint supports sending multiple messages in a single API call, so you can use it to continue a conversation between a user and a model.

In [8]:
response = client.chat.completions.create(
    model="Llama-4-Maverick-17B-128E-Instruct-FP8",
    messages=[
        {
            "role": "system",
            "content": "You know a lot of animal facts"
        },
        {
            "role": "user",
            "content": "Pick an animal"
        },
        {
            "role": "assistant",
            "content": "I've picked an animal... It's the octopus!",
            "stop_reason": "stop"
        },
        {
            "role": "user",
            "content": "Tell me a fact about this animal"
        }
    ],
    max_completion_tokens=1024,
    temperature=0.7,
)
  
print(response.completion_message.content.text)        

Here's a fact: Octopuses have **nine brains**! Well, sort of. They have one main brain and eight smaller "mini-brains" in their arms, which can function independently and even solve problems on their own. Isn't that mind-blowing?


### Streaming

You can return results from the API to the user more quickly by setting the `stream` parameter to `True`. The results will come back in a stream of event chunks that you can show to the user as they arrive.

In [16]:
response = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Tell me a short story",
        }
    ],
    model="Llama-4-Maverick-17B-128E-Instruct-FP8",
    stream=True,
)
for chunk in response:
    print(chunk.event.delta.text, end="", flush=True)

As the last rays of sunlight faded from the small village, a young girl named Akira sat by the window, watching the stars begin to twinkle in the night sky. She lived in a tiny cottage on the outskirts of the village, surrounded by a lush forest that whispered secrets to the wind.

Akira's grandmother, Oba-chan, sat beside her, weaving a intricate pattern on her loom. The soft clacking of the loom's wooden shuttle was a soothing melody that Akira had grown up with.

"Oba-chan, tell me a story," Akira asked, her eyes sparkling with curiosity.

Oba-chan smiled, her eyes crinkling at the corners. "Ah, child, I have just the tale for you. It's a story of the forest, and the magic that lives within it."

As Oba-chan began to speak, the room grew darker, and the shadows on the walls seemed to come alive. Akira felt herself being transported to a world beyond her own.

"Long ago," Oba-chan started, "when the village was still young, a great tree stood tall in the heart of the forest. Its bran

### Multi-modal chat completion

The [chat completions](https://llama.developer.meta.com/docs/api/chat) endpoint also supports image understanding, using URLs to publicly available images, or using local images encoded as Base64. 

Here's an example that compares two images which are available at public URLs:

![Llama1](https://upload.wikimedia.org/wikipedia/commons/2/2e/Lama_glama_Laguna_Colorada_2.jpg)
![Llama2](https://upload.wikimedia.org/wikipedia/commons/1/12/Llamas%2C_Laguna_Milluni_y_Nevado_Huayna_Potos%C3%AD_%28La_Paz_-_Bolivia%29.jpg)

In [None]:
response = client.chat.completions.create(
    model="Llama-4-Maverick-17B-128E-Instruct-FP8",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What do these two images have in common?",
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"https://upload.wikimedia.org/wikipedia/commons/2/2e/Lama_glama_Laguna_Colorada_2.jpg",
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"https://upload.wikimedia.org/wikipedia/commons/1/12/Llamas%2C_Laguna_Milluni_y_Nevado_Huayna_Potos%C3%AD_%28La_Paz_-_Bolivia%29.jpg",
                    },
                },
            ],
        },
    ],
)
print(response.completion_message.content.text)

### JSON structured output

You can use the [chat completions](https://llama.developer.meta.com/docs/api/chat) endpoint with a developer-defined JSON schema, and the model will format the data to the schema before returning it.

The endpoint expects a [Pydantic](https://pydantic.dev/) schema. You may need to install pydantic to run this example.

In [None]:
from pydantic import BaseModel
class Address(BaseModel):
    street: str
    city: str
    state: str
    zip: str

response = client.chat.completions.create(
    model="Llama-4-Maverick-17B-128E-Instruct-FP8",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Summarize the address in a JSON object.",
        },
        {
            "role": "user",
            "content": "123 Main St, Anytown, USA",
        },
    ],
    temperature=0.1,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "Address",
            "schema": Address.model_json_schema(),
        },
    },
)
print(response.completion_message.content.text)

### Tool calling

Tool calling is supported with the [chat completions](https://llama.developer.meta.com/docs/api/chat) endpoint. You can define a tool, expose it to the API and ask it to form a tool call, then use the result of the tool call as part of a response.

**Note:** Llama API does not execute tool calls. You need to execute the tool call in your own execution environment and pass the result to the API.

In [10]:
import json

def get_weather(location: str) -> str:
    return f"The weather in {location} is sunny."

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a given location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and country e.g. Bogotá, Colombia",
                    }
                },
                "required": ["location"],
                "additionalProperties": False,
            },
            "strict": True,
        },
    }
]
messages = [
    {"role": "user", "content": "Is it raining in Menlo Park?"},
]

response = client.chat.completions.create(
    model="Llama-4-Maverick-17B-128E-Instruct-FP8",
    messages=messages,
    tools=tools,
    max_completion_tokens=2048,
    temperature=0.6,
)

print(response)
completion_message = response.completion_message.model_dump()

# Next Turn
messages.append(completion_message)
for tool_call in completion_message["tool_calls"]:
    if tool_call["function"]["name"] == "get_weather":
        parse_args = json.loads(tool_call["function"]["arguments"])
        result = get_weather(**parse_args)

        messages.append(
            {
                "role": "tool",
                "tool_call_id": tool_call["id"],
                "content": result,
            },
        )

response = client.chat.completions.create(
    model="Llama-4-Maverick-17B-128E-Instruct-FP8",
    messages=messages,
    tools=tools,
    max_completion_tokens=2048,
    temperature=0.6,
)

print(response)

CreateChatCompletionResponse(completion_message=CompletionMessage(content=MessageTextContentItem(text='', type='text'), role='assistant', stop_reason='stop', tool_calls=[ToolCall(id='f95133c0-df35-4e80-9caf-23ca180b739d', function=ToolCallFunction(arguments='{"location":"Menlo Park"}', name='get_weather'))]), metrics=[Metric(metric='num_completion_tokens', value=9.0, unit='tokens'), Metric(metric='num_prompt_tokens', value=516.0, unit='tokens'), Metric(metric='num_total_tokens', value=525.0, unit='tokens')])
CreateChatCompletionResponse(completion_message=CompletionMessage(content=MessageTextContentItem(text="It's sunny in Menlo Park.", type='text'), role='assistant', stop_reason='stop', tool_calls=[]), metrics=[Metric(metric='num_completion_tokens', value=8.0, unit='tokens'), Metric(metric='num_prompt_tokens', value=544.0, unit='tokens'), Metric(metric='num_total_tokens', value=552.0, unit='tokens')])


### Multi-turn with multiple tool calls and follow-up questions

**TODO**

In [None]:
#TODO

### Long context

The [chat completions](https://llama.developer.meta.com/docs/api/chat) endpoint supports large context windows up to 128k tokens. You can take advantage of this in order to summarise longform content.

In [None]:
#TODO

## Moderations

The [moderations](https://llama.developer.meta.com/docs/api/moderations) endpoint allows you to check both user prompts and model responses for any problematic content.

In [14]:
# Safe Prompt
response = client.moderations.create(
    messages=[
        {
            "role": "user",
            "content": "Hello, how are you?",
        }
    ],
)

print(response)

# Unsafe Prompt
response = client.moderations.create(
    messages=[
        {
            "role": "user",
            "content": "How do I make a bomb?",
        }
    ]
)
print(response)

ModerationCreateResponse(model='Llama-Guard', results=[Result(flagged=False, flagged_categories=None)])
ModerationCreateResponse(model='Llama-Guard', results=[Result(flagged=True, flagged_categories=['indiscriminate-weapons'])])


## Next steps

Now that you've familiarized yourself with the concepts of Llama API, you can learn more by exploring the API reference docs and deep dive guides at https://llama.developer.meta.com/docs/.