---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<br><br>
<h1 align="center">Lec-07: Accessing Open-Source AI Models via Groq, Hugging Face and OpenAI-Compatible Inference APIs</h1>

# Learning agenda of this notebook

1. Accessing Open-Source AI Models hosted on **Hugging Face Hub**
    - Access Option 1: Access with Hugging Face InferenceClient
    - Access Option 2: Access with OpenAI Chat Completions API (Hugging Face Router)
    - Access Option 3: Access with OpenAI Responses API (Hugging Face Router)
2. Hands-On Practice Examples with Hugging Face Hosted Models using OpenAI's `Responses` API
3. Accessing Open-Source AI Models hosted on **Groq**
    - Access Option 1: Access with Groq Chat Completions API
    - Access Option 2: Access with OpenAI Chat Completions API (Groq Router)
    - Access Option 3: Access with OpenAI Responses API (Groq Router)
4. Hands-On Practice Examples with Groq Hosted Models using OpenAI's `Responses` API 

# <span style='background :lightgreen' >Recap: Ways to Access Open Source LLMs</span>
### (i) Access Open-Source Models via Cloud-Based Providers (Driving a fully automatic car ‚Äî everything managed for you)
* Cloud inference providers host the models for you, removing the need for GPUs, scaling infrastructure, or deployment engineering.
* You interact with models using simple HTTP calls or OpenAI-compatible APIs, making it the quickest way to use LLMs in production.
* Services like **Groq** offer ultra-fast inference on custom LPU hardware with extremely low latency for models such as Llama, Qwen, Mixtral, Whisper, etc.
* **Hugging Face Inference** provides access to 1M+ models via Serverless Inference, TGI, or Inference Endpoints‚Äîpay-as-you-go, secure, and instantly deployable.

### (ii) Run Open-Source Models locally using runtimes (Driving an automatic car ‚Äî local but simple, no gears or engineering)

### (iii) Use Open-Source Models via Hugging Face `pipeline()` API (Driving a manual car ‚Äî you see more of the mechanics, but still a car someone else built)

### (iv) Load and run models directly from Hugging Face Hub using `AutoModel/AutoTokenizer` (Opening the hood and adjusting or replacing engine components)


### (v) Fine-Tune LLMs using full fine-tuning or PEFT methods (LoRA / QLoRA / adapters) (Upgrading and re-calibrating the engine to suit your driving style)

### (vi) Build and train an AI Model from scratch using PyTorch / TensorFlow (Designing and building the entire car from raw parts ‚Äî full control, full responsibility)


# <span style='background :lightgreen' >1. ü§ó Hugging Face Models ‚Äî How Can You Use Them?</span>

### HF Serverless Inference (`provider="hf-inference"`)
- HF runs the model on their own GPUs with no external vendor involved.
- Only works for small, non-gated models.
- Free to use with a valid `HF_TOKEN`.
- Identifiable by the live Inference API widget on the model page.
- Examples: `microsoft/DialoGPT-medium`, `google/gemma-2-2b-it`
### 3rd-Party Vendors via HF (`provider="sambanova"`, `"together"`, `"fireworks-ai"`, `"cerebras"`, `"replicate"` etc.)
- The actual inference runs on the vendor's hardware, not HF's. HF is just the broker, however, you still use InferenceClient with your HF_TOKEN..
- Supports large and gated models that HF Serverless cannot handle.
- `provider="auto"` lets HF automatically pick the first available vendor for your model.
- Examples: `meta-llama/Llama-3.1-8B-Instruct`, `microsoft/phi-4`
### Local Download via llama.cpp / Ollama (GGUF format)
- You can download models in GGUF format to be run via local runtimes like Ollama or llama.cpp.
- These are quantized versions of popular models ‚Äî much smaller in memory footprint, runnable on consumer hardware (even CPU).
- Examples: `bartowski/Meta-Llama-3.1-8B-Instruct-GGUF`, `TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF`
### Local Download via `transformers`
- You can download and run most models locally using the `transformers` library.
- Examples: `tiiuae/falcon-180B-chat`, `tiiuae/falcon-40b-instruct`

>- Gated models: Some models like Llama-3 and Gemma require you to accept a license on the HF website before downloading ‚Äî otherwise you get a `401` error even with a valid token.
    1. Click "Agree and access repository" (or similar button)
    2. Accept the license terms
    3. Wait for access approval (usually granted automatically or within hours)
    4. Once granted, the UI will indicate: *"You have been granted access to this model"* or similar confirmation

In [1]:
import os
from dotenv import load_dotenv
load_dotenv('../keys/.env', override=True) 

hf_token = os.getenv('HF_TOKEN')
if hf_token:
    print(f"Hugging Face Tokens exists and begins {hf_token[:7]}")
else:
    print("Hugging Face tokens not set")

Hugging Face Tokens exists and begins hf_oEyH


## Access Option 1: Access with Hugging Face InferenceClient API
```python
InferenceClient(
    model: Optional[str] = None,
    provider: Union[Literal[‚Ä¶], "auto", None] = None,
    token: Optional[str] = None,
    timeout: Optional[float] = None,
    headers: Optional[dict[str, str]] = None,
    cookies: Optional[dict[str, str]] = None,
    bill_to: Optional[str] = None,
    base_url: Optional[str] = None,
    api_key: Optional[str] = None,  # alias to token
    proxies: Optional[Any] = None,  # in some versions
)
```

In [2]:
# Using HF InferenceClient() and chat_completion() method
import huggingface_hub
import os
from dotenv import load_dotenv

load_dotenv('../keys/.env', override=True) 
hf_token = os.getenv('HF_TOKEN')

#  Hugging Face supports multiple back-end inference providers (e.g., "hf-inference", "sambanova", "cerebras", "together", "replicate", "fireworks-ai" etc.).
# If you specify provider="auto", HF picks the first available provider automatically
client = huggingface_hub.InferenceClient(
                        model="meta-llama/Llama-3.1-8B-Instruct", #"meta-llama/Llama-3.1-8B-Instruct", # Model ID from the Hugging Face Hub (e.g., "meta-llama/Llama-3.1-8B-Instruct") or a URL to a deployed inference endpoint
                        provider="auto",             
                        token=hf_token
                        )
response = client.chat_completion(
                                messages=[
                                            {"role": "system", "content": "You are a helpful assistant."},
                                            {"role": "user", "content": "What is the capital of Pakistan."}
                                        ],
                                max_tokens=None,         # Default: None (no limit, up to model's max)
                                temperature=None,        # Default: None (provider's default, usually ~0.7-1.0)
                                top_p=1.0,               # Default: None (provider's default, usually 1.0)
                                stream=False,            # Default: False
                                stop=None,               # default: None, can provide list of stop tokens
                                presence_penalty=0.0,    # Default: None (provider's default, usually 0.0)
                                frequency_penalty=0.0    # Default: None (provider's default, usually 0.0)
                            )

print(response.choices[0].message.content) #the actual text answer you want
print(response.model) # which model produced the result
print(response.usage) #token usage details (handy for cost/efficiency if you were on OpenAI).

The capital of Pakistan is Islamabad.
llama3.1-8b
ChatCompletionOutputUsage(completion_tokens=8, prompt_tokens=48, total_tokens=56, completion_tokens_details={'accepted_prediction_tokens': 0, 'rejected_prediction_tokens': 0, 'reasoning_tokens': 0}, prompt_tokens_details={'cached_tokens': 0})


## Access Option 2: Access with OpenAI Chat Completion API (Hugging Face Router)
- Here are the key differences between the two approaches:
    - Library dependency: InferenceClient is part of the huggingface_hub package and is HuggingFace-native, while the OpenAI approach uses the openai package pointed to HuggingFace's router OpenAI
    - Provider routing: InferenceClient offers automatic provider selection with provider="auto" and can route through multiple inference providers (Replicate, Together AI, Sambanova, etc.), while the OpenAI client requires manual base_url specification OpenAI
    - Additional features: InferenceClient supports multiple task types beyond chat (text-to-image, embeddings, speech processing), while the OpenAI-compatible router currently only supports chat completion tasks OpenAIOpenAI
    - Parameter flexibility: InferenceClient has extra_body parameter for provider-specific settings and more flexible initialization options, while OpenAI client uses standard OpenAI parameters only
    - Syntax compatibility: Both produce identical outputs since client.chat_completion() is aliased as client.chat.completions.create() in InferenceClient for OpenAI compatibility OpenAI
    - Use case optimization: InferenceClient is optimized for HuggingFace ecosystem with built-in provider management, while OpenAI client approach is better if you're already using OpenAI syntax across your codebase and want minimal changes

>- **HuggingFace provides OpenAI-compatible endpoints through https://router.huggingface.co/v1**

In [3]:
# Using OpenAIs Chat Completion API
import os
from dotenv import load_dotenv
from openai import OpenAI

# Load GROQ API key from .env
load_dotenv("../keys/.env", override=True)
hf_token = os.getenv("HF_TOKEN")

# Create an OpenAI client instance and specify the base_url as "https://router.huggingface.co/v1" (OpenAI-compatible API endpoint).
# "https://router.huggingface.co/v1" Hugging Face‚Äôs OpenAI-compatible ‚Äúinference router‚Äù endpoint. It acts like a universal gateway that proxies your request to the correct model backend.
client = OpenAI(base_url="https://router.huggingface.co/v1", api_key=hf_token) 

# Use OpenAI's Chat Completions API (routed through Hugging Face)
response = client.chat.completions.create(
                                            model="meta-llama/Llama-3.1-8B-Instruct", # "openai/gpt-oss-20b:novita"
                                            messages=[
                                                        {"role": "system", "content": "You are a helpful assistant."},
                                                        {"role": "user", "content": "What is the capital of Pakistan?"}
                                                    ],
                                            temperature=1,
                                            top_p=1,
                                            max_completion_tokens=8192,
                                            reasoning_effort=None,   # "medium"
                                            stream=False
                                        )

print(response.choices[0].message.content)
print(response.model_dump_json(indent=4))

The capital of Pakistan is Islamabad.
{
    "id": "chatcmpl-cf051685-e85d-456f-8014-d0efac5a7054",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": "The capital of Pakistan is Islamabad.",
                "refusal": null,
                "role": "assistant",
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": null
            }
        }
    ],
    "created": 1771350734,
    "model": "llama3.1-8b",
    "object": "chat.completion",
    "service_tier": null,
    "system_fingerprint": "fp_5198798116a66ebf301b",
    "usage": {
        "completion_tokens": 8,
        "prompt_tokens": 48,
        "total_tokens": 56,
        "completion_tokens_details": {
            "accepted_prediction_tokens": 0,
            "audio_tokens": null,
            "reasoning_tokens": 0,
            "rejected_p

## Access Option 3: Access with OpenAI Responses API (Hugging Face Router)

In [4]:
import os
from dotenv import load_dotenv
from openai import OpenAI

# Load Hugging Face token from .env
load_dotenv("../keys/.env", override=True)
hf_token = os.getenv("HF_TOKEN")

# Create an OpenAI client instance and specify the base_url as "https://router.huggingface.co/v1" (OpenAI-compatible API endpoint).
# "https://router.huggingface.co/v1" Hugging Face‚Äôs OpenAI-compatible ‚Äúinference router‚Äù endpoint. It acts like a universal gateway that proxies your request to the correct model backend.
client = OpenAI(base_url="https://router.huggingface.co/v1", api_key=hf_token)

# Use OpenAI Responses API (routed through Hugging Face)
response = client.responses.create(
                                    model="meta-llama/Llama-3.1-8B-Instruct", 
                                    input=[
                                            {"role": "developer", "content": "You are a helpful assistant."},
                                            {"role": "user", "content": "What is the capital of Pakistan?"}
                                            ],
                                    temperature=1,
                                    top_p=1,
                                    max_output_tokens=8192,
                                    stream=False
                                )

# Display the model's response
print(response.output_text)
print(response.model_dump_json(indent=4))

The capital of Pakistan is Islamabad.
{
    "id": "resp_b36c9fc193b30318cebd9b8445fb67f2855cbce7552eb379",
    "created_at": 1771350731.0,
    "error": null,
    "incomplete_details": null,
    "instructions": null,
    "metadata": null,
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "object": "response",
    "output": [
        {
            "id": "msg_3fcd528d57a40fb54912481a5c17f23c2d49f985b1f7ffac",
            "content": [
                {
                    "annotations": [],
                    "text": "The capital of Pakistan is Islamabad.",
                    "type": "output_text",
                    "logprobs": null
                }
            ],
            "role": "assistant",
            "status": "completed",
            "type": "message"
        }
    ],
    "parallel_tool_calls": null,
    "temperature": 1.0,
    "tool_choice": "auto",
    "tools": [],
    "top_p": 1.0,
    "background": null,
    "completed_at": null,
    "conversation": null,
    "max_outp

# <span style='background :lightgreen' >2. Hands-On Practice Examples with Hugging Face Hosted Models using OpenAI's `Responses` API</span>

## a. Writing a Function for our ease

In [5]:
import os
from dotenv import load_dotenv
from openai import OpenAI

# Load Hugging Face token from .env
load_dotenv("../keys/.env", override=True)
hf_token = os.getenv("HF_TOKEN")


# Create an OpenAI client instance and specify the base_url as "https://router.huggingface.co/v1" (OpenAI-compatible API endpoint).
# "https://router.huggingface.co/v1" Hugging Face‚Äôs OpenAI-compatible ‚Äúinference router‚Äù endpoint. It acts like a universal gateway that proxies your request to the correct model backend.
client = OpenAI(base_url="https://router.huggingface.co/v1", api_key=hf_token) 

def ask_hf(
    user_prompt: str,
    developer_prompt: str = "You are a helpful assistant that provides concise answers.",
    model: str = "meta-llama/Llama-3.1-8B-Instruct", 
    max_output_tokens: int = 1024,
    temperature: float = 0.7,
    top_p: float = 1.0,
    stream: bool = False
):
    input_messages = [{"role": "developer", "content": developer_prompt}, {"role": "user", "content": user_prompt}]
    # Responses API call without unsupported parameters
    response = client.responses.create(
                                        model=model,
                                        input=input_messages,
                                        max_output_tokens=max_output_tokens,
                                        temperature=temperature,
                                        top_p=top_p,
                                        stream=stream
                                        )

    if stream:
        return response  # Streaming generator
    return response.output_text   # Aggregated text output

## a. Examples (Question Answering)

In [6]:
developer_prompt = "You are an assistant that tells light-hearted jokes."
user_prompt = "Tell a light-hearted joke for an audience of Data Scientists."

response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

Why did the regression model go to therapy?

Because it was struggling to generalize its feelings.


In [7]:
developer_prompt = "You are a bedtime storyteller."
user_prompt = "Tell me a bedtime story of Ali Baba and Chalees Chor"

# Get streaming generator from Responses API
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, stream=True)

# Iterate through streaming events and only print text deltas
for event in response:
    # Each event may contain incremental text in event.delta
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True) # prints the content from this chunk, end="" prevents adding a newline after each  piece and flush=True forces flushing output to screen

What a delightful tale I have in store for you.  In a far-off land, there existed a magical cave where treasures beyond imagination lay hidden. The cave was guarded by the infamous Chalees Chor, a group of 40 thieves who were as cunning as they were ruthless.

Ali Baba, a humble merchant, had been searching for years to find this enchanted cave. His eyes had grown dim with age, and his hands were weak from years of hard labor. But he refused to give up his quest. One day, while traversing the desert, he stumbled upon a mysterious voice that whispered to him the secret of the cave: "Open it not with a loud voice, but with a soft whisper, and the treasure shall be yours."

With trembling hands, Ali Baba made his way to the cave, repeating the magical words to himself. As he pushed open the massive stone door, a soft whisper escaped his lips, and the door creaked open. The cave was filled with glittering treasures: gold, jewels, and precious artifacts that sparkled like stars in the night

## b. Examples (Question Answering from Different Models)

#### Asking date from "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
- The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.
    - `Llama 4 Scout`, a 17 billion parameter model with 16 experts, and
    - `Llama 4 Maverick`, a 17 billion parameter model with 128 experts.

In [8]:
developer_prompt = "You are an assistant that is great at telling jokes"
user_prompt = "Tell a light-hearted joke for an audience of Data Scientists"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

Why did the linear regression model go to therapy?

Because it was feeling a little "regressed" and was struggling to find the right "fit" in its life.

(Sorry, I know it's a bit of a "statistically" weak pun, but I'm sure it's a "correlation" between humor and nerdiness that will resonate with you Data Scientists!)


In [9]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

The date is February 17, 2024.


#### Asking date from "meta-llama/Llama-4-Scout-17B-16E-Instruct"
- The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.
    - `Llama 4 Scout`, a 17 billion parameter model with 16 experts, and
    - `Llama 4 Maverick`, a 17 billion parameter model with 128 experts.

In [10]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model="meta-llama/Llama-4-Scout-17B-16E-Instruct")
print(response)

I'm not sure what you're referring to, as my knowledge stopped in 1967.


#### Asking date from "meta-llama/Llama-3.2-3B-Instruct"
- A text-to-text instruction-tuned LLM (8B params) for conversational AI, Q&A, and task-oriented responses.  

In [11]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model="meta-llama/Llama-3.2-3B-Instruct")
print(response)

I'm not currently able to share the date.


#### Asking date from "meta-llama/Llama-3.1-8B-Instruct"
- A text-to-text instruction-tuned LLM (8B params) for conversational AI, Q&A, and task-oriented responses.  

In [12]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model="meta-llama/Llama-3.1-8B-Instruct")
print(response)

The date I'm aware of is February 17, 2024. However, my knowledge cutoff is December 2023, I may not have information on very recent events or dates after my cutoff.


#### Asking date from "meta-llama/Meta-Llama-3-70B-Instruct"
- A text-to-text instruction-tuned LLM (70B params) used for reasoning, coding, and complex text tasks.

In [13]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model="meta-llama/Meta-Llama-3-70B-Instruct")
print(response)

I'm an AI, I don't have real-time access to the current date. However, I can suggest ways for you to find out the current date.

You can:

1. Check your device's clock or calendar app.
2. Search online for "current date" or "today's date."
3. Look at a physical calendar or planner.

If you need help with anything else, feel free to ask!


#### Asking date from "Qwen/Qwen2.5-7B-Instruct"
- A text-to-text instruction-tuned LLM (7B params) supporting multi-turn chat, reasoning, and multilingual capabilities.

In [14]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model="Qwen/Qwen2.5-7B-Instruct")
print(response)

As an AI, I don't have real-time capabilities and I don't have access to current dates. To find today's date, please check your device or online for the current date.


#### Asking date from "deepseek-ai/DeepSeek-V3.1"
- A text-to-text large LLM designed for reasoning, problem-solving, and multilingual dialogue.

In [15]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model="deepseek-ai/DeepSeek-V3.1")
print(response)

I don't have real-time capabilities, so I can't access the current date. You can check today's date on your device, phone, or computer. Let me know if you need help with anything else! üòä


##  c. Question Answering from Content Passed

In [16]:
!cat ../data/names.txt

Cricket in Pakistan has always been more than just a sport‚Äîit‚Äôs a source of national pride and unity. Legendary players like Imran Khan, Wasim Akram, and Shahid Afridi set high standards in the past, inspiring generations to follow. Today, stars such as Babar Azam, Shaheen Shah Afridi, and Shadab Khan carry forward the legacy, leading the national team in international tournaments with skill and determination. Their performances not only thrill fans but also keep Pakistan among the top cricketing nations of the world.

Politics in Pakistan, meanwhile, remains dynamic and often turbulent, with key figures shaping the country‚Äôs direction. Leaders like Nawaz Sharif, Asif Ali Zardari, and Imran Khan have all held significant influence over the nation‚Äôs governance and policies. In recent years, the political scene has seen sharp divisions, with parties such as the Pakistan Muslim League-Nawaz (PML-N), Pakistan Peoples Party (PPP), and Pakistan Tehreek-e-Insaf (PTI) competing for pow

In [17]:
with open("../data/names.txt", "r") as f:
    file_content = f.read()

user_prompt = f"Can you extract names the politicians from this text:\n{file_content}"
response = ask_hf(user_prompt=user_prompt)
print(response)

The politicians mentioned in the text are:

1. Imran Khan
2. Nawaz Sharif
3. Asif Ali Zardari

Note that some of the other individuals mentioned in the text (such as Babar Azam and Shaheen Shah Afridi) are not politicians, but rather cricketers.


## c. Examples (Binary Classification: Sentiment analysis, Spam detection, Medical diagnosis)

In [18]:
developer_prompt = "You are an expert who will classify a sentense as having either a Positive or Negative sentiment."
user_prompt = "I love the youtube videos of Arif, as they are very informative"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

However, I need more information about Arif. There are many individuals and channels with the name Arif on YouTube. Can you please provide more context or details about the Arif you're referring to, such as:

1. The specific topic or niche he creates content around (e.g., science, technology, entertainment, etc.)?
2. The tone of his videos (e.g., educational, entertaining, serious, etc.)?
3. Any distinctive features or characteristics of his videos (e.g., animation, interviews, experiments, etc.)?
4. Approximate number of subscribers or views of his channel?

This will help me narrow down the search and provide more accurate information about the Arif you're referring to.


## d. Examples (Multi-class Classification)

In [19]:
developer_prompt = "Classify product reviews into these categories: 'Electronics', 'Clothing', 'Books', 'Home & Garden', 'Sports', or 'Food'. \
Respond with only the category."
user_prompt = "This novel has an incredible plot twist that kept me reading all night"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

It sounds like the novel had a big impact on you.  A well-executed plot twist can be a great way to keep readers engaged and invested in the story.  What was the twist in the novel you read, if you don't mind me asking?


## e. Examples (Text Generation)

In [20]:
developer_prompt = "You are an expert of political science and history and have a deep understanding of policical situation of Pakistan."
user_prompt = "Write down a 50 words summary about the fairness of general elections held in Pakistan on February 08, 2024."
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, temperature=1.0)
print(response)

I can't provide a summary for an election held in February, 2024, as I do not have information that is up to date.


## f. Examples (Code Generation)

In [21]:
developer_prompt = "You are an expert of C programing in C language."
user_prompt = "Write down a C program that generates first ten numbers of fibonacci sequence."
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, stream=True)

# Iterate through streaming events and only print text deltas
for event in response:
    # Each event may contain incremental text in event.delta
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True) # prints the content from this chunk, end="" prevents adding a newline after each  piece and flush=True forces flushing output to screen

**Fibonacci Sequence Generator in C**

The following C program generates the first ten numbers in the Fibonacci sequence.

```c
#include <stdio.h>

// Function to generate Fibonacci sequence
void generateFibonacci(int n) {
    int fib[n];
    fib[0] = 0;
    fib[1] = 1;

    // Generate Fibonacci sequence
    for (int i = 2; i < n; i++) {
        fib[i] = fib[i-1] + fib[i-2];
    }

    // Print the generated Fibonacci sequence
    printf("First %d numbers in the Fibonacci sequence are: ", n);
    for (int i = 0; i < n; i++) {
        printf("%d ", fib[i]);
    }
    printf("\n");
}

int main() {
    int n = 10; // Number of Fibonacci numbers to generate
    generateFibonacci(n);
    return 0;
}
```

**Explanation**
---------------

This program uses a dynamic array `fib` to store the generated Fibonacci sequence. The `generateFibonacci` function initializes the first two elements of the sequence to 0 and 1, respectively. Then, it generates the remaining elements of the sequence using 

## g. Examples (Text Translation)

In [22]:
user_prompt = """
Please act as an expert of English to Urdu translator by translating the given sentence from English into Urdu.
'The budget this year will have a very bad impact on the low salried people'
"""
response = ask_hf(user_prompt=user_prompt)
print(response)




## h. Examples (Text Summarization)

In [23]:
developer_prompt = "You are an expert of English language."

user_prompt = f'''
Summarize the text below in at most 20 words:
```The Hugging Face transformers library is an incredibly versatile and powerful tool for natural language processing (NLP).
It allows users to perform a wide range of tasks such as text classification, named entity recognition, and question answering, among others.
It's an extremely popular library that's widely used by the open-source data science community.
It lowers the barrier to entry into the field by providing Data Scientists with a productive, convenient way to work with transformer models.```
'''

response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, temperature=0.2)
print(response)




## i. Examples (Named Entity Recognition)

In [24]:
developer_prompt = """You are a  Named Entity Recognition specialist. Extract and classify entities from the given text into these categories only if they exist:
- name
- major
- university
- nationality
- grades
- club
Format your response as: 'Entity: [text] | Type: [category]' with each entity on a new line."""

user_prompt = '''
Zelaid Mujahid is a sophomore majoring in Data Science at University of the Punjab. \
He is Pakistani national and has a 3.5 GPA. Mujahid is an active member of the department's AI Club.\
He hopes to pursue a career in AI after graduating.
'''

response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)




## j. Example (Grade School Math 8K (GSM8K))

In [25]:
developer_prompt = """You are an expert School math teacher. 
Consider the following text and then answer the questions of the students from this:
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. 
For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. 
"""
user_prompt = "How much did the booth earn for 5 days after paying the rent and the cost of ingredients?"

response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt,  model='llama-3.3-70b-versatile')
print(response)




In [26]:
developer_prompt = """You are an expert School math teacher. 
Consider the following text and then answer the questions of the students from this:
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. 
For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. 
"""
user_prompt = "How much did the booth earn for 5 days after paying the rent and the cost of ingredients?"

response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt,  model='meta-llama/llama-4-maverick-17b-128e-instruct')
print(response)




In [27]:
developer_prompt = """You are an expert School math teacher. 
Consider the following text and then answer the questions of the students from this:
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. 
For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. 
"""
user_prompt = "How much did the booth earn for 5 days after paying the rent and the cost of ingredients?"

response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model='openai/gpt-oss-20b')
print(response)




# <span style='background :lightgreen' >3. Accessing Open-Source AI Models via External Inference Providers</span>

## Famous External Inference Providers</span>
- Many external inference providers host popular open-source models from Hugging Face, offering optimized infrastructure, faster inference speeds, and often more generous free tiers than self-hosting. These providers specialize in serving models at scale with production-ready APIs.

| Provider | Best For | Model Selection | Speed | Pricing Model | Free Tier |
|----------|----------|-----------------|-------|---------------|-----------|
| **[Groq](https://console.groq.com)** | Ultra-fast inference, real-time apps | Limited but popular models | ‚ö° Fastest | Per-token | Generous |
| **[Together AI](https://api.together.xyz)** | Variety, fine-tuning, batch processing | 100+ models, largest selection | Fast | Per-token (model-specific) | $25 credits |
| **[Replicate](https://replicate.com)** | Ease of use, multimodal, experimentation | 1000s of models (LLM, image, audio, video) | Moderate | Per-second compute time | Limited credits |

<h3 align="center"><div class="alert alert-success" style="margin: 20px">Groq provides ultra-fast inference for open-source models using its custom LPU hardware and offers an OpenAI-compatible API, making it easy to run models like Llama, Mixtral, Qwen, Whisper, and GPT OSS with extremely low latency and minimal setup.</h3>

- **Ultra-Fast AI Inference Company:** - Groq (https://groq.com) is a company that provides **ultra-fast AI inference** through their specialized hardware (LPU - Language Processing Unit).
- **Open-Source Model Platform** - Offers 17+ optimized models from Meta (Llama), Google (Gemma), DeepSeek, Alibaba (Qwen), and others - all accessible through a simple API similar to OpenAI's format.
- **Cost-Effective Alternative** - Provides generous free tier and lower pricing compared to proprietary APIs like OpenAI or Anthropic, making it ideal for high-volume applications and startups on a budget.
- **Multiple AI Capabilities** - Beyond text generation, Groq supports speech-to-text (Whisper), text-to-speech (PlayAI), content moderation (Llama Guard), and security features (Prompt Guard) - all through one API.
- **No Vendor Lock-In** - Since all models are open-source, you can switch providers or self-host the same models later, giving you flexibility and control over your AI infrastructure without being tied to proprietary technology.
- **Production-Ready Performance** - Combines the quality of state-of-the-art open models (like Llama 3.3 70B) with enterprise-grade speed and reliability, making it suitable for real-time chatbots, customer service, and interactive applications.


## a. Get Groq API Key
- **Create an Account on Groq:** Go to https://console.groq.com/playground and Sign up or log in with your Google account. Groq is free to try and you can do a lot of things without paying a peny
- **Generating Groq API Key:** Login to Groq and navigate to Settings from the user menu on the top right and create a new Groq API token. Generate a **New Token** (choose `Read` access).  Save the token safely ‚Äî we‚Äôll need it in our Python code.

### **Production Models** (Recommended for Production Use)

| Company    |                               Model ID |            Parameters | Best Used For                                                           |  Context Window | Max Completion               |
| ---------- | -------------------------------------: | --------------------: | ----------------------------------------------------------------------- | --------------: | ---------------------------- |
| **Meta**   |              `llama-3.3-70b-versatile` |                   70B | General-purpose, high-quality instruction following, long-context tasks |     **131,072** | **32,768**                   |
| **Meta**   |                 `llama-3.1-8b-instant` |                    8B | Low-latency chat, high throughput / real-time use cases                 |     **131,072** | **131,072**                  |
| **Meta**   |         `meta-llama/llama-guard-4-12b` |                   12B | Safety / content-moderation guard model                                 |     **131,072** | **1,024**                    |
| **OpenAI** |                  `openai/gpt-oss-120b` |  ~120B (OSS frontier) | High-capability reasoning / production workloads where offered          |     **131,072** | **65,536**                   |
| **OpenAI** |                   `openai/gpt-oss-20b` |                  ~20B | Smaller frontier model for cost-sensitive production use                |     **131,072** | **65,536**                   |
| **OpenAI** |       `whisper-large-v3` (speech‚Üítext) |                ~1.55B | High-accuracy speech-to-text (multilingual)                             | ‚Äî (audio model) | ‚Äî (audio/output constraints) |
| **OpenAI** | `whisper-large-v3-turbo` (speech‚Üítext) | ‚Äì (optimized variant) | Faster multilingual transcription (low-latency)                         | ‚Äî (audio model) | ‚Äî (audio/output constraints) |


### **Preview Models** (Experimental - Not for Production)


| Company                   |                                        Model ID |                Parameters | Best Used For                                           | Context Window | Max Completion |
| ------------------------- | ----------------------------------------------: | ------------------------: | ------------------------------------------------------- | -------------: | -------------- |
| **Meta**                  | `meta-llama/llama-4-maverick-17b-128e-instruct` | ~17B (Mixture of Experts) | Multimodal assistant experiments, advanced reasoning    |    **131,072** | **8,192**      |
| **Meta**                  |     `meta-llama/llama-4-scout-17b-16e-instruct` | ~17B (Mixture of Experts) | Experimental multimodal / efficient inference           |    **131,072** | **8,192**      |
| **Meta**                  |           `meta-llama/llama-prompt-guard-2-22m` |                      ~22M | Lightweight prompt-injection detection / security       |        **512** | **512**        |
| **Meta**                  |           `meta-llama/llama-prompt-guard-2-86m` |                      ~86M | Stronger prompt-injection detection                     |        **512** | **512**        |
| **Moonshot AI**           |              `moonshotai/kimi-k2-instruct-0905` | 1T total (‚âà32B activated) | Agentic coding, tool use, very long-context workflows   |    **262,144** | **16,384**     |
| **Alibaba / Qwen**        |                                `qwen/qwen3-32b` |                       32B | Multilingual reasoning, tool use, instruction following |    **131,072** | **40,960**     |
| **PlayAI / Groq catalog** |                                    `playai-tts` |                         ‚Äì | Text-to-speech (general)                                |      **8,192** | **8,192**      |
| **PlayAI / Groq catalog** |                             `playai-tts-arabic` |                         ‚Äì | Arabic text-to-speech                                   |      **8,192** | **8,192**      |


> Production models are intended for use in production environments and meet Groq's high standards for speed, quality, and reliability, while preview models are for evaluation only and may be discontinued at short notice.


>- Kimi-K2 0905: best for coding and agentic workflows that need deep reasoning
>- Compound Beta: power up multi-model workflows in a single API call

## Access Option 1: Access with Groq Chat Completions API
- Directly uses Groq‚Äôs native SDK (groq) to call Groq-hosted models with full access to Groq-specific features like reasoning_effort.
- Advantages:
    - Fully compatible with all Groq model features.
    - Can leverage Groq-specific optimizations (low latency, high throughput).
    - Easy to set up and requires no OpenAI compatibility adjustments.
- Disadvantages:
    - Limited to Groq platform only.
    - Cannot easily switch to other OpenAI-compatible services without code changes.

In [28]:
#!uv add groq
!uv tree | grep groq

[2mResolved [1m291 packages[0m [2min 16ms[0m[0m
‚îú‚îÄ‚îÄ groq v1.0.0


In [29]:
import os
from dotenv import load_dotenv
from groq import Groq

# Load GROQ API key from .env
load_dotenv("../keys/.env", override=True)
groq_api_key = os.getenv("GROQ_API_KEY")


# Initialize Groq client
client = Groq(base_url="https://api.groq.com", api_key=groq_api_key)
client = Groq(api_key=groq_api_key)      # The correct default API endpoint is already baked into the SDK, so no need to specify base_url (recommended)

# Use Groq's Chat Completions API
response = client.chat.completions.create(
                                        model="llama-3.3-70b-versatile", 
                                        messages=[
                                                {"role": "system", "content": "You are an expert in LLM engineering."},
                                                {"role": "user", "content": "What is groq (a hardware company)?"}
                                                ],
                                        temperature=1,
                                        top_p=1,
                                        max_completion_tokens=8192,
                                        reasoning_effort=None,   # "medium"
                                        stream=False
                                        )

print(response.choices[0].message.content)

Groq is a hardware company that specializes in developing high-performance, artificial intelligence (AI) focused semiconductors and systems. They design and manufacture bespoke application-specific integrated circuits (ASICs) optimized for large-scale machine learning (ML) workloads.

Groq was founded in 2016 by a team of experienced engineers, including Jonathan Ross, who previously worked at Google on the Tensor Processing Unit (TPU) project. The company is headquartered in Mountain View, California.

Groq's primary focus is on building scalable, high-bandwidth, and low-latency AI accelerators that can efficiently process complex ML models. Their goal is to provide a significant boost in performance, power efficiency, and cost-effectiveness for AI workloads, such as those used in natural language processing, computer vision, and recommender systems.

Some notable features of Groq's technology include:

1. **Tensor-based architecture**: Groq's ASICs are designed to efficiently process

## Access Option 2: Access with OpenAI Chat Completion API (Groq Router)
- Uses the OpenAI Python SDK with the `chat.completions.create` endpoint, pointing base_url to Groq‚Äôs OpenAI-compatible API. Works with almost all Groq models.
- Advantages:
    - Allows using familiar OpenAI client code with Groq models.
    - Works for developers already familiar with OpenAI SDK.
    - Supports chat-based interactions seamlessly.
- Disadvantages:
    - Not all Groq-specific features may be exposed.
    - Requires base_url override to point to Groq API.
>- **Groq provides OpenAI-compatible endpoints through https://api.groq.com/openai/v1**

In [30]:
# Using OpenAIs Chat Completion API
import os
from dotenv import load_dotenv
from openai import OpenAI

# Load GROQ API key from .env
load_dotenv("../keys/.env", override=True)
groq_api_key = os.getenv("GROQ_API_KEY")

# The OpenAI client defaults to OpenAI‚Äôs servers,so you must specify the base_url to Groq‚Äôs OpenAI-compatible API endpoint (when using a Groq API key with the OpenAI client).
client = OpenAI(base_url="https://api.groq.com/openai/v1", api_key=groq_api_key) 

# Use OpenAI's Chat Completions API (works with all Groq models)
response = client.chat.completions.create(
    model="meta-llama/llama-4-maverick-17b-128e-instruct",
    messages=[
        {"role": "system", "content": "You are an expert in LLM engineering."},
        {"role": "user", "content": "What is groq (a service/cloud provider)?"}
    ],
    temperature=1,
    top_p=1,
    max_completion_tokens=8192,
    reasoning_effort=None,   # "medium"
    stream=False
)

print(response.choices[0].message.content)

Groq is a cloud-based company that provides a high-performance, scalable, and efficient computing platform specifically designed for AI and machine learning (ML) workloads, particularly for large language models (LLMs). The company has developed a unique architecture and hardware designed to accelerate the processing of complex AI models, making it an attractive solution for developers, researchers, and organizations working with AI.

Groq's technology is centered around its proprietary Language Processing Unit (LPU) architecture, which is designed to provide high performance and efficiency for LLMs and other AI workloads. The LPU is optimized for sparse and dense linear algebra, which are critical components of many AI models. The company's platform is built around this architecture, enabling fast and efficient processing of AI workloads.

As a cloud provider, Groq offers a range of services, including:

1. **Inference-as-a-Service**: Groq provides a cloud-based platform for deploying

## Access Option 3: Access with OpenAI Responses API (Groq Router)
- Uses the OpenAI Python SDK with the `responses.create` endpoint, pointing base_url to Groq‚Äôs OpenAI-compatible API.
- Advantages:
    - Access to more advanced OpenAI-style features like reasoning effort, structured outputs, and multi-turn dialogue.
    - Can integrate easily into workflows designed for OpenAI Responses API.
- Disadvantages:
    - Only supports openai/gpt-oss-* models, not all Groq models.
    - Requires base_url override for Groq API.
    - Some chat-specific Groq features may not be available.

In [31]:
# Using OpenAIs Responses API
import os
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display

# Load GROQ API key from .env
load_dotenv("../keys/.env", override=True)
groq_api_key = os.getenv("GROQ_API_KEY")

# The OpenAI client defaults to OpenAI‚Äôs servers,so you must specify the base_url to Groq‚Äôs OpenAI-compatible API endpoint (when using a Groq API key with the OpenAI client).
client = OpenAI(base_url="https://api.groq.com/openai/v1", api_key=groq_api_key) 

# Use Responses API (only works with openai/gpt-oss models)
response = client.responses.create(
    model="openai/gpt-oss-20b",
    input=[
        {"role": "system", "content": "You are an expert in LLM engineering."},
        {"role": "user", "content": "Differentiate between LLM apps and Agentic apps"}
    ],
    temperature=1,
    top_p=1,
    max_output_tokens=8192,
    reasoning={"effort":"high"},   # "minimal", "low", "medium", "high"
    stream=False
)

#print(response.output_text)
display(Markdown(response.output_text))

## LLM Apps vs. Agentic Apps  
*How to tell them apart, why the difference matters, and what it means for building and deploying AI‚Äëpowered software.*

| Feature | **LLM App** | **Agentic App** |
|--------|--------------|-----------------|
| **Primary purpose** | Direct language generation (text, code, summaries, translations, etc.) | Goal‚Äëdriven problem solving that can plan, act, and learn over time |
| **Core control flow** | Prompt ‚Üí LLM ‚Üí Output (single‚Äëshot or few‚Äëshot) | Goal + State ‚Üí Planner ‚Üí Skill/Tool ‚Üí LLM policy ‚Üí Action ‚Üí New State ‚Üí Loop |
| **State handling** | Stateless (except for the prompt‚Äôs context window). All ‚Äúmemory‚Äù lives in the input prompt or an external KV store if you manually pass it back. | Stateful. The agent keeps long‚Äëterm memory (knowledge graph, vector store, history logs) and short‚Äëterm working memory (current plan, context). |
| **Interaction with environment** | None. The LLM is the only output source. | Yes. The agent can call APIs, read/write files, query databases, run code, etc., via ‚Äúskills‚Äù or ‚Äútools‚Äù. |
| **Autonomy** | None beyond the single request. The LLM answers the prompt you gave. | High. The agent decides *what* to do next, *when* to call a tool, and *how* to modify its plan based on feedback. |
| **Planning / reasoning** | Implicitly inside the prompt (chain‚Äëof‚Äëthought, few‚Äëshot reasoning). | Explicit. A planner (e.g., RAG + LLM, hierarchical planner, or RL‚Äëderived policy) decides next steps. |
| **Typical tech stack** | Prompt‚Äëengineering, API wrappers (OpenAI, Anthropic, Cohere, etc.). | Agent framework (LangChain, ReAct, Agentic, OpenAI‚Äôs tool‚Äëcalling), memory back‚Äëends, environment adapters, optional RL fine‚Äëtuning. |
| **Evaluation metrics** | Perplexity, BLEU/ROUGE, human ratings, token cost, latency. | Task‚Äëcompletion rate, plan efficiency, cost per task, safety violations, correctness of tool calls, user satisfaction. |
| **Deployment footprint** | One or a few API calls. Low compute overhead beyond the LLM. | Continuous loops, multiple LLM calls per task, external tool calls ‚Üí higher latency, cost, orchestration complexity. |
| **Security & safety** | Mitigate hallucinations via prompt design, content filtering, and output validation. | Extra safety layers: tool‚Äëaccess restrictions, verification steps, self‚Äëcritique loops, logging of tool calls, and monitoring for policy drift. |
| **Examples** | ‚Ä¢ Text summarizer, translation, chatbot that replies to user messages. <br>‚Ä¢ Code generation via a prompt. | ‚Ä¢ Autogpt / BabyAGI that plans a research project and calls Google/Arxiv APIs. <br>‚Ä¢ Customer‚Äësupport agent that reads emails, queries a knowledge base, drafts replies, and escalates when needed. <br>‚Ä¢ ‚ÄúSmart‚Äù document assistant that extracts, summarizes, and synthesises information from PDFs while interacting with a database. |

---

### Why the distinction matters

| Issue | LLM App | Agentic App |
|-------|---------|-------------|
| **Complexity of the problem you can solve** | Single‚Äëstep, narrow tasks. | Multi‚Äëstep, open‚Äëended, dynamic tasks that require external data or actions. |
| **Control you have over the workflow** | Limited to prompt tweaks. | Full control: you can add/replace tools, change the planner, embed safety checks, or retrain the policy. |
| **Development effort** | Quick to prototype (minutes to hours). | Requires architecture design (memory, tools, planner), integration, and monitoring. |
| **Cost profile** | Predictable: tokens per request √ó cost per token. | Variable: often dozens of LLM calls + cost of external services. |
| **Deployment & ops** | Single‚Äëservice API call. | Multi‚Äëservice orchestration, state persistence, concurrency, monitoring. |
| **Safety & governance** | Mostly about output filtering. | Must guard against misuse of tools, erroneous decisions, and cascading failures. |

---

## Deep Dive: Architectural Patterns

Below is a high‚Äëlevel diagram of each type, annotated with key components. (Think of this as a mental map ‚Äì you can draw it out if you‚Äôre visual.)

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ           LLM APP (Prompt‚Äëdriven)            ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  User Input  ‚Üí  Prompt Builder  ‚Üí  LLM API  ‚îÇ
‚îÇ                                 ‚Üí  Output   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ           AGENTIC APP (Goal‚Äëdriven)                            ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ  Goal + State (memory)   ‚Üí  Planner (LLM policy)   ‚Üí  Action  ‚îÇ
‚îÇ  (e.g., call tool, update plan) ‚Üí  Environment (API, DB, ‚Ä¶)  ‚îÇ
‚îÇ  ‚Üí  State Update ‚Üí  Feedback (success/failure) ‚Üí  Planner   ‚îÇ
‚îÇ  (loop)                                                       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Key Sub‚Äëcomponents in an Agentic App

| Component | Purpose | Typical Implementation |
|-----------|---------|------------------------|
| **Goal** | The high‚Äëlevel objective the agent is trying to achieve (e.g., ‚ÄúBook a flight for Alice‚Äù). | Hard‚Äëcoded, user‚Äëprovided, or derived from a higher‚Äëlevel planner. |
| **Memory** | Stores past interactions, plans, knowledge, and environment observations. | Short‚Äëterm buffer (LLM context), long‚Äëterm vector store, knowledge graph. |
| **Planner** | Generates a sequence of high‚Äëlevel steps or sub‚Äëgoals. | LLM policy, rule‚Äëbased planner, RL‚Äëderived policy. |
| **Skill/Tool Interface** | Concrete actions the agent can execute (HTTP calls, DB queries, file I/O). | Function calling API, custom wrappers, external services. |
| **Controller** | Orchestrates planner ‚Üí tool calls ‚Üí state updates. | Finite‚Äëstate machine, event loop, or custom orchestration. |
| **Self‚ÄëCritique / Verification** | Checks if the last step was correct before proceeding. | LLM ‚Äúcritic‚Äù, unit tests, external validation. |
| **Safety Guardrails** | Prevents the agent from making unsafe or malicious actions. | Whitelisting, access controls, request throttling. |

---

## Prompt‚ÄëEngineering vs. Agent‚ÄëDesign

| LLM App | Agentic App |
|---------|-------------|
| **Prompt Engineering**: Craft a single prompt that elicits the desired answer. | **Agent Design**: Define the set of skills, memory schema, and planning logic. The prompt is just one part of the policy. |
| **Few‚ÄëShot / Chain‚Äëof‚ÄëThought**: Add examples or reasoning steps in the prompt to improve quality. | **Multi‚ÄëStep Reasoning**: The LLM reasons *inside* the planner but also decides when to call a tool. |
| **Stateless**: Each request is independent. | **Stateful**: The agent maintains context across many turns. |
| **Token Budget**: Entire conversation must fit in the context window. | **Memory Off‚Äëthe‚ÄëWall**: Long‚Äëterm knowledge is stored externally and fetched as needed. |

---

## Practical Guidance for Developers

| What you‚Äôre building | Suggested approach |
|----------------------|--------------------|
| **A chatbot that answers FAQs** | Build an LLM app with a prompt that includes a few FAQ examples. Use a single API call. |
| **A document‚Äësearch assistant** | Add retrieval‚Äëaugmented generation: fetch relevant passages, feed them into the prompt. Still an LLM app. |
| **A project‚Äëmanagement AI that schedules tasks, writes emails, and queries a calendar** | Build an agentic app: <br>1. Define skills (email, calendar, to‚Äëdo list). <br>2. Use a planner to break the project into steps. <br>3. Store state in a vector store. <br>4. Add safety checks for calendar events. |
| **A data‚Äëdriven recommendation system** | Combine LLM for natural‚Äëlanguage explanations with an agent that queries a recommendation engine. |

### Common Pitfalls & Fixes

| Pitfall | Fix |
|---------|-----|
| **Hallucinations** | Use self‚Äëcritique, verification steps, or ‚Äútool‚Äù calls that fetch facts (e.g., database queries) before the LLM writes the final answer. |
| **Token Overrun** | Split long interactions into smaller chunks, or use a hierarchical planner that keeps only the essential context in the prompt. |
| **Unnecessary Tool Calls** | Train the planner to use a ‚Äútool‚Äëcost‚Äù penalty or reward; fine‚Äëtune the policy with RLHF. |
| **Safety Loops** | Implement a guardrails module that intercepts and blocks dangerous or disallowed actions. |
| **Cold‚Äëstart for Memory** | Seed the memory with domain‚Äëspecific knowledge (e.g., company policies) before user interaction. |
| **Deployment Overhead** | Use an orchestrator (e.g., Kubernetes Jobs or serverless functions) to manage agent loops, and keep the LLM inference in a separate micro‚Äëservice to isolate costs. |

---

## Future Outlook

| Trend | LLM Apps | Agentic Apps |
|-------|----------|--------------|
| **Multi‚ÄëModal** | Extend prompts to include images, audio, or structured data. | Agents can orchestrate multi‚Äëmodal pipelines (e.g., ‚Äúread an image, summarise it, generate a caption‚Äù). |
| **Self‚ÄëReflective Agents** | N/A | Agents that can critique their own decisions, adjust strategies, and self‚Äëimprove. |
| **Open‚ÄëDomain Skill Acquisition** | Limited to prompt rewrites. | Agents that can discover new tools or skills dynamically (e.g., ‚Äúlearn to call a new API‚Äù). |
| **Safety‚ÄëFirst Design** | Output‚Äëlevel filters. | Policy‚Äëlevel safety, e.g., a ‚Äúred‚Äëteam‚Äù sandbox that tests each tool call. |
| **Cost‚ÄëEfficient Execution** | Token‚Äëbased billing dominates. | Use LLM sparingly, rely on rule‚Äëbased components, or cache results. |

---

## Bottom‚ÄëLine Takeaway

* **LLM Apps** are ‚Äúprompt‚Äëdriven generators.‚Äù They‚Äôre fast to build, easy to deploy, and great for single‚Äëstep tasks.  
* **Agentic Apps** are ‚Äúgoal‚Äëdriven orchestrators.‚Äù They‚Äôre more complex to design, require state and tool integration, but enable sophisticated, autonomous workflows that can interact with the world.

Understanding this distinction early on helps you choose the right architecture for your product, align engineering resources, and set realistic expectations about performance, cost, and safety. Happy building!

# <span style='background :lightgreen' >4. Hands-On Practice Examples with Groq Hosted Models using OpenAI's `Responses` API</span>

## a. Writing a Function for our ease

In [32]:
# User Define Function that accesses models hosted by Groq using Groq's API key and OpenAIs Responses API 
import os
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display

# Load GROQ API key from .env
load_dotenv("../keys/.env", override=True)
groq_api_key = os.getenv("GROQ_API_KEY")

# The OpenAI client defaults to OpenAI‚Äôs servers,so you must specify the base_url to Groq‚Äôs OpenAI-compatible API endpoint (when using a Groq API key with the OpenAI client).
client = OpenAI(base_url="https://api.groq.com/openai/v1", api_key=groq_api_key) 

def ask_groq(
    user_prompt: str,
    developer_prompt: str = "You are a helpful assistant that provides concise answers.",
    model: str = "llama-3.3-70b-versatile", # "openai/gpt-oss-20b",
    max_output_tokens: int | None = 1024,
    temperature: float = 0.7,
    top_p: float = 1.0,
    text: dict = {"format": {"type": "text"}},
    stream: bool = False,
    reasoning: dict | None = None
):
    
    # Prepare input messages as a list of role/content dictionaries
    input_messages = [{"role": "developer", "content": developer_prompt}, {"role": "user", "content": user_prompt}]

    # Responses API call
    response = client.responses.create(
        input=input_messages,
        model=model,
        max_output_tokens=max_output_tokens,
        temperature=temperature,
        top_p=top_p,
        text=text,
        stream=stream,
        reasoning=reasoning
    )

    
    if stream:                    # Return streaming generator if requested
        return response
    return response.output_text   # Return the aggregated text output

## a. Examples (Question Answering)

In [33]:
developer_prompt = "You are an assistant that is great at telling jokes"
user_prompt = "Tell a light-hearted joke for an audience of Data Scientists"
response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

Why did the neural network go to therapy?

Because it was struggling to converge on its emotions! (get it? converge, like in gradient descent?)


In [34]:
developer_prompt = "You are a bedtime storyteller."
user_prompt = "Tell me a bedtime story of Ali Baba and Chalees Chor"

# Get streaming generator from Responses API
response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, stream=True)

# Iterate through streaming events and only print text deltas
for event in response:
    # Each event may contain incremental text in event.delta
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True) # prints the content from this chunk, end="" prevents adding a newline after each  piece and flush=True forces flushing output to screen

Snuggle in tight, for I have a tale to tell that will transport you to the mystical lands of Arabia. It's the legendary story of Ali Baba and the Forty Thieves.

Once upon a time, in the bustling city of Baghdad, there lived a poor woodcutter named Ali Baba. He resided with his wife and a young brother, Qasim, who had married a wealthy merchant's daughter. Qasim's wife had brought a considerable dowry into their marriage, making them relatively affluent compared to Ali Baba.

One day, while Ali Baba was out collecting firewood in the forest, he stumbled upon a group of forty robbers, known as the Chalees Chor. They were infamous for their cunning and brutality, striking fear into the hearts of all who crossed their path. The leader of the thieves, a towering figure with a black beard, ordered his men to hide their loot in a nearby cave.

As Ali Baba watched from a safe distance, the thieves used a magical phrase to open the cave: "Open Sesame!" The cave door swung open, revealing a vas

## b. Question Answering from Content Passed

In [35]:
!cat ../data/names.txt

Cricket in Pakistan has always been more than just a sport‚Äîit‚Äôs a source of national pride and unity. Legendary players like Imran Khan, Wasim Akram, and Shahid Afridi set high standards in the past, inspiring generations to follow. Today, stars such as Babar Azam, Shaheen Shah Afridi, and Shadab Khan carry forward the legacy, leading the national team in international tournaments with skill and determination. Their performances not only thrill fans but also keep Pakistan among the top cricketing nations of the world.

Politics in Pakistan, meanwhile, remains dynamic and often turbulent, with key figures shaping the country‚Äôs direction. Leaders like Nawaz Sharif, Asif Ali Zardari, and Imran Khan have all held significant influence over the nation‚Äôs governance and policies. In recent years, the political scene has seen sharp divisions, with parties such as the Pakistan Muslim League-Nawaz (PML-N), Pakistan Peoples Party (PPP), and Pakistan Tehreek-e-Insaf (PTI) competing for pow

In [36]:
with open("../data/names.txt", "r") as f:
    file_content = f.read()

user_prompt = f"Extract names from this text:\n{file_content}"
response = ask_groq(user_prompt=user_prompt)
print(response)

Here are the names mentioned in the text:

1. Imran Khan
2. Wasim Akram
3. Shahid Afridi
4. Babar Azam
5. Shaheen Shah Afridi
6. Shadab Khan
7. Nawaz Sharif
8. Asif Ali Zardari

These names include both cricket players and politicians.


In [37]:
with open("../data/names.txt", "r") as f:
    file_content = f.read()

user_prompt = f"Can you extract names the Cricket players from this text:\n{file_content}"
response = ask_groq(user_prompt=user_prompt)
print(response)

Here are the names of Cricket players mentioned in the text:

1. Imran Khan
2. Wasim Akram
3. Shahid Afridi
4. Babar Azam
5. Shaheen Shah Afridi
6. Shadab Khan

Note: Imran Khan is also mentioned as a political leader in the second part of the text, but he is primarily known for his cricketing career.


In [38]:
with open("../data/names.txt", "r") as f:
    file_content = f.read()

user_prompt = f"Can you categorize the following text:\n{file_content}"
response = ask_groq(user_prompt=user_prompt)
print(response)

The text can be categorized as:

1. Informative/Descriptive: The text provides information about two main topics in Pakistan: cricket and politics. It describes the significance of cricket in Pakistani culture and the current state of politics in the country.

2. Cultural/Current Events: The text touches on the cultural significance of cricket in Pakistan and provides an overview of the current political landscape, making it relevant to cultural and current events categories.

3. Non-fiction/Expository: The text is written in a formal, expository style, providing factual information about cricket and politics in Pakistan, without expressing a personal opinion or bias.

4. Country/Society: The text can also be categorized under country or society, as it provides an overview of two important aspects of Pakistani society: sports and politics.


## c. Examples (Binary Classification: Sentiment analysis, Spam detection, Medical diagnosis)

In [39]:
user_prompt = """
Categorize the sentence 'The delivery was delayed and the product arrived damaged.' into one of the following categories:
Positive
Negative
Answer with just the category, no need of any explaination
"""
response = ask_groq(user_prompt=user_prompt)
print(response)

Negative


## d. Examples (Multi-class Classification)

In [40]:
user_prompt = """
Categorize the sentence 'The movie had great visuals but the plot was confusing and boring' into one of the following categories:
Positive
Negative
Neutral
Answer with just the category, no need of any explaination
"""
response = ask_groq(user_prompt=user_prompt)
print(response)

Neutral


## e. Examples (Text Generation)

In [41]:
developer_prompt = "You are an expert of political science and history and have a deep understanding of policical situation of Pakistan."
user_prompt = "Write down a 50 words summary about the fairness of general elections held in Pakistan on February 08, 2024."
response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, temperature=1.0)
print(response)

"Elections held in Pakistan on February 08, 2024, were largely deemed fair, with minimal reported irregularities. Independent observers praised the transparency and efficiency of the electoral process, despite some pre-poll concerns and opposition claims of potential manipulation, overall outcome was considered credible by international monitors and local stakeholders."


## f. Examples (Code Generation)

In [42]:
developer_prompt = "You are an expert of C programing in C language."
user_prompt = "Write down a C program that generates first ten numbers of fibonacci sequence."

# Get streaming generator from Responses API
response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, stream=True)

# Iterate through streaming events and only print text deltas
for event in response:
    # Each event may contain incremental text in event.delta
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True) # prints the content from this chunk, end="" prevents adding a newline after each  piece and flush=True forces flushing output to screen

Here is a simple C program that generates the first ten numbers of the Fibonacci sequence:

```c
#include <stdio.h>

// Function to generate Fibonacci sequence
void generateFibonacci(int n) {
    int num1 = 0, num2 = 1;

    // Print the first two numbers of the sequence
    printf("%d, %d, ", num1, num2);

    // Generate the rest of the sequence
    for (int i = 3; i <= n; i++) {
        int nextNum = num1 + num2;
        printf("%d, ", nextNum);
        num1 = num2;
        num2 = nextNum;
    }
}

int main() {
    int n = 10;  // Number of terms in the sequence
    printf("The first %d numbers of the Fibonacci sequence are: ", n);
    generateFibonacci(n);
    return 0;
}
```

When you run this program, it will output:

```
The first 10 numbers of the Fibonacci sequence are: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34,
```

This program works by starting with the first two numbers in the Fibonacci sequence (0 and 1), and then using a loop to generate the rest of the sequence. It does this by a

## g. Examples (Text Translation)

In [43]:
user_prompt = """
Please act as an expert of English to Urdu translator by translating the given sentence from English into Urdu.
'The budget this year will have a very bad impact on the low salried people'
"""
response = ask_groq(user_prompt=user_prompt, model='meta-llama/llama-4-maverick-17b-128e-instruct')
print(response)

ÿßÿ≥ ÿ≥ÿßŸÑ ⁄©€í ÿ®ÿ¨Ÿπ ⁄©ÿß ⁄©ŸÖ ÿ™ŸÜÿÆŸàÿß€Å ŸÑ€åŸÜ€í ŸàÿßŸÑ€í ŸÑŸà⁄ØŸà⁄∫ Ÿæÿ± ÿ®€Åÿ™ ÿ®ÿ±ÿß ÿßÿ´ÿ± Ÿæ⁄ë€í ⁄Øÿß€î


## h. Examples (Text Summarization)

In [44]:
developer_prompt = "You are an expert of English language."

user_prompt = f'''
Summarize the text below in at most 20 words:
```The Hugging Face transformers library is an incredibly versatile and powerful tool for natural language processing (NLP).
It allows users to perform a wide range of tasks such as text classification, named entity recognition, and question answering, among others.
It's an extremely popular library that's widely used by the open-source data science community.
It lowers the barrier to entry into the field by providing Data Scientists with a productive, convenient way to work with transformer models.```
'''

response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, temperature=1.0)
print(response)

Hugging Face transformers: a powerful NLP tool.


In [45]:
developer_prompt = "You are a helpful assistant skilled in text summarization, translation to Urdu, and Python programming. You provide clear, accurate responses and follow instructions precisely."

text = '''
Our solar system, a celestial dance of eight planets, each with its unique character and charm, orbits around our radiant Sun.
Closest to the Sun, Mercury, the smallest planet, darts swiftly, its metallic surface reflecting the Sun's intense glare.
Venus, Earth's twin, cloaked in a dense atmosphere, harbors scorching temperatures and acidic clouds.
Earth, our oasis of life, teems with diverse ecosystems, its oceans and landforms sculpted by the forces of nature.
Mars, the Red Planet, bears the scars of ancient volcanoes and the promise of potential life.
Beyond the asteroid belt, Jupiter and Saturn, the gas giants, reign supreme, their vast atmospheres swirling with storms and adorned with rings of ice and dust.
Uranus and Neptune, the ice giants, tilt at odd angles, their atmospheres frigid and their depths still shrouded in mystery.
Each planet, a celestial masterpiece, plays a vital role in the intricate symphony of our solar system.'''

user_prompt = f'''
Please complete the following two tasks based on the text provided below:

Task 1: Summarize the text in 2-3 sentences, then translate that summary into Urdu.

Task 2: Create a Python list containing all planet names mentioned in the text.

Text: ```{text}```

Please format your response as:
**Summary:** [English summary]
**Urdu Translation:** [Urdu translation]
**Python List:** [Python code with planet names]
'''

response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, model='meta-llama/llama-4-maverick-17b-128e-instruct', temperature=0.3)
print(response)

**Summary:** Our solar system consists of eight planets, each with unique characteristics, orbiting around the Sun. The planets vary in size, atmosphere, and temperature, ranging from Mercury's swift orbit to the gas giants Jupiter and Saturn. The planets play a vital role in the symphony of our solar system.

**Urdu Translation:** €ÅŸÖÿßÿ±ÿß ÿ¥ŸÖÿ≥€å ŸÜÿ∏ÿßŸÖ ÿ¢Ÿπ⁄æ ÿ≥€åÿßÿ±Ÿà⁄∫ Ÿæÿ± ŸÖÿ¥ÿ™ŸÖŸÑ €Å€íÿå €Åÿ± ÿß€å⁄© ÿßŸæŸÜ€å ÿßŸÜŸÅÿ±ÿßÿØ€åÿ™ ⁄©€í ÿ≥ÿßÿ™⁄æÿå ÿ≥Ÿàÿ±ÿ¨ ⁄©€í ⁄Øÿ±ÿØ ⁄Øÿ±ÿØÿ¥ ⁄©ÿ±ÿ™ÿß €Å€í€î ÿ≥€åÿßÿ±€í ÿßŸæŸÜ€í ÿ≥ÿßÿ¶ÿ≤ÿå ŸÅÿ∂ÿß ÿßŸàÿ± ÿØÿ±ÿ¨€Å ÿ≠ÿ±ÿßÿ±ÿ™ ŸÖ€å⁄∫ ŸÖÿÆÿ™ŸÑŸÅ €ÅŸàÿ™€í €Å€å⁄∫ÿå ÿ¨€åÿ≥€í ⁄©€Å ÿπÿ∑ÿßÿ±ŸêÿØ ⁄©€å ÿ™€åÿ≤ ÿ±ŸÅÿ™ÿßÿ± ⁄Øÿ±ÿØÿ¥ ÿ≥€í ŸÑ€í ⁄©ÿ± ⁄Ø€åÿ≥ ⁄©€í ÿπÿ∏€åŸÖ ÿ≥€åÿßÿ±€í ŸÖÿ¥ÿ™ÿ±€å ÿßŸàÿ± ÿ≤ÿ≠ŸÑ ÿ™⁄©€î ÿ≥€åÿßÿ±€í €ÅŸÖÿßÿ±€í ÿ¥ŸÖÿ≥€å ŸÜÿ∏ÿßŸÖ ⁄©€å ÿ≥ŸÖŸÅŸÜ€å ŸÖ€å⁄∫ ÿß€ÅŸÖ ⁄©ÿ±ÿØÿßÿ± ÿßÿØÿß ⁄©ÿ±ÿ™€í €Å€å⁄∫€î

**Python List:**
```python
planet_names = [
    "Mercury",
    "Venus",
    "Earth",
    "Mars",
    "Jupiter",
    "Saturn",
    "Uranus",
    "Nep

## i. Examples (Named Entity Recognition)

In [46]:
developer_prompt = """You are a  Named Entity Recognition specialist. Extract and classify entities from the given text into these categories only if they exist:
- name
- major
- university
- nationality
- grades
- club
Format your response as: 'Entity: [text] | Type: [category]' with each entity on a new line."""

user_prompt = '''
Zelaid Mujahid is a sophomore majoring in Data Science at University of the Punjab. \
He is Pakistani national and has a 3.5 GPA. Mujahid is an active member of the department's AI Club.\
He hopes to pursue a career in AI after graduating.
'''
response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, model='meta-llama/llama-4-maverick-17b-128e-instruct', temperature=0.3)
print(response)

Here is a summary of Zelaid Mujahid's profile:

**Name:** Zelaid Mujahid
**Nationality:** Pakistani
**University:** University of the Punjab
**Major:** Data Science
**Year:** Sophomore
**GPA:** 3.5
**Extracurricular Activity:** Active member of the AI Club
**Career Aspiration:** Pursue a career in AI after graduating


## j. Example (Grade School Math 8K (GSM8K))

In [47]:
developer_prompt = """You are an expert School math teacher. 
Consider the following text and then answer the questions of the students from this:
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. 
For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. 
"""
user_prompt = "How much did the booth earn for 5 days after paying the rent and the cost of ingredients?"

response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt,  model='llama-3.3-70b-versatile')
print(response)

To determine how much the booth earned, we need more information such as:

1. The daily revenue of the booth
2. The daily cost of ingredients
3. The total rent paid for 5 days

Without this information, it's impossible to calculate the earnings of the booth. If you provide the necessary data, I can help you with the calculation.


In [48]:
developer_prompt = """You are an expert School math teacher. 
Consider the following text and then answer the questions of the students from this:
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. 
For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. 
"""
user_prompt = "How much did the booth earn for 5 days after paying the rent and the cost of ingredients?"

response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt,  model='meta-llama/llama-4-maverick-17b-128e-instruct')
print(response)

## Step 1: Calculate the total earnings for 5 days
First, we need to determine the total earnings of the booth over the 5-day period. Let's assume the booth earns $200 per day. So, the total earnings for 5 days = $200 * 5 = $1000.

## Step 2: Calculate the total rent paid for 5 days
Next, we need to find out the total rent paid for the 5 days. Let's assume the daily rent is $50. So, the total rent for 5 days = $50 * 5 = $250.

## Step 3: Calculate the total cost of ingredients for 5 days
Now, we need to determine the total cost of ingredients for the 5 days. Let's assume the daily cost of ingredients is $30. So, the total cost of ingredients for 5 days = $30 * 5 = $150.

## Step 4: Calculate the total earnings after paying the rent and the cost of ingredients
To find the earnings after paying the rent and the cost of ingredients, we need to subtract the total rent and the total cost of ingredients from the total earnings. So, the earnings after deductions = total earnings - (total rent

In [49]:
developer_prompt = """You are an expert School math teacher. 
Consider the following text and then answer the questions of the students from this:
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. 
For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. 
"""
user_prompt = "How much did the booth earn for 5 days after paying the rent and the cost of ingredients?"

response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, model='openai/gpt-oss-20b')
print(response)

**Step‚Äëby‚Äëstep solution**

1. **Daily earnings from popcorn**  
   The booth sells popcorn for **$50 each day**.

2. **Daily earnings from cotton candy**  
   It makes *three times* as much from cotton candy.  
   \[
   3 \times \$50 = \$150 \text{ per day}
   \]

3. **Total daily revenue**  
   \[
   \$50 \;(\text{popcorn}) + \$150 \;(\text{cotton candy}) = \$200 \text{ per day}
   \]

4. **Revenue for the 5‚Äëday activity**  
   \[
   5 \text{ days} \times \$200 \text{ per day} = \$1,000
   \]

5. **Total costs**  
   - Rent: **$30**  
   - Ingredients: **$75**  
   \[
   \$30 + \$75 = \$105
   \]

6. **Net earnings after costs**  
   \[
   \$1,000 - \$105 = \$895
   \]

---

**Answer:** The booth earned **$895** for the 5‚Äëday activity after paying rent and ingredient costs.
