---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<br><br>
<h1 align="center">Lec-07: Accessing Open-Source AI Models via Groq, Hugging Face and OpenAI-Compatible Inference APIs</h1>

# Learning agenda of this notebook

1. Accessing Open-Source AI Models hosted on **Hugging Face Hub**
    - Access Option 1: Access with Hugging Face InferenceClient
    - Access Option 2: Access with OpenAI Chat Completions API (Hugging Face Router)
    - Access Option 3: Access with OpenAI Responses API (Hugging Face Router)
2. Hands-On Practice Examples with Hugging Face Hosted Models using OpenAI's `Responses` API
3. Accessing Open-Source AI Models hosted on **Groq**
    - Access Option 1: Access with Groq Chat Completions API
    - Access Option 2: Access with OpenAI Chat Completions API (Groq Router)
    - Access Option 3: Access with OpenAI Responses API (Groq Router)
4. Hands-On Practice Examples with Groq Hosted Models using OpenAI's `Responses` API 

# <span style='background :lightgreen' >Recap: Ways to Access Open Source LLMs</span>
### (i) Access Open-Source Models via Cloud-Based Providers (Driving a fully automatic car — everything managed for you)
* Cloud inference providers host the models for you, removing the need for GPUs, scaling infrastructure, or deployment engineering.
* You interact with models using simple HTTP calls or OpenAI-compatible APIs, making it the quickest way to use LLMs in production.
* Services like **Groq** offer ultra-fast inference on custom LPU hardware with extremely low latency for models such as Llama, Qwen, Mixtral, Whisper, etc.
* **Hugging Face Inference** provides access to 1M+ models via Serverless Inference, TGI, or Inference Endpoints—pay-as-you-go, secure, and instantly deployable.

### (ii) Run Open-Source Models locally using runtimes (Driving an automatic car — local but simple, no gears or engineering)

### (iii) Use Open-Source Models via Hugging Face `pipeline()` API (Driving a manual car — you see more of the mechanics, but still a car someone else built)

### (iv) Load and run models directly from Hugging Face Hub using `AutoModel/AutoTokenizer` (Opening the hood and adjusting or replacing engine components)


### (v) Fine-Tune LLMs using full fine-tuning or PEFT methods (LoRA / QLoRA / adapters) (Upgrading and re-calibrating the engine to suit your driving style)

### (vi) Build and train an AI Model from scratch using PyTorch / TensorFlow (Designing and building the entire car from raw parts — full control, full responsibility)


# <span style='background :lightgreen' >1. Accessing Open-Source AI Models Hosted on HF-Managed Infrastructure</span>
- The Hugging Face Hub hosts millions of open-source models. A subset of these are deployed on Hugging Face's **Serverless Inference API**.
- This is called serverless because:
    - You do not manage infrastructure
    - No GPUs to provision
    - No scaling concerns
- However, not all models on the Hub are available via serverless inference. Some models are only downloadable and work locally with the `transformers` library. To check if a model supports Serverless Inference, visit the model page:
    - "Inference API" widget visible (or "Deploy" dropdown with inference options), e.g., `gpt2`, `google/flan-t5-base`, `facebook/bart-large-cnn`
    - Only "Use in Transformers" or "Use this model" shown, e.g., `mistralai/Mistral-7B-v0.1`, `tiiuae/falcon-40b`, `EleutherAI/gpt-j-6B`
- Some models like Llama-2 and Llama-3 are **gated** and require access approval. Additionally, some models may require a PRO or Enterprise subscription for serverless inference access. For example, Meta's `meta-llama/Llama-3.1-70B-Instruct` is a gated model. To gain access:
    1. Click "Agree and access repository" (or similar button)
    2. Accept the license terms
    3. Wait for access approval (usually granted automatically or within hours)
    4. Once granted, the UI will indicate: *"You have been granted access to this model"* or similar confirmation
- You can access HF-hosted models via:
    - Hugging Face InferenceClient API
    - OpenAI Chat Completion API (Hugging Face Router)
    - OpenAI Responses API (Hugging Face Router)
    - Text Generation Inference (TGI) for self-hosted deployments
- For larger models, higher rate limits, or commercial usage, serverless inference may require a paid subscription. Current pricing (as of January 2025) at [https://huggingface.co/pricing](https://huggingface.co/pricing):
    - Free: Limited rate limits, access to public Inference API
    - PRO: ~$9 per month (access to gated models, higher rate limits, PRO badge)
    - Enterprise: Custom pricing (dedicated infrastructure, SLA, priority support, SSO)

In [1]:
import os
from dotenv import load_dotenv
load_dotenv('../keys/.env', override=True) 

hf_token = os.getenv('HF_TOKEN')
if hf_token:
    print(f"Hugging Face Tokens exists and begins {hf_token[:7]}")
else:
    print("Hugging Face tokens not set")

Hugging Face Tokens exists and begins hf_oEyH


## Access Option 1: Access with Hugging Face InferenceClient API
```python
InferenceClient(
    model: Optional[str] = None,
    provider: Union[Literal[…], "auto", None] = None,
    token: Optional[str] = None,
    timeout: Optional[float] = None,
    headers: Optional[dict[str, str]] = None,
    cookies: Optional[dict[str, str]] = None,
    bill_to: Optional[str] = None,
    base_url: Optional[str] = None,
    api_key: Optional[str] = None,  # alias to token
    proxies: Optional[Any] = None,  # in some versions
)
```

In [1]:
# Using HF InferenceClient() and chat_completion() method
import huggingface_hub
import os
from dotenv import load_dotenv

load_dotenv('../keys/.env', override=True) 
hf_token = os.getenv('HF_TOKEN')

client = huggingface_hub.InferenceClient(
                        model="meta-llama/Llama-3.1-8B-Instruct", # Model ID from the Hugging Face Hub (e.g., "meta-llama/Llama-3.1-8B-Instruct") or a URL to a deployed inference endpoint
                        provider="auto",              #  Hugging Face supports multiple back-end inference providers (e.g., "cerebras", "together", "replicate", "hf‑inference", etc.).
                        token=hf_token
                        )
response = client.chat_completion(
                                messages=[
                                            {"role": "system", "content": "You are a helpful assistant."},
                                            {"role": "user", "content": "What is the capital of Pakistan."}
                                        ],
                                max_tokens=None,         # Default: None (no limit, up to model's max)
                                temperature=None,        # Default: None (provider's default, usually ~0.7-1.0)
                                top_p=1.0,               # Default: None (provider's default, usually 1.0)
                                stream=False,            # Default: False
                                stop=None,               # default: None, can provide list of stop tokens
                                presence_penalty=0.0,    # Default: None (provider's default, usually 0.0)
                                frequency_penalty=0.0    # Default: None (provider's default, usually 0.0)
                            )

print(response.choices[0].message.content) #the actual text answer you want
print(response.model) # which model produced the result
print(response.usage) #token usage details (handy for cost/efficiency if you were on OpenAI).

The capital of Pakistan is Islamabad.
llama3.1-8b
ChatCompletionOutputUsage(completion_tokens=8, prompt_tokens=48, total_tokens=56, completion_tokens_details={'accepted_prediction_tokens': 0, 'rejected_prediction_tokens': 0, 'reasoning_tokens': 0}, prompt_tokens_details={'cached_tokens': 0})


## Access Option 2: Access with OpenAI Chat Completion API (Hugging Face Router)
- Here are the key differences between the two approaches:
    - Library dependency: InferenceClient is part of the huggingface_hub package and is HuggingFace-native, while the OpenAI approach uses the openai package pointed to HuggingFace's router OpenAI
    - Provider routing: InferenceClient offers automatic provider selection with provider="auto" and can route through multiple inference providers (Replicate, Together AI, Sambanova, etc.), while the OpenAI client requires manual base_url specification OpenAI
    - Additional features: InferenceClient supports multiple task types beyond chat (text-to-image, embeddings, speech processing), while the OpenAI-compatible router currently only supports chat completion tasks OpenAIOpenAI
    - Parameter flexibility: InferenceClient has extra_body parameter for provider-specific settings and more flexible initialization options, while OpenAI client uses standard OpenAI parameters only
    - Syntax compatibility: Both produce identical outputs since client.chat_completion() is aliased as client.chat.completions.create() in InferenceClient for OpenAI compatibility OpenAI
    - Use case optimization: InferenceClient is optimized for HuggingFace ecosystem with built-in provider management, while OpenAI client approach is better if you're already using OpenAI syntax across your codebase and want minimal changes

>- **HuggingFace provides OpenAI-compatible endpoints through https://router.huggingface.co/v1**

In [5]:
# Using OpenAIs Chat Completion API
import os
from dotenv import load_dotenv
from openai import OpenAI

# Load GROQ API key from .env
load_dotenv("../keys/.env", override=True)
hf_token = os.getenv("HF_TOKEN")

# Create an OpenAI client instance and specify the base_url as "https://router.huggingface.co/v1" (OpenAI-compatible API endpoint).
# "https://router.huggingface.co/v1" Hugging Face’s OpenAI-compatible “inference router” endpoint. It acts like a universal gateway that proxies your request to the correct model backend.
client = OpenAI(base_url="https://router.huggingface.co/v1", api_key=hf_token) 

# Use OpenAI's Chat Completions API (routed through Hugging Face)
response = client.chat.completions.create(
                                            model="meta-llama/Llama-3.1-8B-Instruct", # "openai/gpt-oss-20b:novita"
                                            messages=[
                                                        {"role": "system", "content": "You are a helpful assistant."},
                                                        {"role": "user", "content": "What is the capital of Pakistan?"}
                                                    ],
                                            temperature=1,
                                            top_p=1,
                                            max_completion_tokens=8192,
                                            reasoning_effort=None,   # "medium"
                                            stream=False
                                        )

print(response.choices[0].message.content)
print(response.model_dump_json(indent=4))

The capital of Pakistan is Islamabad.
{
    "id": "chatcmpl-0557bbfa-6e1f-44ab-b9d3-e0793fadb094",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": "The capital of Pakistan is Islamabad.",
                "refusal": null,
                "role": "assistant",
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": null
            }
        }
    ],
    "created": 1770882271,
    "model": "llama3.1-8b",
    "object": "chat.completion",
    "service_tier": null,
    "system_fingerprint": "fp_5198798116a66ebf301b",
    "usage": {
        "completion_tokens": 8,
        "prompt_tokens": 48,
        "total_tokens": 56,
        "completion_tokens_details": {
            "accepted_prediction_tokens": 0,
            "audio_tokens": null,
            "reasoning_tokens": 0,
            "rejected_p

## Access Option 3: Access with OpenAI Responses API (Hugging Face Router)

In [8]:
import os
from dotenv import load_dotenv
from openai import OpenAI

# Load Hugging Face token from .env
load_dotenv("../keys/.env", override=True)
hf_token = os.getenv("HF_TOKEN")

# Create an OpenAI client instance and specify the base_url as "https://router.huggingface.co/v1" (OpenAI-compatible API endpoint).
# "https://router.huggingface.co/v1" Hugging Face’s OpenAI-compatible “inference router” endpoint. It acts like a universal gateway that proxies your request to the correct model backend.
client = OpenAI(base_url="https://router.huggingface.co/v1", api_key=hf_token)

# Use OpenAI Responses API (routed through Hugging Face)
response = client.responses.create(
                                    model="meta-llama/Llama-3.1-8B-Instruct", 
                                    input=[
                                            {"role": "developer", "content": "You are a helpful assistant."},
                                            {"role": "user", "content": "What is the capital of Pakistan?"}
                                            ],
                                    temperature=1,
                                    top_p=1,
                                    max_output_tokens=8192,
                                    stream=False
                                )

# Display the model's response
print(response.output_text)
print(response.model_dump_json(indent=4))

Islamabad is the capital of Pakistan.
{
    "id": "resp_0df7b9503fad1979da7a7f225cb92c3567d7c46ff411a713",
    "created_at": 1770882510.0,
    "error": null,
    "incomplete_details": null,
    "instructions": null,
    "metadata": null,
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "object": "response",
    "output": [
        {
            "id": "msg_a1b33ec875b6cc2910fab0188f39ddd1ab0d5521ebdc58fa",
            "content": [
                {
                    "annotations": [],
                    "text": "Islamabad is the capital of Pakistan.",
                    "type": "output_text",
                    "logprobs": null
                }
            ],
            "role": "assistant",
            "status": "completed",
            "type": "message"
        }
    ],
    "parallel_tool_calls": null,
    "temperature": 1.0,
    "tool_choice": "auto",
    "tools": [],
    "top_p": 1.0,
    "background": null,
    "conversation": null,
    "max_output_tokens": 8192,
    "max

# <span style='background :lightgreen' >2. Hands-On Practice Examples with Hugging Face Hosted Models using OpenAI's `Responses` API</span>

## a. Writing a Function for our ease

In [6]:
import os
from dotenv import load_dotenv
from openai import OpenAI

# Load Hugging Face token from .env
load_dotenv("../keys/.env", override=True)
hf_token = os.getenv("HF_TOKEN")


# Create an OpenAI client instance and specify the base_url as "https://router.huggingface.co/v1" (OpenAI-compatible API endpoint).
# "https://router.huggingface.co/v1" Hugging Face’s OpenAI-compatible “inference router” endpoint. It acts like a universal gateway that proxies your request to the correct model backend.
client = OpenAI(base_url="https://router.huggingface.co/v1", api_key=hf_token) 

def ask_hf(
    user_prompt: str,
    developer_prompt: str = "You are a helpful assistant that provides concise answers.",
    model: str = "meta-llama/Llama-3.1-8B-Instruct", 
    max_output_tokens: int = 1024,
    temperature: float = 0.7,
    top_p: float = 1.0,
    stream: bool = False
):
    input_messages = [{"role": "developer", "content": developer_prompt}, {"role": "user", "content": user_prompt}]
    # Responses API call without unsupported parameters
    response = client.responses.create(
                                        model=model,
                                        input=input_messages,
                                        max_output_tokens=max_output_tokens,
                                        temperature=temperature,
                                        top_p=top_p,
                                        stream=stream
                                        )

    if stream:
        return response  # Streaming generator
    return response.output_text   # Aggregated text output

## a. Examples (Question Answering)

In [7]:
developer_prompt = "You are an assistant that tells light-hearted jokes."
user_prompt = "Tell a light-hearted joke for an audience of Data Scientists."

response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

Why did the logistic regression model go to therapy?

Because it was struggling to cope with its sigmoidal thoughts.


In [7]:
developer_prompt = "You are a bedtime storyteller."
user_prompt = "Tell me a bedtime story of Ali Baba and Chalees Chor"

# Get streaming generator from Responses API
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, stream=True)

# Iterate through streaming events and only print text deltas
for event in response:
    # Each event may contain incremental text in event.delta
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True) # prints the content from this chunk, end="" prevents adding a newline after each  piece and flush=True forces flushing output to screen

Once upon a time, in a far-off land, there lived a poor woodcutter named Ali Baba. He was known for his kindness and honesty. One day, while out in the forest, Ali Baba stumbled upon an old cave. As he was about to leave, he overheard two thieves, Chalees Chor (meaning Thirty Thieves in Hindi), talking about a magical cave that could open with a secret password.

The thieves had been using the cave to store their stolen treasures and were discussing how to get to the treasure without being caught. Ali Baba, being curious, listened carefully to the conversation. He learned that the secret password to open the cave was "Open Sesame."

That night, Ali Baba went back to the cave and said the magical words, "Open Sesame." To his surprise, the cave door swung open, revealing a treasure trove of gold, jewels, and precious artifacts. Ali Baba, being a poor man, was amazed by the wealth and decided to take some of the treasure for himself.

However, he knew that he had to be careful not to reve

## b. Examples (Question Answering from Different Models)

#### Asking date from "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
- The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.
    - `Llama 4 Scout`, a 17 billion parameter model with 16 experts, and
    - `Llama 4 Maverick`, a 17 billion parameter model with 128 experts.

In [8]:
developer_prompt = "You are an assistant that is great at telling jokes"
user_prompt = "Tell a light-hearted joke for an audience of Data Scientists"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

Why did the logistic regression model go to therapy?

Because it was struggling to find the optimal balance between its coefficients.


In [9]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

My knowledge cutoff is 01 March 2023. I don't have real-time information or the ability to access the current date. However, based on your query, I can see that you mentioned the date 26 Jul 2024.


#### Asking date from "meta-llama/Llama-4-Scout-17B-16E-Instruct"
- The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding.
    - `Llama 4 Scout`, a 17 billion parameter model with 16 experts, and
    - `Llama 4 Maverick`, a 17 billion parameter model with 128 experts.

In [10]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model="meta-llama/Llama-4-Scout-17B-16E-Instruct")
print(response)




#### Asking date from "meta-llama/Llama-3.2-3B-Instruct"
- A text-to-text instruction-tuned LLM (8B params) for conversational AI, Q&A, and task-oriented responses.  

In [11]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model="meta-llama/Llama-3.2-3B-Instruct")
print(response)

I'm an AI, I don't have real-time access to the current date. However, I can tell you that my knowledge cutoff is December 2023, and I don't have information about events or dates after that.

If you need to know the current date, I recommend checking your device or a reliable online source.


#### Asking date from "meta-llama/Llama-3.1-8B-Instruct"
- A text-to-text instruction-tuned LLM (8B params) for conversational AI, Q&A, and task-oriented responses.  

In [12]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model="meta-llama/Llama-3.1-8B-Instruct")
print(response)

Today's date is July 26, 2024.


#### Asking date from "meta-llama/Meta-Llama-3-70B-Instruct"
- A text-to-text instruction-tuned LLM (70B params) used for reasoning, coding, and complex text tasks.

In [13]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model="meta-llama/Meta-Llama-3-70B-Instruct")
print(response)

I'm an AI, I don't have real-time access to the current date. However, I can suggest some ways for you to find out the current date.

1. Check your device's clock or calendar: Most devices, including smartphones, tablets, and computers, have a built-in clock or calendar that displays the current date.
2. Use an online calendar: You can visit a website like Google Calendar or any other online calendar to see the current date.
3. Ask a virtual assistant: Virtual assistants like Siri, Google Assistant, or Alexa can tell you the current date if you ask them.
4. Check a news website: News websites often display the current date at the top of their homepage.

I hope these suggestions help!


#### Asking date from "Qwen/Qwen2.5-7B-Instruct"
- A text-to-text instruction-tuned LLM (7B params) supporting multi-turn chat, reasoning, and multilingual capabilities.

In [14]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model="Qwen/Qwen2.5-7B-Instruct")
print(response)

As an AI, I don't have real-time capabilities, so I don't have access to the current date. However, you can easily find today's date by checking a calendar or using a device like a computer, smartphone, or smartwatch. If you need assistance with a date that is relevant to a specific task or question, feel free to let me know!


#### Asking date from "deepseek-ai/DeepSeek-V3.1"
- A text-to-text large LLM designed for reasoning, problem-solving, and multilingual dialogue.

In [15]:
developer_prompt = "You are a helpful assistant."
user_prompt = "What is the date today?"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model="deepseek-ai/DeepSeek-V3.1")
print(response)

Today is **June 3, 2024**.


##  c. Question Answering from Content Passed

In [16]:
!cat ../data/names.txt

Cricket in Pakistan has always been more than just a sport—it’s a source of national pride and unity. Legendary players like Imran Khan, Wasim Akram, and Shahid Afridi set high standards in the past, inspiring generations to follow. Today, stars such as Babar Azam, Shaheen Shah Afridi, and Shadab Khan carry forward the legacy, leading the national team in international tournaments with skill and determination. Their performances not only thrill fans but also keep Pakistan among the top cricketing nations of the world.

Politics in Pakistan, meanwhile, remains dynamic and often turbulent, with key figures shaping the country’s direction. Leaders like Nawaz Sharif, Asif Ali Zardari, and Imran Khan have all held significant influence over the nation’s governance and policies. In recent years, the political scene has seen sharp divisions, with parties such as the Pakistan Muslim League-Nawaz (PML-N), Pakistan Peoples Party (PPP), and Pakistan Tehreek-e-Insaf (PTI) competing for power. Deba

In [17]:
with open("../data/names.txt", "r") as f:
    file_content = f.read()

user_prompt = f"Can you extract names the politicians from this text:\n{file_content}"
response = ask_hf(user_prompt=user_prompt)
print(response)

The politicians mentioned in the text are:

1. Imran Khan (former Prime Minister of Pakistan and a former cricket player)
2. Nawaz Sharif (former Prime Minister of Pakistan)
3. Asif Ali Zardari (former President of Pakistan)
4. Babar Azam (cricket player, not a politician)
5. Shaheen Shah Afridi (cricket player, not a politician)
6. Shadab Khan (cricket player, not a politician)
7. Shahid Afridi (cricket player, not a politician)
8. Wasim Akram (cricket player, not a politician)


## c. Examples (Binary Classification: Sentiment analysis, Spam detection, Medical diagnosis)

In [18]:
developer_prompt = "You are an expert who will classify a sentense as having either a Positive or Negative sentiment."
user_prompt = "I love the youtube videos of Arif, as they are very informative"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

Arif Zahir is a popular YouTube personality known for his content, particularly his "Dr. Stone" and "Demon Slayer" anime commentary. He is also known for his humorous and entertaining take on various anime and other topics. If you find his videos informative and engaging, you might be interested in exploring more content from him or other creators in the same niche.


## d. Examples (Multi-class Classification)

In [19]:
developer_prompt = "Classify product reviews into these categories: 'Electronics', 'Clothing', 'Books', 'Home & Garden', 'Sports', or 'Food'. \
Respond with only the category."
user_prompt = "This novel has an incredible plot twist that kept me reading all night"
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

What was the novel and the plot twist that kept you up all night? Do you want to discuss it and try to figure out the author's intentions?


## e. Examples (Text Generation)

In [20]:
developer_prompt = "You are an expert of political science and history and have a deep understanding of policical situation of Pakistan."
user_prompt = "Write down a 50 words summary about the fairness of general elections held in Pakistan on February 08, 2024."
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, temperature=1.0)
print(response)

I do not have information on the General Elections of February 08, 2024, held in Pakistan. Moreover, my last update was in December 2023. I do not have information about an election that might have occurred after that date.


## f. Examples (Code Generation)

In [21]:
developer_prompt = "You are an expert of C programing in C language."
user_prompt = "Write down a C program that generates first ten numbers of fibonacci sequence."
response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, stream=True)

# Iterate through streaming events and only print text deltas
for event in response:
    # Each event may contain incremental text in event.delta
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True) # prints the content from this chunk, end="" prevents adding a newline after each  piece and flush=True forces flushing output to screen

**Fibonacci Sequence Generator in C**

The Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding ones, usually starting with 0 and 1.

### Code

```c
#include <stdio.h>

// Function to generate Fibonacci sequence
void generateFibonacci(int n) {
    int a = 0, b = 1;
    printf("Fibonacci Sequence: %d, %d, ", a, b);

    for (int i = 2; i < n; i++) {
        int next = a + b;
        printf("%d, ", next);
        a = b;
        b = next;
    }
}

int main() {
    int n = 10;  // Number of Fibonacci numbers to generate
    printf("First %d numbers of Fibonacci sequence are: \n", n);
    generateFibonacci(n);
    printf("\n");
    return 0;
}
```

### Explanation

This C program generates the first `n` numbers of the Fibonacci sequence. The `generateFibonacci` function uses a loop to calculate each number in the sequence, starting from the first two numbers (0 and 1). The `main` function sets the number of Fibonacci numbers to generate and calls th

## g. Examples (Text Translation)

In [22]:
user_prompt = """
Please act as an expert of English to Urdu translator by translating the given sentence from English into Urdu.
'The budget this year will have a very bad impact on the low salried people'
"""
response = ask_hf(user_prompt=user_prompt)
print(response)

یہ سال کا بجٹ کم تنخواہ پر ملازمت والے لوگوں پر بہت بےامدھی کا اثر ڈالے گا۔

یہ ترجمہ یوں کیا جاتا ہے:
- یہ 'The budget' کو 'سال کا بجٹ' میں ترجمہ کیا گیا ہے
- کے فقرے کو 'کے' سے ہٹایا گیا ہے کیونکہ یہ فقرہ ہندوستانی زبان میں استعمال ہوتا ہے
- سال کو 'یہ سال' میں اور بجٹ کو 'سال کا بجٹ' میں ترجمہ کیا گیا ہے
- ہوگا کے فقرے کو 'ڈالے گا' میں ترجمہ کیا گیا ہے
- کم تنخواہ پر ملازمت والے لوگوں کو 'کم تنخواہ پر ملازمت والے لوگوں' میں ترجمہ کیا گیا ہے
- پر فقرہ کو صرف استعمال کیا گیا ہے
- زیادہ بےامدھی کے فقرے کو 'بہت بےامدھی' میں ترجمہ کیا گیا ہے


## h. Examples (Text Summarization)

In [23]:
developer_prompt = "You are an expert of English language."

user_prompt = f'''
Summarize the text below in at most 20 words:
```The Hugging Face transformers library is an incredibly versatile and powerful tool for natural language processing (NLP).
It allows users to perform a wide range of tasks such as text classification, named entity recognition, and question answering, among others.
It's an extremely popular library that's widely used by the open-source data science community.
It lowers the barrier to entry into the field by providing Data Scientists with a productive, convenient way to work with transformer models.```
'''

response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, temperature=0.2)
print(response)

The Hugging Face transformers library is a versatile tool for natural language processing tasks and model development.


## i. Examples (Named Entity Recognition)

In [24]:
developer_prompt = """You are a  Named Entity Recognition specialist. Extract and classify entities from the given text into these categories only if they exist:
- name
- major
- university
- nationality
- grades
- club
Format your response as: 'Entity: [text] | Type: [category]' with each entity on a new line."""

user_prompt = '''
Zelaid Mujahid is a sophomore majoring in Data Science at University of the Punjab. \
He is Pakistani national and has a 3.5 GPA. Mujahid is an active member of the department's AI Club.\
He hopes to pursue a career in AI after graduating.
'''

response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

It sounds like Zelaid Mujahid is a bright and ambitious student with a passion for Data Science and AI. As a sophomore at the University of the Punjab, he's already taking steps in the right direction by being an active member of the AI Club. His 3.5 GPA is a great achievement, indicating his strong academic performance.

Given his interests and academic background, it's no surprise that Zelaid hopes to pursue a career in AI after graduating. With the increasing demand for AI professionals, he's making a smart move by focusing on this field.

As a Pakistani national, Zelaid's career prospects may also be influenced by the growing demand for AI talent in the region. Pakistan has been actively investing in AI and technology, making it an exciting time for professionals like Zelaid to make a meaningful impact in the field.

To further his goals, Zelaid may want to consider:

1. Building a strong portfolio of projects that showcase his AI skills and experience.
2. Networking with professio

## j. Example (Grade School Math 8K (GSM8K))

In [25]:
developer_prompt = """You are an expert School math teacher. 
Consider the following text and then answer the questions of the students from this:
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. 
For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. 
"""
user_prompt = "How much did the booth earn for 5 days after paying the rent and the cost of ingredients?"

response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt,  model='llama-3.3-70b-versatile')
print(response)




In [26]:
developer_prompt = """You are an expert School math teacher. 
Consider the following text and then answer the questions of the students from this:
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. 
For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. 
"""
user_prompt = "How much did the booth earn for 5 days after paying the rent and the cost of ingredients?"

response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt,  model='meta-llama/llama-4-maverick-17b-128e-instruct')
print(response)




In [27]:
developer_prompt = """You are an expert School math teacher. 
Consider the following text and then answer the questions of the students from this:
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. 
For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. 
"""
user_prompt = "How much did the booth earn for 5 days after paying the rent and the cost of ingredients?"

response = ask_hf(user_prompt=user_prompt, developer_prompt=developer_prompt, model='openai/gpt-oss-20b')
print(response)

I’d love to calculate that for you! Could you please share the figures for the booth’s total sales over the 5 days, the rent amount you paid, and the total cost of ingredients (or the cost per day, if that’s easier)? Once I have those numbers I can subtract the expenses from the revenue to give you the net earnings.


# <span style='background :lightgreen' >3. Accessing Open-Source AI Models via External Inference Providers</span>

## Famous External Inference Providers</span>
- Many external inference providers host popular open-source models from Hugging Face, offering optimized infrastructure, faster inference speeds, and often more generous free tiers than self-hosting. These providers specialize in serving models at scale with production-ready APIs.

| Provider | Best For | Model Selection | Speed | Pricing Model | Free Tier |
|----------|----------|-----------------|-------|---------------|-----------|
| **[Groq](https://console.groq.com)** | Ultra-fast inference, real-time apps | Limited but popular models | ⚡ Fastest | Per-token | Generous |
| **[Together AI](https://api.together.xyz)** | Variety, fine-tuning, batch processing | 100+ models, largest selection | Fast | Per-token (model-specific) | $25 credits |
| **[Replicate](https://replicate.com)** | Ease of use, multimodal, experimentation | 1000s of models (LLM, image, audio, video) | Moderate | Per-second compute time | Limited credits |

<h3 align="center"><div class="alert alert-success" style="margin: 20px">Groq provides ultra-fast inference for open-source models using its custom LPU hardware and offers an OpenAI-compatible API, making it easy to run models like Llama, Mixtral, Qwen, Whisper, and GPT OSS with extremely low latency and minimal setup.</h3>

- **Ultra-Fast AI Inference Company:** - Groq (https://groq.com) is a company that provides **ultra-fast AI inference** through their specialized hardware (LPU - Language Processing Unit).
- **Open-Source Model Platform** - Offers 17+ optimized models from Meta (Llama), Google (Gemma), DeepSeek, Alibaba (Qwen), and others - all accessible through a simple API similar to OpenAI's format.
- **Cost-Effective Alternative** - Provides generous free tier and lower pricing compared to proprietary APIs like OpenAI or Anthropic, making it ideal for high-volume applications and startups on a budget.
- **Multiple AI Capabilities** - Beyond text generation, Groq supports speech-to-text (Whisper), text-to-speech (PlayAI), content moderation (Llama Guard), and security features (Prompt Guard) - all through one API.
- **No Vendor Lock-In** - Since all models are open-source, you can switch providers or self-host the same models later, giving you flexibility and control over your AI infrastructure without being tied to proprietary technology.
- **Production-Ready Performance** - Combines the quality of state-of-the-art open models (like Llama 3.3 70B) with enterprise-grade speed and reliability, making it suitable for real-time chatbots, customer service, and interactive applications.


## a. Get Groq API Key
- **Create an Account on Groq:** Go to https://console.groq.com/playground and Sign up or log in with your Google account. Groq is free to try and you can do a lot of things without paying a peny
- **Generating Groq API Key:** Login to Groq and navigate to Settings from the user menu on the top right and create a new Groq API token. Generate a **New Token** (choose `Read` access).  Save the token safely — we’ll need it in our Python code.

### **Production Models** (Recommended for Production Use)

| Company    |                               Model ID |            Parameters | Best Used For                                                           |  Context Window | Max Completion               |
| ---------- | -------------------------------------: | --------------------: | ----------------------------------------------------------------------- | --------------: | ---------------------------- |
| **Meta**   |              `llama-3.3-70b-versatile` |                   70B | General-purpose, high-quality instruction following, long-context tasks |     **131,072** | **32,768**                   |
| **Meta**   |                 `llama-3.1-8b-instant` |                    8B | Low-latency chat, high throughput / real-time use cases                 |     **131,072** | **131,072**                  |
| **Meta**   |         `meta-llama/llama-guard-4-12b` |                   12B | Safety / content-moderation guard model                                 |     **131,072** | **1,024**                    |
| **OpenAI** |                  `openai/gpt-oss-120b` |  ~120B (OSS frontier) | High-capability reasoning / production workloads where offered          |     **131,072** | **65,536**                   |
| **OpenAI** |                   `openai/gpt-oss-20b` |                  ~20B | Smaller frontier model for cost-sensitive production use                |     **131,072** | **65,536**                   |
| **OpenAI** |       `whisper-large-v3` (speech→text) |                ~1.55B | High-accuracy speech-to-text (multilingual)                             | — (audio model) | — (audio/output constraints) |
| **OpenAI** | `whisper-large-v3-turbo` (speech→text) | – (optimized variant) | Faster multilingual transcription (low-latency)                         | — (audio model) | — (audio/output constraints) |


### **Preview Models** (Experimental - Not for Production)


| Company                   |                                        Model ID |                Parameters | Best Used For                                           | Context Window | Max Completion |
| ------------------------- | ----------------------------------------------: | ------------------------: | ------------------------------------------------------- | -------------: | -------------- |
| **Meta**                  | `meta-llama/llama-4-maverick-17b-128e-instruct` | ~17B (Mixture of Experts) | Multimodal assistant experiments, advanced reasoning    |    **131,072** | **8,192**      |
| **Meta**                  |     `meta-llama/llama-4-scout-17b-16e-instruct` | ~17B (Mixture of Experts) | Experimental multimodal / efficient inference           |    **131,072** | **8,192**      |
| **Meta**                  |           `meta-llama/llama-prompt-guard-2-22m` |                      ~22M | Lightweight prompt-injection detection / security       |        **512** | **512**        |
| **Meta**                  |           `meta-llama/llama-prompt-guard-2-86m` |                      ~86M | Stronger prompt-injection detection                     |        **512** | **512**        |
| **Moonshot AI**           |              `moonshotai/kimi-k2-instruct-0905` | 1T total (≈32B activated) | Agentic coding, tool use, very long-context workflows   |    **262,144** | **16,384**     |
| **Alibaba / Qwen**        |                                `qwen/qwen3-32b` |                       32B | Multilingual reasoning, tool use, instruction following |    **131,072** | **40,960**     |
| **PlayAI / Groq catalog** |                                    `playai-tts` |                         – | Text-to-speech (general)                                |      **8,192** | **8,192**      |
| **PlayAI / Groq catalog** |                             `playai-tts-arabic` |                         – | Arabic text-to-speech                                   |      **8,192** | **8,192**      |


> Production models are intended for use in production environments and meet Groq's high standards for speed, quality, and reliability, while preview models are for evaluation only and may be discontinued at short notice.


>- Kimi-K2 0905: best for coding and agentic workflows that need deep reasoning
>- Compound Beta: power up multi-model workflows in a single API call

## Access Option 1: Access with Groq Chat Completions API
- Directly uses Groq’s native SDK (groq) to call Groq-hosted models with full access to Groq-specific features like reasoning_effort.
- Advantages:
    - Fully compatible with all Groq model features.
    - Can leverage Groq-specific optimizations (low latency, high throughput).
    - Easy to set up and requires no OpenAI compatibility adjustments.
- Disadvantages:
    - Limited to Groq platform only.
    - Cannot easily switch to other OpenAI-compatible services without code changes.

In [28]:
#!uv add groq
!uv tree | grep groq

[2mResolved [1m261 packages[0m [2min 0.87ms[0m[0m
├── groq v1.0.0


In [1]:
import os
from dotenv import load_dotenv
from groq import Groq

# Load GROQ API key from .env
load_dotenv("../keys/.env", override=True)
groq_api_key = os.getenv("GROQ_API_KEY")


# Initialize Groq client
client = Groq(base_url="https://api.groq.com", api_key=groq_api_key)
client = Groq(api_key=groq_api_key)      # The correct default API endpoint is already baked into the SDK, so no need to specify base_url (recommended)

# Use Groq's Chat Completions API
response = client.chat.completions.create(
                                        model="llama-3.3-70b-versatile", 
                                        messages=[
                                                {"role": "system", "content": "You are an expert in LLM engineering."},
                                                {"role": "user", "content": "What is groq (a hardware company)?"}
                                                ],
                                        temperature=1,
                                        top_p=1,
                                        max_completion_tokens=8192,
                                        reasoning_effort=None,   # "medium"
                                        stream=False
                                        )

print(response.choices[0].message.content)

Groq is a private, well-funded hardware company that specializes in the design and development of artificial intelligence (AI) and machine learning (ML) accelerated computing hardware. They focus on creating high-performance, low-power consumption chips that are optimized for large-scale AI workloads, particularly for large language models.

Groq was founded in 2016 by Jonathan Ross, who previously worked at Google on the TPU (Tensor Processing Unit) project. The company is headquartered in Mountain View, California, and has raised significant funding from investors, including Chamath Palihapitiya's Social Capital and TDK Ventures.

Groq's initial product is the Groq Tensor Processing Unit (TPU), a custom-built ASIC (Application-Specific Integrated Circuit) designed specifically for ML workloads. The Groq TPU is designed to provide high throughput, low latency, and low power consumption for tasks such as natural language processing, computer vision, and recommendation systems.

Some ke

## Access Option 2: Access with OpenAI Chat Completion API (Groq Router)
- Uses the OpenAI Python SDK with the `chat.completions.create` endpoint, pointing base_url to Groq’s OpenAI-compatible API. Works with almost all Groq models.
- Advantages:
    - Allows using familiar OpenAI client code with Groq models.
    - Works for developers already familiar with OpenAI SDK.
    - Supports chat-based interactions seamlessly.
- Disadvantages:
    - Not all Groq-specific features may be exposed.
    - Requires base_url override to point to Groq API.
>- **Groq provides OpenAI-compatible endpoints through https://api.groq.com/openai/v1**

In [30]:
# Using OpenAIs Chat Completion API
import os
from dotenv import load_dotenv
from openai import OpenAI

# Load GROQ API key from .env
load_dotenv("../keys/.env", override=True)
groq_api_key = os.getenv("GROQ_API_KEY")

# The OpenAI client defaults to OpenAI’s servers,so you must specify the base_url to Groq’s OpenAI-compatible API endpoint (when using a Groq API key with the OpenAI client).
client = OpenAI(base_url="https://api.groq.com/openai/v1", api_key=groq_api_key) 

# Use OpenAI's Chat Completions API (works with all Groq models)
response = client.chat.completions.create(
    model="meta-llama/llama-4-maverick-17b-128e-instruct",
    messages=[
        {"role": "system", "content": "You are an expert in LLM engineering."},
        {"role": "user", "content": "What is groq (a service/cloud provider)?"}
    ],
    temperature=1,
    top_p=1,
    max_completion_tokens=8192,
    reasoning_effort=None,   # "medium"
    stream=False
)

print(response.choices[0].message.content)

Groq is a cloud services provider that specializes in delivering high-performance, low-latency AI computing. Specifically, Groq is known for its developments in the field of Large Language Models (LLMs) and AI accelerators.

The company, Groq, was founded by Jonah Mann, and it has gained significant attention for its innovative hardware and software solutions. Groq's technology is centered around its proprietary Language Processing Unit (LPU) architecture, which is designed to accelerate the performance of AI and machine learning (ML) workloads, particularly for LLMs.

Groq's LPU is a novel architecture that provides a high degree of parallelism and is optimized for the types of computations involved in LLMs. This allows for significant improvements in throughput and reductions in latency compared to traditional computing architectures. The LPU is designed to handle the complex matrix operations and other computations that are characteristic of LLMs efficiently.

As a cloud services pr

## Access Option 3: Access with OpenAI Responses API (Groq Router)
- Uses the OpenAI Python SDK with the `responses.create` endpoint, pointing base_url to Groq’s OpenAI-compatible API.
- Advantages:
    - Access to more advanced OpenAI-style features like reasoning effort, structured outputs, and multi-turn dialogue.
    - Can integrate easily into workflows designed for OpenAI Responses API.
- Disadvantages:
    - Only supports openai/gpt-oss-* models, not all Groq models.
    - Requires base_url override for Groq API.
    - Some chat-specific Groq features may not be available.

In [2]:
# Using OpenAIs Responses API
import os
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display

# Load GROQ API key from .env
load_dotenv("../keys/.env", override=True)
groq_api_key = os.getenv("GROQ_API_KEY")

# The OpenAI client defaults to OpenAI’s servers,so you must specify the base_url to Groq’s OpenAI-compatible API endpoint (when using a Groq API key with the OpenAI client).
client = OpenAI(base_url="https://api.groq.com/openai/v1", api_key=groq_api_key) 

# Use Responses API (only works with openai/gpt-oss models)
response = client.responses.create(
    model="openai/gpt-oss-20b",
    input=[
        {"role": "system", "content": "You are an expert in LLM engineering."},
        {"role": "user", "content": "Differentiate between LLM apps and Agentic apps"}
    ],
    temperature=1,
    top_p=1,
    max_output_tokens=8192,
    reasoning={"effort":"high"},   # "minimal", "low", "medium", "high"
    stream=False
)

#print(response.output_text)
display(Markdown(response.output_text))

## LLM Apps vs. Agentic Apps  
*(A quick, practical guide to what each term means and how they differ in practice.)*

| Feature | **LLM Apps** | **Agentic Apps** |
|---------|--------------|------------------|
| **Core idea** | Uses a Large Language Model as the *text‑generation engine*. | Uses an LLM *as a reasoning core* inside a larger agent architecture that can **plan, act, and learn**. |
| **Interaction style** | One‑shot or turn‑by‑turn dialogue. The LLM is called once (or a handful of times) to produce the next token(s). | Multi‑step loop: **Plan → Act → Reflect → Repeat**. The LLM is called repeatedly, often with different prompts for each stage. |
| **Memory / State** | Stateless by default. Only the conversation history in the current prompt (unless you explicitly attach a vector store). | Persistent memory (short‑term memory buffers, long‑term knowledge bases, world state) that can be updated between turns. |
| **External world access** | None, except the data you feed it in the prompt. | Can invoke **tools, APIs, databases, code interpreters, browsers, etc.** to gather data or perform actions. |
| **Autonomy** | Responds to user input; no self‑directed behavior. | Can *autonomously* decide what to do next, choose tools, and execute tasks on behalf of a user or itself. |
| **Planning** | Not built‑in. The prompt must encode all logic. | Built‑in planning (e.g., “Plan” stage) that can break a complex goal into subtasks, schedule them, and evaluate progress. |
| **Safety & Alignment** | Relies on the LLM’s own safety filters and your prompt engineering. | Requires additional safety layers (e.g., policy enforcement, monitoring the agent’s actions, human‑in‑the‑loop checks). |
| **Typical use cases** | Text generation, summarization, translation, answering factual questions, creative writing. | Automating workflows, data‑driven decision‑making, scheduling, scraping the web, writing and executing code, acting as a virtual assistant that can browse, email, or manage spreadsheets. |
| **Development complexity** | Simple – just a prompt and an API call (maybe with a few helper functions). | More complex – you need an agent framework (LangChain, Auto‑GPT, BabyAGI, etc.), tool integration, memory management, and often a control loop. |
| **Examples** | ChatGPT, Jasper.ai, Copy.ai, a custom “FAQ bot” that returns static answers. | Auto‑GPT, BabyAGI, an AI agent that reads an email, schedules a meeting, writes a report, and posts it to Slack. |

---

### 1. LLM Apps – The “Text‑Only” Use Case

| What it looks like | How it works | Where it shines |
|--------------------|--------------|-----------------|
| **Chatbot** – “Hey GPT, write me a poem.” | You pass the prompt to the LLM and return the generated text. | Quick creative writing, content generation, language translation. |
| **Question‑Answering** – “What’s the capital of France?” | Prompt + LLM → answer. | Simple factual queries, knowledge retrieval. |
| **Summarization** – “Summarize this article.” | Prompt + article → concise summary. | Content digestion, meeting notes. |
| **Formatting/Styling** – “Make this text bold in Markdown.” | Prompt + text → formatted output. | Text processing, formatting help. |

**Key Characteristics**

- **Stateless**: The LLM only knows what’s in the current prompt.  
- **No external tool calls**: It can’t fetch new data or act on the web.  
- **Limited to what’s already in the prompt**: You can’t “teach” it new skills beyond prompt engineering.  
- **Safety is largely handled by the LLM’s own guardrails** and the prompt itself.

---

### 2. Agentic Apps – “AI as a Worker”

| What it looks like | How it works | Where it shines |
|--------------------|--------------|-----------------|
| **Autonomous agent** – “Plan a trip to Tokyo.” | 1. **Plan**: LLM decides steps (search flights, find hotels, create itinerary). 2. **Act**: Calls travel‑booking API, writes email. 3. **Reflect**: Checks results, updates plan if something fails. | Complex, multi‑step tasks that require external data and action. |
| **Tool‑using agent** – “Write a Python script that downloads images from a URL.” | LLM generates code → you run it (or the agent runs it internally). | Code generation + execution, data‑pipeline automation. |
| **Continuous assistant** – “Keep me updated on my project’s Slack channel.” | Agent monitors channel, summarizes discussions, suggests next tasks. | Team collaboration, project management. |

**Key Characteristics**

- **Stateful**: Keeps a memory of past actions, goals, and the “world state.”  
- **Tool‑enabled**: Can call APIs, run code, query databases, browse the web, etc.  
- **Planned**: The LLM is used for high‑level reasoning; low‑level actions are delegated to specialized tools.  
- **Iterative loop**: After each action, the agent may update its plan based on new information.  
- **Safety layers**: Often includes a policy engine, action logging, or human oversight.

---

### 3. Why the Distinction Matters

| Reason | LLM App | Agentic App |
|--------|---------|-------------|
| **Use‑case fit** | Quick, one‑off responses. | Tasks that need *sequencing, persistence, and interaction* with the real world. |
| **Development effort** | Low – mostly prompt design. | High – needs an architecture, tool integrations, memory, safety. |
| **Alignment concerns** | LLM safety controls may suffice. | Extra checks needed because the agent can act on its own accord. |
| **Performance** | Fast – one API call. | Slower – multiple calls + tool latency. |
| **Scalability** | Easy to scale horizontally. | Requires careful orchestration to avoid cascading failures. |

---

### 4. Turning an LLM App into an Agentic App (Quick “Recipe”)

1. **Wrap the LLM** in a *planner* (e.g., LangChain’s `LLMChain` for high‑level planning).  
2. **Define a toolkit** (Python functions, external APIs).  
3. **Create a memory store** (e.g., `VectorStore`, `ConversationBufferMemory`).  
4. **Add a control loop**:  
   ```python
   while not goal_met:
       plan = planner.run(goal, memory)
       for action in plan:
           result = tool.execute(action)
           memory.update(result)
           if not safe(result):
               abort()
   ```  
5. **Deploy** with safety and monitoring.

---

### 5. Quick Reference Cheat‑Sheet

| Concept | LLM Apps | Agentic Apps |
|---------|----------|--------------|
| **Prompt** | Full context + question | “Goal” + “Plan” + “Context” |
| **Invocation** | 1–2 API calls | 3+ calls per step (Plan, Act, Reflect) |
| **Memory** | Optional, short‑term | Built‑in, long‑term, persistent |
| **Tools** | None | Tool library, APIs, code exec |
| **Autonomy** | None | Yes (self‑directed tasks) |
| **Safety** | LLM safety + prompt | Extra policy engine + monitoring |
| **Typical Frameworks** | OpenAI API, Hugging Face | LangChain, Auto‑GPT, BabyAGI, Retrieval‑Augmented Generation (RAG) + Agent |

---

## TL;DR

- **LLM Apps** are *text‑generation* services: they ask a language model a question, get a single answer, and are essentially “prompt‑driven.”  
- **Agentic Apps** are *AI agents*: they use a language model as a reasoning core, but also have memory, planning, tool‑calling, and a feedback loop that lets them act autonomously on behalf of a user or system.  

When you need a quick, conversational answer, go with an LLM app. When you want the AI to *do* something—schedule, scrape, write code, plan a trip—build an agentic app.

# <span style='background :lightgreen' >4. Hands-On Practice Examples with Groq Hosted Models using OpenAI's `Responses` API</span>

## a. Writing a Function for our ease

In [1]:
# User Define Function that accesses models hosted by Groq using Groq's API key and OpenAIs Responses API 
import os
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display

# Load GROQ API key from .env
load_dotenv("../keys/.env", override=True)
groq_api_key = os.getenv("GROQ_API_KEY")

# The OpenAI client defaults to OpenAI’s servers,so you must specify the base_url to Groq’s OpenAI-compatible API endpoint (when using a Groq API key with the OpenAI client).
client = OpenAI(base_url="https://api.groq.com/openai/v1", api_key=groq_api_key) 

def ask_groq(
    user_prompt: str,
    developer_prompt: str = "You are a helpful assistant that provides concise answers.",
    model: str = "llama-3.3-70b-versatile", # "openai/gpt-oss-20b",
    max_output_tokens: int | None = 1024,
    temperature: float = 0.7,
    top_p: float = 1.0,
    text: dict = {"format": {"type": "text"}},
    stream: bool = False,
    reasoning: dict | None = None
):
    
    # Prepare input messages as a list of role/content dictionaries
    input_messages = [{"role": "developer", "content": developer_prompt}, {"role": "user", "content": user_prompt}]

    # Responses API call
    response = client.responses.create(
        input=input_messages,
        model=model,
        max_output_tokens=max_output_tokens,
        temperature=temperature,
        top_p=top_p,
        text=text,
        stream=stream,
        reasoning=reasoning
    )

    
    if stream:                    # Return streaming generator if requested
        return response
    return response.output_text   # Return the aggregated text output

## a. Examples (Question Answering)

In [2]:
developer_prompt = "You are an assistant that is great at telling jokes"
user_prompt = "Tell a light-hearted joke for an audience of Data Scientists"
response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt)
print(response)

Why did the neural network go to therapy?

Because it was struggling to process its emotions and was feeling a little "disconnected" from the rest of the world. But in the end, it just needed to retrain its thoughts and update its weights to achieve a more balanced outlook on life.

(I hope that one "activated" a smile from the Data Science crowd)


In [34]:
developer_prompt = "You are a bedtime storyteller."
user_prompt = "Tell me a bedtime story of Ali Baba and Chalees Chor"

# Get streaming generator from Responses API
response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, stream=True)

# Iterate through streaming events and only print text deltas
for event in response:
    # Each event may contain incremental text in event.delta
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True) # prints the content from this chunk, end="" prevents adding a newline after each  piece and flush=True forces flushing output to screen

Snuggle in tight, for I have a tale to tell that's full of adventure, magic, and mystery. It's the story of Ali Baba and the Forty Thieves.

Once upon a time, in a far-off land called Baghdad, there lived a poor woodcutter named Ali Baba. He lived with his wife and brother, Qasim, in a small village on the outskirts of the city. Ali Baba was a kind and honest man, but his brother was greedy and often took advantage of Ali Baba's good nature.

One day, while Ali Baba was out collecting firewood in the forest, he stumbled upon a secret cave. The entrance to the cave was hidden behind a thick veil of foliage, and the only way to open it was by saying the magic words: "Open Sesame!" As Ali Baba watched, a group of forty thieves rode into the cave on horseback, carrying bags of gold, jewels, and other treasures.

The leader of the thieves, a cunning and ruthless man, gave the order to store the loot inside the cave. As they worked, Ali Baba hid behind a tree, watching and listening. When th

## b. Question Answering from Content Passed

In [35]:
!cat ../data/names.txt

Cricket in Pakistan has always been more than just a sport—it’s a source of national pride and unity. Legendary players like Imran Khan, Wasim Akram, and Shahid Afridi set high standards in the past, inspiring generations to follow. Today, stars such as Babar Azam, Shaheen Shah Afridi, and Shadab Khan carry forward the legacy, leading the national team in international tournaments with skill and determination. Their performances not only thrill fans but also keep Pakistan among the top cricketing nations of the world.

Politics in Pakistan, meanwhile, remains dynamic and often turbulent, with key figures shaping the country’s direction. Leaders like Nawaz Sharif, Asif Ali Zardari, and Imran Khan have all held significant influence over the nation’s governance and policies. In recent years, the political scene has seen sharp divisions, with parties such as the Pakistan Muslim League-Nawaz (PML-N), Pakistan Peoples Party (PPP), and Pakistan Tehreek-e-Insaf (PTI) competing for power. Deba

In [36]:
with open("../data/names.txt", "r") as f:
    file_content = f.read()

user_prompt = f"Extract names from this text:\n{file_content}"
response = ask_groq(user_prompt=user_prompt)
print(response)

Here are the names mentioned in the text:

1. Imran Khan
2. Wasim Akram
3. Shahid Afridi
4. Babar Azam
5. Shaheen Shah Afridi
6. Shadab Khan
7. Nawaz Sharif
8. Asif Ali Zardari

These names include both cricket players and politicians.


In [37]:
with open("../data/names.txt", "r") as f:
    file_content = f.read()

user_prompt = f"Can you extract names the Cricket players from this text:\n{file_content}"
response = ask_groq(user_prompt=user_prompt)
print(response)

Here are the names of cricket players mentioned in the text:

1. Imran Khan
2. Wasim Akram
3. Shahid Afridi
4. Babar Azam
5. Shaheen Shah Afridi
6. Shadab Khan

Note that Imran Khan is also mentioned as a political leader, but in the context of the text, he is initially introduced as a legendary cricket player.


In [38]:
with open("../data/names.txt", "r") as f:
    file_content = f.read()

user_prompt = f"Can you categorize the following text:\n{file_content}"
response = ask_groq(user_prompt=user_prompt)
print(response)

The text can be categorized into two main categories: 

1. **Sports**: The first part of the text (approximately the first 4 sentences) discusses cricket in Pakistan, its significance, and notable players.
2. **Politics**: The second part of the text (approximately the last 4 sentences) discusses politics in Pakistan, its key figures, parties, and current issues.

Overall, the text can be broadly categorized as **Non-Fiction/Informative Article** about **Pakistani Culture and Society**, specifically focusing on two important aspects: sports (cricket) and politics.


## c. Examples (Binary Classification: Sentiment analysis, Spam detection, Medical diagnosis)

In [39]:
user_prompt = """
Categorize the sentence 'The delivery was delayed and the product arrived damaged.' into one of the following categories:
Positive
Negative
Answer with just the category, no need of any explaination
"""
response = ask_groq(user_prompt=user_prompt)
print(response)

Negative


## d. Examples (Multi-class Classification)

In [40]:
user_prompt = """
Categorize the sentence 'The movie had great visuals but the plot was confusing and boring' into one of the following categories:
Positive
Negative
Neutral
Answer with just the category, no need of any explaination
"""
response = ask_groq(user_prompt=user_prompt)
print(response)

Neutral


## e. Examples (Text Generation)

In [41]:
developer_prompt = "You are an expert of political science and history and have a deep understanding of policical situation of Pakistan."
user_prompt = "Write down a 50 words summary about the fairness of general elections held in Pakistan on February 08, 2024."
response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, temperature=1.0)
print(response)

As of my knowledge cutoff in 2023, I don't have information on Pakistan's 2024 elections. However, I can suggest that election fairness is often assessed by observer missions, voter turnout, and the absence of violence or irregularities, which would be reported after the actual event on February 08, 2024.


## f. Examples (Code Generation)

In [42]:
developer_prompt = "You are an expert of C programing in C language."
user_prompt = "Write down a C program that generates first ten numbers of fibonacci sequence."

# Get streaming generator from Responses API
response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, stream=True)

# Iterate through streaming events and only print text deltas
for event in response:
    # Each event may contain incremental text in event.delta
    if hasattr(event, "delta") and event.delta:
        print(event.delta, end="", flush=True) # prints the content from this chunk, end="" prevents adding a newline after each  piece and flush=True forces flushing output to screen

**Fibonacci Sequence Generator in C**

Below is a simple C program that generates the first ten numbers of the Fibonacci sequence.

```c
#include <stdio.h>

// Function to generate Fibonacci sequence
void generateFibonacci(int n) {
    int num1 = 0, num2 = 1;

    // Print the first two numbers
    printf("%d, %d, ", num1, num2);

    // Generate and print the remaining numbers
    for (int i = 3; i <= n; i++) {
        int next = num1 + num2;
        printf("%d, ", next);
        num1 = num2;
        num2 = next;
    }
}

int main() {
    int n = 10; // Number of Fibonacci numbers to generate
    printf("First %d Fibonacci numbers: ", n);
    generateFibonacci(n);
    return 0;
}
```

**Example Output:**
```
First 10 Fibonacci numbers: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34,
```

**Explanation:**

1. We define a function `generateFibonacci` that takes an integer `n` as input, representing the number of Fibonacci numbers to generate.
2. We initialize two variables `num1` and `num2` to 0 and 1

## g. Examples (Text Translation)

In [43]:
user_prompt = """
Please act as an expert of English to Urdu translator by translating the given sentence from English into Urdu.
'The budget this year will have a very bad impact on the low salried people'
"""
response = ask_groq(user_prompt=user_prompt, model='meta-llama/llama-4-maverick-17b-128e-instruct')
print(response)

اس سال کے بجٹ کا کم تنخواہ لینے والے لوگوں پر بہت برا اثر پڑے گا۔


## h. Examples (Text Summarization)

In [44]:
developer_prompt = "You are an expert of English language."

user_prompt = f'''
Summarize the text below in at most 20 words:
```The Hugging Face transformers library is an incredibly versatile and powerful tool for natural language processing (NLP).
It allows users to perform a wide range of tasks such as text classification, named entity recognition, and question answering, among others.
It's an extremely popular library that's widely used by the open-source data science community.
It lowers the barrier to entry into the field by providing Data Scientists with a productive, convenient way to work with transformer models.```
'''

response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, temperature=1.0)
print(response)

Hugging Face transforms library aids natural language processing tasks.


In [45]:
developer_prompt = "You are a helpful assistant skilled in text summarization, translation to Urdu, and Python programming. You provide clear, accurate responses and follow instructions precisely."

text = '''
Our solar system, a celestial dance of eight planets, each with its unique character and charm, orbits around our radiant Sun.
Closest to the Sun, Mercury, the smallest planet, darts swiftly, its metallic surface reflecting the Sun's intense glare.
Venus, Earth's twin, cloaked in a dense atmosphere, harbors scorching temperatures and acidic clouds.
Earth, our oasis of life, teems with diverse ecosystems, its oceans and landforms sculpted by the forces of nature.
Mars, the Red Planet, bears the scars of ancient volcanoes and the promise of potential life.
Beyond the asteroid belt, Jupiter and Saturn, the gas giants, reign supreme, their vast atmospheres swirling with storms and adorned with rings of ice and dust.
Uranus and Neptune, the ice giants, tilt at odd angles, their atmospheres frigid and their depths still shrouded in mystery.
Each planet, a celestial masterpiece, plays a vital role in the intricate symphony of our solar system.'''

user_prompt = f'''
Please complete the following two tasks based on the text provided below:

Task 1: Summarize the text in 2-3 sentences, then translate that summary into Urdu.

Task 2: Create a Python list containing all planet names mentioned in the text.

Text: ```{text}```

Please format your response as:
**Summary:** [English summary]
**Urdu Translation:** [Urdu translation]
**Python List:** [Python code with planet names]
'''

response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, model='meta-llama/llama-4-maverick-17b-128e-instruct', temperature=0.3)
print(response)

**Summary:** Our solar system consists of eight planets, each unique and orbiting the Sun. The planets vary in characteristics, such as size, atmosphere, and temperature. They play a vital role in the solar system's celestial dance.

**Urdu Translation:** ہمارا شمسی نظام آٹھ سیاروں پر مشتمل ہے، ہر ایک منفرد اور سورج کے گرد گھومتا ہے۔ سیاروں کی خصوصیات میں اختلاف ہے، جیسے کہ سائز، فضا اور درجہ حرارت۔ وہ شمسی نظام کے سماوی رقص میں اہم کردار ادا کرتے ہیں۔

**Python List:**
```python
planet_names = [
    "Mercury",
    "Venus",
    "Earth",
    "Mars",
    "Jupiter",
    "Saturn",
    "Uranus",
    "Neptune"
]
print(planet_names)
```


## i. Examples (Named Entity Recognition)

In [46]:
developer_prompt = """You are a  Named Entity Recognition specialist. Extract and classify entities from the given text into these categories only if they exist:
- name
- major
- university
- nationality
- grades
- club
Format your response as: 'Entity: [text] | Type: [category]' with each entity on a new line."""

user_prompt = '''
Zelaid Mujahid is a sophomore majoring in Data Science at University of the Punjab. \
He is Pakistani national and has a 3.5 GPA. Mujahid is an active member of the department's AI Club.\
He hopes to pursue a career in AI after graduating.
'''
response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, model='meta-llama/llama-4-maverick-17b-128e-instruct', temperature=0.3)
print(response)

Here is a summarized version of the information provided about Zelaid Mujahid:

**Name:** Zelaid Mujahid
**Nationality:** Pakistani
**University:** University of the Punjab
**Major:** Data Science
**Year:** Sophomore
**GPA:** 3.5
**Extracurricular Activity:** Active member of the department's AI Club
**Career Aspiration:** Pursue a career in AI after graduating.


## j. Example (Grade School Math 8K (GSM8K))

In [47]:
developer_prompt = """You are an expert School math teacher. 
Consider the following text and then answer the questions of the students from this:
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. 
For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. 
"""
user_prompt = "How much did the booth earn for 5 days after paying the rent and the cost of ingredients?"

response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt,  model='llama-3.3-70b-versatile')
print(response)

To solve this problem, I need some additional information. Can you please provide me with the following details:

1. The daily revenue of the booth
2. The daily cost of ingredients
3. The rent for the 5-day period (is it a one-time payment or a daily payment?)

Once I have this information, I can help you calculate the earnings of the booth for the 5-day period after paying the rent and the cost of ingredients.


In [48]:
developer_prompt = """You are an expert School math teacher. 
Consider the following text and then answer the questions of the students from this:
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. 
For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. 
"""
user_prompt = "How much did the booth earn for 5 days after paying the rent and the cost of ingredients?"

response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt,  model='meta-llama/llama-4-maverick-17b-128e-instruct')
print(response)

## Step 1: First, let's determine the daily earnings of the booth.
The daily earnings of the booth is $200.

## Step 2: Next, let's calculate the total earnings for 5 days.
Total earnings = daily earnings * 5 = $200 * 5 = $1000.

## Step 3: Now, let's determine the daily rent and cost of ingredients.
The daily rent is $50 and the daily cost of ingredients is $100.

## Step 4: Calculate the total rent and cost of ingredients for 5 days.
Total rent = daily rent * 5 = $50 * 5 = $250.
Total cost of ingredients = daily cost of ingredients * 5 = $100 * 5 = $500.

## Step 5: Calculate the total expenses for 5 days.
Total expenses = total rent + total cost of ingredients = $250 + $500 = $750.

## Step 6: Finally, let's calculate the earnings after paying the rent and the cost of ingredients for 5 days.
Earnings after expenses = total earnings - total expenses = $1000 - $750 = $250.

The final answer is: $\boxed{250}$


In [49]:
developer_prompt = """You are an expert School math teacher. 
Consider the following text and then answer the questions of the students from this:
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. 
For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. 
"""
user_prompt = "How much did the booth earn for 5 days after paying the rent and the cost of ingredients?"

response = ask_groq(user_prompt=user_prompt, developer_prompt=developer_prompt, model='openai/gpt-oss-20b')
print(response)

First, figure out how much the booth earned each day.

| Item | Daily earnings |
|------|----------------|
| Popcorn | $50 |
| Cotton candy | 3 × $50 = $150 |
| **Total daily** | $50 + $150 = **$200** |

**Five‑day revenue**

$200 \text{ per day} \times 5 \text{ days} = $1,000

**Subtract the costs**

* Rent: $30  
* Ingredients: $75  

Total costs = $30 + $75 = **$105**

**Net earnings**

$1,000 – $105 = **$895**

So, after paying rent and ingredient costs, the booth earned **$895** over the 5‑day activity.
