# 🌟 LLM Inference Hands-On (Day 5) 🚀

Welcome to the **Day 5 Hands-On Session**!  
In this notebook, we will explore **practical applications of Large Language Models (LLMs) and Multimodal AI** using the Hugging Face Inference API.  

We’ll go beyond simple text generation and try out exciting capabilities such as:
- 🖼️ **Text-to-Image Generation** – turning words into stunning visuals  
- 🎥 **Text-to-Video** – creating short clips from prompts  
- 🎙️ **Speech Recognition (ASR)** – converting audio to text  
- 🖼️ **Image Classification** – detecting if content is safe or not  
- 🔀 **Image-to-Image Editing** – transforming one image into another  
- 🧠 **Multimodal Models** – combining text + image for richer AI interactions  

💡 The goal is to give you hands-on experience with **different AI model types**, while keeping things simple and interactive.  
All examples will use Hugging Face’s **InferenceClient**, which allows us to call powerful models with just a few lines of code.

---

🔐 Handling API Tokens

For simplicity, we will hardcode our Hugging Face API token here, though it's not recommended for production or shared notebooks. Instead, consider using environment variables or a ```.env``` file for better security.

In [None]:
HF_TOKEN = "YOUR TOKEN HERE"

## Text-to-Image Generation

Let’s explore how to generate images from text prompts using Inference Providers. We’ll use **black-forest-labs/FLUX.1-dev**, a state-of-the-art diffusion model that produces highly detailed, photorealistic images.

In [None]:
from huggingface_hub import InferenceClient
from IPython.display import Image, display
import os

client = InferenceClient(api_key=HF_TOKEN)

image = client.text_to_image(
    prompt="Astronaut riding a horse",
    model="black-forest-labs/FLUX.1-dev"
)

# Save the generated image
image.save("generated_image.png")
display(Image(filename="generated_image.png"))

## Image-to-Image Editing

Edit an existing image using a model that supports image-to-image manipulation. For example, you can prompt the model to: *Turn the cat into a tiger.*

In [None]:
from huggingface_hub import InferenceClient
from PIL import Image
import os

client = InferenceClient(provider="fal-ai", api_key=HF_TOKEN)

with open("cat.png", "rb") as image_file:
   input_image = image_file.read()

# output is a PIL.Image object
image = client.image_to_image(
    input_image,
    prompt="Turn the cat into a tiger.",
    model="Qwen/Qwen-Image-Edit",
)

image.save("transformed_cat_to_tiger.png")
display(Image(filename="transformed_cat_to_tiger.png"))

## Text-to-Video

Generate short video clips from text descriptions using a model that supports **Text-to-Video** synthesis. For example, you can prompt: *Create a 10-second video of a sunset over the ocean with soft waves crashing on the shore.*

In [None]:
from huggingface_hub import InferenceClient
from IPython.display import Video, display
import os

client = InferenceClient(provider="replicate", api_key=HF_TOKEN)

video = client.text_to_video(
    prompt="A young man walking on the street",
    model="Wan-AI/Wan2.2-T2V-A14B",
)

# Save the video to a file (since the response is in bytes, we write it to a file)
with open("generated_video.mp4", "wb") as f:
    f.write(video)

# Display the video in the notebook
display(Video(filename="generated_video.mp4"))

## Image Classification — NSFW Detection

Classify images to detect whether they contain unsafe or inappropriate content. Used in Social Media Content Moderation.

In [None]:
from huggingface_hub import InferenceClient
import os

client = InferenceClient(provider="hf-inference", api_key=HF_TOKEN)

# Classifying the image for NSFW content
output = client.image_classification("cat.png", model="Falconsai/nsfw_image_detection")

# Extract and print the classification result
print(f"Label: {output['label']}")
print(f"Confidence: {output['confidence']}")

## Conversational VLM

Interact with **Vision-Language Models (VLMs)** that can process both images and text in a conversation.

In [None]:
import os
from huggingface_hub import InferenceClient

client = InferenceClient(provider="nscale", api_key=HF_TOKEN)

completion = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in one sentence."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                    }
                }
            ]
        }
    ],
)

print(completion.choices[0].message)

### 📌 Summary

In this hands-on session, we explored various practical applications of **Large Language Models (LLMs)** and **Multimodal AI** using the **Hugging Face Inference API**. Here’s a recap of what we covered:

1. **Text-to-Image Generation**: Generate realistic images from text descriptions.
2. **Image-to-Image Editing**: Transform an existing image with a new prompt (e.g., *Turn a cat into a tiger*).
3. **Text-to-Video**: Create video clips from text prompts.
4. **Automatic Speech Recognition (ASR)**: Transcribe spoken language into text.
5. **Image Classification — NSFW Detection**: Detect explicit content in images.
6. **Conversational VLM**: Engage with vision-language models for multimodal conversations (text + image).
7. **Translation**: Translate text between languages (e.g., Hindi to English using Helsinki model).

💡 **Next Steps**: Explore more models on the Hugging Face Hub, experiment with different use cases, and try out additional AI capabilities!

For more APIs, check out the official documentation for **Inference Providers**:  
[Hugging Face Inference Providers](https://huggingface.co/docs/inference-providers)