# NVIDIA Nemotron Nano 2 VL on Fireworks AI

This notebook demonstrates how to use NVIDIA Nemotron Nano 2 VL, a powerful 12B multimodal reasoning model for document intelligence and product catalog cleansing deployed on Fireworks AI.


## Table of Contents

1. [Introduction to NVIDIA Nemotron Nano 2 VL](#1-introduction-to-nvidia-nemotron-nano-2-vl)
2. [Setting Up Deployment on Fireworks AI](#2-setting-up-deployment-on-fireworks-ai)
3. [Performance and Speed Metrics](#3-performance-and-speed-metrics)
4. [Use Cases](#4-use-cases)
   - [Document Intelligence](#41-document-intelligence)
   - [Product Catalog Cleansing](#42-product-catalog-cleansing)

## 1. Introduction to NVIDIA Nemotron Nano 2 VL

### Overview

NVIDIA Nemotron Nano 2 VL is the strongest open 12B multimodal reasoning model for video understanding and document intelligence. Built on a hybrid transformer-Mamba architecture, it combines the best of both worlds:

- **Accuracy on par with transformer-only models**
- **Limited memory and compute usage from Mamba architecture**
- **Higher token throughput and lower latency**

### Key Features

#### Highest Accuracy
- Trained with NVIDIA curated high-quality synthetic data
- Best-in-class accuracy for:
  - Character recognition (OCR)
  - Chart reasoning
  - Image understanding
  - Video understanding
  - Document intelligence
- **73.2 average score** vs 64.2 with current top VL model on benchmarks including MMMU, MathVista, AI2D, OCRBench, OCRBench-v2, OCR-Reasoning, ChartQA, DocVQA, and Video-MME

#### Highest Efficiency
- **Up to 10x higher throughput** compared to Llama Nemotron Nano VL
- Efficient Video Sampling (EVS) for processing longer videos
- Lower total cost of inference


### Primary Use Cases

1. **AI Assistant with Document Intelligence**
   - Customer service (dashboards, screenshots, docs)
   - IT, finance, insurance, healthcare forms

2. **Content Ingestion**
   - Product catalog cleansing
   - Dense captioning of images/videos

3. **Multi-modal Applications**
   - RAG systems with complex documents, figures, graphs, etc
   - Agentic apps and services

## 2. Setting up on demand deployment on Fireworks AI

### Prerequisites

Before getting started, you'll need:
- A Fireworks AI account ([sign up here](https://fireworks.ai))
- API key from Fireworks AI dashboard
- Python 3.8 or higher

### Installation

Install the required packages:

In [None]:
!./setup.sh

### Environment Setup

Set up your Fireworks AI API key:

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

FIREWORKS_API_KEY = os.getenv('FIREWORKS_API_KEY')

if not FIREWORKS_API_KEY:
    raise ValueError("Please set FIREWORKS_API_KEY environment variable")

### Create on-demand deployment for NVIDIA Nemotron Nano 2 VL

Creating an [on-demand deployment](https://fireworks.ai/docs/getting-started/ondemand-quickstart#on-demand-quickstart) for proper testing and benchmarks

In [None]:
# TODO: Update to NVIDIA Nemotron Nano 2 VL when available
! firectl create deployment accounts/fireworks/models/qwen2-vl-72b-instruct --min-replica-count 1 --max-replica-count 1 --accelerator-type NVIDIA_H100_80GB

In [None]:
! firectl-admin get deployment <DEPLOYMENT-ID>

### Helper Functions

Utility functions for encoding images and making API calls:

In [None]:
def encode_image_to_base64(image_path):
    """
    Encode an image file to base64 string.
    
    Args:
        image_path: Path to the image file
        
    Returns:
        Base64 encoded string of the image
    """
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')


def encode_image_url(image_url):
    """
    Download and encode an image from a URL.
    
    Args:
        image_url: URL of the image
        
    Returns:
        Base64 encoded string of the image
    """
    import requests
    response = requests.get(image_url)
    return base64.b64encode(response.content).decode('utf-8')


def analyze_image(image_source, prompt, is_url=False, max_tokens=1000):
    """
    Analyze an image using the VLM model.
    
    Args:
        image_source: Path to image file or URL
        prompt: Text prompt for the model
        is_url: Whether image_source is a URL
        max_tokens: Maximum tokens in response
        
    Returns:
        Model response text
    """
    # Encode image
    if is_url:
        image_b64 = encode_image_url(image_source)
    else:
        image_b64 = encode_image_to_base64(image_source)
    
    # Make API call
    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt,
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_b64}"
                    },
                },
            ],
        }],
        max_tokens=max_tokens,
    )
    
    return response.choices[0].message.content


def analyze_multiple_images(image_sources, prompt, are_urls=False, max_tokens=1500):
    """
    Analyze multiple images using the VLM model.
    
    Args:
        image_sources: List of image paths or URLs
        prompt: Text prompt for the model
        are_urls: Whether image_sources contains URLs
        max_tokens: Maximum tokens in response
        
    Returns:
        Model response text
    """
    content = [{"type": "text", "text": prompt}]
    
    # Add all images
    for image_source in image_sources:
        if are_urls:
            image_b64 = encode_image_url(image_source)
        else:
            image_b64 = encode_image_to_base64(image_source)
        
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{image_b64}"
            },
        })
    
    # Make API call
    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[{"role": "user", "content": content}],
        max_tokens=max_tokens,
    )
    
    return response.choices[0].message.content


print("Helper functions defined successfully!")

### Basic Usage Example

Quick example to verify the setup works:

In [None]:
# Example with a URL image
test_image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

result = analyze_image(
    image_source=test_image_url,
    prompt="Describe this image in detail.",
    is_url=True
)

print("Model Response:")
print(result)

## 3. Performance and Speed Metrics

### Benchmarking Setup

Let's measure the performance characteristics of the model on Fireworks AI.

In [None]:
## TODO: Add / clone llm_bench and run benchmarks vs other VL models

### Run Performance Benchmark

In [None]:
## TODO

### Expected Performance with Nemotron Nano 2 VL

Based on NVIDIA's specifications, Nemotron Nano 2 VL offers:

## TODO: update based on above
- **Up to 10x higher throughput** compared to Llama Nemotron Nano VL
- **Efficient Video Sampling (EVS)** for processing longer videos
- **128k context length** support
- **Low latency inference** thanks to the hybrid Mamba architecture

**Note:** Actual performance metrics will be updated once the model is available on Fireworks AI.

## 4. Use Cases

### 4.1 Document Intelligence

Nemotron Nano 2 VL excels at understanding complex documents including:
- Forms and invoices
- Charts and graphs
- Screenshots and dashboards
- Healthcare and insurance documents
- Financial reports

Let's explore document intelligence capabilities:

#### OCR and Text Extraction

Extract text from documents with high accuracy:

In [None]:
# Example: Extract text from a document image
# Replace with your document image URL or path
document_image = "https://example.com/sample-invoice.jpg"  # TODO: Add real example

ocr_prompt = """
Extract all text from this document. Organize the output in a structured format.
Maintain the original layout and hierarchy where possible.
"""

# Uncomment when you have a document image:
# result = analyze_image(document_image, ocr_prompt, is_url=True)
# print(result)

print("OCR example ready. Add a document image URL to test.")

#### Chart and Graph Analysis

Understanding complex charts and extracting insights:

In [None]:
# Example: Analyze a chart or graph
chart_prompt = """
Analyze this chart and provide:
1. The type of chart/visualization
2. Key data points and trends
3. Main insights or conclusions
4. Any notable patterns or anomalies
"""

# Uncomment when you have a chart image:
# chart_image = "path/to/chart.png"
# result = analyze_image(chart_image, chart_prompt)
# print(result)

print("Chart analysis example ready. Add a chart image to test.")

#### Form Understanding and Data Extraction

Extract structured data from forms:

In [None]:
import json

# Example: Extract structured data from a form
form_prompt = """
Extract all information from this form and return it as a JSON object.
Include field names as keys and the corresponding values.
If a field is empty, mark it as null.
"""

# Uncomment when you have a form image:
# form_image = "path/to/form.png"
# result = analyze_image(form_image, form_prompt, max_tokens=2000)
# 
# # Parse JSON response
# try:
#     form_data = json.loads(result)
#     print("Extracted Form Data:")
#     print(json.dumps(form_data, indent=2))
# except json.JSONDecodeError:
#     print("Response:", result)

print("Form extraction example ready. Add a form image to test.")

### 4.2 Product Catalog Cleansing

Use Nemotron Nano 2 VL to clean and enrich product catalog data by:
- Extracting product attributes from images
- Generating accurate descriptions
- Identifying missing or incorrect information
- Categorizing products automatically
- Quality checking product listings

#### Product Attribute Extraction

Extract detailed attributes from product images:

In [None]:
## TODO add pydantic class and example

#### Automated Product Description Generation

Generate SEO-friendly product descriptions:

In [None]:
## TODO add pydantic class and example

## Conclusion

This notebook demonstrated how to use NVIDIA Nemotron Nano 2 VL on Fireworks AI for:

1. **Document Intelligence**: OCR, chart analysis, form extraction, and multi-page document understanding
2. **Product Catalog Cleansing**: Attribute extraction, description generation, quality checks, and batch processing

### Key Advantages of Nemotron Nano 2 VL:

- **Best-in-class accuracy** for OCR, charts, and document understanding (73.2 avg benchmark score)
- **10x higher throughput** compared to previous models ## TODO update with actual benchmarks
- **128k context length** for processing long documents and videos
- **Hybrid transformer-Mamba architecture** for efficiency
- **Open weights and permissive license** for customization

### Next Steps:

1. Add your own document and product images to test the examples
2. Integrate the model into your production pipelines

### Resources:

- [VLM on Fireworks AI Documentation](https://fireworks.ai/docs/guides/querying-vision-language-models#querying-vision-language-models)
- [Model on Hugging Face](https://huggingface.co) (# TODO add actual link)
