# What Is Modal?
- Modal is a cloud-native, serverless compute platform designed specifically for AI and data teams. 
- It lets you "bring your own code" and run CPU-, GPU-, and memory-intensive workloads at scale without managing infrastructure.

# Core Features

- Sub-second container starts
  - Rust-based container stack for lightning-fast cold boots.
- Seamless autoscaling
  - Scales from zero to thousands of GPUs (or CPU nodes) to handle unpredictable loads.
- Fast model loading
  - Optimized file system loads gigabytes of model weights in seconds.
- Bring-your-own code & frameworks
  - Deploy custom models, Hugging Face pipelines, or any Python/Rust/C++ code without changing your codebase.
- Flexible environments
  - Use prebuilt Python containers or supply your own Docker images; provision A100/H100 GPUs on demand.
- Integrated data volumes
  - Mount S3, R2, or other object storage as persistent volumes for datasets, model weights, or experiment outputs. 
- Observability & integrations
  - Export logs/traces via OpenTelemetry to Datadog, New Relic, etc., for real-time monitoring.

# Architecture & Workflow


- Define a Modal Function
  - Decorate Python functions (or methods) with `@modal.function` to specify compute resources (CPU/GPU, memory, timeout).
- Local Testing
  - Invoke the same functions locally for rapid iteration.
- Deployment
  - Push to Modal with a single CLI command; the platform packages your code, dependencies, and runtime.
- Execution
  - Modal schedules workloads in its serverless fleet, spins up containers on demand, runs your code, then scales to zero.
- Scaling & Load Balancing
  - Autoscaling across containers abstracts away horizontal scaling complexities.

# Typical Use Cases

- LLM Inference
  - Deploy chatbots or embedding services using custom or open-source language models.
- Fine-tuning / Training
  - Run hyperparameter sweeps on A100/H100 GPUs without queue times; pay per second.
- Batch Data Processing
  - Fan-out parallel jobs for dataset preprocessing, feature extraction, or vector indexing.
- Research & Prototyping
  - Spin up experiments quickly, test new architectures without worrying about infra setup.

# Pricing & Credits

- Pay-as-you-go: billed by the second for CPU, GPU, and memory usage.
- Free Tier & Credits: startups can apply for up to $50 K in free credits; personal trial includes free hours of CPU/GPU.
- Cost Efficiency: no idle-resource charges—containers scale to zero when idle.

# Example Usage

In [1]:
import modal

Setting up the modal tokens. This is the same as running `modal setup` from the command line. It connects with Modal and installs your tokens.

In [2]:
!modal setup

⠋ Waiting for authentication in the web browser
⠙ Waiting for authentication in the web browser
The web browser should have opened for you to authenticate and get an API 
token.
If it didn't, please copy this URL into your web browser manually:

⠙ Waiting for authentication in the web browser
https://modal.com/token-flow/tf-0Pxo0paz7NCNUmZzmaRv4d

⠙ Waiting for authentication in the web browser
⠙ Waiting for authentication in the web browser

⠋ Waiting for token flow to complete...
⠙ Waiting for token flow to complete...
⠹ Waiting for token flow to complete...
⠸ Waiting for token flow to complete...
⠼ Waiting for token flow to complete...
⠴ Waiting for token flow to complete...
⠧ Waiting for token flow to complete...
⠇ Waiting for token flow to complete...
⠏ Waiting for token flow to complete...
⠋ Waiting for token flow to complete...
⠙ Waiting for token flow to complete...
⠹ Waiting for token flow to complete...
⠼ Waiting for token flow to complete...
⠴ Waiting for token flow to compl

In [3]:
!modal token new

⠋ Waiting for authentication in the web browser
⠙ Waiting for authentication in the web browser
The web browser should have opened for you to authenticate and get an API 
token.
If it didn't, please copy this URL into your web browser manually:

⠙ Waiting for authentication in the web browser
https://modal.com/token-flow/tf-titplyGscUrKr9KRKK1jkQ

⠙ Waiting for authentication in the web browser
⠙ Waiting for authentication in the web browser

⠋ Waiting for token flow to complete...
⠙ Waiting for token flow to complete...
⠹ Waiting for token flow to complete...
⠸ Waiting for token flow to complete...
⠼ Waiting for token flow to complete...
⠴ Waiting for token flow to complete...
⠧ Waiting for token flow to complete...
⠇ Waiting for token flow to complete...
⠏ Waiting for token flow to complete...
⠋ Waiting for token flow to complete...
⠙ Waiting for token flow to complete...
⠹ Waiting for token flow to complete...
⠼ Waiting for token flow to complete...
⠴ Waiting for token flow to compl

Grab the token id and token secret from this file and add it to .env

In [4]:
from dotenv import load_dotenv
load_dotenv()

True

In [5]:
from hello import app, hello

In [6]:
with app.run():
    reply = hello.local() # This will run the hello function locally
reply

'Hello from Pune, Maharashtra, IN!!'

In [7]:
with app.run():
    reply=hello.remote() # This will run the hello function in the Modal cloud
reply

'Hello from Amsterdam, North Holland, NL!!'

Register your Hugging face secret to `modal.com`

In [8]:
# First check if you can access the Hugging Face API
import requests
import os

token = os.getenv("HF_TOKEN")
headers = {"Authorization": f"Bearer {token}"}
response = requests.get(
    "https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/resolve/main/config.json",
    headers=headers
)
print(f"Status: {response.status_code}")
print(f"Response: {response.text[:200]}")

Status: 200
Response: {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4


In [10]:
from llama import app, generate

Troubleshooting in case you get errors
- Ensure you have access to the repository or the access is granted to the repository
- The hugging face token should have `Read` permission
- Restart the kernel of the notebook

In [11]:
with modal.enable_output():
    with app.run():
        result=generate.remote("Life is a mystery, everyone must stand alone, I hear")
result

Output()

Output()

Output()

'<|begin_of_text|>Life is a mystery, everyone must stand alone, I hear you call my name,'

# Ephemeral v/s Deployed (Persistent) Mode

When you build LLM-powered applications on Modal, you can choose between two primary execution modes:
- Ephemeral Mode
- Deployed (Persistent) Mode

##  Comparative Summary

| Feature                            | Ephemeral Mode                                                    | Deployed Mode                                                               |
| ---------------------------------- | ----------------------------------------------------------------- | --------------------------------------------------------------------------- |
| **Container Lifetime**             | Short-lived (per invocation)                                      | Long-lived (always on until scaled to zero or undeployed)                   |
| **Startup Latency**                | ~ 5–15 s cold start per invocation                                | ~ < 1 s cold start only when scaling out; subsequent requests use warm pods |
| **State Persistence (In-Process)** | None (stateless)                                                  | Temporary (can cache weights/tokenizer for session)                         |
| **Billing**                        | Pay per second of execution                                       | Pay for resources 24/7 (even if idle)                                       |
| **Autoscaling Granularity**        | “Scale = # concurrent invocations” — automated and instant        | Policy-based: min/max replicas, CPU/GPU utilization rules                   |
| **Typical Use Cases**              | Batch jobs, event-driven tasks, experimentation, low-traffic APIs | Production APIs, chatbots, high QPS inference services                      |
| **Operational Complexity**         | Minimal (just write a function and call it)                       | Moderate (define services, endpoints, scaling, networking)                  |
| **Cold Start Impact on UX**        | Significant if synchronous; may need async workflows              | Minimal (< 1 s); suitable for interactive UIs and low-latency SLAs          |

## Ephemeral Mode

### Definition
- Ephemeral: Each invocation spins up a fresh, short-lived container (a “Modal function”) to run your code, then immediately tears it down when the process finishes.
- Typically used for ad hoc tasks: one-off inference calls, batch jobs, CI/CD, experimentation.


### Lifecycle and Architecture

- Invocation Trigger
  - You call a Modal function (e.g., via Python SDK or CLI).
  - Modal provisions a new container instance, pulls your code and dependencies, and boots it.
- Execution
  - The container runs your Python code (e.g., a call to a Hugging Face transformer or OpenAI API).
  - GPU/CPU allocation is done on demand (you specify machine type when defining the function).
- Teardown
  - Once the function completes (either returns or times out), Modal automatically destroys the container.
  - All state in memory (and ephemeral disk) is lost—nothing is persisted unless you explicitly write to external storage (e.g., S3, Cloud SQL, etc.).

### Key Characteristics

- Statelessness
  - No persistent filesystem or in-memory state.
  - Great for simple “request → response” flows.
- Fast Start-up (Cold Start):
  - Typical container boot time: 5–15 seconds (varies by image size, GPU liaison, etc.).
  - SUITABLE for workloads that can tolerate cold-start latency.
- Auto-Scaling:
  - Modal automatically scales out: multiple simultaneous function invocations spin up multiple containers in parallel.
  - No need to manage VM pools or auto-scaler rules.
- Billing Model:
  - You pay for CPU (or GPU) and memory *only while your container is running.*
  - Cheaper for infrequent workloads but may be less cost-efficient for sustained high QPS (Queries Per Second).

### Key Characteristics
- Batch Inference Pipelines:
  - Run an LLM over a large dataset (e.g., summarization of thousands of documents overnight).
- Back-end Hooks:
  - Data preprocessing, custom tokenization, embedding generation, or occasional jobs triggered by events (e.g., user uploads a document, spin up a function to analyze sentiment using an LLM).
- Experiments/Proof-of-Concepts:
  - Quickly test a new LLM model or prompt without standing up persistent infrastructure.
- Cost-Sensitive, Low-Traffic Endpoints:
  - If you expect < 100 requests/day, ephemeral avoids paying for idle resources.

### Advantages & Limitations

| Aspect                  | Advantages                                               | Limitations                                                           |
| ----------------------- | -------------------------------------------------------- | --------------------------------------------------------------------- |
| **Simplicity**          | — No cluster management.<br>— Zero ops overhead.         | — Cold-start latency.<br>— Stateless, so must externalize state.      |
| **Scalability**         | — Infinite scale (Modal auto-scales containers).         | — Each new request may incur provisioning delay.                      |
| **Cost Efficiency**     | — Pay only when running.                                 | — Becomes expensive if containers run constantly.                     |
| **State Management**    | — Encourages decoupled, event-driven design.             | — No in-process caching or persistent memory.                         |
| **Resource Allocation** | — Select GPU/CPU per function; Modal handles allocation. | — If you need a warmed GPU across calls, ephemeral may be suboptimal. |

## Deployed (Persistent) Mode

### Definition
- Deployed: You create and “deploy” a long-lived endpoint (an HTTP service) or background worker that continuously runs inside Modal. The container stays alive, listening for incoming requests or processing messages.
- Ideal for production-grade LLM inference endpoints, real-time chatbots, and apps requiring low latency.

### Lifecycle and Architecture

- Deployment Step
  - You define a Modal “App” or “Service” in code (e.g., using `@modal.web_endpoint` decorator or by specifying a persistent `modal.App` deployment).
- Steady State
  - The container(s) remain running 24/7, load-balanced by Modal’s infrastructure.
  - You can configure autoscaling rules (min/max replicas, concurrency per container).
- Requests Handling
  - Clients (e.g., a frontend or API consumer) send HTTP calls. Modal’s load balancer routes traffic to a warm container, yielding < 100 ms response for simple functions, or ~1 s for LLM calls on GPU.
- Updates
  - When you push a code change or new container image, Modal rolls out updated instances with zero downtime (blue/green or rolling updates).
- Shutdown (Optional)
  - You can manually “undeploy” or scale to zero if you no longer need the service.

### Key Characteristics

- Statefulness (Intermediate):
  - While containers are long-lived, you still shouldn’t rely on in-process memory for permanent storage.
  - However, you can maintain caches or keep loaded LLM weights in GPU memory across multiple requests for low-latency inference.
- Minimal Cold-Start:
  - Because containers stay warm, you avoid repeated startup delays. Only scale-out (adding new replicas) may incur a slight delay.
- Autoscaling
  - Modal lets you specify concurrency targets or CPU/GPU utilization thresholds. If usage surges, it brings up new replicas automatically.
- Billing Model
  - You pay for the allocated CPU/GPU and memory continuously, regardless of request volume
  - More cost-effective for sustained, predictable traffic.
- Network Endpoint
  - Exposes a stable public (or private) URL.
  - TLS termination, authentication (via API keys or JWT), and custom domains are supported.

### Use Cases for LLM Apps


- Chatbots & Conversational AI
  - Real-time, low-latency interactions.
  - Keep LLM weights loaded on GPU to serve multiple user messages rapidly.
- High-Traffic Inference Services
  - ≥ 100 requests/hour.
  - Constant stream of inference requests (e.g., summarizing customer support tickets as they arrive).
- Microservices in Production
  - An LLM “summarization” microservice consumed by other services (front-end, mobile apps).
  - Integrate with other endpoints—Modal handles the HTTP routing.
- Fine-tuned LLM Hosting
  - You have a custom-fine-tuned model in a private S3 bucket; deploy it once, keep it loaded, and serve calls with minimal latency.

### Advantages & Limitations

| Aspect               | Advantages                                                                                                                                 | Limitations                                                                |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------- |
| **Latency**          | — Warm containers: < 100 ms cold start for the first request; < 10 ms subsequent (excluding LLM inference time).<br>— Ideal for real-time. | — Some cost of idle resources if traffic is intermittent.                  |
| **Cost Efficiency**  | — Economical for ≥ 100 RPS (requests per second) or continuous load.                                                                       | — Wasted hours for idle CPU/GPU if traffic dips below thresholds.          |
| **Scaling Controls** | — Autoscaling policies (min/max replicas, concurrency).<br>— Predictable capacity.                                                         | — More config complexity—must tune autoscaling thresholds to avoid thrash. |
| **State Management** | — Can cache tokenizer artifacts, keep weights in memory.                                                                                   | — Not permanent storage (still need external DB for true persistence).     |
| **DevOps Overhead**  | — Modal handles rolling updates, health checks, and LB.                                                                                    | — Slightly more initial setup (define endpoints, scaling rules).           |

# In Practice

In [12]:
# You can also run "modal deploy -m pricer_service" at the command line in an activated environment
!modal deploy -m pricer_service

- Creating objects...
\ Creating objects...
/ Creating objects...
└── - Creating mount c:\TFS\Study\machine-learning\pricer_service.py: Uploaded 
    0/1 files
| Creating objects...
└── | Creating mount c:\TFS\Study\machine-learning\pricer_service.py: Uploaded 
    0/1 files
/ Creating objects...
└── / Creating mount c:\TFS\Study\machine-learning\pricer_service.py: Uploaded 
    0/1 files
\ Creating objects...
└── | Creating mount c:\TFS\Study\machine-learning\pricer_service.py: Uploaded 
    0/1 files
- Creating objects...
└── - Creating mount c:\TFS\Study\machine-learning\pricer_service.py: 
    Finalizing index of 1 files
Building image im-FSg4WGzWI8V8fp80FPTVhQ
- Creating objects...
└── - Creating mount c:\TFS\Study\machine-learning\pricer_service.py: Finalizin

=> Step 0: FROM base
\ Creating objects...
└── \ Creating mount c:\TFS\Study\machine-learning\pricer_service.py: Finalizin

=> Step 1: RUN python -m pip install accelerate bitsandbytes peft torch transformers
\ Creating obj

In [13]:
pricer = modal.Function.from_name("pricer-service", "price")

In [15]:
pricer.remote("Quadcast HyperX condenser mic, connects via usb-c to your computer for crystal clear audio")

133.0