<a href="https://colab.research.google.com/github/ayagup/stablediffusion/blob/main/SDXL%20Inferencing%20with%20Maxdiffusion%20on%20TPU%20v6e-1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cells
A notebook is a list of cells. Cells contain either explanatory text or executable code and its output. Click a cell to select it.

## Code cells
Below is a **code cell**. Once the toolbar button indicates CONNECTED, click in the cell to select it and execute the contents in the following ways:

* Click the **Play icon** in the left gutter of the cell;
* Type **Cmd/Ctrl+Enter** to run the cell in place;
* Type **Shift+Enter** to run the cell and move focus to the next cell (adding one if none exists); or
* Type **Alt+Enter** to run the cell and insert a new code cell immediately below it.

There are additional options for running some or all cells in the **Runtime** menu.


In [None]:
# a = 10
# a

## Text cells
This is a **text cell**. You can **double-click** to edit this cell. Text cells
use markdown syntax. To learn more, see our [markdown
guide](/notebooks/markdown_guide.ipynb).

You can also add math to text cells using [LaTeX](http://www.latex-project.org/)
to be rendered by [MathJax](https://www.mathjax.org). Just place the statement
within a pair of **\$** signs. For example `$\sqrt{3x-1}+(1+x)^2$` becomes
$\sqrt{3x-1}+(1+x)^2.$


## Adding and moving cells
You can add new cells by using the **+ CODE** and **+ TEXT** buttons that show when you hover between cells. These buttons are also in the toolbar above the notebook where they can be used to add a cell below the currently selected cell.

You can move a cell by selecting it and clicking **Cell Up** or **Cell Down** in the top toolbar.

Consecutive cells can be selected by "lasso selection" by dragging from outside one cell and through the group.  Non-adjacent cells can be selected concurrently by clicking one and then holding down Ctrl while clicking another.  Similarly, using Shift instead of Ctrl will select all intermediate cells.

# Working with python
Colaboratory is built on top of [Jupyter Notebook](https://jupyter.org/). Below are some examples of convenience functions provided.

Long running python processes can be interrupted. Run the following cell and select **Runtime -> Interrupt execution** (*hotkey: Cmd/Ctrl-M I*) to stop execution.

In [None]:
# import time
# print("Sleeping")
# time.sleep(30) # sleep for a while; interrupt me!
# print("Done Sleeping")

## System aliases

Jupyter includes shortcuts for common operations, such as ls:

In [None]:
# !ls /bin

That `!ls` probably generated a large output. You can select the cell and clear the output by either:

1. Clicking on the clear output button (x) in the toolbar above the cell; or
2. Right clicking the left gutter of the output area and selecting "Clear output" from the context menu.

Execute any other process using `!` with string interpolation from python variables, and note the result can be assigned to a variable:

In [None]:
# # In https://github.com/ipython/ipython/pull/10545, single quote strings are ignored
# message = 'Colaboratory is great!'
# foo = !unset message && echo -e '{message}\n{message}\n'$message"\n$message"
# foo

## Magics
Colaboratory shares the notion of magics from Jupyter. There are shorthand annotations that change how a cell's text is executed. To learn more, see [Jupyter's magics page](http://nbviewer.jupyter.org/github/ipython/ipython/blob/1.x/examples/notebooks/Cell%20Magics.ipynb).


In [None]:
# %%html
# <marquee style='width: 30%; color: blue;'><b>Whee!</b></marquee>

In [None]:
# %%html
# <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 450 400" width="200" height="200">
#   <rect x="80" y="60" width="250" height="250" rx="20" style="fill:red; stroke:black; fill-opacity:0.7" />
#   <rect x="180" y="110" width="250" height="250" rx="40" style="fill:blue; stroke:black; fill-opacity:0.5;" />
# </svg>

## Automatic completions and exploring code

Colab provides automatic completions to explore attributes of Python objects, as well as to quickly view documentation strings. As an example, first run the following cell to import the  [`numpy`](http://www.numpy.org) module.

In [None]:
# import numpy as np

If you now insert your cursor after `np` and press **Period**(`.`), you will see the list of available completions within the `np` module. Completions can be opened again by using **Ctrl+Space**.

In [None]:
# np

If you type an open parenthesis after any function or class in the module, you will see a pop-up of its documentation string:

In [None]:
# np.ndarray

The documentation can be opened again using **Ctrl+Shift+Space** or you can view the documentation for method by mouse hovering over the method name.

When hovering over the method name the `Open in tab` link will open the documentation in a persistent pane. The `View source` link will navigate to the source code for the method.

## Exception Formatting

Exceptions are formatted nicely in Colab outputs:

In [None]:
# x = 1
# y = 4
# z = y/(1-x)

## Rich, interactive outputs
Until now all of the generated outputs have been text, but they can be more interesting, like the chart below.

In [None]:
# import numpy as np
# from matplotlib import pyplot as plt

# ys = 200 + np.random.randn(100)
# x = [x for x in range(len(ys))]

# plt.plot(x, ys, '-')
# plt.fill_between(x, ys, 195, where=(ys > 195), facecolor='g', alpha=0.6)

# plt.title("Fills and Alpha Example")
# plt.show()

# Integration with Drive

Colaboratory is integrated with Google Drive. It allows you to share, comment, and collaborate on the same document with multiple people:

* The **SHARE** button (top-right of the toolbar) allows you to share the notebook and control permissions set on it.

* **File->Make a Copy** creates a copy of the notebook in Drive.

* **File->Save** saves the File to Drive. **File->Save and checkpoint** pins the version so it doesn't get deleted from the revision history.

* **File->Revision history** shows the notebook's revision history.

## Commenting on a cell
You can comment on a Colaboratory notebook like you would on a Google Document. Comments are attached to cells, and are displayed next to the cell they refer to. If you have **comment-only** permissions, you will see a comment button on the top right of the cell when you hover over it.

If you have edit or comment permissions you can comment on a cell in one of three ways:

1. Select a cell and click the comment button in the toolbar above the top-right corner of the cell.
1. Right click a text cell and select **Add a comment** from the context menu.
3. Use the shortcut **Ctrl+Shift+M** to add a comment to the currently selected cell.

You can resolve and reply to comments, and you can target comments to specific collaborators by typing *+[email address]* (e.g., `+user@domain.com`). Addressed collaborators will be emailed.

The Comment button in the top-right corner of the page shows all comments attached to the notebook.

In [None]:
import os
from google.colab import auth

# Authenticate if using Cloud TPU resources outside the local Colab instance
auth.authenticate_user()

In [None]:
# Install the specialized vLLM build for TPUs
!pip install vllm-tpu

In [None]:
help(LLM.__init__)

In [None]:
!export MAX_MODEL_LEN=2048
!export TP=1 # number of chips

In [None]:
# !VLLM_LOGGING_LEVEL=DEBUG VLLM_TRACE_FUNCTION=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
!vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --seed 42 \
    --disable-log-requests \
    --no-enable-prefix-caching \
    --async-scheduling \
    --gpu-memory-utilization 0.98 \
    --max-num-batched-tokens 4096 \
    --max-num-seqs 128 \
    --tensor-parallel-size 1 \
    --max-model-len 4096

In [None]:
from huggingface_hub import login
from google.colab import userdata

# Automatically retrieves the token from your Colab Secrets
hf_token = userdata.get('HF_TOKEN')
login(hf_token)

In [None]:
!curl http://127.0.0.1:8000/ping

In [None]:
import os
import sys
import tempfile
from vllm import LLM, SamplingParams

# Set environment variables
os.environ['VLLM_LOGGING_LEVEL'] = 'INFO'

# Workaround for Jupyter/Colab: Create a wrapper for stdout that supports fileno()
class StdoutWrapper:
    """Wrapper for sys.stdout that provides a fileno() method for Jupyter compatibility"""
    def __init__(self, original_stdout):
        self.original_stdout = original_stdout
        self._temp_file = tempfile.TemporaryFile(mode='w+', buffering=1)

    def fileno(self):
        return self._temp_file.fileno()

    def __getattr__(self, name):
        return getattr(self.original_stdout, name)

    def __del__(self):
        try:
            self._temp_file.close()
        except:
            pass

# Patch stdout/stderr before vLLM initialization
print("üîß Applying Jupyter compatibility patches...")
original_stdout = sys.stdout
original_stderr = sys.stderr

sys.stdout = StdoutWrapper(original_stdout)
sys.stderr = StdoutWrapper(original_stderr)

try:
    print("=" * 80)
    print("Initializing vLLM Engine")
    print("=" * 80)

    print(f"\nüì° Engine Configuration:")
    print(f"  - Model: meta-llama/Llama-3.1-8B-Instruct")
    print(f"  - Max Model Length: 4096 tokens")
    print(f"  - Max Sequences: 128")
    print(f"  - GPU Memory: 98%")
    print("\n‚öôÔ∏è  Initializing engine (this may take 10-15 minutes for model compilation)...\n")

    # Create the LLM engine
    llm = LLM(
        model="meta-llama/Llama-3.1-8B-Instruct",
        seed=42,
        gpu_memory_utilization=0.98,
        max_model_len=4096,
        max_num_batched_tokens=4096,
        max_num_seqs=128,
        tensor_parallel_size=1,
    )

    print("\n‚úÖ Engine initialized successfully!")
    print("\n" + "=" * 80)
    print("Engine ready for inference!")
    print("=" * 80)

finally:
    # Restore original stdout/stderr
    sys.stdout = original_stdout
    sys.stderr = original_stderr

# Example usage function
def generate_text(prompt: str, temperature: float = 0.7, max_tokens: int = 2048):
    """Generate text using the vLLM engine"""
    sampling_params = SamplingParams(
        temperature=temperature,
        max_tokens=max_tokens,
    )

    outputs = llm.generate([prompt], sampling_params)
    return outputs[0].outputs[0].text

print("\nüìù Example Usage:")
print("""
# Generate text directly
prompt = "Explain quantum computing in simple terms:"
result = generate_text(prompt, temperature=0.7, max_tokens=512)
print(result)
""")
print("\n‚úÖ Engine is ready! Use the generate_text() function above to generate text.")

In [None]:
prompt = "Explain quantum computing in simple terms. Respond in .md format."
result = generate_text(prompt, temperature=0.7, max_tokens=4096)
print(result)

In [None]:
import os
import sys
import tempfile
from vllm import LLM, SamplingParams

# Set environment variables
os.environ['VLLM_LOGGING_LEVEL'] = 'INFO'

# Workaround for Jupyter/Colab: Create a wrapper for stdout that supports fileno()
class StdoutWrapper:
    """Wrapper for sys.stdout that provides a fileno() method for Jupyter compatibility"""
    def __init__(self, original_stdout):
        self.original_stdout = original_stdout
        self._temp_file = tempfile.TemporaryFile(mode='w+', buffering=1)

    def fileno(self):
        return self._temp_file.fileno()

    def __getattr__(self, name):
        return getattr(self.original_stdout, name)

    def __del__(self):
        try:
            self._temp_file.close()
        except:
            pass

# Patch stdout/stderr before vLLM initialization
print("üîß Applying Jupyter compatibility patches...")
original_stdout = sys.stdout
original_stderr = sys.stderr

sys.stdout = StdoutWrapper(original_stdout)
sys.stderr = StdoutWrapper(original_stderr)

try:
    print("=" * 80)
    print("Initializing vLLM Engine")
    print("=" * 80)

    print(f"\nüì° Engine Configuration:")
    print(f"  - Model: Qwen/Qwen3-32B")
    print(f"  - Max Model Length: 4096 tokens")
    print(f"  - Max Sequences: 128")
    print(f"  - GPU Memory: 98%")
    print("\n‚öôÔ∏è  Initializing engine (this may take 10-15 minutes for model compilation)...\n")

    # Create the LLM engine
    llm = LLM(
        model="Qwen/Qwen3-32B",
        seed=42,
        gpu_memory_utilization=0.98,
        max_model_len=4096,
        max_num_batched_tokens=2048,
        max_num_seqs=256,
        tensor_parallel_size=1,
    )

    print("\n‚úÖ Engine initialized successfully!")
    print("\n" + "=" * 80)
    print("Engine ready for inference!")
    print("=" * 80)

finally:
    # Restore original stdout/stderr
    sys.stdout = original_stdout
    sys.stderr = original_stderr

# Example usage function
def generate_text(prompt: str, temperature: float = 0.7, max_tokens: int = 2048):
    """Generate text using the vLLM engine"""
    sampling_params = SamplingParams(
        temperature=temperature,
        max_tokens=max_tokens,
    )

    outputs = llm.generate([prompt], sampling_params)
    return outputs[0].outputs[0].text

print("\nüìù Example Usage:")
print("""
# Generate text directly
prompt = "Explain quantum computing in simple terms:"
result = generate_text(prompt, temperature=0.7, max_tokens=512)
print(result)
""")
print("\n‚úÖ Engine is ready! Use the generate_text() function above to generate text.")

In [None]:
# !VLLM_LOGGING_LEVEL=DEBUG VLLM_TRACE_FUNCTION=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
!vllm serve Qwen/Qwen3-14B \
    --seed 42 \
    --disable-log-requests \
    --no-enable-prefix-caching \
    --async-scheduling \
    --gpu-memory-utilization 0.98 \
    --max-num-batched-tokens 2048 \
    --max-num-seqs 256 \
    --tensor-parallel-size 1 \
    --max-model-len 8192

In [None]:
# !VLLM_LOGGING_LEVEL=DEBUG VLLM_TRACE_FUNCTION=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
!vllm serve meta-llama/CodeLlama-7b-Instruct-hf \
    --seed 42 \
    --disable-log-requests \
    --no-enable-prefix-caching \
    --async-scheduling \
    --gpu-memory-utilization 0.98 \
    --max-num-batched-tokens 2048 \
    --max-num-seqs 256 \
    --tensor-parallel-size 1 \
    --max-model-len 4096

In [None]:
# !VLLM_LOGGING_LEVEL=DEBUG VLLM_TRACE_FUNCTION=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
!vllm serve microsoft/phi-4 \
    --seed 42 \
    --disable-log-requests \
    --no-enable-prefix-caching \
    --async-scheduling \
    --gpu-memory-utilization 0.98 \
    --max-num-batched-tokens 2048 \
    --max-num-seqs 256 \
    --tensor-parallel-size 1 \
    --max-model-len 4096

In [None]:
!git clone https://github.com/AI-Hypercomputer/maxdiffusion.git

In [None]:
!cd maxdiffusion && bash setup.sh MODE=stable DEVICE=tpu

In [None]:
# !cd maxdiffusion && pip install -e .
!cd maxdiffusion && pip install -r requirements.txt

In [None]:
!cd maxdiffusion && export PYTHONPATH=$PYTHONPATH:$(pwd)/src && python -m maxdiffusion.generate_video \
  src/maxdiffusion/configs/svd_xt_v6e.yml \
  image_path="./input_image.png" \
  output_path="./output_video.mp4"

In [None]:
!export PYTHONPATH=$PYTHONPATH:/content/maxdiffusion/src && python3 -m maxdiffusion.generate_video src/maxdiffusion/configs/svd_xt_v6e.yml

In [None]:
# Running in Colab with TPU v6e-1
import os
import sys

# 1. Add the 'src' directory to the system path so Python can find 'maxdiffusion'
# This is the Python-native equivalent of setting PYTHONPATH
sys.path.append('/content/maxdiffusion/src')

# 2. Set the environment variables required for TPU v6e performance
os.environ["LIBTPU_INIT_ARGS"] = "--xla_tpu_rwb_fusion=false --xla_tpu_dot_dot_fusion_duplicated=true"
os.environ["PYTHONPATH"] = "/content/maxdiffusion/src"

# 3. Run the generation script using the full module path
# %run /content/maxdiffusion/src/maxdiffusion/generate_sdxl.py /content/maxdiffusion/src/maxdiffusion/configs/svd_xt_v6e.yml
%run /content/maxdiffusion/src/maxdiffusion/generate_sdxl.py /content/maxdiffusion/src/maxdiffusion/configs/base_xl.yml

In [None]:
# Running in Colab with TPU v6e-1
import os
import sys

# 1. Add the 'src' directory to the system path so Python can find 'maxdiffusion'
# This is the Python-native equivalent of setting PYTHONPATH
sys.path.append('/content/maxdiffusion/src')

# 2. Set the environment variables required for TPU v6e performance
os.environ["LIBTPU_INIT_ARGS"] = "--xla_tpu_rwb_fusion=false --xla_tpu_dot_dot_fusion_duplicated=true"
os.environ["PYTHONPATH"] = "/content/maxdiffusion/src"

# 3. Run the generation script using the full module path
# %run /content/maxdiffusion/src/maxdiffusion/generate_sdxl.py /content/maxdiffusion/src/maxdiffusion/configs/svd_xt_v6e.yml
%run /content/maxdiffusion/src/maxdiffusion/generate_sdxl.py /content/maxdiffusion/src/maxdiffusion/configs/ltx_video.yml

In [None]:
from huggingface_hub import login
from google.colab import userdata

# Automatically retrieves the token from your Colab Secrets
hf_token = userdata.get('HF_TOKEN')
login(hf_token)