# RemBG ‚Äî Naive Video Background Removal Pipeline

## What this notebook demonstrates

This notebook removes the background from every frame of a video using **rembg**, a library built on top of neural segmentation models.

### How background removal works

Traditional background removal (like green screens) relies on a known, uniform colour.  
AI-based removal works differently: a neural network looks at the whole image and predicts which pixels belong to the *foreground subject* (a person, object, etc.) and which belong to the *background*. This is called **salient object detection**.

The default model used here is **U2Net** (Qin et al., 2020). It was trained on large datasets of images with annotated foregrounds and learns to produce a soft *alpha mask* ‚Äî a grayscale image where white = keep and black = remove. The mask is then applied to the original frame to produce a transparent PNG.

### Why "naive"?

Each frame is processed **independently** ‚Äî the model has no memory of the previous frame. This means that even tiny differences in lighting or subject position cause the mask to shift slightly between frames, producing visible **flickering or jitter** in the output video.

This is an intentional limitation of this demonstrator. Fixing it requires **temporal consistency** techniques (e.g. optical flow, cross-frame attention) explored in other notebooks.

**Steps in this notebook:**
1. Install dependencies
2. Create working folders
3. Upload a video
4. Extract frames with FFmpeg
5. Apply RemBG to each frame individually
6. Reassemble frames into a video
7. Observe the flickering artefact

## Step 1 ‚Äî Install dependencies

| Package | Role |
|---|---|
| `rembg` | Background removal library. Wraps the U2Net model and exposes a simple `remove()` function. |
| `ffmpeg-python` | Python bindings for FFmpeg. Used to extract frames from a video and reassemble them afterwards. |
| `onnxruntime` | Runtime engine for ONNX models. U2Net is distributed as an `.onnx` file, which `rembg` downloads automatically on first use. ONNX (Open Neural Network Exchange) is a portable model format that runs on CPU or GPU without requiring a specific training framework like PyTorch. |

> **GPU note:** By default `onnxruntime` runs on CPU. Installing `onnxruntime-gpu` instead enables CUDA acceleration and speeds up processing significantly on longer videos.

In [None]:
# üì¶ Install dependencies
%pip install rembg ffmpeg-python onnxruntime
%pip install rembg ffmpeg-python
%apt-get install -y ffmpeg

Collecting onnxruntime
  Downloading onnxruntime-1.21.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting coloredlogs (from onnxruntime)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading onnxruntime-1.21.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (16.0 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m16.0/16.0 MB[0m [31m102.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m46.0/46.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading humanfriendly-10.0-py2.py3-none-any.

## Step 2 ‚Äî Create working folders

We work with individual image files rather than the video directly, so we need two staging folders:

- `frames/` ‚Äî raw frames extracted from the input video
- `output_frames/` ‚Äî frames after background removal

Both are temporary; they are not the final output.

In [None]:
# üìÅ Create folders
import os
for folder in ["frames", "output_frames"]:
    os.makedirs(folder, exist_ok=True)

## Step 3 ‚Äî Upload your video

This cell uses the Colab file uploader. Upload an MP4 or MOV clip ‚Äî **shorter clips (5‚Äì15 seconds) are recommended** to keep processing time reasonable on CPU.

**Tips for best results:**
- The subject should be clearly separated from the background (e.g. a person against a wall).
- Avoid very busy or textured backgrounds ‚Äî U2Net can struggle with complex scenes.
- Consistent lighting across the clip reduces mask instability.

In [None]:
# ‚¨ÜÔ∏è Upload your video
from google.colab import files
uploaded = files.upload()

Saving 20250409_1305_Gentle Dance Breeze_simple_compose_01jrd3jwyefq7v0vbypcqm82w9.mp4 to 20250409_1305_Gentle Dance Breeze_simple_compose_01jrd3jwyefq7v0vbypcqm82w9.mp4


## Step 4 ‚Äî Extract frames from the video

A video is just a sequence of still images (frames) displayed in rapid succession (typically 24‚Äì60 frames per second). To process it with a per-image model we must:

1. **Decode** the video into individual frames ‚Äî FFmpeg reads the compressed video stream and outputs one PNG file per frame.
2. **Process** each frame.
3. **Re-encode** the processed frames back into a video.

`qscale=2` sets FFmpeg's JPEG-like quality scale for PNG export. Lower values = higher quality (range 1‚Äì31). We use 2 to preserve fine detail in hair and edges that the model needs to segment accurately.

> The frame filenames (`frame_00001.png`, `frame_00002.png`, ‚Ä¶) encode the order. This ordering is essential when reassembling the video later.

In [None]:
# üéûÔ∏è Extract frames from the video
import ffmpeg

input_video = list(uploaded.keys())[0]
(
    ffmpeg
    .input(input_video)
    .output('frames/frame_%05d.png', qscale=2)
    .run()
)

(None, None)

## Step 5 ‚Äî Apply background removal frame by frame

`rembg.remove()` takes raw image bytes and returns a PNG with an **alpha channel** added. The alpha channel is a fourth pixel channel (RGBA) that controls transparency:
- **255 (white)** = fully opaque ‚Äî this pixel belongs to the foreground subject.
- **0 (black)** = fully transparent ‚Äî this pixel is background and should be discarded.
- Values in between produce soft, semi-transparent edges (important for hair and fur).

Internally, `remove()`:
1. Resizes the image to the model's input resolution.
2. Runs a forward pass through U2Net, producing a probability map (the "saliency map").
3. Post-processes the map into a binary-ish mask with soft edges.
4. Applies the mask to the original image as an alpha channel.

On first run the model weights (`u2net.onnx`, ~176 MB) are downloaded and cached in `~/.u2net/`. Subsequent runs use the cache.

> **Performance note:** Each frame is an independent model call. For a 30 fps, 10-second clip that is 300 inference passes. This is why GPU acceleration matters for real-time or near-real-time use.

In [None]:
# ‚úÇÔ∏è Apply RemBG frame-by-frame
from rembg import remove
from PIL import Image
import os

input_dir = "frames"
output_dir = "output_frames"

for filename in sorted(os.listdir(input_dir)):
    if filename.endswith(".png"):
        with open(os.path.join(input_dir, filename), "rb") as inp:
            input_data = inp.read()
            output_data = remove(input_data)
        with open(os.path.join(output_dir, filename), "wb") as out:
            out.write(output_data)

Downloading data from 'https://github.com/danielgatis/rembg/releases/download/v0.0.0/u2net.onnx' to file '/root/.u2net/u2net.onnx'.
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 176M/176M [00:00<00:00, 186GB/s]


In [None]:
# üîç Inspect a single frame ‚Äî before and after background removal
# This cell visualises the alpha mask alongside the original and processed frame.
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

sample_file = sorted(os.listdir("frames"))[0]

original = Image.open(f"frames/{sample_file}").convert("RGB")
processed = Image.open(f"output_frames/{sample_file}").convert("RGBA")

# The alpha channel is the mask predicted by U2Net
alpha_mask = processed.split()[-1]  # Extract alpha channel as greyscale image

# Composite the processed frame over a white background so transparency is visible
white_bg = Image.new("RGB", processed.size, (255, 255, 255))
white_bg.paste(processed, mask=alpha_mask)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
axes[0].imshow(original);       axes[0].set_title("Original frame")
axes[1].imshow(alpha_mask, cmap="gray"); axes[1].set_title("Alpha mask (U2Net output)")
axes[2].imshow(white_bg);       axes[2].set_title("Result (composited on white)")
for ax in axes:
    ax.axis("off")
plt.suptitle(f"Frame: {sample_file}", y=1.02)
plt.tight_layout()
plt.show()

## Step 6 ‚Äî Reassemble the video

FFmpeg reads the numbered PNGs in order and encodes them into an MP4:

- **`vcodec=libx264`** ‚Äî H.264, the most widely compatible video codec. Works in browsers, phones, and every video player.
- **`pix_fmt=yuv420p`** ‚Äî the pixel format expected by H.264. It does **not** support an alpha channel, so the transparency from rembg is composited over **black** in the output. To preserve transparency you would use `libvpx-vp9` with `yuva420p` and output a `.webm` file.
- **`framerate=30`** ‚Äî must match the framerate used during extraction, otherwise the video plays at the wrong speed.

> **Watch for flickering.** Play the output video and notice how the mask boundary shifts between frames. This is the core artefact of frame-independent processing. Compare it to your input and ask yourself: what information from adjacent frames could the model use to stabilise the mask?

In [None]:
# üß± Rebuild video from processed frames
output_name = "output.mp4"
(
    ffmpeg
    .input('output_frames/frame_%05d.png', framerate=30)
    .output(output_name, vcodec='libx264', pix_fmt='yuv420p')
    .run()
)
from IPython.display import Video
Video(output_name)

## Reflection ‚Äî What could be improved?

| Limitation | Possible solution |
|---|---|
| Flickering (no temporal consistency) | Propagate the previous frame's mask via optical flow to constrain the current prediction |
| Slow per-frame inference | Batch frames through the model; use GPU via `onnxruntime-gpu` |
| Black background in output | Export as WebM with alpha channel, or composite over a custom background |
| U2Net struggles with complex edges | Try `birefnet-general` (a newer, higher-quality model available in rembg) |
| Hard mask edges | `rembg` supports matting post-processing (`om=True`) for softer transitions |

These are the problems that production tools like DaVinci Resolve, Adobe After Effects, and cloud APIs solve ‚Äî each with their own trade-offs between speed, quality, and cost.