本專案基於 Meta FAIR 的 Segment Anything Model 2(SAM 2),並在此基礎上加入「ModFlows 色彩風格轉換」與「Streamlit 互動式前端」:你可以用 SAM2 標記出主體區域,只在該區域內套用一張或多張風格圖片的色彩風格;同時也支援影片逐幀處理。
- 論文與資源(原始 SAM 2)
SAM 2 是一個在影像與影片上皆可提示式分割(promptable segmentation)的基礎模型,對影片提供串流記憶架構以支援即時處理,並提供影像與影片兩種推論 API。
在官方 SAM 2 的基礎上,本專案新增:
-
ModFlows 色彩風格混合
- 路徑:
src/sam2_style_pipeline.py - 功能:用 SAM2 產生的遮罩,只將風格轉換應用在前景/背景/全圖;支援多張風格圖加權混合。
- 路徑:
-
Streamlit 互動介面
- 路徑:
streamlit_app.py - 功能:上傳內容圖/影片與多張風格圖、在畫布上點選主體作為 SAM2 提示、調整混合參數後輸出結果(圖與影片)。
- 路徑:
-
範例腳本
- 路徑:
test.py:最小可執行示例(單張圖片)。 - 路徑:
run_inference.py:批次處理資料夾內的圖片。
- 路徑:
需求(建議)
- Python 3.10+
- PyTorch(請依你的 GPU/CUDA 版本安裝)
- torchvision 對應版本
- 其餘套件見
requirements.txt
安裝步驟
# 建議使用 Conda / venv 建立乾淨環境
pip install -r requirements.txt
# 安裝對應的 PyTorch(以 CUDA 12.1 為例;請依你的環境調整)
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121注意:Windows 使用者若遇到 CUDA 編譯相關問題,官方建議在 WSL(Ubuntu) 環境下安裝使用。
✨ 自動下載功能:本專案已內建完整的自動下載功能。首次運行時,系統會自動檢查並下載所有缺失的 SAM2 模型文件,確保文件齊全,無需手動操作。
- 自動創建必要的目錄結構(如
sam2/checkpoints/) - 自動檢測缺失的模型文件
- 自動從官方伺服器下載(顯示下載進度條)
- 下載完成後自動繼續執行
本專案預設使用 sam2/checkpoints/sam2.1_hiera_tiny.pt 與對應 config sam2/configs/sam2.1/sam2.1_hiera_t.yaml。如需更換,可在 src/sam2_style_pipeline.py 的 get_style_sam2_settings() 中修改。
支援的模型(自動下載):
- SAM 2.1:
sam2.1_hiera_tiny.pt,sam2.1_hiera_small.pt,sam2.1_hiera_base_plus.pt,sam2.1_hiera_large.pt - SAM 2.0:
sam2_hiera_tiny.pt,sam2_hiera_small.pt,sam2_hiera_base_plus.pt,sam2_hiera_large.pt
手動下載(可選):如需手動下載,可使用官方一鍵腳本:
cd checkpoints && ./download_ckpts.sh && cd ..或個別下載(SAM 2.1):
注意:ModFlows encoder 模型(checkpoints/2024.04.28 14-08-55_merged_8195_encoder_epoch_700000.pt)需要手動下載,系統會在使用時檢查文件是否存在並提示。該模型文件較大,已通過 Git LFS 託管在 GitHub 上,克隆倉庫時會自動下載。
啟動 Streamlit:
streamlit run streamlit_app.py步驟:
- 上傳內容圖(或 mp4 影片)與風格圖:三個風格區塊(共用/前景/背景)至少上傳其中一組即可,可多張混合;不強制上傳「共用風格」,也可只提供前景或背景風格。
- 勾選「使用互動點選主體」,於畫布中點選目標前景點。
- 調整參數:
- 套用範圍:前景/背景/全圖/前景+背景同時套用
- strength(風格強度)、enc_steps(步數)、blend_alpha(遮罩混合比例)
- 多風格權重(共用/前景/背景皆可獨立加權)
- 按「開始風格化」,將輸出:
- 原圖
- 分割標記(mask)
- 風格化結果(圖片或影片)
若畫布背景未顯示,請更新到最新程式(已內建相容修補),或在瀏覽器強制重新整理後再試。
自此版本起,影片轉換流程會即時計算並顯示處理 FPS 統計(平均/中位、最小/最大),同時列出解析度、處理幀數與總耗時。你可以在 Streamlit 影片結果下方看到這些數值。
- 解析度不同,處理速度(FPS)會有明顯差異。下表提供不同解析度的參考欄位,請依你的硬體實測填入:
| 解析度 | 平均 FPS | 中位 FPS | 備註 |
|---|---|---|---|
| 480p | |||
| 720p | 建議使用 SAM 2 Tiny 權重 | ||
| 1080p |
雖然 ModFlows 與 SAM 2 皆為深度學習模型,但透過使用 Tiny 版本的 SAM 2 權重,系統在處理 720p 影片時仍能保持合理的推論速度。實際數值會依 GPU/CPU、PyTorch/CUDA 版本與 enc_steps/strength 等參數而異。
單張圖片(最小示例):
from src.sam2_style_pipeline import stylize_image_with_sam2, get_style_sam2_settings
settings = get_style_sam2_settings()
blended, orig_im, mask_im = stylize_image_with_sam2(
content_path="data/content/1.jpg",
style_paths=["data/style/1.jpg"], # 共用風格(可多張,可省略)
fg_style_paths=["data/style/2.jpg"], # 前景專用風格(可選)
bg_style_paths=["data/style/3.jpg"], # 背景專用風格(可選)
point_coords=[[320, 240]], # SAM2 提示:前景點座標
point_labels=[1], # 1=前景
output_path="data/output/stylized_1.jpg",
settings=settings,
style_weights=[1.0], # 共用風格權重(可省略)
fg_style_weights=[1.0], # 前景權重(可省略)
bg_style_weights=[1.0], # 背景權重(可省略)
apply_scope="both", # foreground / background / all / both
)影片逐幀:
from src.sam2_style_pipeline import stylize_video_with_sam2, get_style_sam2_settings
settings = get_style_sam2_settings()
stylize_video_with_sam2(
video_path="data/content/298103.mp4",
style_paths=["data/style/1.jpg"],
point_coords=[[320, 240]],
point_labels=[1],
output_path="data/output/stylized_video.mp4",
settings=settings,
apply_scope="foreground",
)src/sam2_style_pipeline.pystylize_image_with_sam2:圖片管線(支援多風格合成、前景/背景獨立或全圖套用)。stylize_video_with_sam2:影片逐幀處理(沿用相同提示與前/後景分離風格)。get_style_sam2_settings:集中調整 SAM2/Encoder 權重、裝置、混合參數等。
streamlit_app.py:互動式前端(上傳、點選、參數調整與結果下載)。src/inference.py, src/encoder.py:ModFlows 相關實作。sam2/:SAM 2 官方程式碼與設定。checkpoints/:模型權重(SAM2 與 ModFlows)。data/content, data/style, data/output:輸入/輸出資料夾。
-
Hydra 找不到設定(MissingConfigException)
請確認sam_config路徑使用套件相對路徑(已在get_style_sam2_settings()設為configs/sam2.1/sam2.1_hiera_t.yaml),且sam2/目錄存在於此專案。 -
匯入
iopath失敗
安裝:pip install iopath==0.1.10
本專案對
hieradet.py亦加入 fallback,以避免 iopath 缺失直接崩潰;仍建議安裝 iopath 以確保完整功能。 -
Streamlit 版本差異造成
image_to_url缺失
已在前端加入相容層;若仍遇到背景不顯示,請升級/重啟後再試。 -
PyTorch 安裝不到對應的
+cuXXX後綴
請依 CUDA 版本使用官方 index-url,或直接安裝 CPU 版;requirements 中不直接鎖定+cu後綴。
影像:
import torch
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
checkpoint = "./checkpoints/sam2.1_hiera_tiny.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_t.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
with torch.inference_mode():
predictor.set_image(<your_image_numpy>)
masks, _, _ = predictor.predict(<input_prompts>)影片:
import torch
from sam2.build_sam import build_sam2_video_predictor
checkpoint = "./checkpoints/sam2.1_hiera_tiny.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_t.yaml"
predictor = build_sam2_video_predictor(model_cfg, checkpoint)
with torch.inference_mode():
state = predictor.init_state(<your_video_path>)
# 新增提示、即時獲得該影格輸出
frame_idx, object_ids, masks = predictor.add_new_points_or_box(state, <your_prompts>)
# 於影片中傳播
for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
...- SAM 2 的模型權重、demo(前後端)、訓練程式碼:Apache-2.0(詳見上游 LICENSE)
- 本專案的其他程式碼沿用原授權;若使用 ModFlows,請依其授權規範使用與引用。
若本專案對你的研究或產品有幫助,請引用原始 SAM 2 與 ModFlows 的相關工作。
SAM 2:
請參考官方論文頁面之 BibTeX。
ModFlows(顏色風格轉換):
@inproceedings{larchenko2024color,
title={Color Style Transfer with Modulated Flows},
author={Larchenko, Maria and Lobashev, Alexander and Guskov, Dmitry and Palyulin, Vladimir Vladimirovich},
booktitle={ICML 2024 Workshop on Structured Probabilistic Inference \& Generative Modeling}
}
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer
[Paper] [Project] [Demo] [Dataset] [Blog] [BibTeX]
Segment Anything Model 2 (SAM 2) is a foundation model towards solving promptable visual segmentation in images and videos. We extend SAM to video by considering images as a video with a single frame. The model design is a simple transformer architecture with streaming memory for real-time video processing. We build a model-in-the-loop data engine, which improves model and data via user interaction, to collect our SA-V dataset, the largest video segmentation dataset to date. SAM 2 trained on our data provides strong performance across a wide range of tasks and visual domains.
12/11/2024 -- full model compilation for a major VOS speedup and a new SAM2VideoPredictor to better handle multi-object tracking
- We now support
torch.compileof the entire SAM 2 model on videos, which can be turned on by settingvos_optimized=Trueinbuild_sam2_video_predictor, leading to a major speedup for VOS inference. - We update the implementation of
SAM2VideoPredictorto support independent per-object inference, allowing us to relax the assumption of prompting for multi-object tracking and adding new objects after tracking starts. - See
RELEASE_NOTES.mdfor full details.
09/30/2024 -- SAM 2.1 Developer Suite (new checkpoints, training code, web demo) is released
- A new suite of improved model checkpoints (denoted as SAM 2.1) are released. See Model Description for details.
- To use the new SAM 2.1 checkpoints, you need the latest model code from this repo. If you have installed an earlier version of this repo, please first uninstall the previous version via
pip uninstall SAM-2, pull the latest code from this repo (withgit pull), and then reinstall the repo following Installation below.
- To use the new SAM 2.1 checkpoints, you need the latest model code from this repo. If you have installed an earlier version of this repo, please first uninstall the previous version via
- The training (and fine-tuning) code has been released. See
training/README.mdon how to get started. - The frontend + backend code for the SAM 2 web demo has been released. See
demo/README.mdfor details.
SAM 2 needs to be installed first before use. The code requires python>=3.10, as well as torch>=2.5.1 and torchvision>=0.20.1. Please follow the instructions here to install both PyTorch and TorchVision dependencies. You can install SAM 2 on a GPU machine using:
git clone https://github.com/facebookresearch/sam2.git && cd sam2
pip install -e .If you are installing on Windows, it's strongly recommended to use Windows Subsystem for Linux (WSL) with Ubuntu.
To use the SAM 2 predictor and run the example notebooks, jupyter and matplotlib are required and can be installed by:
pip install -e ".[notebooks]"Note:
- It's recommended to create a new Python environment via Anaconda for this installation and install PyTorch 2.5.1 (or higher) via
pipfollowing https://pytorch.org/. If you have a PyTorch version lower than 2.5.1 in your current environment, the installation command above will try to upgrade it to the latest PyTorch version usingpip. - The step above requires compiling a custom CUDA kernel with the
nvcccompiler. If it isn't already available on your machine, please install the CUDA toolkits with a version that matches your PyTorch CUDA version. - If you see a message like
Failed to build the SAM 2 CUDA extensionduring installation, you can ignore it and still use SAM 2 (some post-processing functionality may be limited, but it doesn't affect the results in most cases).
Please see INSTALL.md for FAQs on potential issues and solutions.
First, we need to download a model checkpoint. All the model checkpoints can be downloaded by running:
cd checkpoints && \
./download_ckpts.sh && \
cd ..or individually from:
(note that these are the improved checkpoints denoted as SAM 2.1; see Model Description for details.)
Then SAM 2 can be used in a few lines as follows for image and video prediction.
SAM 2 has all the capabilities of SAM on static images, and we provide image prediction APIs that closely resemble SAM for image use cases. The SAM2ImagePredictor class has an easy interface for image prompting.
import torch
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
predictor.set_image(<your_image>)
masks, _, _ = predictor.predict(<input_prompts>)Please refer to the examples in image_predictor_example.ipynb (also in Colab here) for static image use cases.
SAM 2 also supports automatic mask generation on images just like SAM. Please see automatic_mask_generator_example.ipynb (also in Colab here) for automatic mask generation in images.
For promptable segmentation and tracking in videos, we provide a video predictor with APIs for example to add prompts and propagate masklets throughout a video. SAM 2 supports video inference on multiple objects and uses an inference state to keep track of the interactions in each video.
import torch
from sam2.build_sam import build_sam2_video_predictor
checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = build_sam2_video_predictor(model_cfg, checkpoint)
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
state = predictor.init_state(<your_video>)
# add new prompts and instantly get the output on the same frame
frame_idx, object_ids, masks = predictor.add_new_points_or_box(state, <your_prompts>):
# propagate the prompts to get masklets throughout the video
for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
...Please refer to the examples in video_predictor_example.ipynb (also in Colab here) for details on how to add click or box prompts, make refinements, and track multiple objects in videos.
Alternatively, models can also be loaded from Hugging Face (requires pip install huggingface_hub).
For image prediction:
import torch
from sam2.sam2_image_predictor import SAM2ImagePredictor
predictor = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
predictor.set_image(<your_image>)
masks, _, _ = predictor.predict(<input_prompts>)For video prediction:
import torch
from sam2.sam2_video_predictor import SAM2VideoPredictor
predictor = SAM2VideoPredictor.from_pretrained("facebook/sam2-hiera-large")
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
state = predictor.init_state(<your_video>)
# add new prompts and instantly get the output on the same frame
frame_idx, object_ids, masks = predictor.add_new_points_or_box(state, <your_prompts>):
# propagate the prompts to get masklets throughout the video
for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
...The table below shows the improved SAM 2.1 checkpoints released on September 29, 2024.
| Model | Size (M) | Speed (FPS) | SA-V test (J&F) | MOSE val (J&F) | LVOS v2 (J&F) |
|---|---|---|---|---|---|
| sam2.1_hiera_tiny (config, checkpoint) |
38.9 | 91.2 | 76.5 | 71.8 | 77.3 |
| sam2.1_hiera_small (config, checkpoint) |
46 | 84.8 | 76.6 | 73.5 | 78.3 |
| sam2.1_hiera_base_plus (config, checkpoint) |
80.8 | 64.1 | 78.2 | 73.7 | 78.2 |
| sam2.1_hiera_large (config, checkpoint) |
224.4 | 39.5 | 79.5 | 74.6 | 80.6 |
The previous SAM 2 checkpoints released on July 29, 2024 can be found as follows:
| Model | Size (M) | Speed (FPS) | SA-V test (J&F) | MOSE val (J&F) | LVOS v2 (J&F) |
|---|---|---|---|---|---|
| sam2_hiera_tiny (config, checkpoint) |
38.9 | 91.5 | 75.0 | 70.9 | 75.3 |
| sam2_hiera_small (config, checkpoint) |
46 | 85.6 | 74.9 | 71.5 | 76.4 |
| sam2_hiera_base_plus (config, checkpoint) |
80.8 | 64.8 | 74.7 | 72.8 | 75.8 |
| sam2_hiera_large (config, checkpoint) |
224.4 | 39.7 | 76.0 | 74.6 | 79.8 |
Speed measured on an A100 with torch 2.5.1, cuda 12.4. See benchmark.py for an example on benchmarking (compiling all the model components). Compiling only the image encoder can be more flexible and also provide (a smaller) speed-up (set compile_image_encoder: True in the config).
See sav_dataset/README.md for details.
You can train or fine-tune SAM 2 on custom datasets of images, videos, or both. Please check the training README on how to get started.
We have released the frontend + backend code for the SAM 2 web demo (a locally deployable version similar to https://sam2.metademolab.com/demo). Please see the web demo README for details.
The SAM 2 model checkpoints, SAM 2 demo code (front-end and back-end), and SAM 2 training code are licensed under Apache 2.0, however the Inter Font and Noto Color Emoji used in the SAM 2 demo code are made available under the SIL Open Font License, version 1.1.
See contributing and the code of conduct.
The SAM 2 project was made possible with the help of many contributors (alphabetical):
Karen Bergan, Daniel Bolya, Alex Bosenberg, Kai Brown, Vispi Cassod, Christopher Chedeau, Ida Cheng, Luc Dahlin, Shoubhik Debnath, Rene Martinez Doehner, Grant Gardner, Sahir Gomez, Rishi Godugu, Baishan Guo, Caleb Ho, Andrew Huang, Somya Jain, Bob Kamma, Amanda Kallet, Jake Kinney, Alexander Kirillov, Shiva Koduvayur, Devansh Kukreja, Robert Kuo, Aohan Lin, Parth Malani, Jitendra Malik, Mallika Malhotra, Miguel Martin, Alexander Miller, Sasha Mitts, William Ngan, George Orlin, Joelle Pineau, Kate Saenko, Rodrick Shepard, Azita Shokrpour, David Soofian, Jonathan Torres, Jenny Truong, Sagar Vaze, Meng Wang, Claudette Ward, Pengchuan Zhang.
Third-party code: we use a GPU-based connected component algorithm adapted from cc_torch (with its license in LICENSE_cctorch) as an optional post-processing step for the mask predictions.
If you use SAM 2 or the SA-V dataset in your research, please use the following BibTeX entry.
@article{ravi2024sam2,
title={SAM 2: Segment Anything in Images and Videos},
author={Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and R{\"a}dle, Roman and Rolland, Chloe and Gustafson, Laura and Mintun, Eric and Pan, Junting and Alwala, Kalyan Vasudev and Carion, Nicolas and Wu, Chao-Yuan and Girshick, Ross and Doll{\'a}r, Piotr and Feichtenhofer, Christoph},
journal={arXiv preprint arXiv:2408.00714},
url={https://arxiv.org/abs/2408.00714},
year={2024}
}
