Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how can I use GPU in write_videofile #2011

Open
TANGnlp0711 opened this issue Jul 13, 2023 · 9 comments
Open

how can I use GPU in write_videofile #2011

TANGnlp0711 opened this issue Jul 13, 2023 · 9 comments
Labels
question Questions regarding functionality, usage

Comments

@TANGnlp0711
Copy link

@tburrows13 @mgaitan <!--
Hello! If you think that it is a simple problem, then consider asking instead on our Gitter channel: https://gitter.im/movie-py/. This makes it easier to have a back-and-forth discussion in real-time.


You can format code by putting ``` (that's 3 backticks) on a line by itself at the beginning and end of each code block. For example:
I rewrite the file:ffmpeg_writer: add -hwaccle nvdec
line[97]
cmd = [
FFMPEG_BINARY,
"-hwaccel","nvdec",
"-y",
"-loglevel",
"error" if logfile == sp.PIPE else "info",
"-f",
"rawvideo",
"-vcodec",
"rawvideo",
"-s",
"%dx%d" % (size[0], size[1]),
"-pix_fmt",
pix_fmt,
"-r",
"%.02f" % fps,
"-an",
"-i",
"-",
]
if audiofile is not None:
cmd.extend(["-i", audiofile, "-acodec", "copy"])
cmd.extend(["-vcodec", codec, "-preset", preset])
if ffmpeg_params is not None:
cmd.extend(ffmpeg_params)
if bitrate is not None:
cmd.extend(["-b", bitrate])

video_clips.write_videofile(file_name, temp_audiofile=file_name.replace(VIDEO_EXT_NAME, '.mp3'),
                                        fps=24,codec='h264_nvenc') 

The GPU memory is being occupied, but the GPU utilization is almost negligible. As a result, the time taken to write the video does not show any significant improvement.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 37C P0 34W / 70W | 216MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3427015 C /usr/local/bin/ffmpeg 211MiB |
+-----------------------------------------------------------------------------+
-->

@TANGnlp0711 TANGnlp0711 added the question Questions regarding functionality, usage label Jul 13, 2023
@sixyang
Copy link

sixyang commented Sep 11, 2023

hello, the bottleneck is not write_video, is the for-loop and iter_frames function.

@antsmallant
Copy link

hello, the bottleneck is not write_video, is the for-loop and iter_frames function.

yeah, look into the ffmpeg_write.py, in write_frame function, img_array.tobytes() will cost about 90% of the total time running write_videofile, this is the bottleneck.

@sixyang
Copy link

sixyang commented Feb 5, 2024

hello, the bottleneck is not write_video, is the for-loop and iter_frames function.

yeah, look into the ffmpeg_write.py, in write_frame function, img_array.tobytes() will cost about 90% of the total time running write_videofile, this is the bottleneck.

you can use torch to accelerate, in the file moviepy/video/tools/drawing.py, modify blit to blit_gpu, shown as follows:

import numpy as np
import torch


def blit_gpu(im1, im2, pos=None, mask=None, ismask=False):
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    if pos is None:
        pos = [0, 0]

    xp, yp = pos
    x1 = max(0, -xp)
    y1 = max(0, -yp)
    h1, w1 = im1.shape[:2]
    h2, w2 = im2.shape[:2]
    xp2 = min(w2, xp + w1)
    yp2 = min(h2, yp + h1)
    x2 = min(w1, w2 - xp)
    y2 = min(h1, h2 - yp)
    xp1 = max(0, xp)
    yp1 = max(0, yp)

    if (xp1 >= xp2) or (yp1 >= yp2):
        return im2

    if not isinstance(im1, torch.Tensor):               # 5.43 ms per loop / 100 loops
        im1 = torch.tensor(im1, device=device)
    if not isinstance(im2, torch.Tensor):
        im2 = torch.tensor(im2, device=device)

    blitted = im1[y1:y2, x1:x2]

    new_im2 = im2.clone()

    if mask is None:
        new_im2[yp1:yp2, xp1:xp2] = blitted
    else:
        if not isinstance(mask, torch.Tensor):          # 2.71 ms per loop / 10 loops
            mask = torch.tensor(mask[y1:y2, x1:x2], device=device)  # 1.45 ms / 100 loops
        else:
            mask = mask[y1:y2, x1:x2]
        if len(im1.shape) == 3:
            mask = mask.unsqueeze(-1).repeat(1, 1, 3)
        blit_region = new_im2[yp1:yp2, xp1:xp2]
        new_im2[yp1:yp2, xp1:xp2] = mask * blitted + (1 - mask) * blit_region

    # return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy()   # 6.13 ms / 100 loops
    return new_im2 if not ismask else new_im2

then modify file moviepy/video/VideoClip.py line 565 to return blit_gpu(img, picture, pos, mask=mask, ismask=self.ismask).
This works a lot, provided that you have a GPU

@keikoro
Copy link
Collaborator

keikoro commented Feb 10, 2024

Please always include your specs like we ask for in our issue templates – MoviePy version, platform used etc.

Code samples and logs should be code-formatted for better readability.

@JasonChoate
Copy link

This works a lot, provided that you have a GPU

This is giving me the following error with my RTX 3070:

File "C:\Python311\Lib\site-packages\moviepy\Clip.py", line 474, in iter_frames
    frame = frame.astype(dtype)
            ^^^^^^^^^^^^
AttributeError: 'Tensor' object has no attribute 'astype'. Did you mean: 'dtype'?

I really do appreciate the thought being put into this though, being able to utilize a GPU to help mitigate this bottleneck would be massive.

@zhangdanq
Copy link

This works a lot, provided that you have a GPU

This is giving me the following error with my RTX 3070:

File "C:\Python311\Lib\site-packages\moviepy\Clip.py", line 474, in iter_frames
    frame = frame.astype(dtype)
            ^^^^^^^^^^^^
AttributeError: 'Tensor' object has no attribute 'astype'. Did you mean: 'dtype'?

I really do appreciate the thought being put into this though, being able to utilize a GPU to help mitigate this bottleneck would be massive.

You can use

return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy() # 6.13 ms / 100 loops

@icynare
Copy link

icynare commented Apr 18, 2024

It works for me! 3 times faster.

File "C:\Python311\Lib\site-packages\moviepy\Clip.py", line 474, in iter_frames
    frame = frame.astype(dtype)
            ^^^^^^^^^^^^
AttributeError: 'Tensor' object has no attribute 'astype'. Did you mean: 'dtype'?

@JasonChoate As for this error, just modify function iter_frames in Clip.py as follows:

if (dtype is not None) and (frame.dtype != dtype):
       # frame = frame.astype(dtype)
       frame = frame.cpu().numpy().astype(dtype)

@maxin9966
Copy link

hello, the bottleneck is not write_video, is the for-loop and iter_frames function.

yeah, look into the ffmpeg_write.py, in write_frame function, img_array.tobytes() will cost about 90% of the total time running write_videofile, this is the bottleneck.

you can use torch to accelerate, in the file moviepy/video/tools/drawing.py, modify blit to blit_gpu, shown as follows:

import numpy as np
import torch


def blit_gpu(im1, im2, pos=None, mask=None, ismask=False):
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    if pos is None:
        pos = [0, 0]

    xp, yp = pos
    x1 = max(0, -xp)
    y1 = max(0, -yp)
    h1, w1 = im1.shape[:2]
    h2, w2 = im2.shape[:2]
    xp2 = min(w2, xp + w1)
    yp2 = min(h2, yp + h1)
    x2 = min(w1, w2 - xp)
    y2 = min(h1, h2 - yp)
    xp1 = max(0, xp)
    yp1 = max(0, yp)

    if (xp1 >= xp2) or (yp1 >= yp2):
        return im2

    if not isinstance(im1, torch.Tensor):               # 5.43 ms per loop / 100 loops
        im1 = torch.tensor(im1, device=device)
    if not isinstance(im2, torch.Tensor):
        im2 = torch.tensor(im2, device=device)

    blitted = im1[y1:y2, x1:x2]

    new_im2 = im2.clone()

    if mask is None:
        new_im2[yp1:yp2, xp1:xp2] = blitted
    else:
        if not isinstance(mask, torch.Tensor):          # 2.71 ms per loop / 10 loops
            mask = torch.tensor(mask[y1:y2, x1:x2], device=device)  # 1.45 ms / 100 loops
        else:
            mask = mask[y1:y2, x1:x2]
        if len(im1.shape) == 3:
            mask = mask.unsqueeze(-1).repeat(1, 1, 3)
        blit_region = new_im2[yp1:yp2, xp1:xp2]
        new_im2[yp1:yp2, xp1:xp2] = mask * blitted + (1 - mask) * blit_region

    # return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy()   # 6.13 ms / 100 loops
    return new_im2 if not ismask else new_im2

then modify file moviepy/video/VideoClip.py line 565 to return blit_gpu(img, picture, pos, mask=mask, ismask=self.ismask). This works a lot, provided that you have a GPU

@sixyang Is this fully utilizing the NVENC of 40-series GPUs?

@notmmao
Copy link

notmmao commented May 11, 2024

hello, the bottleneck is not write_video, is the for-loop and iter_frames function.

In my case, I use VizTracer for measurements and find that iter_frames averages 500ms per frame, while write_frame averages 2ms per frame.

from viztracer import VizTracer
with VizTracer(ignore_frozen=True, ignore_c_function=True) as _:
    final_clip.write_videofile(f"{fn}.mp4",
        # threads=16,   # ffmpeg 不是瓶颈
        codec='h264_nvenc', # 2ms per frame, 不是瓶颈
        write_logfile=f"{fn}.log"
    )

write_frame
iter_frame

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions regarding functionality, usage
Projects
None yet
Development

No branches or pull requests

9 participants