Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make .detect_video more memory efficient #139

Closed
ejolly opened this issue Sep 27, 2022 · 5 comments
Closed

Make .detect_video more memory efficient #139

ejolly opened this issue Sep 27, 2022 · 5 comments

Comments

@ejolly
Copy link
Contributor

ejolly commented Sep 27, 2022

@ljchang after chatting with @TiankangXie it looks like we can fairly easily roll our own read_video function because torch also provides a lower level API with their VideoReader class.

Just like in their examples, we can just write a function that wraps the next(reader) calls and return a generator so at most we load only batch_size frames at most into memory on each loop iteration. That way even long videos shouldn't be a problem on low RAM/VRAM machines, and more memory will simply allow for bigger batch sizes.

The downside trying to get it to work right now is that torch needs to be compiled with support for it and requires a working ffmeg install:

*** RuntimeError: Not compiled with video_reader support, to enable video_reader support, please install ffmpeg (version 4.2 is currently supported) and build torchvision from source.
Traceback (most recent call last):
  File "/Users/Esh/anaconda3/envs/py-feat/lib/python3.8/site-packages/torchvision/io/__init__.py", line 130, in __init__
    raise RuntimeError(

So it seems like the real cost of rolling our own solution with VideoReader until torch allows for more memory efficient read_video(), is an added dependency on ffmepg and potentially more installation hassle. Or we can try a different library or solution for loading video frames. From a brief search on github it looks like there are lots of custom solutions as third party libraries, because this isn't quite "solved." But most libraries "cheat" a bit IMO. e.g. Expecting that you've pre-saved each frame as a separate image file on disk

@ejolly ejolly created this issue from a note in Refactor Detection Module (Tasks) Sep 27, 2022
@ejolly ejolly changed the title More memory efficient detect_video Make .detect_video more memory efficient Sep 27, 2022
@maltelueken
Copy link

Hi,

first of all, really great work! I was very happy to see the v0.5 release.

I ran into this issue when using your VideoDataset implementation. This is how I solved it for now: https://github.com/mexca/mexca/blob/main/mexca/video.py#L36

The advantage of this solution is that it does not depend on pytorch's VideoReader, which requires building torchvision from source and only seems to work on Linux currently, nor on torchvision.datasets.video_utils.VideoClips which does not seem to work well with batching. A disadvantage of the solution is that the entire video first needs to be decoded to read the timestamps, which can take a couple of minutes for longer videos (this is also necessary with VideoClips btw).

I hope this may be helpful.

@ejolly
Copy link
Contributor Author

ejolly commented Jun 22, 2023

Thanks @maltelueken this super helpful! We're currently trying an approach in #170 to "lazy" load video-frames using pyav. Would love your thoughts/any testing you might be able todo with this approach!

@juaninachon
Copy link

@ejolly @ljchang Hi, I would like to understand why it seems that the time that it takes to process each iteration increases ~lineally with video length. With batch_size=1, it starts at 1.00s/it. By frame 150, it has already doubled and by frame 300, it reaches 3.10s/it. I think this makes running detect_video on large files implausible. Prediction wise, should it make a difference if I split the video into 60s or 10s chunks? Thanks in advance for any recommendations.

Kind regards.
Juan

My system: i7-7700, 50gb ram, rtx 3060 12vram, ubuntu 22.04, feat version '0.6.1'

My detector:
Detector(face_model="retinaface",
              landmark_model="mobilefacenet",
              au_model="xgb",
              emotion_model="resmasknet",
              facepose_model="img2pose",
              identity_model="facenet",
              device="cuda")

.detect_video(video_path=mp4_file,
                      skip_frames=None,
                     output_size=700,
                     batch_size=1,
                     num_workers=0,
                     pin_memory=False,
                     face_detection_threshold=0.5,
                     face_identity_threshold=0.8)

My file (ffmpeg -i -f) :
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'C3A02A_000.mp4':
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
encoder : Lavf58.76.100
Duration: 00:01:00.08, start: 0.000000, bitrate: 30200 kb/s
Stream #0:0(eng): Video: h264 (High) (avc1 / 0x31637661), yuvj420p(pc, bt709), 1920x1080 [SAR 1:1 DAR 16:9], 30009 kb/s, 59.94 fps, 59.94 tbr, 60k tbn, 119.88 tbc (default)
Metadata:
handler_name : GoPro AVC
vendor_id : [0][0][0][0]
timecode : 14:38:07:29
Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 189 kb/s (default)
Metadata:
handler_name : GoPro AAC
vendor_id : [0][0][0][0]
Stream #0:2(eng): Data: none (tmcd / 0x64636D74)
Metadata:
handler_name : GoPro AVC
timecode : 14:38:07:29
Stream mapping:
Stream #0:0 -> #0:0 (h264 (native) -> wrapped_avframe (native))
Stream #0:1 -> #0:1 (aac (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
Metadata:
major_brand : isom
minor_version : 512
compatible_brands: isomiso2avc1mp41
encoder : Lavf58.76.100
Stream #0:0(eng): Video: wrapped_avframe, yuvj420p(pc, bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], q=2-31, 200 kb/s, 59.94 fps, 59.94 tbn (default)
Metadata:
handler_name : GoPro AVC
vendor_id : [0][0][0][0]
timecode : 14:38:07:29
encoder : Lavc58.134.100 wrapped_avframe
Stream #0:1(eng): Audio: pcm_s16le, 48000 Hz, stereo, s16, 1536 kb/s (default)
Metadata:
handler_name : GoPro AAC
vendor_id : [0][0][0][0]
encoder : Lavc58.134.100 pcm_s16le
frame= 3600 fps=429 q=-0.0 Lsize=N/A time=00:01:00.06 bitrate=N/A speed=7.16x

@ejolly
Copy link
Contributor Author

ejolly commented Jan 2, 2024

Hey @juaninachon this was a conscious design decision on our part until Pytorch natively handles videos in a more efficient way without additional installation overhead. By default Pytorch tries to read every frame from a video file into RAM at once before passing batches to the GPU. A lot of our users complained that pyfeat would crash silently because they were running out of memory when processing videos. Our current solution "seeks" to each frame in the video and discards previous frames before passing batches to the GPU. What you're noticing is the linearly increasing "seek" time, which we decided was a better trade-off than running out of memory, until we or torch has a more efficient native solution.

There should be no difference between cutting the video into segments and processing them independently if that helps speed things up!

@juaninachon
Copy link

juaninachon commented Jan 4, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants