Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

video stream size prediction: measure input video stream size #91

Open
alexheretic opened this issue Dec 5, 2022 · 7 comments
Open

Comments

@alexheretic
Copy link
Owner

Currently we have a couple of methods to use the measured sample encode percent to predict the final encoded video stream size of the whole input.

  • input_file_size * encode_percent Original method, it can work fairly well when the video stream is the vast majority of the file size. This falls down otherwise, e.g. when there are multiple large audio tracks.
  • encode_size * input_duration / sample_duration This can be better when the previous method would over-estimate. However, sample sizes tend to be a worse approximation of average size than they are VMAF indicators so this is also not super accurate.

Since the first methods main problem is over-estimating we take the minimum of these two predictions.

A more accurate approach would be to use the input video stream size. input_vstream_size * encode_percent which directly works around the issues with the first approach.

The problem with this is to calculate the input video stream size we need to fully scan the file with ffmpeg. This could be undesirable in cases where the input is on slower storage. It's a relatively expensive operation and only helps the prediction video stream size output which is only shown on crf-search.

On the other hand we may be able to scan the input concurrently with the crf-search, since it isn't a CPU intensive operation. So may be worth it depending on how valuable the predicted size is to the user.

@iPaulis
Copy link

iPaulis commented Dec 6, 2022

As you said, the current method is not a consistently accurate approximation, it depends a lot on the source file itself, how the encoding efficiency affect those specific samples and how many of those samples are created; the more samples the more accurate approximation, but with only a few of them, like for short videos, results may get skewed.

Here is an idea to achieve a more reliable, consistent and accurate final encoded video stream size prediction based on an improved approximation, without having to scan the file at all.
It should also be fairly simple to implement and would work better regardless of the video duration or number of samples, and it should always provide a prediction lower than the whole input_filesize method (unless its a file with no audio stream, in that case it would result the same), so it shouldn't be necessary to pick the minimum between the 2 methods, I think.

The aim is to get the percentage and thus the size of the video stream out of the whole input file. That is very difficult to calculate through the sample video streams due to how video varies a lot in bitrate from sample to sample, but it should be more easily achievable taking into account the audio streams.
Unlike video streams, the vast majority of audio streams in video files commonly are constant bitrate and even if some are VBR, the bitrate variation would not be that high (it also affects less the final prediction as the audio streams are smaller). That means that in theory with only one lossless sample that includes all streams should be enough, as the audio streams size should be almost the same in any sample.

So, ab-av1 only needs to create one full sample (maybe 2, just in case only to debug and confirm audio streams are very similar size, but it might be unnecessary, or only useful when audio is VBR) with ffmpeg -i INPUT -map 0 -c copy -c:a copy OUTPUT or a similar command based in the current parameters used, but in my test with only -map 0 it was taking forever to split.
The important part is to get all streams (better if all subtitles are included, sometimes there are over 15 tracks and many times they are in PGS format which is image based and take up about 20-30MB each for a whole movie, so in some source files 350MB could be just subtitles, and that would affect the result) at the same timestamp as one of the regular lossless samples, so the video streams match. Then we take the difference:
not_vstreams_sample_size = full_lossless_sample_size - vstream_lossless_sample_size

As the bitrate of those streams are mostly constant we can extrapolate to the whole file (this should be more consistent than the current method):
not_vstreams_full_size = not_vstreams_sample_size * input_duration / sample_duration

Now we should be able to get a closer approximation to vstream_size:
approx_vstream_full_size = input_file_size - not_vstreams_full_size

encode_percent = encode_sample_size / lossless_sample_size
final_encoded_size_prediction = approx_vstream_full_size * encode_percent

That is the main idea, I hope it does not complicate things too much, becasue I believe it should give quite better results in a greater variety of scenarios. There are already a couple of files I would like to test this with; I did a rough manual calculation and in my limited testing it was more accurate every time with just a single 20s full sample as the only reference, it seems promising.

What do you think?
Maybe you had already thought of this, I hope you didn't discard the idea for some reason.

@alexheretic
Copy link
Owner Author

That's an interesting idea, though I suspect it'll end up being an approximation of similar accuracy to the current encoded-sample-size * duration calculation. The sample vstream to audio stream proportion will fluctuate in the same way as sample vstream size (which is the cause of the current inaccuracy). It will also mean writing more samples which is mildly undesirable. We could do better by just including all streams in the samples (or first sample) and doing a scan of them to get the stream sizes.

However, I think I'd prefer doing an async scan of the input which will produce the video stream size. This option will be more accurate and I think simpler.

@iPaulis
Copy link

iPaulis commented Dec 6, 2022

It is more accurate than the current method because audio streams are not affected by the video streams fluctuation (current inaccuracy).

I’m not saying to use video to audio proportions, which do fluctuate the same way and would maintain the same source inaccuracy, but to use absolute numbers of the audio streams, taking advantage of the constant bitrate nature of most of the audio and subtitle codecs, as shown in the equations I wrote above.
Using absolute numbers you don’t carry the video fluctuation inaccuracies with you, which is what we want to avoid. Only thanks to that this method is already more consistent.

It only requires creating 1 at most 2 more samples with all the streams. Then the absolute size of all streams that are not video can be obtained with a simple substraction of full sample minus video sample.
This way you can completely avoid having to scan and parse anything.

The scanning method is of course the most accurate way without a doubt, but it needs extensive HDD operations in big files that may take more than 10mins to complete, like it did for one of my files. And the higher the size, the longer it would take, like with higher resolution 4k files, uncompressed raw video, and so on.

I honestly think this new approximation could work and proof to be more reliable and accurate, so scanning could be completely avoided.
I thought you preferred to avoid that, so this new method could be worth a try.

This idea is a different kind of approach and should result in a noticeably more accurate approximation. It did in my tests. I can show you an example with real numbers later, but trust me, it is noticeably different to the current method, they are not similar (unless the current method was already spot on for a specific file, then results are similar, but for other files where the current method was deviating, this new approach makes a good difference).

@iPaulis
Copy link

iPaulis commented Dec 6, 2022

Some rough tests

Audio sample size approx. method:

not_vstreams_sample_size = full_lossless_sample_size - vstream_lossless_sample_size

As the bitrate of those streams are mostly constant we can extrapolate to the whole file (this should be more consistent than the current method):
not_vstreams_full_size = not_vstreams_sample_size * input_duration / sample_duration

Now we should be able to get a closer approximation to vstream_size:
approx_vstream_full_size = input_file_size - not_vstreams_full_size

encode_percent = encode_sample_size / lossless_sample_size
final_encoded_size_prediction = approx_vstream_full_size * encode_percent

Source file 27,3GB = 19,1GB vstream + 8,2GB other streams (12 audio + 13 subtitle streams)

Real encoded video stream size is 6,1GB:

  • Whole input size prediction method: crf 18 VMAF 96.51 predicted full encode size 9.37 GiB (32%) taking 2 hours
  • Encode sample size approx. method: crf 18 VMAF 96.51 predicted video stream size 5.90 GiB (32%) taking 2 hours
  • Audio stream size approx. method (to reach vstream size by subtraction):
    not_vstreams_lossless_sample_size = 76821KB - 56002KB = 20819KB
    not_vstreams_full_size = 20819KB * 8460s / 20s = 8806437KB / 1024 = 8600,04MB / 1024 = 8,399GB
    approx_vstream_full_size = 27,3GB - 8,399GB = 18,902GB
    final_encoded_size_prediction = 18,902GB * 32% = 6,049GB

Previous prediction was fine, but this one is better.

Source file 24,3GB = 13,3GB vstream + 11GB other streams (1 secondary video + 12 audio + 17 subtitle streams)

Real encoded video stream size is 6,0GB:

  • Whole input size prediction method: crf 17 VMAF 96.47 predicted full encode size 11.97 GiB (45%) taking 2 hours
  • Encode sample size approx. method: crf 17 VMAF 96.47 predicted video stream size 6.81 GiB (45%) taking 2 hours
  • Audio stream size approx. method (to reach vstream size by subtraction):
    not_vstreams_lossless_sample_size = 56283KB - 27413KB = 28870KB
    not_vstreams_full_size = 28870KB * 8280s / 20s = 11952180KB / 1024 = 11672,05MB / 1024 = 11,399GB
    approx_vstream_full_size = 24,3GB - 11,399GB = 12,90GB
    final_encoded_size_prediction = 12,90GB * 45% = 5,81GB

Previous prediction was not very good, new one is closer.

I'll test a bit more with less complex source files to see how it behaves, but for what I've seen this is a more accurate method.

@iPaulis
Copy link

iPaulis commented Dec 7, 2022

Source file 22,9GB = 18,4GB vstream + 4,5GB other streams (2 audio + 2 subtitle streams)

Real encoded video stream size is 3,86GB:

  • Whole input size prediction method: crf 18 VMAF 98.30 predicted full encode size 4.90 GiB (21%) taking 2 hours
  • Encode sample size approx. method: crf 18 VMAF 98.30 predicted video stream size 4.01 GiB (21%) taking 2 hours
  • Audio stream size approx. method (to reach vstream size by subtraction):
    not_vstreams_lossless_sample_size = 62319KB - 50684KB = 11635KB
    not_vstreams_full_size = 11635KB * 7219s / 20s = 4199653,25KB / 1024 = 4101,22MB / 1024 = 4,01GB
    approx_vstream_full_size = 22,9GB - 4,01GB = 18,895GB
    final_encoded_size_prediction = 18,895GB * 21% = 3,97GB

This case should have been a bit challenging for the new method as both audio streams were variable bitrate (unless I was extremely lucky with the random sample I created), but even so the results were even better than the previous method, which was already pretty good.


Source file 11,4GB = 10,3GB vstream + 1,1GB other streams (2 audio + 6 subtitle streams)

Real encoded video stream size is 2,27GB:

  • Whole input size prediction method: crf 17 VMAF 97.09 predicted video stream size 2.57 GiB (22%) taking 2 hours
  • Encode sample size approx. method: crf 17 VMAF 97.09 predicted video stream size 2.51 GiB (22%) taking 2 hours
  • Audio stream size approx. method (to reach vstream size by subtraction):
    not_vstreams_lossless_sample_size = 17879KB - 15079KB = 2800KB
    not_vstreams_full_size = 2800KB * 7560s / 20s = 1058400KB / 1024 = 1033,594MB / 1024 = 1,01GB
    approx_vstream_full_size = 11,4GB - 1,01GB = 10,39GB
    final_encoded_size_prediction = 10,39GB * 22% = 2,29GB

This would be like the best case scenario, because both audio streams are constant bitrate, and it shows in the result being so accurate.


Source file 14,2GB = 11,2GB vstream + 3GB other streams (4 audio + 8 subtitle streams)

Real encoded video stream size is 5,15GB:

  • Whole input size prediction method: crf 18-19 VMAF 97.07-97.41 predicted video stream size 5.84-7.28 GiB (41-51%) taking 2 hours
  • Encode sample size approx. method: crf 18.5 VMAF 97.24 predicted video stream size 6.11 GiB (46%) taking 2 hours
  • Audio stream size approx. method (to reach vstream size by subtraction):
    not_vstreams_lossless_sample_size = 13890KB - 6000KB = 7890KB
    not_vstreams_full_size = 7890KB * 7687s / 20s = 3032521,5KB / 1024 = 2961,45MB / 1024 = 2,89GB
    approx_vstream_full_size = 14,2GB - 2,89GB = 11,31GB
    final_encoded_size_prediction = 11,31GB * 46% = 5,2GB

Another easy case for the new method, as all the audio streams are CBR, but thanks to that it offered a far better approximation than the current method, that overestimated the encoded size.


Source file 17,9GB = 16,9GB vstream + 1GB other streams (2 audio + 3 subtitle streams)

Real encoded video stream size is 2,37GB:

  • Whole input size prediction method: crf 17 VMAF 95.46 predicted full encode size 2.59 GiB (14%) taking 2 hours
  • Encode sample size approx. method: crf 17 VMAF 95.46 predicted video stream size 2.52 GiB (14%) taking 2 hours
  • Audio stream size approx. method (to reach vstream size by subtraction):
    not_vstreams_lossless_sample_size = 37903KB - 36333KB = 1570KB
    not_vstreams_full_size = 1570KB * 9326s / 20s = 732091KB / 1024 = 714,93MB / 1024 = 0,698GB
    approx_vstream_full_size = 17,9GB - 0,698GB = 17,2GB
    final_encoded_size_prediction = 17,2GB * 14% = 2,41GB

It seems that the smaller the audio stream (which is the most common situation), the more accurate this method becomes, but even in extreme conditions with 12 audio streams the results were pretty good, better than current predictions.

I'm more convinced the more I test it, you may try and see it for yourself.

PS: By the way, the new results cache system is wonderful.

@iPaulis
Copy link

iPaulis commented Dec 9, 2022

OK, the last one, I tried to find one of the most complex files I have.
I also simplified the process to take the non video streams sample size with a single ffmpeg command, so it is not necessary to compare against another sample with the video stream to get the size. That allows to get a sample from any random point in the file and could be of any duration, so if necessary the sample duration could be extended to get improved more accurate results.

However, I did all my tests with a single random 20s sample per test and the results were already pretty good, so unless some particular file is exceptionally difficult, like with lots of variable bitrate audio streams, that should be enough.

This is the ffmpeg command (aparently writing the seeking parameters in the beginning helps accelerating the process very significantly): ffmpeg -ss 1100 -t 20 -i ".\00000.m2ts" -map 0:a -map 0:s -c copy "non_vstreams_sample.mkv"
It results in a sample file with all audio+subtitle streams, without video, so the file size of that sample is all we need to start calculating the approximation.

Source file 29,9GB = 22,4GB vstream + 7,5GB other streams (13 audio + 28 subtitle streams)

Real encoded video stream size is 4,26GB:

  • Whole input size prediction method: crf 17 VMAF 95.32 predicted full encode size 5.78 GiB (19%) taking 2 hours
  • Encode sample size approx. method: crf 17 VMAF 95.32 predicted video stream size 3.49 GiB (19%) taking 89 minutes
  • Audio stream size approx. method (to reach vstream size by subtraction):
    non_vstreams_sample.mkv = 16897KB
    non_vstreams_full_size = 16897KB * 9225s / 20s = 7793741,25KB / 1024 = 7611,08MB / 1024 = 7,433GB
    approx_vstream_full_size = 29,9GB - 7,433GB = 22,467GB
    final_encoded_size_prediction = 22,467GB * 19% = 4,27GB

I hope you find this method effective, it really works and requires minimal processing.

@alexheretic
Copy link
Owner Author

Thanks! You've convinced me it's worth trying this out. And perhaps i can just include audio & subs in all lossless samples and parse to audio+sub stream sizes from ffmpeg to make the calculation. Videos can have attachments too like fonts, but it's probably something we can look closer at later.

I'll test using this approach when I have some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants