Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question/Suggestion] How is the estimated encode size calculated? #79

Closed
iPaulis opened this issue Nov 28, 2022 · 13 comments · Fixed by #81
Closed

[Question/Suggestion] How is the estimated encode size calculated? #79

iPaulis opened this issue Nov 28, 2022 · 13 comments · Fixed by #81

Comments

@iPaulis
Copy link

iPaulis commented Nov 28, 2022

I'm quite confused about the encoded size estimations I'm getting for my videos with crf-search.
I understand they are just predictions based on a limited number of samples, but they don't get near the real encoded size of my files, sometimes not by much, but many times by a huge margin, so I've been trying to understand how it works and where the inaccuracy comes from.
At first, I thought it could be caused by differences in encoding parameters or something like this, but now I've been noticing it depends more on how the source file is; how many tracks it has and what percentage of the whole file size corresponds to the video track or to other tracks.

Correct me if I'm wrong, but I'm assuming ab-av1 is comparing the encoding efficiency/percentage between the lossless samples and the encoded samples, and then applying that resulting percentage to the input source file size to get the final predicted encoding size.

For example, when running crf-search, this is what ab-av1 is predicting for one of my full bluray files:

crf 21 VMAF 95.24 predicted full encode size 5.26 GiB (18%) taking 2 hours

The source file is 29,2GB, so the 18% of that results in 5,256MB, as predicted.

The problem I'm finding is that about 10GB out of those 29,2GB are multiple tracks of audio, and the video track is only 19,1GB, so the prediction would be inaccurate, as the audio tracks are not going to be encoded the same way as the video track, which is what actually ab-av1 is calculating. Besides many of those tracks would probably be discarded, and others could be re-encoded, sometimes the same language comes in 2 different audio codecs and quality in 2 tracks, and so on; there is no way to know all of that for every source file and every use case.

My point is that the total size of the final file completely depends on what the user decides to do with the other tracks, but ab-av1 cannot predict that, so I think it should focus on providing the information of what it can actually predict, which is what is actually being calculated with the video encoders and presets; the estimated size of the encoded video track only. Otherwise it gets confusing, it is not clear what that estimation represents, not the whole file size nor only the video track size, at least as it is right now if I'm not mistaken.

So, continuing with the example, I believe the more realistic accurate prediction of the encode size would be to calculate the 18% of the video track only (19,1GB), which results in 3,44GB.
Then if I decided to keep all audio tracks untouched, the final file would weight around 13,5GB, far from the currently predicted 5,26GB, or if I decided to select only one language, depending on the quality of that track, it would result in a file around 4GB, also not close to the current prediction.
Users can take that into account and consider ab-av1's prediction as a baseline only for the video track, which is quite helpful already.

Would it be possible for ab-av1 to get the size of only the video track from the whole file and then make the encoding prediction based on that? The result would be more accurate, and it would make much more sense in my opinion.

Thank you.

@alexheretic
Copy link
Owner

alexheretic commented Nov 28, 2022

You're correct in that this could definitely be improved. It's currently simply comparing the sample(s) video stream size total to the encoded versions. So audio is completely ignored which can kinda work for cases when the video stream is a big enough proportion of the total size.

The percentage itself should be a decent estimate of the video stream compression, it's just the predicted full size which is skewed depending on audio stream size.

Improving this may have some difficulties as I don't think we can always easily find the size of the video stream without reading through the whole input video. We would rather avoid that. crf-search also doesn't "know" if you're going to re-encode the audio.

Ideas:

  • Take the encoded sample size and multiply by duration to get a predicted encoded video stream size only. Then change the text: ... predicted video stream size xGiB (y%) taking ....
  • Or keep trying to predict the entire thing by taking the lossless samples (which are video only) and duration multiplying their size to get the approx input video stream size. The prediction would then be input-size - approx-input-vsize + approx-encode-vsize

The first is simplest. But either would be better than the current approach :)

@iPaulis
Copy link
Author

iPaulis commented Nov 29, 2022

Oh, I hoped there was some way to get the video stream size from the metadata with ffprobe or something, although not always video files contain that kind of metadata.
How about reading the bitrate from the metadata? given an average bitrate of the whole video stream, knowing the duration it could be used to calculate the size.

If there could be some cases where there is an easy way to read that video stream size data from the file, I would use this method for the files where it is possible to do this. Even if it is not always possible, at least some files could benefit from a greatly improved accuracy in their predictions.
Then, if there is no simple way to get real size data to work with (or something equivalent), I guess getting an approximate number is the next best thing for those cases.
Like plan A and plan B, but as we know that plan A will sometimes fail, the priority here would be to develop plan B, which always works, although not as accurately, and plan A could be delayed as a future enhancement.

I wouldn't try to predict the entire thing audio included because it gets unnecessarily complicated as we don't know what the user would choose to do with the audio streams, so depending on the source file the prediction could be unhelpful in many cases. Besides, I believe the idea behind ab-av1 is to show a fast prediction of the encoding gains we could get, so I would focus only in the video stream, as adding audio to the mix would dilute the gains that the video encoder is offering for a given file, which is the most interesting part in my opinion.

@iPaulis
Copy link
Author

iPaulis commented Nov 29, 2022

I've been thinking what the best approach could be to get a reasonably accurate prediction and not over-complicate things.
I would go the simplest way as you said, but rather than calculating the prediction through the encoded samples (because of how encoders work and are more or less efficient for certain parts of the video, some encoded samples could skew the prediction too much), I would do it through the lossless samples just once, which would represent a baseline for the video stream size. That way you could stick to the method you are already using with minimal changes, and use the same baseline to make any calculation, which I believe should work more reliably and give more consistent results:
prediction = size * encoded_percent (which I guess is an average between the samples?) / 100

The idea is to replace the current whole input size with just the video stream size, whether it is a real one, if available, or an approximation.
That approximation could be as you said, (lossless sample_size/sample_duration) * input_duration, but being only a few samples even for a whole length movie, I'm not sure if using an average of the samples' size is the best method, as the bitrate could be unevenly distributed throughout the video, so it tends to deviate quite a bit and the result gets skewed.
I've tried manually calculating the video stream size using the average of the lossless samples with 2 movies, 11 samples per movie, and the final predicted size was off by +-10 to 15%. With fewer samples the prediction could be less accurate.

Perhaps the deviation could be lowered applying weighted means instead of regular arithmetic means, so the most common sample sizes weight more than the odd high/low size samples.
How the encoded percent is calculated could be another deviation point for the final predicted size, a weighted mean could also be of help here when calculating the encoded percent of the whole set of samples.

But, those are just accuracy details for later, as you said, either way an approximate prediction will be more representative of a real encode size than the current method, so changing that is the first step and would be a helpful improvement.

@alexheretic
Copy link
Owner

My instinct is to keep it simple for now, I don't think predicting the encoded size is "core functionality" anyway. So shifting the goalposts to just predicting the video stream size means this is still kinda useful but should be much more consistent in it's accuracy.

I've implemented this in #81 & I'll run some tests to see how it changes things.

@alexheretic
Copy link
Owner

Test 1

tmp1.mp4 size 3768896485, ffmpeg full read video:3621025kB

Video is ~96% of file size.

v0.5.0

$ ab-av1 sample-encode -i tmp1.mp4 --preset 12 --crf 30
VMAF 95.23 predicted full encode size 1.08 GiB (31%) taking 9 minutes

pr-81

$ ab-av1 sample-encode -i tmp1.mp4 --preset 12 --crf 30
VMAF 95.23 predicted video stream size 1.11 GiB (31%) taking 9 minutes

Using video stream

video stream prediction: 3621025000 * 0.31 = 1.05 GiB

Encode

After encoding: ffmpeg full read video:1204457kB

Video size 1.12 GiB, file size 1.21 GiB

With a much larger video / audio ratio all 3 methods are fairly close to each other.
pr-81 wins mostly because it states "video stream size" which they're all close enough to.

Test 2

tmp2.mkv size 1704472495, ffmpeg full read video:1253044kB

Video is ~74% of size (file has dual flac audio & ttf attachments)

v0.5.0

ab-av1 sample-encode -i tmp2.mkv --preset 12 --crf 35
VMAF 95.29 predicted full encode size 228.51 MiB (14%) taking 4 minutes

pr-81

$ ab-av1 sample-encode -i tmp2.mkv --preset 12 --crf 30
VMAF 95.29 predicted video stream size 162.87 MiB (14%) taking 4 minutes

Using video stream

video stream prediction: 1253044000 * 0.14 = 167 MiB

Encode

After encoding: ffmpeg full read video:141851kB

Video size 135 MiB, file size 540 MiB

Here we see the existing approach really fall down as a significant amount of data is not video.
The pr-81 approach seems decent enough here.

Remarks

#81 definitely seems to be an improvement, and mainly just by trying to predict video stream size instead of full encoded file size.

Using the full video stream size is hard to do (since it requires a full scan or flaky per-file metadata) and doesn't seem to help that much since we still have to use the encoded sample reduction percent. Ultimately the samples encoded size seems much more volatile than the sample VMAF (the primary use), which does makes sense in crf size encoding. So sample size really is a prediction, I don't think it can ever be super accurate.

@alexheretic
Copy link
Owner

Just to follow up on the sample size vs sample VMAF volatility I tried test 2 with different samples counts.

$ ab-av1 sample-encode -i tmp2.mkv --preset 12 --crf 35 --sample-every 6m
- Sample 1 (10%) vmaf 95.29
- Sample 2 (15%) vmaf 94.67
- Sample 3 (10%) vmaf 94.41
- Sample 4 (8%) vmaf 95.39
VMAF 94.94 predicted video stream size 102.06 MiB (11%) taking 4 minutes
$ ab-av1 sample-encode -i tmp2.mkv --preset 12 --crf 35 --sample-every 3m
- Sample 1 (23%) vmaf 92.57
- Sample 2 (13%) vmaf 95.60
- Sample 3 (14%) vmaf 94.57
- Sample 4 (15%) vmaf 94.86
- Sample 5 (13%) vmaf 94.83
- Sample 6 (13%) vmaf 95.49
- Sample 7 (13%) vmaf 95.51
- Sample 8 (7%) vmaf 95.41
VMAF 94.85 predicted video stream size 180.02 MiB (15%) taking 4 minutes

VMAF seems relatively stable here (which is the point of sample crf encoding) the size is more volatile. Ultimately the sample-encode job is to estimate VMAF and the predicted reduction is a nice to have that can tell you when the encode isn't worth it. We also want to waste as little time as possible in the sample-encode stages, so I think the tradeoff makes sense.

@alexheretic
Copy link
Owner

Here are the values doing a full pass, encoding the whole video stream as a single sample.

$ ab-av1 sample-encode -i tmp2.mkv --preset 12 --crf 35 --sample-every 1s
- Sample 1 (9%) vmaf 94.76
VMAF 94.76 predicted video stream size 138.98 MiB (9%) taking 3 minutes

Since this encodes the whole thing it should be 100% accurate, but it doesn't seem to be (139 vs 135). Presumably this is because we're measuring encoded sample file size which may include a little extra container overhead. I doubt this is worth optimising though, since it won't introduce much inaccuracy compared to general sample size variance.

@alexheretic
Copy link
Owner

alexheretic commented Nov 29, 2022

I'm a little unsure if ffmpeg's "kB" is 1000 or 1024. It should mean 1000, but I have it in main as 1024 and testing more I actually think it is 1024...

... In which case the actual encoded size was indeed 138.52 MiB so the full pass is pretty much bang on.

@iPaulis
Copy link
Author

iPaulis commented Nov 29, 2022

I just was going to warn about the container overhead issue. The inaccuracy will depend on the container used mkv/mp4 and the lossless samples' size. The larger the sample size, the smaller the overhead introduced and then multiplied over the whole duration, but the opposite also happens.
That could be mostly insignificant, although noticeable, if the source file is a bdremux or similar large high quality video (with my full blurays for example it is just a 2-3 percent overhead with lossless samples of around 39MB in mkv), but for smaller source files, the container overhead could get bigger and skew the results upwards. Low quality (low bitrate and size) source files of long duration could specially be affected, always getting higher estimations that they should.

For example, taking a mp4 file with 1,18GB video stream, the lossless sample is about 29MB and video stream of almost 27MB, which represents a container overhead of around 6%.
Testing with another file (855MB video stream) in .mp4, the lossless mp4 sample is about 28MB and video stream about 25MB, so that is already a container overhead of around 9%. That could start skewing the results.

@alexheretic
Copy link
Owner

It's true measuring sample stream size would be more accurate, I'm just not convinced it'll significantly improve the encode percent predictions. I'll probably make a separate issue to investigate.

When i calculated the container overhead for my examples it was less than 1% (after is correctly figured kB as 1024). Perhaps i need some more test cases.

@alexheretic
Copy link
Owner

I've raised #82 to follow up on the sample vstream measuring idea.

With that pushed to later, I think #81 will resolve this issue, would you agree @iPaulis or am I missing other potential improvements?

@iPaulis
Copy link
Author

iPaulis commented Nov 29, 2022

From my limited testing, there also seems to be a significant difference between containers, with mkv container having always an smaller overhead, usually around 2-3% for very small 1-2MB samples, even for samples < 0,5MB, while mp4 varies a lot more depending on the sample size as explained above.

In fact, mkv seems to be very very consistent in that 2-3% container overhead, so it could be very easily corrected to get a more accurate video stream size. It is an almost insignificant percentage, but it seems so consistent and easy to predict that it could be worth just compensating a fixed percentage overhead to get a closer prediction.
Here is some info I found from Matroska testing the overhead of their container. They conclude that even for low bitrate audio only or low bitrate video only files, mkv can keep the overhead below 3%.
Being those tests the worst case scenarios, I think it would be safe to assume a general 2% average overhead and apply that correction into the prediction of the vstream size.

.mp4 seems to be much more unpredictable, and quite risky actually. I've seen results from 40% to 0% overhead, so depending on the file it could completely destroy the prediction.

For example, this is the info of a source file (it is a quite standard h264 mp4 file):

Vídeo

ID : 1
Formato : AVC
Formato/Info : Advanced Video Codec
Formato del perfil : Main@L3.1
Ajustes del formato : CABAC / 1 Ref Frames
Ajustes del formato, CABAC :
Ajustes del formato, RefFrames : 1 fotograma
ID códec : avc1
ID códec/Info : Advanced Video Coding
Duración : 28 min 59 s
Tasa de bits : 3 733 kb/s
Ancho : 1 280 píxeles
Alto : 720 píxeles
Relación de aspecto : 16:9
Modo velocidad fotogramas : Constante
Velocidad de fotogramas : 25,000 FPS
Espacio de color : YUV
Submuestreo croma : 4:2:0
Profundidad bits : 8 bits
Tipo barrido : Progresivo
Bits/(píxel*fotograma) : 0.162
Tamaño de pista : 774 MiB (94%)
Idioma : Ruso
Codec configuration box : avcC

Audio

ID : 2
Formato : AAC LC
Formato/Info : Advanced Audio Codec Low Complexity
ID códec : mp4a-40-2
Duración : 28 min 59 s
Tipo de tasa de bits : Variable
Tasa de bits : 254 kb/s
Tasa de bits máxima : 320 kb/s
Canal(es) : 2 canales
Channel layout : L R
Velocidad de muestreo : 48,0 kHz
Velocidad de fotogramas : 46,875 FPS (1024 SPF)
Modo de compresión : Con pérdida
Tamaño de pista : 52,8 MiB (6%)
Idioma : Ruso
Default :
Alternate group : 1

And this is one of the lossless samples created by ab-av1 for that mp4:

Tamaño de pista : 4,91 MiB (57%)
Cantidad de pistas original : 8,57 MiB (100%)

With those kind of big overheads, it could create an scenario where the approximate video stream size is higher that the actual whole input file size (quite easy to happen if the source file has very little audio streams). You should probably add a safeguard about it and check that the vstream never gets higher than the input file, and if it does, use the input size instead.

It does not seem necessary to do that for mkv with such a little overhead, but for mp4... too unpredictable to get realiable results. I'm not sure if you should even consider changing the output container of the lossless samples to mkv even for source mp4 files. That could completely resolve this inaccuracy and unpredictability issue of mp4.
I recommend you to test it by yourself to see if you get the same kind of overhead variation from mp4 container.

@iPaulis
Copy link
Author

iPaulis commented Nov 29, 2022

I've raised #82 to follow up on the sample vstream measuring idea.

With that pushed to later, I think #81 will resolve this issue, would you agree @iPaulis or am I missing other potential improvements?

Sure, I agree it is already a good improvement. I can't think of any other potential improvements right now, apart from the overhead correction.
As you said, we can continue testing that in #82, but as I was just pointing out in my previous post, I'm more worried about the mp4 container overhead, seeing some initial results, I'd be extra careful or there could be very weird results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants