New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question/Suggestion] How is the estimated encode size calculated? #79
Comments
You're correct in that this could definitely be improved. It's currently simply comparing the sample(s) video stream size total to the encoded versions. So audio is completely ignored which can kinda work for cases when the video stream is a big enough proportion of the total size. The percentage itself should be a decent estimate of the video stream compression, it's just the predicted full size which is skewed depending on audio stream size. Improving this may have some difficulties as I don't think we can always easily find the size of the video stream without reading through the whole input video. We would rather avoid that. crf-search also doesn't "know" if you're going to re-encode the audio. Ideas:
The first is simplest. But either would be better than the current approach :) |
Oh, I hoped there was some way to get the video stream size from the metadata with ffprobe or something, although not always video files contain that kind of metadata. If there could be some cases where there is an easy way to read that video stream size data from the file, I would use this method for the files where it is possible to do this. Even if it is not always possible, at least some files could benefit from a greatly improved accuracy in their predictions. I wouldn't try to predict the entire thing audio included because it gets unnecessarily complicated as we don't know what the user would choose to do with the audio streams, so depending on the source file the prediction could be unhelpful in many cases. Besides, I believe the idea behind ab-av1 is to show a fast prediction of the encoding gains we could get, so I would focus only in the video stream, as adding audio to the mix would dilute the gains that the video encoder is offering for a given file, which is the most interesting part in my opinion. |
I've been thinking what the best approach could be to get a reasonably accurate prediction and not over-complicate things. The idea is to replace the current whole input size with just the video stream size, whether it is a real one, if available, or an approximation. Perhaps the deviation could be lowered applying weighted means instead of regular arithmetic means, so the most common sample sizes weight more than the odd high/low size samples. But, those are just accuracy details for later, as you said, either way an approximate prediction will be more representative of a real encode size than the current method, so changing that is the first step and would be a helpful improvement. |
My instinct is to keep it simple for now, I don't think predicting the encoded size is "core functionality" anyway. So shifting the goalposts to just predicting the video stream size means this is still kinda useful but should be much more consistent in it's accuracy. I've implemented this in #81 & I'll run some tests to see how it changes things. |
Test 1tmp1.mp4 size 3768896485, ffmpeg full read video:3621025kB Video is ~96% of file size. v0.5.0
pr-81
Using video streamvideo stream prediction: 3621025000 * 0.31 = 1.05 GiB EncodeAfter encoding: ffmpeg full read video:1204457kB Video size 1.12 GiB, file size 1.21 GiB With a much larger video / audio ratio all 3 methods are fairly close to each other. Test 2tmp2.mkv size 1704472495, ffmpeg full read video:1253044kB Video is ~74% of size (file has dual flac audio & ttf attachments) v0.5.0
pr-81
Using video streamvideo stream prediction: 1253044000 * 0.14 = 167 MiB EncodeAfter encoding: ffmpeg full read video:141851kB Video size 135 MiB, file size 540 MiB Here we see the existing approach really fall down as a significant amount of data is not video. Remarks#81 definitely seems to be an improvement, and mainly just by trying to predict video stream size instead of full encoded file size. Using the full video stream size is hard to do (since it requires a full scan or flaky per-file metadata) and doesn't seem to help that much since we still have to use the encoded sample reduction percent. Ultimately the samples encoded size seems much more volatile than the sample VMAF (the primary use), which does makes sense in crf size encoding. So sample size really is a prediction, I don't think it can ever be super accurate. |
Just to follow up on the sample size vs sample VMAF volatility I tried test 2 with different samples counts.
VMAF seems relatively stable here (which is the point of sample crf encoding) the size is more volatile. Ultimately the sample-encode job is to estimate VMAF and the predicted reduction is a nice to have that can tell you when the encode isn't worth it. We also want to waste as little time as possible in the sample-encode stages, so I think the tradeoff makes sense. |
Here are the values doing a full pass, encoding the whole video stream as a single sample.
Since this encodes the whole thing it should be 100% accurate, but it doesn't seem to be (139 vs 135). Presumably this is because we're measuring encoded sample file size which may include a little extra container overhead. I doubt this is worth optimising though, since it won't introduce much inaccuracy compared to general sample size variance. |
I'm a little unsure if ffmpeg's "kB" is 1000 or 1024. It should mean 1000, but I have it in main as 1024 and testing more I actually think it is 1024... ... In which case the actual encoded size was indeed 138.52 MiB so the full pass is pretty much bang on. |
I just was going to warn about the container overhead issue. The inaccuracy will depend on the container used mkv/mp4 and the lossless samples' size. The larger the sample size, the smaller the overhead introduced and then multiplied over the whole duration, but the opposite also happens. For example, taking a mp4 file with 1,18GB video stream, the lossless sample is about 29MB and video stream of almost 27MB, which represents a container overhead of around 6%. |
It's true measuring sample stream size would be more accurate, I'm just not convinced it'll significantly improve the encode percent predictions. I'll probably make a separate issue to investigate. When i calculated the container overhead for my examples it was less than 1% (after is correctly figured kB as 1024). Perhaps i need some more test cases. |
From my limited testing, there also seems to be a significant difference between containers, with mkv container having always an smaller overhead, usually around 2-3% for very small 1-2MB samples, even for samples < 0,5MB, while mp4 varies a lot more depending on the sample size as explained above. In fact, mkv seems to be very very consistent in that 2-3% container overhead, so it could be very easily corrected to get a more accurate video stream size. It is an almost insignificant percentage, but it seems so consistent and easy to predict that it could be worth just compensating a fixed percentage overhead to get a closer prediction. .mp4 seems to be much more unpredictable, and quite risky actually. I've seen results from 40% to 0% overhead, so depending on the file it could completely destroy the prediction. For example, this is the info of a source file (it is a quite standard h264 mp4 file):
And this is one of the lossless samples created by ab-av1 for that mp4:
With those kind of big overheads, it could create an scenario where the approximate video stream size is higher that the actual whole input file size (quite easy to happen if the source file has very little audio streams). You should probably add a safeguard about it and check that the vstream never gets higher than the input file, and if it does, use the input size instead. It does not seem necessary to do that for mkv with such a little overhead, but for mp4... too unpredictable to get realiable results. I'm not sure if you should even consider changing the output container of the lossless samples to mkv even for source mp4 files. That could completely resolve this inaccuracy and unpredictability issue of mp4. |
Sure, I agree it is already a good improvement. I can't think of any other potential improvements right now, apart from the overhead correction. |
I'm quite confused about the encoded size estimations I'm getting for my videos with crf-search.
I understand they are just predictions based on a limited number of samples, but they don't get near the real encoded size of my files, sometimes not by much, but many times by a huge margin, so I've been trying to understand how it works and where the inaccuracy comes from.
At first, I thought it could be caused by differences in encoding parameters or something like this, but now I've been noticing it depends more on how the source file is; how many tracks it has and what percentage of the whole file size corresponds to the video track or to other tracks.
Correct me if I'm wrong, but I'm assuming ab-av1 is comparing the encoding efficiency/percentage between the lossless samples and the encoded samples, and then applying that resulting percentage to the input source file size to get the final predicted encoding size.
For example, when running crf-search, this is what ab-av1 is predicting for one of my full bluray files:
The source file is 29,2GB, so the 18% of that results in 5,256MB, as predicted.
The problem I'm finding is that about 10GB out of those 29,2GB are multiple tracks of audio, and the video track is only 19,1GB, so the prediction would be inaccurate, as the audio tracks are not going to be encoded the same way as the video track, which is what actually ab-av1 is calculating. Besides many of those tracks would probably be discarded, and others could be re-encoded, sometimes the same language comes in 2 different audio codecs and quality in 2 tracks, and so on; there is no way to know all of that for every source file and every use case.
My point is that the total size of the final file completely depends on what the user decides to do with the other tracks, but ab-av1 cannot predict that, so I think it should focus on providing the information of what it can actually predict, which is what is actually being calculated with the video encoders and presets; the estimated size of the encoded video track only. Otherwise it gets confusing, it is not clear what that estimation represents, not the whole file size nor only the video track size, at least as it is right now if I'm not mistaken.
So, continuing with the example, I believe the more realistic accurate prediction of the encode size would be to calculate the 18% of the video track only (19,1GB), which results in 3,44GB.
Then if I decided to keep all audio tracks untouched, the final file would weight around 13,5GB, far from the currently predicted 5,26GB, or if I decided to select only one language, depending on the quality of that track, it would result in a file around 4GB, also not close to the current prediction.
Users can take that into account and consider ab-av1's prediction as a baseline only for the video track, which is quite helpful already.
Would it be possible for ab-av1 to get the size of only the video track from the whole file and then make the encoding prediction based on that? The result would be more accurate, and it would make much more sense in my opinion.
Thank you.
The text was updated successfully, but these errors were encountered: