This notebook attempts to compare the following ZIM files. Both are full scrapes of the full scrape of Abkhazian Wikipedia.

- wikipedia_ab_all_maxi_2024-06.zim, taken with `mwoffliner` 1.13, size: 26588 bytes
- wikipedia_ab_all_maxi_2024-07.zim, taken with `mwoffliner` 1.14, size: 36892 bytes

We are attempting to reconcile the ~27% increase in size.

The file `comparison.tsv`, included with this notebook, contains three columns:

1. The content path of the item in the ZIM where it appears, or in both ZIMs
2. The size of the content in the June ZIM, if it exists there, NaN/NULL otherwise
3. The size of the content in the July ZIM, if it exists there, NaN/NULL otherwise

In [53]:
print(f'Total difference in file system {36892 - 26588} KB')

Total difference in file system 10304 KB


In [55]:
import pandas as pd

df = pd.read_csv('comparison.tsv', sep='\t')
print('Total rows: ', df.shape[0])

df['diff'] = df['july'] - df['june']
in_both = df['diff'][(df['june'] > 0) & (df['july'] > 0)]
print('Rows in both ZIMs: ', in_both.shape[0])
in_both.describe()

Total rows:  11066
Rows in both ZIMs:  10315


count     10315.000000
mean      -1598.346389
std       10047.443497
min      -44852.000000
25%        -246.000000
50%         199.000000
75%        1084.000000
max      159898.000000
Name: diff, dtype: float64

We see that on average, items actually got smaller between June and July. This means that the items that got bigger got especially big.

In [57]:
df['diff'].sum() / 1024

np.float64(-16100.5302734375)

Surprisingly, the difference from June to July is negative! Despite our file sizes, the ZIM in July has *less* data.

In [8]:
df.nlargest(10, 'diff')

Unnamed: 0,path,june,july,diff
9947,I/Lashkendar_temple_ruins.JPG.webp,10952.0,170850.0,159898.0
10321,I/RR5109-0021R_100-летие_Российского_футбола.gif,21630.0,179841.0,158211.0
10978,I/Большой_театр_1883.gif,19983.0,174006.0,154023.0
10227,I/Paliurus_fg01.jpg.webp,25804.0,170058.0,144254.0
11064,X/fulltext/xapian,7757824.0,7897088.0,139264.0
9740,I/Hovenia_dulcis.jpg.webp,26482.0,157642.0,131160.0
8677,I/Carmen_-_illustration_by_Luc_for_Journal_Amu...,18916.0,144298.0,125382.0
10607,I/Tsebelda_iconostasis.jpg.webp,18596.0,142832.0,124236.0
8415,I/Ambara_church_ruins_in_Abkhazia%2C_1899.jpg....,10896.0,119938.0,109042.0
8706,I/Christos_Acheiropoietos.jpg.webp,6150.0,111004.0,104854.0


In [9]:
df.nsmallest(10, 'diff')

Unnamed: 0,path,june,july,diff
3038,A/Анатуралтә_ахыҧхьаӡара,72555.0,27703.0,-44852.0
3039,A/Анатуралтә_ахыԥхьаӡара,72555.0,27703.0,-44852.0
4345,A/Иԥсабаратәу_ахыԥхьаӡара,72555.0,27703.0,-44852.0
3627,A/Белоруссиа_ақалақьқәа,209414.0,175008.0,-34406.0
3628,A/Белоруссиа_ақалақьқәа_рсиа,209414.0,175008.0,-34406.0
3630,A/Белорустәыла_ақалақьқәа,209414.0,175008.0,-34406.0
2904,A/Акириллица,34094.0,4251.0,-29843.0
4402,A/Кириллица,34094.0,4251.0,-29843.0
3860,A/Гъь,29269.0,2739.0,-26530.0
5808,A/Ѹ,29191.0,2677.0,-26514.0


In [41]:
webps = df[df['path'].str.endswith('.webp') & (df['june'] > 0) & (df['july'] > 0)]
print('Num webps', webps.shape[0])
webps['diff'].describe()

Num webps 2392


count      2392.000000
mean       2969.474080
std       12679.723609
min      -20758.000000
25%         -38.000000
50%           0.000000
75%           8.000000
max      159898.000000
Name: diff, dtype: float64

In [40]:
webps['diff'][webps['diff'] > 0].describe()

count       681.000000
mean      10661.333333
std       21949.767232
min           2.000000
25%          24.000000
50%          94.000000
75%       13028.000000
max      159898.000000
Name: diff, dtype: float64

In [44]:
webps['diff'][webps['diff'] < 0].describe()

count      868.000000
mean      -181.320276
std        723.304779
min     -20758.000000
25%       -242.000000
50%        -86.000000
75%        -28.000000
max         -2.000000
Name: diff, dtype: float64

In [17]:
large_diff_webps = webps[webps['diff'].abs() > 1000]
large_diff_webps['diff'].describe()

count       242.000000
mean      29735.652893
std       28186.760915
min      -20758.000000
25%       10289.500000
50%       21394.000000
75%       38107.000000
max      159898.000000
Name: diff, dtype: float64

In [23]:
webps['diff'].sum() / 1024

np.float64(6936.505859375)

The total difference attributed to webps in the ZIMs is ~6.77 MB or 6936 KB, which is about two-thirds of the discrepancy between the ZIMs. This isn't the whole story though, as we'll see from looking at the non-webp items.

In [None]:
print(non_webps.shape[0], webps.shape[0], in_both.shape[0])
assert non_webps.shape[0] + webps.shape[0] == in_both.shape[0]

7923 2392 10315


In [37]:
non_webps = df[(~df['path'].str.endswith('.webp')) & (df['june'] > 0) & (df['july'] > 0)]
non_webps['diff'].describe()

count      7923.000000
mean      -2977.398082
std        8643.249370
min      -44852.000000
25%        -487.500000
50%         773.000000
75%        1084.000000
max      158211.000000
Name: diff, dtype: float64

In [58]:
non_webps['diff'].sum() / 1024

np.float64(-23037.0361328125)

So the difference attributable to non-webp files is -23 MB. This can potentially be explained by the fact that the text in ZIM files is compressed on disk, while webps are already compressed before they are stored in the ZIM.

**Overall, there's no clear explanation for why the file sizes are so different, except possibly for some discrepancy in image processing (which is what our initial guess was).**