forked from Aloisius/ia-web-commons
-
Notifications
You must be signed in to change notification settings - Fork 6
Closed
Labels
Description
The WAT extracts also include data: URLs in the "Links" list. Data URLs, especially those encoding images in Base64, occupy non-trivial amounts of the WAT storage. For multiple tasks, including the construction of webgraphs they are useless.
This is mostly to make a decision whether
- continue to include "data:" URLs
- do not include
- truncate "data:" URLs after x bytes
Counts of "data:" URL types from one single WAT file:
25602 data:image/gif;base64,
18019 data:image/svg+xml,
10560 data:image/png;base64,
7683 data:image/svg+xml;base64,
2988 data:text/javascript;base64,
785 data:image/svg+xml;charset=utf-8,
391 data:application/x-font-woff;charset=utf-8;base64,
264 data:image/webp;base64,
226 data:image/svg+xml;charset=UTF-8,
193 data:image/jpeg;base64,
192 data:image/svg+xml;charset=utf8,
146 data:image/jpg;base64,
128 data:text/javascript,
67 data:image/svg+xml;utf8,
44 data:image/svg+xml;charset=US-ASCII,
38 data:image/svg+xml;utf8;base64,
35 data:application/font-woff;charset=utf-8;base64,
35 data:TemplateStyles:r1238218222",
28 data:text/css;base64,