Skip to content

WAT: do not include data URLs #48

@sebastian-nagel

Description

@sebastian-nagel

The WAT extracts also include data: URLs in the "Links" list. Data URLs, especially those encoding images in Base64, occupy non-trivial amounts of the WAT storage. For multiple tasks, including the construction of webgraphs they are useless.

This is mostly to make a decision whether

  1. continue to include "data:" URLs
  2. do not include
  3. truncate "data:" URLs after x bytes

Counts of "data:" URL types from one single WAT file:

25602   data:image/gif;base64,
18019   data:image/svg+xml,
10560   data:image/png;base64,
7683    data:image/svg+xml;base64,
2988    data:text/javascript;base64,
785     data:image/svg+xml;charset=utf-8,
391     data:application/x-font-woff;charset=utf-8;base64,
264     data:image/webp;base64,
226     data:image/svg+xml;charset=UTF-8,
193     data:image/jpeg;base64,
192     data:image/svg+xml;charset=utf8,
146     data:image/jpg;base64,
128     data:text/javascript,
67      data:image/svg+xml;utf8,
44      data:image/svg+xml;charset=US-ASCII,
38      data:image/svg+xml;utf8;base64,
35      data:application/font-woff;charset=utf-8;base64,
35      data:TemplateStyles:r1238218222",
28      data:text/css;base64,

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions