Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy loading #87

Merged
merged 21 commits into from
Jan 28, 2019
Merged

Lazy loading #87

merged 21 commits into from
Jan 28, 2019

Conversation

remusao
Copy link
Collaborator

@remusao remusao commented Jan 15, 2019

This PR implements a different, more compact representation for the engine. In a nutshell, it is now possible to load the serialized version of the Engine (as a typed array) and immediately start using it to match resources (with almost no loading cost). The required filters or resources will be loaded lazily and some will be cached in memory (e.g.: the small subset of filters which are needed on the visited domains).

Changelog:

  • [BREAKING] serialization module has been removed, instead, each class now
    provides a serialize method as well as a static method deserialize.
  • [BREAKING] FiltersEngine now exposes different methods for update:
    update which expects a diff of filters, updateList and
    updateResources. This API should be a cleared and allows using the
    adblocker without managing filters lists.
  • [BREAKING] ReverseIndex' API dropped the use of a callback to specify
    filters and instead expects a list of filters.
  • [BREAKING] parsing and matching filters can now be done using methods of
    the filters classes directly instead of free functions. For example
    NetworkFilter has a parse and match method (with the same expected
    arguments).
  • ReverseIndex is now implemented using a very compact
    representation (stored in a typed array).
  • toString method of filters should now be more accurate.
  • Addition of numerous unit tests (coverage is now >90%)

All the benchmarks below are using one list: Easylist. Its raw size is 2.7 MB.

Loading time benchmarks

  • adblocker with lazy loading in Chrome + Easylist loaded:

    • size of serialized engine: 3.1 MB
    • time to deserialize: 1-5 ms (loading time!)
    • time to parse filters + serialize: 1000 ms
  • adblocker on master branch:

    • size of serialized engine: 3.3 MB
    • time to deserialize: 280-400 ms
    • time to parse filters + serialize: 980 ms

Memory usage benchmarks

Some benchmarks on memory compared to before and other popular content blockers. These were performed using only Easylist loaded:

  • adblocker with lazy loading (this PR):
    • initial: 5.9 MB
    • after browsing[1]: 9.8 MB
    • after restart: 5.9 MB

Note: 1.3MB is used by tldts.

  • adblocker on master branch:

    • initial: 27 MB
    • after browsing[1]: 28 MB
    • after restart: 24 MB
  • uBlock Origin

    • initial: 11.8 MB
    • after browsing[1]: 13.2 MB
    • after restart: 11.6 MB
  • AdblockPlus

    • initial: 14.6 MB (after updating the lists the memory spiked at around 100MB then 70MB after GC, which might suggest some memory leak?)
    • after browsing[1]: 18.6 MB

[1] After browsing a few pages: spiegel.de, bild.de, lemonde.fr.

Synthetic benchmarks (Node.js)

These are the results in the test-suite running in Node.js:

  • adblocker lazy loading:

    • cosmetic filters parsing: 16 ops/sec
    • network filters parsing: 15 ops/sec
    • engine init: 2 ops/sec
    • string hashing: 88 ops/sec
    • string tokenizing 15 ops/sec
    • engine serialization 798 ops/sec
    • engine deserialization 2412 ops/sec
    • request matching: 0.028 ms/request (average)
  • adblocker on master branch:

    • cosmetic filters parsing: 36 ops/sec
    • network filters parsing: 16 ops/sec
    • engine init: 3 ops/sec
    • string hashing: 88 ops/sec
    • string tokenizing 15 ops/sec
    • engine serialization 55 ops/sec
    • engine deserialization 9 ops/sec
    • request matching: 0.008 ms/request (average)

We observe that the time to initialize the engine from scratch (parsing the lists, etc.) is higher with the lazy loading. The performance is still reasonable though and this trade-off is fine:

  • Initialization appears rarely
  • We can circumvent this by shipping the serialized array directly to clients for instant loading

We also see that the raw performance of matching requests is now ~3 times slower. This is hopefully a temporary trade-off and there are ways to re-gain the initial performance there as well (in a subsequent PR as this one is already quite massive).

@remusao remusao added the WIP label Jan 15, 2019
@remusao remusao force-pushed the lazy-loading branch 2 times, most recently from 1a5b4b2 to ed42a77 Compare January 15, 2019 14:06
@remusao remusao merged commit 2b90a75 into ghostery:master Jan 28, 2019
@remusao remusao mentioned this pull request Jan 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant