Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement $replace modifier, improve option parsing #3897

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

seia-soto
Copy link
Member

@seia-soto seia-soto commented Apr 9, 2024

fixes #3886 built top on #3887

https://adguard.com/kb/general/ad-filtering/create-own-filters/#replace-modifier

also includes:


Why rawOptions?

$replace is not the only option we should deal with network filters' html filtering capability. Most of network filtering options involve pattern definition in its option. This prevents writing additional fields in future.

@seia-soto seia-soto changed the title fix: properly find the filter options index feat: implement $replace modifier Apr 9, 2024
@seia-soto
Copy link
Member Author

Note: only option was set in the test.

@seia-soto
Copy link
Member Author

seia-soto commented Apr 9, 2024

  • Parsing $replace modifier
  • Validating components of the $replace modifier
  • Integrate with html filtering api (StreamingHtmlFilter)
  • E2E Testing (moved to Add E2E testing for adblockers #4050)

const [, rawRegexp, replacement, modifiers] = splitUnescaped(optionValue, '/');
const regexp = new RegExp(rawRegexp, modifiers);

return [regexp, replacement];
Copy link
Member Author

@seia-soto seia-soto Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the filter doesn't need to deal with multiple html filtering filter modifiers, we may build regex and replacement at the parse time instead of runtime. (Then throw an error if multiple html filtering modifiers were found)

@chrmod chrmod added the PR: New Feature 🚀 Increment minor version when merged label Apr 15, 2024
@remusao
Copy link
Collaborator

remusao commented Apr 22, 2024

@seia-soto Could you share more information on the $replace option? In particular:

  1. How many filters rely on this option currently?
  2. Can you list some (or all) of them as examples explaining the behavior?

Thanks, I think it will help review the implementation.

@@ -2444,4 +2444,12 @@ describe('scriptlets arguments parsing', () => {
);
}
});

it('parses replace modifier', () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add more tests since anything related to HTML filtering is critical to get right. If possible, let's add one test case for each existing filter in the lists used so far (I assume there are not so many). Let's particularly try to find corner cases of the behavior of this new $replace option.

We also need more end-to-end tests on the whole HTML filtering.

packages/adblocker/src/filters/network.ts Outdated Show resolved Hide resolved
return nextIndex;
}

function splitUnescaped(text: string, character: string) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure we add enough test cases to cover this (and above) functions.

return null;
}

public isHtmlFilteringRule(): boolean {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: I wonder if it could make sense to also use the NetworkFilter abstraction to represent the ^script-text html filters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking a similar system to that. However, I don't know if it's possible to make a proper regexp filtering system with streaming data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation already supports regexps from what I can tell.

For example this filter (already supported):

##^script:has-text(/innerHTML.*appendChild/)

Would be equivalent to (if my understanding is correct):

$replace=/innerHTML.*appendChild//

So the current mechanism needs to be extended but it seems doable without changing to a different framework.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation does buffer the all of the script tag until they're finished as I know. This means we know the start and end of the part. However, $replace is performed in full content which means we might need to know full text.

@seia-soto
Copy link
Member Author

@seia-soto Could you share more information on the $replace option? In particular:

  1. How many filters rely on this option currently?
  2. Can you list some (or all) of them as examples explaining the behavior?

Thanks, I think it will help review the implementation.

@remusao In aspect of adblocker library, the importance of $replace is pretty much lower than implementing the full html capability. I'd like to build an integrated system for html filtering because most of them are handled with regexp. However, in aspect of Ghostery product, this is in high priority since these filters do important role in YouTube blocking.

@chrmod
Copy link
Member

chrmod commented Apr 29, 2024

All uBO filters replace that are currently in use:

filters/filters-2024.txt
30:||alliptvlinks.com/tktk-content/plugins/$script,1p,replace=/\bconst now.+?, 100/clearInterval(timer);resolve();}, 100/gms
579:/theme/002/js/application.js?2.0|$script,1p,replace=/video\.maxPop/0/

filters/unbreak.txt
4802:||s3media.247sports.com/Scripts/Bundle/*/videoPlayer.js^$script,1p,replace=/;if\(!\([a-z]+\|\|\(null===[^{]+/;if(false)/

filters/filters-2023.txt
2931:||dehlinks.ir/link_download.php?Mozojadid_Id=$doc,replace=/content="15;/content="0;/
3017:||rekidai-info.github.io/_app/immutable/components/pages/index/_page.svelte-$script,replace=/try\{.*?catch.*?push\(\)\}catch\{//
3018:||rekidai-info.github.io/_app/immutable/components/pages/index/_page.svelte-$script,replace=/throw new Error\("Error Loading Rekidai Data."\)\}throw new Error\("Ad block detected."\)//
5289:||veev.to/assets/videoplayer/*.js$script,replace=/\bhttps:\/\/pagead2\.googlesyndication\.com\/pagead\/js\/adsbygoogle\.js/https:\/\/veev.to\/assets\/videoplayer\/17c088d.js/

filters/filters-2022.txt
3502:||theappstore.org/script.js?v=$script,1p,replace=/result\.length \> 10000/result.length < 10000/g
3606:/loader.min.js$xhr,script,domain=loawa.com|ygosu.com|sportalkorea.com|enetnews.co.kr|edaily.co.kr|economist.co.kr|etoday.co.kr|hankyung.com|isplus.com|hometownstation.com|inven.co.kr|honkailab.com|warcraftrumbledeck.com|genshinlab.com|thestockmarketwatch.com|thephoblographer.com|issuya.com|dogdrip.net|worldhistory.org|bamgosu.site,replace=/\)\{var [a-z]{1,2},[a-z]{1,2},[a-z]{1,2},[a-z]{1,2}\=[a-z]{2};return [a-z]\(\)/){return;/g
3607:/loader.min.js$xhr,script,domain=loawa.com|ygosu.com|sportalkorea.com|enetnews.co.kr|edaily.co.kr|economist.co.kr|etoday.co.kr|hankyung.com|isplus.com|hometownstation.com|inven.co.kr|honkailab.com|warcraftrumbledeck.com|genshinlab.com|thestockmarketwatch.com|thephoblographer.com|issuya.com|dogdrip.net|worldhistory.org|bamgosu.site,replace=/\)\{var [a-z]{1,2},[a-z]{1,2},[a-z]{1,2};.*?return [a-z]\(\)/){return; return c()/g
3608:/loader.min.js$xhr,script,domain=loawa.com|ygosu.com|sportalkorea.com|enetnews.co.kr|edaily.co.kr|economist.co.kr|etoday.co.kr|hankyung.com|isplus.com|hometownstation.com|inven.co.kr|honkailab.com|warcraftrumbledeck.com|genshinlab.com|thestockmarketwatch.com|thephoblographer.com|issuya.com|dogdrip.net|worldhistory.org,replace=/\.mark\(\(function [a-z0-9]{1,2}\([a-z0-9]{1,2},[a-z0-9]{1,2}\){var.*\]\]\)\}\)\)\),/.mark((function neutralized(a,b){var none = false;}))),/g
4298:||bitcotasks.com/assets/js/mainjs.php$script,1p,replace=/entry.duration > 0/entry.duration < 10/

filters/quick-fixes.txt
129:||d3lj2s469wtjp0.cloudfront.net/build/js/public/$script,3p,replace=/\{try\{.*?clip-path.*?catch\(/{try{}catch(/,domain=puzzle-loop.com|puzzle-words.com|puzzle-chess.com|puzzle-thermometers.com|puzzle-norinori.com|puzzle-minesweeper.com|puzzle-slant.com|puzzle-lits.com|puzzle-galaxies.com|puzzle-tents.com|puzzle-battleships.com|puzzle-pipes.com|puzzle-hitori.com|puzzle-heyawake.com|puzzle-shingoki.com|puzzle-masyu.com|puzzle-stitches.com|puzzle-aquarium.com|puzzle-tapa.com|puzzle-star-battle.com|puzzle-kakurasu.com|puzzle-skyscrapers.com|puzzle-futoshiki.com|puzzle-shakashaka.com|puzzle-kakuro.com|puzzle-jigsaw-sudoku.com|puzzle-killer-sudoku.com|puzzle-binairo.com|puzzle-nonograms.com|puzzle-sudoku.com|puzzle-light-up.com|puzzle-bridges.com|puzzle-shikaku.com|puzzle-nurikabe.com|puzzle-dominosa.com
139:||statics.1mv.xyz/statics/*.js|$script,3p,replace=/;return _0x[a-z0-9]+\['[_a-z]+'\]\['s'\]/;return false/
140:||statics.1mv.xyz/statics/*.js|$script,3p,replace=/;if\(null!==\(_0x[a-z0-9]+=this\['[_a-z]+'\]\)[^)]+\)return;/;if(true)return;/
153:||in-jpn.com^$script,replace=/var w_status[\s\S\n]+?doSakigake\(\);[\s\S\n]+?\}//,badfilter
154:||in-jpn.com^$script,replace=/var w_\w+[\s\S\n]+?doSakigake\(\);[\s\S\n]+?\}//

filters/annoyances-others.txt
396:||www.facebook.com/api/graphql/$xhr,replace=/\{"brs_content_label":[^,]+,"category":"ENGAGEMENT[^\n]+"cursor":"[^"]+"\}/{}/g
7177:||solarmovie.vip/js/$script,1p,replace=/\(\{checkers\:.*?\]\}\)/({checkers:[]})/g
7484:||tver.jp/_next/static/chunks/$replace=/e\?(e\(\):\(n\.play\(\))/!1?\$1/,script

filters/filters.txt
25:||www.youtube.com/playlist?list=$xhr,1p,replace=/"adPlacements.*?([A-Z]"\}|"\}{2\,4})\}\]\,//
26:||www.youtube.com/playlist?list=$xhr,1p,replace=/"adSlots.*?\}\]\}\}\]\,//
27:||www.youtube.com/watch?v=$xhr,1p,replace=/"adPlacements.*?([A-Z]"\}|"\}{2\,4})\}\]\,//
28:||www.youtube.com/watch?v=$xhr,1p,replace=/"adSlots.*?\}\]\}\}\]\,//
29:||www.youtube.com/youtubei/v1/player?$xhr,1p,replace=/"adPlacements.*?([A-Z]"\}|"\}{2\,4})\}\]\,//
30:||www.youtube.com/youtubei/v1/player?$xhr,1p,replace=/"adSlots.*?\}\]\}\}\]\,//
489:||www.facebook.com/api/graphql/$xhr,replace=/\{"brs_content_label":[^,]+,"category":"SPONSORED"[^\n]+"cursor":"[^"]+"\}/{}/
490:||www.facebook.com/api/graphql/$xhr,replace=/\{"node":\{"role":"SEARCH_ADS"[^\n]+?cursor":[^}]+\}/{}/g
491:||www.facebook.com/api/graphql/$xhr,replace=/\{"node":\{"__typename":"MarketplaceFeedAdStory"[^\n]+?"cursor":(?:null|"\{[^\n]+?\}"|[^\n]+?MarketplaceSearchFeedStoriesEdge")\}/{}/g

@seia-soto
Copy link
Member Author

Just a note: better filter selection should be done from performHTMLFiltering

@chrmod
Copy link
Member

chrmod commented Apr 29, 2024

Current matching logic in Ghostery 10 for Firefox:
https://github.com/ghostery/ghostery-extension/blob/d2542406174fb59ff939095b6d6d925bea79a3b9/extension-manifest-v3/src/background/adblocker.js#L356

will have to changed from:

  1. for main frames - apply html cosmetic filters
  2. for all other - match network filter

to:

  1. match all network filters and html cosmetic filters
  2. for main frames - apply html cosmetic filters and html network filters
  3. for all other - block if block network filter matched, filter html if any html filter matched

@seia-soto
Copy link
Member Author

Note: This PR requires additional updates coming from: seia-soto#3

@seia-soto seia-soto force-pushed the support-replace-mod branch 6 times, most recently from 2d34e34 to 18bd35b Compare June 6, 2024 13:44
@seia-soto seia-soto changed the title feat: implement $replace modifier feat: implement $replace modifier, improve option parsing Jun 13, 2024
@seia-soto
Copy link
Member Author

seia-soto commented Jun 13, 2024

This also requires a performance improvement.

The last result is not valid since the logic had a parsing issue. Below is the corrected values.

Ratio offset from initialisation time: 1.1422831189.
Corrected value for the list parsing time of ghostery:master: 111,494.7324346903.

(seia-soto:support-replace-mod)

Avg serialization time (100 samples): 193.007 μs
Avg deserialization time (100 samples): 35.182 μs
Serialized size: 1,866 KiB
List parsing time: 140,669.292 μs (~26% incr)
Initialization time: 23,179.875 μs

(ghostery:master)
Avg serialization time (100 samples): 270.544 μs
Avg deserialization time (100 samples): 22.112 μs
Serialized size: 1,866 KiB
List parsing time: 97,606.916 μs
Initialization time: 20,292.583 μs

*sha:dc9d1b6a0711c22464c6e3006338165cc26d4ba2

packages/adblocker-webextension/adblocker.ts Outdated Show resolved Hide resolved
packages/adblocker/test/utils.test.ts Show resolved Hide resolved
packages/adblocker/test/parsing.test.ts Outdated Show resolved Hide resolved
it('removes matching modifiers', () => {
const modifiers: HTMLModifier[] = [[new RegExp('"trackingParam":"(\\w+)"'), '"$1":""']];

expect(filter(`{"trackingParam":"a"}`, [], modifiers)).to.be.eql(`{"a":""}`);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have more tests with fragments that look like real HTML.

packages/adblocker/src/utils.ts Outdated Show resolved Hide resolved
packages/adblocker-webextension/adblocker.ts Outdated Show resolved Hide resolved
packages/adblocker/src/filters/network.ts Outdated Show resolved Hide resolved
packages/adblocker/src/filters/network.ts Outdated Show resolved Hide resolved
@@ -212,6 +217,12 @@ describe('html-filtering', () => {
).to.equal(`
<!DOCTYPE html><html lang="en"><head><div id="2x-container"><<script src="https://www.redditstatic.com/desktop2x/js/ads.js"></script><script id="data">window.___r = {"accountManagerModalData":{}},;</script><script defer="" src="https://www.example.com/foo.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/RedesignContentFonts.509eef5d33306bd3b0d5.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/RedesignOldContentFonts.e450653685d17337cac6.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/vendors~Chat~Governance~Reddit.503ee0c2d353daa60d6e.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/vendors~Governance~Reddit.7e2adb288af56de67f65.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/vendors~Poll~Reddit.5f77a82de48fbb3beb21.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/vendors~EconHelperActions~Reddit.ae3c9f7d5b30b3be7151.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/vendors~Reddit.bb2ade21a865dbd52f3f.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/Chat~Governance~Reddit.19024d94a81678cf79e8.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/Governance~Reddit.98e55a3111b273b2f5dd.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/AdminCommunityTopics~Reddit.f091b12b417d6343dc18.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/Reddit.dc10f78afef6b219b26f.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/vendors~CollectionCommentsPage~CommentsPage~Explore~Frontpage~GovernanceReleaseNotesModal~ModListing~afc2720f.c6d86939d4bd0e144927.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/vendors~Chat~ChatMessageInput~CollectionCommentsPage~CommentsPage~Frontpage~PostCreation~RedesignCha~0aefb917.e6923ac4e90b854a1995.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/ChatMessageInput~ChatPost~CollectionCommentsPage~CommentsPage~Explore~Frontpage~GovernanceReleaseNot~3a34166c.dab3a37bed364deddf0e.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/ChatPost~CollectionCommentsPage~CommentsPage~Explore~Frontpage~GovernanceReleaseNotesModal~ModListin~44a849ee.85ccf598d319cc749b92.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/CollectionCommentsPage~CommentsPage~Explore~Frontpage~ModListing~ModQueuePages~ModerationPages~Multi~33b955cc.9c1942fb8eb4378e467c.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/CollectionCommentsPage~CommentsPage~Explore~Frontpage~GovernanceReleaseNotesModal~ModListing~ModQueu~900871b8.509ec80000f4968bb546.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/ChatPost~CollectionCommentsPage~CommentsPage~Frontpage~ModListing~ModQueuePages~ModerationPages~Mult~8849df7b.067fda741fb3f181c83f.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/CollectionCommentsPage~CommentsPage~Explore~Frontpage~ModListing~ModQueuePages~ModerationPages~Multi~5f2f5c2a.9e1690590f39e0d92f45.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/ChatPost~CollectionCommentsPage~CommentsPage~Frontpage~ModListing~ModQueuePages~Multireddit~Original~029c3338.33a759111dafa848481d.js"></script><script defer="" src="https://www.redditstatic.com/desktop2x/CommentsPage.dce215cbd6a2ed7e9969.js"></script></body></html>`);
});

it('removes matching modifiers', () => {
Copy link
Member

@chrmod chrmod Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few things to improve here:

  • HTMLSelector tests to include HTMLModifier's to ensure no breakage when both are applied
  • HTMLModifier tests to cover both examples of a complete buffer but also when it is being streamed
  • HTMLModifier tests to include full document tests, can be run on top of the doc present in the tests already

https://adguard.com/kb/general/ad-filtering/create-own-filters/#replace-modifier

feat: shared option value

feat: html modifiers

test: replace modifier

test: escaped string utils

chore: replace modifierOptionValue to optionValue

chore: remove unintended logger

feat: validate replace option value on parse

feat: html filter index for network filter bucket

chore: check html modifiers on filtering req html

fix: splitUnescaped

test: add replace filters

fix: add export of HTMLModifier

feat: proper option parser for replace modifier (#3)

* fix: properly find the filter options index

* test: validate the function with real world example

* feat: use lastIndexOf position instead of text slicing

* chore: check lastIndex out of bound

* feat: support `$replace` modifier

https://adguard.com/kb/general/ad-filtering/create-own-filters/#replace-modifier

* feat: shared option value

* feat: html modifiers

* test: replace modifier

* test: escaped string utils

* chore: replace modifierOptionValue to optionValue

* chore: remove unintended logger

* feat: validate replace option value on parse

* feat: html filter index for network filter bucket

* chore: check html modifiers on filtering req html

* fix: splitUnescaped

* test: add replace filters

* feat: proper option parser for replace modifier

* chore: move string methods to utils.ts

* chore: add comments to the functions

* chore: remove unused imports

fix: invalid syntax declaration

chore: optimise code flow

chore: reduce loops and footprint

chore: remove unnecessary condition

It's already decided before this step

chore: force line end to stop parsing additional options

fix(test): irregular access of replace modifier array

test: parsing regexp test

fix(test): invalid cases and evals

chore: keep isRedirectRule flag

chore: obsolete splitUnescaped

chore: clean up

- remove duplicate function defs
- scope each filter testing of replace modifier
- clean up imports

fix: chance of dropping unhandled buffer

test: more html modifier scenarios
seia-soto and others added 5 commits July 17, 2024 21:12
! uBlockOrigin/uAssets#22620
alliptvlinks.com##+js(no-xhr-if, /doubleclick|googlesyndication/, length:10)
||alliptvlinks.com/tktk-content/plugins/$script,1p,replace=/\bconst now.+?, 100/clearInterval(timer);resolve();}, 100/gms
@seia-soto
Copy link
Member Author

Quick look on the CLI introduced in 2df01e7 (#3897)

# Basic compression/decompression (brotli)
yarn workspace @cliqz/adblocker test-samples-compression compress --all
yarn workspace @cliqz/adblocker test-samples-compression decompress --all
yarn workspace @cliqz/adblocker test-samples-compression compress ./path_to_file
yarn workspace @cliqz/adblocker test-samples-compression decompress ./path_to_file

# This outputs the filename to save (removes special chars for file system)
yarn workspace @cliqz/adblocker test-samples-compression serialize-url https://...

# This cleans uncompressed files (files without `.br` suffix)
yarn workspace @cliqz/adblocker test-samples-compression clean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: New Feature 🚀 Increment minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support $replace
3 participants