Skip to content
This repository has been archived by the owner. It is now read-only.

Use etld plus one matching for 3p #172

Open
wants to merge 13 commits into
base: master
from

Conversation

@pes10k
Copy link
Collaborator

pes10k commented Feb 4, 2019

fixes #171

@pes10k pes10k requested a review from bbondy Feb 4, 2019
'www.example.org'
)
assert.equal(queryResult.matches, true)
})

This comment has been minimized.

Copy link
@diracdeltas

diracdeltas Feb 6, 2019

Member

i suggest adding some tests that use non-trivial etld+1's like, example.co.uk and example.githubusercontent.com

This comment has been minimized.

Copy link
@pes10k

pes10k Feb 6, 2019

Author Collaborator

There are many, but there at the C++ level. They're in this file https://github.com/brave/ad-block/blob/use-etld-plus-one-matching-for-3p/test/etld_test.cc

This comment has been minimized.

Copy link
@diracdeltas

diracdeltas Feb 6, 2019

Member

i see thanks

@diracdeltas
Copy link
Member

diracdeltas commented Feb 6, 2019

general question about this approach:

is it possible to just use the etld+1 parsing from Chromium? see https://cs.chromium.org/chromium/src/net/base/registry_controlled_domains/registry_controlled_domain.h?q=getdomainandregistry&dr=CSs

@pes10k
Copy link
Collaborator Author

pes10k commented Feb 6, 2019

general question about this approach:

is it possible to just use the etld+1 parsing from Chromium? see https://cs.chromium.org/chromium/src/net/base/registry_controlled_domains/registry_controlled_domain.h?q=getdomainandregistry&dr=CSs

We could, but then we'd loose the ability to run in node (which has been very valuable for crawling / measurement, getting other folks to use the code, debugging, etc)

@@ -0,0 +1,51 @@
/* Copyright (c) 2018 The Brave Software Team. Distributed under the MPL2

This comment has been minimized.

Copy link
@fmarier

fmarier Feb 9, 2019

Member

nit: Should be 2019. This applies to all of the new files added in this PR.

This comment has been minimized.

Copy link
@pes10k

pes10k Feb 9, 2019

Author Collaborator

fixed with 81cc4f4 :)

This comment has been minimized.

Copy link
@fmarier

fmarier Feb 9, 2019

Member

Maybe I'm confused about how GitHub is displaying the latest version of your PR, but it seems like you fixed all files, except for etld/domain.cc?

This comment has been minimized.

Copy link
@pes10k

pes10k Feb 9, 2019

Author Collaborator

No, i goofed, apologies for the butter fingers. Fixed now

@fmarier
Copy link
Member

fmarier commented Feb 9, 2019

I've got a question about the build process since I don't actually know when the build step takes place: if we pull down the list at build time, is there a reason to have it checked into the repo?

@pes10k
Copy link
Collaborator Author

pes10k commented Feb 9, 2019

I've got a question about the build process since I don't actually know when the build step takes place: if we pull down the list at build time, is there a reason to have it checked into the repo?

You're right no need for this. I removed it from the .gitignore previously, now also removed it from the set of tracked files. Should be good now

@pes10k
Copy link
Collaborator Author

pes10k commented Feb 9, 2019

@bbondy my code expects the public suffix list to be in a known location, and lazily parses the list on first use (i.e. at etld/data/<list>.dat). I have no idea if this will work when rolled into the larger browser. I'm just not familiar enough with the build process. Can you double check that aspect?

@pes10k pes10k force-pushed the use-etld-plus-one-matching-for-3p branch from be4c62e to 33ff5d9 Feb 9, 2019
.gitignore Outdated
.vscode

# These files are either fetched at build time, or generated from the build
etld/data/public_suffix_list.h

This comment has been minimized.

Copy link
@fmarier

fmarier Feb 11, 2019

Member

nit: I usually try to make these absolute paths with respect to the location of the .gitignore file (i.e. /etld/data/public_suffix.h) in order to avoid unexpected matches somewhere else in the codebase. Not very likely in this case, so feel free to ignore this suggestion, but a good habit IMHO.

Makefile Outdated
@@ -5,7 +5,9 @@
.PHONY: clean

build:
./node_modules/.bin/node-gyp configure && ./node_modules/.bin/node-gyp build
curl -s https://publicsuffix.org/list/public_suffix_list.dat -o etld/data/public_suffix_list.dat

This comment has been minimized.

Copy link
@fmarier

fmarier Feb 11, 2019

Member

Kudos for not using -k here :)

nit: If you want to make that curl line even stricter, you could also throw in --tlsv1.2 to enforce a minimum level of TLS.

constructorArgs.push(isException ? "true" : "false");

const wrappedLabels = labels.map(JSON.stringify);
constructorArgs.push("{" + wrappedLabels.join(", ") + "}");

This comment has been minimized.

Copy link
@fmarier

fmarier Feb 11, 2019

Member

Should the labels be surrounded by " in case they're not all bare words? Also, is it possible they include " characters that should be escaped?

}

if (previous != 0) {
labels_.push_back(string.substr(previous, current - previous));

This comment has been minimized.

Copy link
@fmarier

fmarier Feb 11, 2019

Member

I believe this call will fail if you get invalid input like: abcd.efgh. (note the trailing dot). Might be worth adding a test for it if there isn't one already.

@pes10k pes10k force-pushed the use-etld-plus-one-matching-for-3p branch 2 times, most recently from 2919690 to 8d72272 Feb 22, 2019
@pes10k
Copy link
Collaborator Author

pes10k commented Feb 22, 2019

@bbondy this is now ready for review again. The ways to enable the eTLD+1 checking (by parsing a public suffix list) are:

  1. when using the check.js script, use the new -P, --public-suffix-rules-path option, and point it to a text including public suffix rules.
  2. use the js AdBlockClient.parsePublicSuffixRules method and give it a string containing public suffix rules
  3. use the C++ AdBlockClient::parsePublicSuffixRules method with a char* / std::string of rules
  4. use either the C++ or JS deserialize methods with a dat file that includes public suffix rules data (serializing after doing 1, 2 or 3 will include the public suffix rule data in the .dat).
@pes10k pes10k force-pushed the use-etld-plus-one-matching-for-3p branch 2 times, most recently from d1ada9e to 86fb8a9 Feb 22, 2019
Copy link
Member

bbondy left a comment

I added a commit to fix npm run perf, this is unfortunately currently regressing perf by 4-5x overall though. We'll need to optimize that so it's trivially the same in speed for matching.

@pes10k pes10k force-pushed the use-etld-plus-one-matching-for-3p branch from 0fcdfc4 to 76bdb66 Feb 25, 2019
@pes10k pes10k requested a review from bbondy Feb 26, 2019
while (current != std::string::npos) {
current_label = label_text.substr(previous, current - previous);
if (current_label.length() == 0) {
throw PublicSuffixRuleInputException(

This comment has been minimized.

Copy link
@bbondy

bbondy Feb 27, 2019

Member

Chromium is built with exceptions disabled, and this would be the first exception.
I'd recommend instead passing in a pointer to a vector and then filling it. And make the return value the result, and propagate failures.

// If don't include any trailing whitespace, if there is any.
current_label = label_text.substr(previous, current - previous);
if (current_label == "") {
throw PublicSuffixRuleInputException(

This comment has been minimized.

Copy link
@bbondy

bbondy Feb 27, 2019

Member

ditto exceptions

PublicSuffixRule::PublicSuffixRule(const std::string& rule_text) {
std::string trimmed_rule_text(trim_to_whitespace(rule_text));
if (trimmed_rule_text.length() == 0) {
throw PublicSuffixRuleInputException(

This comment has been minimized.

Copy link
@bbondy

bbondy Feb 27, 2019

Member

ditto exceptions

break;

case '/':
throw PublicSuffixRuleInputException(

This comment has been minimized.

Copy link
@bbondy

bbondy Feb 27, 2019

Member

ditto exceptions

before(function () {
this.client = new AdBlockClient()
this.client.parse('||bannersnack.com^$third-party')
const etldRules = fs.readFileSync('./test/data/public_suffix_list.dat', 'utf8')

This comment has been minimized.

Copy link
@bbondy

bbondy Feb 27, 2019

Member

Can we do some similar tests for when we don't call parsePublicSuffixRules and it falls back to the warning with FQDN?

This comment has been minimized.

Copy link
@bbondy

bbondy Feb 27, 2019

Member

You can probably just add a for loop which loops twice using different clients around all the it calls.

this.client.parsePublicSuffixRules(etldRules)
})
it('consider eTLD+1 domains as 1p', function () {
const altSubDomainQuery = this.client.findMatchingFilters(

This comment has been minimized.

Copy link
@bbondy

bbondy Feb 27, 2019

Member

could we add tests for matches too?

Copy link
Member

bbondy left a comment

Please squash these 3 commits together by using git rebase -i

  • linter fixes / cleanup
  • further linter fixes / cleanup
  • even more lint cleanup
@pes10k pes10k force-pushed the use-etld-plus-one-matching-for-3p branch from d25f24d to ad0dab0 Feb 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants
You can’t perform that action at this time.