Skip to content
This repository has been archived by the owner. It is now read-only.

Filter out rare rules #109

Closed
wants to merge 3 commits into from
Closed

Filter out rare rules #109

wants to merge 3 commits into from

Conversation

@pes10k
Copy link
Collaborator

pes10k commented Jun 5, 2018

Filters out EasyList rules that are not see in the crawl of Alexa1k (depth 2, breath 3), in the global, CN, DE and RU regions.

A couple of pieces that'd need to be worked out here:

  1. the "rarely used" rules are pulled from https://s3.amazonaws.com/com.brave.research.public/brave-unused-filters.txt. Might be better to put behind fastly or similar
  2. That is generated by this lambda function. I have it running in Lambda under the "research" (i.e. non-production) AWS account, but would probably be good to move it somewhere else (and have it run regularly / cron'ed / etc somewhere). I'm not sure what the best place is for this. Whats the best way of getting this run once a day / week?
  3. Since Brave doesn't apply CSS filters in anyway currently, every CSS rule is in the unused filters list. This is correct for now, but will need to be changed going forward when we get that added in.
  4. Should we kick out to Alexa10k? Other regions?
  5. Extend to EasyPrivacy?
  6. @diracdeltas had a great idea for a possible further optimization:
    • Run the small subset of observed filters in blocking mode
    • Run the full set of EasyList / EasyPrivacy filters in non-blocking mode, and add hits to the blocking set.

I think @diracdeltas 's idea is great, but would take bigger changes and maybe not worth blocking this on. What do you think about doing that as a possible future change (and / or possible good first change for one of the interns working on blocking)?

pes10k added 2 commits Jun 5, 2018
Copy link
Member

bbondy left a comment

Some minor refactoring, but overall lgtm.

lib/util.js Outdated
@@ -60,7 +60,8 @@ const getSingleListDataFromSingleURL = (listURL, filter) => {
body = filter(body)
}
body = sanitizeABPInput(body)

This comment has been minimized.

Copy link
@bbondy

bbondy Jun 12, 2018

Member

sanitizeABPInput was originally meant to be for including all things that modify ABP input so you could not worry about where it comes from. I think it'd be good to add filterRareRulesPromise inside there. You could convert it to a promise as well if you want. You'd just add another filter predicate there and fetch the list on startup of the tool.

return
}

console.log(`Fetching unused rules data from ${rareRulesUrl}...`)

This comment has been minimized.

Copy link
@bbondy

bbondy Jun 12, 2018

Member

I'd split this into its own function

…pen automatically to clients, convert sanitizeABPInput -> sanitizeABPInputPromise
@bbondy
bbondy approved these changes Jun 20, 2018
Copy link
Member

bbondy left a comment

This will be live once we release 0.23.x (will be our next major release soon) and will be live on brave-core builds (which is not risky because only internal). It would be good to have @lukemulks test this once builds are available -- @darkdh is working on it. You can merge now though. Just don't put it on v3 branch.

@pes10k
Copy link
Collaborator Author

pes10k commented Jun 21, 2018

I've raised the issue with #devops about where this should ultimately live and will update this PR and merge once the "how will this live on production" issues get worked out

brave/devops#90

@pes10k
Copy link
Collaborator Author

pes10k commented Dec 13, 2018

This is all managed elsewhere, by different projects now, so closing here

@pes10k pes10k closed this Dec 13, 2018
@bbondy
Copy link
Member

bbondy commented Dec 13, 2018

I'd still like there to be a 2 pass check though for desktop. Should we use this to track that?

@pes10k
Copy link
Collaborator Author

pes10k commented Dec 13, 2018

ah, gotcha. What would you think of flipping the script, and instead of pulling down a list of of rules to filter out, pulling down just a set of rules to include.

Once the latest lambda deploy happens, I'll be writing a archive/summary-latest.json to the adblock-data S3. That includes the rule, which of the 5 global lists the rule is defined in, and the count of times the rule was used in the last two weeks. Finding the intersection of the EL and EP rules in that data, and EL + EP in general, would be an easy option. Might be easier than having a "filter out the unused stuff" option. What do you think?

@bbondy
Copy link
Member

bbondy commented Dec 19, 2018

rules to include is perfectly fine with me too. Actually slightly easier even.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.