Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Plugins] Content Blocker Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML #378

Closed
abhinavsingh opened this issue Jun 18, 2020 · 11 comments
Labels
Proposal Proposals for futuristic feature requests

Comments

@abhinavsingh
Copy link
Owner

Describe the solution you'd like
Enabling this plugin will automatically strip any ad, tracking pixels or malicious content from HTML pages.

This plugin will asynchronously download ad block rules used for content blocking.

@abhinavsingh abhinavsingh added the Proposal Proposals for futuristic feature requests label Jun 18, 2020
@abhinavsingh abhinavsingh self-assigned this Jun 18, 2020
@abhinavsingh abhinavsingh changed the title [Content Blocker] Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML [Plugin] Content Blocker Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML Jun 18, 2020
@abhinavsingh abhinavsingh changed the title [Plugin] Content Blocker Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML [Plugins] Content Blocker Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML Jun 18, 2020
@abhinavsingh abhinavsingh removed their assignment Jun 18, 2020
@mikenye
Copy link
Contributor

mikenye commented Jul 5, 2020

Hi @abhinavsingh,

I’ve recently been considering putting together an ad-blocker for AppleTV and other devices to block things like YouTube and Twitch ads. I came across proxy.py and have been reading about the plugin framework and this looks to be a good way to do this. I thought I’d review the issues to see if anyone else was working on this, and found this issue. :-)

My thoughts were to do something along the lines of:

  • transparently redirect all HTTP/HTTPS traffic through an instance of proxy.py
  • have a list of domains to be “active” on. Domains not in this list will bypass the proxy (if HTTPS, will use CONNECT instead of TLS Interception, to keep things like internet banking un-molested).
  • Requests within “active” domains will then be checked over some kind of block list (likely regex) for objects to be blocked.
  • Perhaps there are several different block actions. For example:
    • “block_404” could block the object with a 404 error
    • “block_503” could return HTTP 503 to mimic a server error
    • “serve_blank_video” action could serve (via proxy.py’s built in web server) a short, blank video instead of a video ad so as to try and prevent breaking the client application...
    • and so on...

Looking at the plugin syntax, I think I can come up with a proof of concept plugin, however the only part that isn’t immediately straightforward to me is how to enable/disable TLS interception on a per request basis... Are you able to point me in the right direction here?

Any other thoughts/ideas?

Thanks!

@mikenye
Copy link
Contributor

mikenye commented Jul 5, 2020

Is it as easy as changing self.request.method to CONNECT for HTTPS requests not in “active” domains?

@abhinavsingh
Copy link
Owner Author

Hi @mikenye

Welcome and thanks for considering proxy.py. My initial thoughts behind this plugin was inline with what you described. Plan was to maintain list of ad-block rules. Example, download industry standard rules from https://easylist.to/ website. Afaik, these rules are mostly regex based. However, as you identified, this strategy will require TLS interception for all the requests, even for trusted domains.

Unfortunately, as of today, proxy.py doesn't support selective TLS interception. But we should be able to add this feature easily. Currently, TLS interception is decided based upon presence of --ca-* flags. If present, proxy.py core goes ahead and perform TLS interception. Example, check this block of code within server.py where TLS interception kicks in:

if self.flags.tls_interception_enabled():
# Perform SSL/TLS handshake with upstream
self.wrap_server()
# Generate certificate and perform handshake with client
try:
# wrap_client also flushes client data before wrapping
# sending to client can raise, handle expected exceptions
self.wrap_client()
except subprocess.TimeoutExpired as e: # Popen communicate timeout
logger.exception('TimeoutExpired during certificate generation', exc_info=e)
return True
except BrokenPipeError:
logger.error(
'BrokenPipeError when wrapping client')
return True
except OSError as e:
logger.exception(
'OSError when wrapping client', exc_info=e)
return True
# Update all plugin connection reference
for plugin in self.plugins.values():
plugin.client._conn = self.client.connection
return self.client.connection

To enable selective TLS interception, we will have to modify above block of code. There are a couple of strategies we can follow:

  1. Plugin callback (recommended) -- Introduce another callback for plugins which when called will return True if TLS interception must be performed.
  2. Flag based -- Introduce startup flags which can define allowed or blocked domains for TLS interception.

Let me know what you think of above. Thank you!!!

@mikenye
Copy link
Contributor

mikenye commented Jul 6, 2020

I think the plugin callback is the best method, as it is more versatile.

I'd love to help with implementing this, but I'm not sure where to start as my python isn't anywhere near as strong as yours!

@abhinavsingh
Copy link
Owner Author

@mikenye Makes sense. Do you want to initiate the content blocker plugin? If yes, let's kick start one in the current state i.e. all requests will be intercepted. I'll revisit it in a week or so to add ad-hoc TLS interception capabilities. Not much will change from plugin perspective. Plugin can simply return True only for active domains.

I have opened a separate issue to track ad-hoc TLS interception capability. #391

@mikenye
Copy link
Contributor

mikenye commented Jul 7, 2020

Yes! I’ll start working on this when I get home from work. :-)

@mikenye
Copy link
Contributor

mikenye commented Jul 8, 2020

@abhinavsingh how's this for a first attempt at providing basic functionality: https://github.com/mikenye/proxy.py/blob/develop/proxy/plugin/filter_by_url_regex.py

I've tested it and it works quite well, obviously the filter list will need to be hugely updated. It blocks most video ads on twitch.tv however it does cause twitch to buffer (future problem to look at).

My thoughts are:

  • If using regex, we probably want to pre-compile the patterns on init to speed up the checking process
  • We likely need to formulate a schema for the block list before we go too much further with this, or we revert to a pre-existing list (such as easylist as-per your initial comments - I need to learn how to parse this)

Keen for your feedback!

If you're happy with this I can submit a pull request.

@abhinavsingh
Copy link
Owner Author

@mikenye This looks perfect to start with.

It blocks most video ads on twitch.tv however it does cause twitch to buffer (future problem to look at).

Going forward we can look into stripping out ad code blocks from the response chunks (I believe this is what most adblocker plugins do, needs more investigation). By stripping out the ad code, browsers won't make any outgoing request for ads. If we return an error code for ad requests, users can experience unexpected behavior on the browser (e.g. video buffering). We'll of-course find out more as we use this plugin in real world scenario :)

My thoughts are:

  • If using regex, we probably want to pre-compile the patterns on init to speed up the checking process

👍 As this list becomes huge we can even use an underlying data structure to hold the rule list, reducing number of comparisons made per request.

  • We likely need to formulate a schema for the block list before we go too much further with this, or we revert to a pre-existing list (such as easylist as-per your initial comments - I need to learn how to parse this)

I'll recommend we use industry standard schema as source of truth. If necessary, we can transform them into a convenient data structure to speed up things. I created a separate issue to add cron style feature, see #392 Thoughts here were that ad blocker plugin can configure a cron job that downloads the rule list every N hour or so. For starter, we can perform a one time download on startup or after every N requests served.

If you're happy with this I can submit a pull request.

Let's do it. Happy to get this in and make this plugin robust over time.

@abhinavsingh
Copy link
Owner Author

abhinavsingh commented Jul 8, 2020

  1. Do run make autopep8 to keep code style consistent.
  2. Also good idea to run make before sending out PR, to avoid CI failures. See https://github.com/abhinavsingh/proxy.py/#development-guide to setup git pre-commit hooks

@mikenye
Copy link
Contributor

mikenye commented Jul 10, 2020

I've submitted the pull request. There are lots of commits (mostly as I was figuring out how things work), so I fully understand if you'd like me to tidy this up and resubmit. Just let me know.

@abhinavsingh
Copy link
Owner Author

Closing this for now.

  1. We already have FilterByURLRegexPlugin
  2. Additionally, now we have CloudflareDnsResolverPlugin which can be used for malware and adult content protection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Proposal Proposals for futuristic feature requests
Projects
None yet
Development

No branches or pull requests

2 participants