[Plugins] Content Blocker Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML #378

abhinavsingh · 2020-06-18T11:59:32Z

Describe the solution you'd like
Enabling this plugin will automatically strip any ad, tracking pixels or malicious content from HTML pages.

This plugin will asynchronously download ad block rules used for content blocking.

mikenye · 2020-07-05T13:15:58Z

Hi @abhinavsingh,

I’ve recently been considering putting together an ad-blocker for AppleTV and other devices to block things like YouTube and Twitch ads. I came across proxy.py and have been reading about the plugin framework and this looks to be a good way to do this. I thought I’d review the issues to see if anyone else was working on this, and found this issue. :-)

My thoughts were to do something along the lines of:

transparently redirect all HTTP/HTTPS traffic through an instance of proxy.py
have a list of domains to be “active” on. Domains not in this list will bypass the proxy (if HTTPS, will use CONNECT instead of TLS Interception, to keep things like internet banking un-molested).
Requests within “active” domains will then be checked over some kind of block list (likely regex) for objects to be blocked.
Perhaps there are several different block actions. For example:
- “block_404” could block the object with a 404 error
- “block_503” could return HTTP 503 to mimic a server error
- “serve_blank_video” action could serve (via proxy.py’s built in web server) a short, blank video instead of a video ad so as to try and prevent breaking the client application...
- and so on...

Looking at the plugin syntax, I think I can come up with a proof of concept plugin, however the only part that isn’t immediately straightforward to me is how to enable/disable TLS interception on a per request basis... Are you able to point me in the right direction here?

Any other thoughts/ideas?

Thanks!

mikenye · 2020-07-05T13:44:14Z

Is it as easy as changing self.request.method to CONNECT for HTTPS requests not in “active” domains?

abhinavsingh · 2020-07-05T16:01:28Z

Hi @mikenye

Welcome and thanks for considering proxy.py. My initial thoughts behind this plugin was inline with what you described. Plan was to maintain list of ad-block rules. Example, download industry standard rules from https://easylist.to/ website. Afaik, these rules are mostly regex based. However, as you identified, this strategy will require TLS interception for all the requests, even for trusted domains.

Unfortunately, as of today, proxy.py doesn't support selective TLS interception. But we should be able to add this feature easily. Currently, TLS interception is decided based upon presence of --ca-* flags. If present, proxy.py core goes ahead and perform TLS interception. Example, check this block of code within server.py where TLS interception kicks in:

proxy.py/proxy/http/proxy/server.py

Lines 283 to 305 in 1b0ed92

    
           if self.flags.tls_interception_enabled(): 
        
               # Perform SSL/TLS handshake with upstream 
        
               self.wrap_server() 
        
               # Generate certificate and perform handshake with client 
        
               try: 
        
                   # wrap_client also flushes client data before wrapping 
        
                   # sending to client can raise, handle expected exceptions 
        
                   self.wrap_client() 
        
               except subprocess.TimeoutExpired as e:  # Popen communicate timeout 
        
                   logger.exception('TimeoutExpired during certificate generation', exc_info=e) 
        
                   return True 
        
               except BrokenPipeError: 
        
                   logger.error( 
        
                       'BrokenPipeError when wrapping client') 
        
                   return True 
        
               except OSError as e: 
        
                   logger.exception( 
        
                       'OSError when wrapping client', exc_info=e) 
        
                   return True 
        
               # Update all plugin connection reference 
        
               for plugin in self.plugins.values(): 
        
                   plugin.client._conn = self.client.connection 
        
               return self.client.connection

To enable selective TLS interception, we will have to modify above block of code. There are a couple of strategies we can follow:

Plugin callback (recommended) -- Introduce another callback for plugins which when called will return True if TLS interception must be performed.
Flag based -- Introduce startup flags which can define allowed or blocked domains for TLS interception.

Let me know what you think of above. Thank you!!!

mikenye · 2020-07-06T04:24:19Z

I think the plugin callback is the best method, as it is more versatile.

I'd love to help with implementing this, but I'm not sure where to start as my python isn't anywhere near as strong as yours!

abhinavsingh · 2020-07-07T08:40:49Z

@mikenye Makes sense. Do you want to initiate the content blocker plugin? If yes, let's kick start one in the current state i.e. all requests will be intercepted. I'll revisit it in a week or so to add ad-hoc TLS interception capabilities. Not much will change from plugin perspective. Plugin can simply return True only for active domains.

I have opened a separate issue to track ad-hoc TLS interception capability. #391

mikenye · 2020-07-07T08:42:30Z

Yes! I’ll start working on this when I get home from work. :-)

mikenye · 2020-07-08T02:46:01Z

@abhinavsingh how's this for a first attempt at providing basic functionality: https://github.com/mikenye/proxy.py/blob/develop/proxy/plugin/filter_by_url_regex.py

I've tested it and it works quite well, obviously the filter list will need to be hugely updated. It blocks most video ads on twitch.tv however it does cause twitch to buffer (future problem to look at).

My thoughts are:

If using regex, we probably want to pre-compile the patterns on init to speed up the checking process
We likely need to formulate a schema for the block list before we go too much further with this, or we revert to a pre-existing list (such as easylist as-per your initial comments - I need to learn how to parse this)

Keen for your feedback!

If you're happy with this I can submit a pull request.

abhinavsingh · 2020-07-08T06:40:25Z

@mikenye This looks perfect to start with.

It blocks most video ads on twitch.tv however it does cause twitch to buffer (future problem to look at).

Going forward we can look into stripping out ad code blocks from the response chunks (I believe this is what most adblocker plugins do, needs more investigation). By stripping out the ad code, browsers won't make any outgoing request for ads. If we return an error code for ad requests, users can experience unexpected behavior on the browser (e.g. video buffering). We'll of-course find out more as we use this plugin in real world scenario :)

My thoughts are:

If using regex, we probably want to pre-compile the patterns on init to speed up the checking process

👍 As this list becomes huge we can even use an underlying data structure to hold the rule list, reducing number of comparisons made per request.

We likely need to formulate a schema for the block list before we go too much further with this, or we revert to a pre-existing list (such as easylist as-per your initial comments - I need to learn how to parse this)

I'll recommend we use industry standard schema as source of truth. If necessary, we can transform them into a convenient data structure to speed up things. I created a separate issue to add cron style feature, see #392 Thoughts here were that ad blocker plugin can configure a cron job that downloads the rule list every N hour or so. For starter, we can perform a one time download on startup or after every N requests served.

If you're happy with this I can submit a pull request.

Let's do it. Happy to get this in and make this plugin robust over time.

abhinavsingh · 2020-07-08T06:44:34Z

Do run make autopep8 to keep code style consistent.
Also good idea to run make before sending out PR, to avoid CI failures. See https://github.com/abhinavsingh/proxy.py/#development-guide to setup git pre-commit hooks

mikenye · 2020-07-10T02:27:32Z

I've submitted the pull request. There are lots of commits (mostly as I was figuring out how things work), so I fully understand if you'd like me to tidy this up and resubmit. Just let me know.

abhinavsingh · 2021-11-08T08:10:36Z

Closing this for now.

We already have FilterByURLRegexPlugin
Additionally, now we have CloudflareDnsResolverPlugin which can be used for malware and adult content protection

abhinavsingh added the Proposal Proposals for futuristic feature requests label Jun 18, 2020

abhinavsingh self-assigned this Jun 18, 2020

abhinavsingh changed the title ~~[Content Blocker] Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML~~ [Plugin] Content Blocker Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML Jun 18, 2020

abhinavsingh changed the title ~~[Plugin] Content Blocker Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML~~ [Plugins] Content Blocker Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML Jun 18, 2020

abhinavsingh removed their assignment Jun 18, 2020

abhinavsingh mentioned this issue Jul 7, 2020

[TLSInterception] Ability to enable TLS interception on demand #391

Closed

mikenye mentioned this issue Jul 10, 2020

Add plugin "FilterByURLRegexPlugin" #397

Merged

abhinavsingh closed this as completed Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Plugins] Content Blocker Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML #378

[Plugins] Content Blocker Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML #378

abhinavsingh commented Jun 18, 2020

mikenye commented Jul 5, 2020

mikenye commented Jul 5, 2020

abhinavsingh commented Jul 5, 2020

mikenye commented Jul 6, 2020

abhinavsingh commented Jul 7, 2020

mikenye commented Jul 7, 2020

mikenye commented Jul 8, 2020 •

edited

Loading

abhinavsingh commented Jul 8, 2020

abhinavsingh commented Jul 8, 2020 •

edited

Loading

mikenye commented Jul 10, 2020

abhinavsingh commented Nov 8, 2021

[Plugins] Content Blocker Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML #378

[Plugins] Content Blocker Plugin that automatically detects and remove any Ad, Tracker, Malicious, Adult content from HTML #378

Comments

abhinavsingh commented Jun 18, 2020

mikenye commented Jul 5, 2020

mikenye commented Jul 5, 2020

abhinavsingh commented Jul 5, 2020

mikenye commented Jul 6, 2020

abhinavsingh commented Jul 7, 2020

mikenye commented Jul 7, 2020

mikenye commented Jul 8, 2020 • edited Loading

abhinavsingh commented Jul 8, 2020

abhinavsingh commented Jul 8, 2020 • edited Loading

mikenye commented Jul 10, 2020

abhinavsingh commented Nov 8, 2021

mikenye commented Jul 8, 2020 •

edited

Loading

abhinavsingh commented Jul 8, 2020 •

edited

Loading