# hide
title: Alice visits a website (DRAFT)
tags: privacy data ads

In [1]:
# hide
import re
import sys
import json
import urllib.parse
sys.path.insert(0, "..")

import pandas as pd
import numpy as np
import graphviz
from har import *

In [2]:
# hide-code
class Graph:
    
    THEMES = {
        "user": {"fillcolor": "#a0f0a0"},
        "browser": {"fillcolor": "#e0e0e0"},
        "urchin": {"fillcolor": "#f0a0a0"},
        "website": {"shape": "box", "fillcolor": "#e0e0f0"},
        "static": {"shape": "house", "fillcolor": "#e0e0f0"},
        "static-urchin": {"shape": "house", "fillcolor": "#f0a0a0"},
        "pod": {"shape": "note", "fillcolor": "#a0f0f0"},
    }
    
    def __init__(self, **kwargs):
        kwargs.setdefault("engine", "neato")
        self.dot = graphviz.Digraph(**kwargs)
        self.dot.attr("graph", size="10,10")
        self.default_kwargs = {
            "fontname": "Helvetica"
        }
    
    def node(self, id: str, label: str, theme: str = None, **kwargs):
        merged_kwargs = self.default_kwargs.copy()
        merged_kwargs.update(kwargs)
        if theme in self.THEMES:
            merged_kwargs.update(self.THEMES[theme])
                
        merged_kwargs.setdefault("style", "filled")
        self.dot.node(id, label, **merged_kwargs)
    
    def edge(self, *args, **kwargs):
        merged_kwargs = self.default_kwargs.copy()
        merged_kwargs.update(kwargs)
        merged_kwargs.setdefault("fontsize", "12")
        self.dot.edge(*args, **merged_kwargs)
    
    def request(self, n1: str, n2: str, label: str = None, **kwargs):
        kwargs.setdefault("taillabel", label)
        kwargs.setdefault("color", "#a8a8a8")
        kwargs.setdefault("fontcolor", "#a8a8a8")
        #kwargs.setdefault("labeldistance", "2.5")
        self.edge(n1, n2, **kwargs)
    
    def response(self, n1: str, n2: str, label: str = None, **kwargs):
        kwargs.setdefault("taillabel", label)
        kwargs.setdefault("color", "#404070")
        kwargs.setdefault("fontcolor", "#404070")
        #kwargs.setdefault("labeldistance", "2.5")
        self.edge(n1, n2, **kwargs)
        
    def display(self):
        from IPython.display import display, HTML
        svg = self.dot._repr_svg_()
        svg = svg[svg.index('DTD/svg11.dtd">')+14:]
        display(HTML(svg))

In [3]:
# hide-code
def alice_graph(with_static=True, with_urchin=False, alice_length=None, urchin_length="2.0"):
    g = Graph()
    g.node("a", "Alice", "user")
    g.node("b", "browser", "browser")
    g.node("1", "website", "website")
    if with_static:
        g.node("1c", "website CDN", "static")
    
    g.request("a", "b", "click", len=alice_length)
    g.response("b", "a", "show")
    g.request("b", "1", len="1.5")
    g.response("1", "b", "html/js")
    if with_static:
        g.request("b", "1c", len="1.5")
        g.response("1c", "b", "img/media")
    
    if with_urchin:
        g.node("u", "Urchin", "urchin")
        g.request("b", "u", "Hey, i'm visiting website", len=urchin_length)
        g.response("u", "b", "display ads")
    return g

Alice wants to visit a website. She's clicking something in a browser and up comes the page.

In [4]:
# hide-code
alice_graph(with_static=False).display()

For better oversight, the whole [Internet Service Providers](https://en.wikipedia.org/wiki/ISP) that actually deliver the content are not displayed.

The website's server delivers some [html](https://en.wikipedia.org/wiki/HTML) and other files to show the page. Huge files or stuff that is quickly needed everywhere may come from a [Content Delivery Network](https://en.wikipedia.org/wiki/Content_delivery_network) that the website runs, pays or simply relies on for free.

For example [jQuery](https://jquery.com/), a framework in [javascript](https://en.wikipedia.org/wiki/JavaScript) that helps making web pages more interactive and fancy and which may have saved javascript as language in web browsers from being overrun by something more convenient. Now this framework can be downloaded any time at 
- `https://cdnjs.cloudflare.com/ajax/libs/jquery/3.5.1/jquery.min.js` 
- `https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js` 
- or a dozen other places. 

So when someone creates a website using jquery it could be delivered along with the rest of the page or the browser might get the file from one of the *CDN*s. *They* say that it's 

> *Good for web performance because the browser might have it in [cache](https://en.wikipedia.org/wiki/Web_cache) already and must not necessarily load it.*

In [5]:
# hide-code
alice_graph(alice_length="2.5").display()

If it loads, though, it will send some information to the CDN like 
- Alice's current [IP address](https://en.wikipedia.org/wiki/IP_address), 
- any associated [Cookies](https://en.wikipedia.org/wiki/HTTP_cookie) 
- and potentially the name of the website Alice is visiting, for example [https://anonyme-alkoholiker.de](https://anonyme-alkoholiker.de).

Some websites simply don't do this. They deliver everything by themselves. Other websites do this extensively and require 10 other services to deliver [CSS](https://en.wikipedia.org/wiki/CSS) styles, fonts, images, scripts and whatever. The Alcoholics Anonymous website mentioned above actually delivers it's own jQuery, but a sub-library of jQuery requires the CSS from `https://code.jquery.com/ui/1.11.4/themes/smoothness/jquery-ui.css?ver=5.6.2`.

But that's how websites get delivered. And, by the way, web development is complicated and expensive, the websites need some revenue, so they place advertising on their page.

In [6]:
# hide-code
alice_graph(alice_length="1.0", with_urchin=True).display()

The [Urchin](https://en.wikipedia.org/wiki/Urchin_\(software\)) might actually be a multinational corporation but in any case, the website owner does allow the other network to place [banner ads](https://en.wikipedia.org/wiki/Web_banner), [web analytics](https://en.wikipedia.org/wiki/Web_analytics) and possibly *unknown* things into it's own page. 

The owner of the website, lets call him Bob, gets some money for each of Alice's clicks on a commercial. Also he can study Alice's reasons for visiting his page and what stuff she looked at there. If he wants to see the statistics of all the visits on his page he goes to Urgin's website.    

In [7]:
# hide-code
g = alice_graph(with_urchin=True, urchin_length="1.5")

g.node("o", "Bob", "user")
g.node("b2", "browser", "browser")
g.node("ud", "Urchin's\nSecret\nData", "urchin")
g.node("uw", "urgin's website", "website")

g.request("u", "ud", "stash data", len="1.5")

#g.request("b2", "u", "Hey, i'm visiting website", len="1.5")
g.response("u", "b2", "display ads")

g.request("o", "b2", "click", len="2.0")
g.response("b2", "o", "show")
g.request("b2", "uw", len="2.0")
g.response("uw", "b2", "html/js")
g.request("uw", "ud", len="2.0")
g.response("ud", "uw", "deliver data for Bob")

g.request("b2", "u", "I'm clicking this", len="1.5")
#g.response("u", "b2", "display ads")

g.display()

Bob can see on Urchin's website some of the data that has been gathered for him. But not more. While Urchin may have data about a million websites, it only shows Bob the stuff related to his own website (and only the stuff that is *legal*). Bob's browser has no direct connection to Urchin's secret data. But, of course, Bob's clicks on Urchin's website are collected like for anybody else. 

Bob thinks: 

> *This is great, i get some money for running my website and i can see what people typed into google before they came onto my website, and how long they stayed on each page, etc...*

Urchin thinks: 

> *This is great, another website included our advertising/analytics framework, so we can build better statistics, track more web users and deliver ads that get clicked more often by individual people so our customers that place ads through our system will pay more.*

Alice probably just thinks: 

> *I like the website, only the ads are a bit annoying.*

Personally, i am thinking:

> *This is a downward spiral to corporate hell*

[Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee), the so-called *Inventor of the Internet* thinks:

> *The whole internet has gone wrong. Let's develop something new that puts our own data back into our own control.*

So his idea is that your personal data like browsing and online-shopping behaviour or medical data is stored in a [Pod](https://docs.inrupt.com/developer-tools/javascript/client-libraries/reference/glossary/?highlight=personal%20data#term-Pod), which is a secured storage on some server. 

Alice [registers](https://docs.inrupt.com/developer-tools/javascript/client-libraries/tutorial/getting-started/?highlight=register#register-your-pod-and-create-your-profile) a Pod and can then give read or write access for particular details to other entities via 
[Access Control Policies (ACP)](https://github.com/solid/authorization-panel/tree/master/proposals/acp). Next time Alice visits the website, it might say:

*Urchin wants to have access to the following personal data:*
- [x] Visits to this website
- [x] Clicks on this website
- [x] List of items you shopped in the last 30 days
- [x] Your calendar
- [x] Your [Steam](https://en.wikipedia.org/wiki/Steam_\(service\)) records
- [x] Your medical history

*Allow?*

Alice might not like that at all and decline. She's still annoyed by the commericals on the website, maybe even more because they are not [personalized](https://en.wikipedia.org/wiki/Personalized_marketing) any more but the list looked too frigthening.


In [8]:
# hide-code
g = alice_graph(alice_length="2.", with_static=False)
g.node("pod", "Alice's\npersonal\ndata", "pod")
g.node("u", "Urchin", "urchin")

g.request("b", "pod", "sign up", len="2.0")
g.request("b", "pod", "I'm visiting website", len="2.0")
g.request("b", "u", "Allow some access", len="2.0")
g.response("u", "b", "deliver ads")
g.request("u", "pod", "request data", len="2.0")
g.response("pod", "u", "deliver data")

g.display()

Let's assume that the technology of Pods is completely secure and only Alice herself decides what goes in and what comes out and to whom. Still we have Urchin lingering on the page, freely executing javascript and collecting data about Alice as usual.

So this is not merely a technological issue. A community like Tim Berners-Lee's [Inrupt](https://inrupt.com/about) must convince the major website owners to restrict the Urchin's actions to those allowed by Alice's profile. This has started with the [General Data Protection Regulation](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) in Europe but it's not near anything useful at the moment. 

How can we trust Urchin or Bob? Well, if Alice's Pod eventually contains **all** of her data at some point in the future, Urchins want to have access so they'll need to play nice. We will see. 

So that's how the web should look like and this is how it looks right now:

In [9]:
# hide-code
HOST_THEMES = {
    "website": [
        re.compile("website"),
        re.compile("redd\.?it"),
        re.compile("futalis.de"),
    ],
    "static": [
        re.compile("cdn"),
    ],
    "static-urchin": [
        re.compile("gstatic"),
    ],
    "urchin": [
        re.compile("aaxdetect"),
        re.compile("adform"),
        re.compile("ads"),
        re.compile("awin1.com"),
        re.compile("doubleclick"),
        re.compile("google"),
        re.compile("insightexpressai"),
        re.compile("redintelligence"),
        re.compile("m-t.io"),
        re.compile("openx\."),
        re.compile("quantcount"),
        re.compile("quantserve"),
        re.compile("scorecardresearch"),
        re.compile("webgains"),
    ],
}

def get_host_theme(host: str):
    for key, regs in HOST_THEMES.items():
        for r in regs:
            if r.findall(host):
                return key

g = Graph(engine="fdp")
g.node("b", "browser", "browser")
g.node("a", "Alice", "user")
g.request("a", "b", label="click")
g.response("b", "a", label="present")

har = HarFile("hars/ebay/*reddit*.har")

for data in har.connections():
    host = data["host"]
    g.node(host, host, get_host_theme(host))
    g.request("b", host, arrowsize=str(1+data["strength"]*2))
    if data["res"]:
        label = "/".join(
            sorted(data["res_type"].keys(), key=lambda k: -data["res_type"][k])
        )
        g.response(host, "b", label=label)

g.display()

Alice just opened reddit.com without any script/ad-blockers, clicked the *Yes-allow-all-just-leave-me-alone* button and browsed for 5 minutes.

`tp` stands for [Tracking Pixels](https://en.wikipedia.org/wiki/Tracking_pixels) which means the browser requests *an image* but the image is actually meaningless. Instead the request of the browser included a lot of information, much more than the information that actually came back.