# hide
title: Alice visits a website (DRAFT)
tags: privacy data ads

In [1]:
# hide
import re
import sys
import json
import urllib.parse
sys.path.insert(0, "../..")
sys.path.insert(0, "..")

import pandas as pd
import numpy as np
from har import *
from nbgraph import Graph as GraphBase

In [2]:
# hide-code
class Graph(GraphBase):
    
    THEMES = {
        "user": {"fillcolor": "#a0f0a0"},
        "browser": {"fillcolor": "#e0e0e0"},
        "urchin": {"fillcolor": "#f0a0a0"},
        "website": {"shape": "box", "fillcolor": "#e0e0f0"},
        "static": {"shape": "house", "fillcolor": "#e0e0f0"},
        "static-urchin": {"shape": "house", "fillcolor": "#f0a0a0"},
        "pod": {"shape": "note", "fillcolor": "#a0f0f0"},
    }
    
    def node(self, id: str, label: str, theme: str = None, **kwargs):
        if theme in self.THEMES:
            for key, value in self.THEMES[theme].items(): 
                kwargs.setdefault(key, value)
        super().node(id, label, **kwargs)
        
    def request(self, n1: str, n2: str, label: str = None, **kwargs):
        kwargs.setdefault("taillabel", label)
        kwargs.setdefault("color", "#a8a8a8")
        kwargs.setdefault("fontcolor", "#a8a8a8")
        #kwargs.setdefault("labeldistance", "2.5")
        self.edge(n1, n2, **kwargs)
    
    def response(self, n1: str, n2: str, label: str = None, **kwargs):
        kwargs.setdefault("taillabel", label)
        kwargs.setdefault("color", "#404070")
        kwargs.setdefault("fontcolor", "#404070")
        #kwargs.setdefault("labeldistance", "2.5")
        self.edge(n1, n2, **kwargs)

In [3]:
# hide-code
def alice_graph(with_static=True, with_urchin=False, alice_length=None, urchin_length="2.0"):
    g = Graph()
    g.node("a", "Alice", "user")
    g.node("b", "browser", "browser")
    g.node("1", "website", "website")
    if with_static:
        g.node("1c", "website CDN", "static")
    
    g.request("a", "b", "click", len=alice_length)
    g.response("b", "a", "show")
    g.request("b", "1", len="1.5")
    g.response("1", "b", "html/js")
    if with_static:
        g.request("b", "1c", len="1.5")
        g.response("1c", "b", "img/media")
    
    if with_urchin:
        g.node("u", "Urchin", "urchin")
        g.request("b", "u", "Hey, i'm visiting website", len=urchin_length)
        g.response("u", "b", "display ads")
    return g

Alice wants to visit a website. She's clicking something in a browser and up comes the page.

In [4]:
# hide-code
alice_graph(with_static=False).display()

For better oversight, the whole [Internet Service Providers](https://en.wikipedia.org/wiki/ISP) that actually deliver the content are not displayed.

The website's server delivers some [html](https://en.wikipedia.org/wiki/HTML) and other files to show the page. Huge files or stuff that is quickly needed everywhere may come from a [Content Delivery Network](https://en.wikipedia.org/wiki/Content_delivery_network) that the website runs, pays or simply relies on for free.

For example [jQuery](https://jquery.com/), a framework in [javascript](https://en.wikipedia.org/wiki/JavaScript) that helps making web pages more interactive and fancy and which may have saved javascript as language in web browsers from being overrun by something more convenient. Now this framework can be downloaded any time at 
- `https://cdnjs.cloudflare.com/ajax/libs/jquery/3.5.1/jquery.min.js` 
- `https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js` 
- or a dozen other places. 

So when someone creates a website using jquery it could be delivered along with the rest of the page or the browser might get the file from one of the *CDN*s. *They* say that it's 

> *Good for web performance because the browser might have it in [cache](https://en.wikipedia.org/wiki/Web_cache) already and must not necessarily load it.*

In [5]:
# hide-code
alice_graph(alice_length="2.5").display()

If it loads, though, it will send some information to the CDN like 
- Alice's current [IP address](https://en.wikipedia.org/wiki/IP_address), 
- any associated [Cookies](https://en.wikipedia.org/wiki/HTTP_cookie) 
- and potentially the name of the website Alice is visiting, for example [https://anonyme-alkoholiker.de](https://anonyme-alkoholiker.de).

Some websites simply don't do this. They deliver everything by themselves. Other websites do this extensively and require 10 other services to deliver [CSS](https://en.wikipedia.org/wiki/CSS) styles, fonts, images, scripts and whatever. The Alcoholics Anonymous website mentioned above actually delivers it's own jQuery, but a sub-library of jQuery requires the CSS from `https://code.jquery.com/ui/1.11.4/themes/smoothness/jquery-ui.css?ver=5.6.2`.

But that's how websites get delivered. And, by the way, web development is complicated and expensive, the websites need some revenue, so they place advertising on their page.

In [6]:
# hide-code
alice_graph(alice_length="1.0", with_urchin=True).display()

The [Urchin](https://en.wikipedia.org/wiki/Urchin_\(software\)) might actually be a multinational corporation but in any case, the website owner does allow the other network to place [banner ads](https://en.wikipedia.org/wiki/Web_banner), [web analytics](https://en.wikipedia.org/wiki/Web_analytics) and possibly *unknown* things into it's own page. 

The owner of the website, lets call him Bob, gets some money for each of Alice's clicks on a commercial. Also he can study Alice's reasons for visiting his page and what stuff she looked at there. If he wants to see the statistics of all the visits on his page he goes to Urgin's website.    

In [7]:
# hide-code
g = alice_graph(with_urchin=True, urchin_length="1.5")

g.node("o", "Bob", "user")
g.node("b2", "browser", "browser")
g.node("ud", "Urchin's\nSecret\nData", "urchin")
g.node("uw", "urgin's website", "website")

g.request("u", "ud", "stash data", len="1.5")

#g.request("b2", "u", "Hey, i'm visiting website", len="1.5")
g.response("u", "b2", "display ads")

g.request("o", "b2", "click", len="2.0")
g.response("b2", "o", "show")
g.request("b2", "uw", len="2.0")
g.response("uw", "b2", "html/js")
g.request("uw", "ud", len="2.0")
g.response("ud", "uw", "deliver data for Bob")

g.request("b2", "u", "I'm clicking this", len="1.5")
#g.response("u", "b2", "display ads")

g.display()

Bob can see on Urchin's website some of the data that has been gathered for him. But not more. While Urchin may have data about a million websites, it only shows Bob the stuff related to his own website (and only the stuff that is *legal*). Bob's browser has no direct connection to Urchin's secret data. But, of course, Bob's clicks on Urchin's website are collected like for anybody else. 

Bob thinks: 

> *This is great, i get some money for running my website and i can see what people typed into google before they came onto my website, and how long they stayed on each page, etc...*

Urchin thinks: 

> *This is great, another website included our advertising/analytics framework, so we can build better statistics, track more web users and deliver ads that get clicked more often by individual people so our customers that place ads through our system will pay more.*

Alice probably just thinks: 

> *I like the website, only the ads are a bit annoying.*

Personally, i am thinking:

> *This is a downward spiral to corporate hell*

And in fact, the internet looks like this right now:

In [8]:
# hide-code
HOST_THEMES = {
    "website": [
        re.compile("website"),
        re.compile("redd\.?it"),
        re.compile("futalis.de"),
    ],
    "static": [
        re.compile("cdn"),
    ],
    "static-urchin": [
        re.compile("gstatic"),
    ],
    "urchin": [
        re.compile("aaxdetect"),
        re.compile("adform"),
        re.compile("ads"),
        re.compile("awin1.com"),
        re.compile("doubleclick"),
        re.compile("google"),
        re.compile("insightexpressai"),
        re.compile("redintelligence"),
        re.compile("m-t.io"),
        re.compile("openx\."),
        re.compile("quantcount"),
        re.compile("quantserve"),
        re.compile("scorecardresearch"),
        re.compile("webgains"),
    ],
}

def get_host_theme(host: str):
    for key, regs in HOST_THEMES.items():
        for r in regs:
            if r.findall(host):
                return key

g = Graph(engine="fdp")
g.node("b", "browser", "browser")
g.node("a", "Alice", "user")
g.request("a", "b", label="click")
g.response("b", "a", label="present")

har = HarFile("hars/alice-visits-reddit-stripped.har")

actions = har.get_actions()
max_v = max(a["receive_params"] / a["count"] for a in actions.values())
evilness = {
    host: a["receive_params"] / max_v
    for host, a in actions.items()
}

for data in har.connections():
    host = data["host"]
    theme = get_host_theme(host)
    evil = evilness[host]
    act = actions[host]
    tooltip = []
    
    kwargs = {}
    if "urchin" in theme:
        if act['receive_params'] // 1024:
            tooltip.append(f"{act['receive_params'] // 1024}kb parameters received")
        if act["send_tp"]:
            evil += .3
            tooltip.append(f"{act['send_tp']}x tracking-pixels")
        if act["send_js_canvas"]:
            evil += .5
            tooltip.append(f"{act['send_js_canvas']}x canvas-fingerprinting")
        evil += max(0, act["in_out_ratio"]-1)
        evil = max(.1, min(1, evil))
        color = [1, 1-evil*.3, 1-evil*.3]
        color = "#" + "".join("%02x" % max(0, int(c*255)) for c in color)
        kwargs["fillcolor"] = color
    
    tooltip.insert(0, host)
    tooltip = "\n".join(tooltip)
    
    g.node(host, host, theme, tooltip=tooltip, **kwargs)
    g.request("b", host, arrowsize=str(1+data["strength"]*2))
    if data["res"]:
        label = "/".join(
            sorted(data["res_type"].keys(), key=lambda k: -data["res_type"][k])
        )
        g.response(host, "b", label=label, labeldistance="3.")

for host, data in har.get_dependency_graph(with_text_match=False).items():
    for host_to in data["to"]:
        g.edge(host, host_to, color="#c0a0a0")

g.display()

Alice just opened **reddit.com** without any script/ad-blockers, clicked the *Yes-allow-all-just-leave-me-alone* button and browsed for 5 minutes. You can get an [HTTP Archive](https://en.wikipedia.org/wiki/HAR_\(file_format\)) file of Alice's session [here](https://github.com/defgsus/blog/src/har_research/hars/alice-visits-reddit-stripped.har).

The red connection lines mean that the content loaded from server **A** requested more content from server **B**. For example, the contents loaded from **googlesyndication.com** (whatever that actually is), requested more content from **retailads.net**, **quantserver.com**, **webgains.com**, etc... 

The more intense the color of an Urchin, the more information got transmitted to that server. This includes [Tracking Pixels](https://en.wikipedia.org/wiki/Tracking_pixels) which means the browser requests *an image* but the image is actually meaningless. Instead the information in the request is the true content of that transmission. That might contain [canvas fingerprinting](https://en.wikipedia.org/wiki/Canvas_fingerprinting) and other techniques to truly identify Alice's browsing device, regardless of deleted cookies or changing the browser's user-profile. 

Some top-notch methods for profiling are presented on [browserleaks.com](https://browserleaks.com/) (You may either be okay with google ads or have your adblockers up-to-date before visiting the page).

So what is this *information* that is passed? Here is what google's **doubleclick.net** receives regularily from Alice's browser:

In [9]:
# hide-code
def dump_params(e: dict):
    for p in e["request"]["queryString"]:
        if len(p["value"]) > 2:
            print(p["name"], "=", p["value"])
dump_params(har.filtered({"request.url": "doubleclick", "response.content.size": lambda s: s < 1000})[0])

xai = AKAOjsvbFEnw9d44qIKiGbLQnehTRfDxz5CoAAn6OnbsoelS7slvu7Rpox3J1a1oKw2jYP-vykNA9NOEljMZEuncJEj920OzdGNfaYCxJNeBUzvWWf0WMnlIAfow3GbMd7k5CR8ojdUWsCqvq-2jeAoMA1ZOZl6R7QMlFpxRs1oaz70eb2Hk4QsPu29f_Ingx1_hGnsgU3PePlBDbixnV7Lb8FCQ25iuUEwam1uQTy83kmnsbcyLdylrks9_GHJ_OfbvmtQy9N42eC6K2Ye-PF4coLQg15C9VIq3ZZxrf9IGlOziUNAJTb8cqQINLv04gj-BgCxlk_sU
sai = AMfl-YSqvI-2l3RwDTbIbkzSIzQwB1byK-wALfER95GT6QFW4OGumcGp-lY1SWzKNkedtVw66oU_JYDzDd6x1Pa6JJmGyQiYmapdvTLh5Ry-5tS8mTyEPfsl4YQ8s1hqfeo
sig = Cg0ArKJSzKjrJDYGyymXEAE


And **webgains.com**:

In [10]:
# hide-code
dump_params(har.filtered({"request.url": "webgains.com", "response.content.size": lambda s: s < 1000})[0])

callback = hitCallback
wgpayload = FOa44iFBBNlY5Du4UXuKrnZ2CI9XkPrwXjm_YrJFW73AuyPB884akiEocEcEJ1w.Cs5uQ1szHVyVxFAk.rpwoNJ9z4oYYLzZKyJcbZpMIrkJXTiEocEcEJ1w.7bhpfze1r6zdstlDJFW73E4QCwby91Sp0alnjk3nKxUC54725H5UWBL6hqeFV.Ld_lHVxX_AD_AKtgtIzZzQmpRnoyDDbbaMrjbQKBcCdDSI6KUMnGWpwoNSUC56MnGWVQdg3ZLQ0F42p9..DgcOQ_i.uJtHoqvynx9MsFyxYMAqJkL6f1BSypw.5B0KB8D1Re4GSr_U_9zWuz3YMJ5tTma1kW0SX3NlY5DtTpuy.DQJ
wgcookie = {"wgifp12595":["99582","12595","723181","","1614816668","https%3A%2F%2F365534f451099c0661cae2249111d71b.safeframe.googlesyndication.com%2Fsafeframe%2F1-0-37%2Fhtml%2Fcontainer.html","","","1770336668","62681100006646100710680011523028"]}
wgchecksum = da59b14c869d4fb06f7ea8f903687b18
userIP = 78.54.127.247
wgtime = 1614816668


(*wgchecksum* could actually be a canvas fingerprint, judging by the name, because their fingerprinting code comes from 
`https://analytics-wg.webgains.io/tech-essence-clk.min.js`)

**aaxadds.com**:

In [11]:
# hide-code
dump_params(har.filtered({"request.url": "aaxads", "response.content.size": lambda s: s < 1000})[1])
#har.filtered({"request.url": "aaxads", "response.content.size": lambda s: s < 1000}).dump()

___stu13p = aveoaamactga5dnnuee25ti2rm86bcrodqacb
lwbsh = AAX
dewh = SSP_CLIENT_gcp_eweu
dgw = desktop
flg = AAX763KC6
fw = BERLIN
skw = 617
slg = 8PR6YK195
gq = reddit.com
vhuyqdph = rtb-nv-dcos-ssp-10-6-34-207-6203
vyu = 030308_203_030312_71_ssp
yk = 617
yz = 1280
ylg = 00001614816541085013121943041473
vvsDeExfnhw = CONTROL
gdss = green
jgivwu = Y-N
xvs_ogi = false
xvs_vwulqj = 1YN-
jixqgo = 1200
jwg = 100
qjixqgo = 1200
ugo = 800
ghqg = 535
uhtxuo = https://www.reddit.com/
nzui = https://www.google.com/?&


**googlesyndication.com**:

In [12]:
# hide-code
dump_params(har.filtered({"request.url": "googlesyndication", "request.queryString": lambda s: len(s) >= 4})[1])

id = sodar2
v = 221
li = gpt_2021022501
jk = 1725576623835237
bg = !1tWl1ZbNAAWsVXnBrDsAKQB2-DxaIGAokhMs2ErfKFwljupXH0xkdsydCe9lAnMCojNvN6PKv4uhAgAAANNSAAAACGgBBwoA7V6tQWriezhBsWP1wv5WVoejfVT3YzYgFzhTVpjH3BoKgKTnDckqhfRSjjcrOgQnhKnDVRop-dQfRmYWRJdFlPwrIXCtL_RohVpoWCLtpp3o42m4yGGp6qkRAxEsoCh7ZAUMaEL3O6m27BJuvpgeUWZdJkJoGWszTvsaE0ULl4ApaDUSZzw_xaPc1iP9YXAJ_oRDB1PuTquxS0pZ4hz1Dgwfdcyk0PHLVMMTJsbRzE2eHHpXafMy-mcB0CCuNy87z2Svux-aNIp8lLhSntyFwf2UJQB1M0-o_STNlc6XTHiwCCZxdQugmLrBfkqzeJkB5e8ilj1hzcMZE0_4CQHo_OylhSyBApPFyFKzpFEIUHDmAvZWI22eotbX0fjMn7_3DHt-LJFk2mjjAmFpDlRnKNhhNaFU74JpZP1-dXfPAUZuf63cFZtSt0u7UZwi1VKepWhqTwz4a8fHVWdAxICs0EiRwFL1u8MFQiNBZV8nXVsGZQ-6q9-vljfQKJ-bveeeBalPRZs7uMEVhswHMhTsJJWElOpnROf4E87UtvGs9QaPBfcJZxtiBGDwCZ-OUT-LRos_q-DN4Ek7Nvdt--taDkmOO1TPlIpKl9pwIwQYKPBcjDE3GS1S4m9YlKfBg14pZCQr0e3wjBqKavfJhsndJt_ySAk4b7Qm1lB4nZbRPQb5CJxzwlN83xEXxTHAm64iv24GmEGYx-MOwO4wQx6p6kalywvxy0UDAssrJffDymYic5u4zYx3fCcgugEFA24SCgwvUx8eH59ndCR1cdHlrE8JEoMT2lY7KDWoNBpnAYus85In2S7mukCgLvp4BgemOKF-

And so on... 

There is no way to tell what the encrypted messages actually contain without [reverse engineering](https://en.wikipedia.org/wiki/Reverse_engineering) the service, which is not entirely legal and very cumbersome as the javascript code is almost always [minified](https://en.wikipedia.org/wiki/Minification_\(programming\)) and [obfuscated](https://en.wikipedia.org/wiki/Obfuscation_\(software\)), as if *they* have something to hide:

In [13]:
# hide-code
text = har.filtered({"request.url": "doubleclick", "response.content.mimeType": "javascript"})[0]["response"]["content"]["text"]
print(text[text.index("Apache-2.0\n*/\n")+14:1000] + "...")

var ba,aa,da,ea,fa,ja,la,na,ka,oa,pa,qa,ta,va,xa,za,Ba,Da,Ea,Ga,Ia,Ja,Ka,La,Ma,Oa,Pa,Ya,ab,ib,lb,ub,vb,xb,Eb,Hb,Ib,Kb,Nb,Ob,Lb,Sb,Ub,cc,ec,gc,ic,jc,mc,nc,oc,pc,rc,sc,tc,vc,wc,yc,Ac,Ec,Hc,Ic,Nc,Qc,Rc,Sc,Vc,Xc,Zc,$c,ad,dd,id,jd,kd,ld,md,nd,od,qd,td,yd,I,zd,Ad,Bd,Cd,Dd,y,Ed,Fd,Gd,Kc,Hd,Id,Md,Nd,Od,ce,de,be,ae,ee,fe,ge,ia,he,ie,je,ke,Dc,L;ba=function(a,b){b=aa(a,b);return 0>b?null:"string"===typeof a?a.charAt(b):a[b]};aa=function(a,b){for(var c=a.length,d="string"===typeof a?a.split(""):a,e=0;e<c;e++)if(e in d&&b.call(void 0,d[e],e,a))return e;return-1};_.ca=function(a,b){return 0<=Array.prototype.indexOf.call(a,b,void 0)};da=function(a,b){b=Array.prototype.indexOf.call(a,b,void 0);var c;(c=0<=b)&&Array.prototype.splice.call(a,b,1);return c};ea=function(a){var b=a.length;if(0<b){for(var c=Array(b),d=0;d<b;d++)c[d]=a[d];return c}return[]};fa=function(a,b,c){return 2>=arguments.length?Array.proto...


Am i alone with my concern? 

Seems like the so-called *Inventor of the Internet*, [Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee) is concerned as well. His idea is that Alice's personal data like browsing and shopping behaviour or even medical data is voluntarily stored in a [Pod](https://docs.inrupt.com/developer-tools/javascript/client-libraries/reference/glossary/?highlight=personal%20data#term-Pod), which is a secured storage on some server. 

Alice [registers](https://docs.inrupt.com/developer-tools/javascript/client-libraries/tutorial/getting-started/?highlight=register#register-your-pod-and-create-your-profile) a Pod and can then give read or write access for particular details to other entities via 
[Access Control Policies (ACP)](https://github.com/solid/authorization-panel/tree/master/proposals/acp). Next time Alice visits the website, it might say:

*Urchin wants to have access to the following personal data:*

- [x] Visits to this website and all affiliated websites
- [x] Clicks on this website and all affiliated websites
- [x] List of items you shopped in the last 30 days
- [x] Your calendar
- [x] Your [Steam](https://en.wikipedia.org/wiki/Steam_\(service\)) records
- [x] Your medical history

*Allow?*

Alice might not like that at all and decline. She's still annoyed by the commericals on the website, maybe even more because they are not [personalized](https://en.wikipedia.org/wiki/Personalized_marketing) any more but the list looked too frigthening.


In [14]:
# hide-code
g = alice_graph(alice_length="2.", with_static=False)
g.node("pod", "Alice's\npersonal\ndata", "pod")
g.node("u", "Urchin", "urchin")

g.request("b", "pod", "sign up", len="2.0")
g.request("b", "pod", "I'm visiting website", len="2.0")
g.request("b", "u", "Allow some access", len="2.0")
g.response("u", "b", "deliver ads")
g.request("u", "pod", "request data", len="2.0")
g.response("pod", "u", "deliver data")

g.display()

Let's assume that the technology of Pods is completely secure and only Alice herself decides what goes in and what comes out and to whom. Still we have Urchin lingering on the page, freely executing javascript and collecting data about Alice as usual.

So this is not merely a technological issue. A community like Tim Berners-Lee's [Inrupt](https://inrupt.com/about) must convince the major website owners to restrict the Urchin's actions to those allowed by Alice's profile. This has started with the [General Data Protection Regulation](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) in Europe but it's not near anything useful at the moment. 

How can we trust Urchin or Bob? Well, if Alice's Pod eventually contains more data than Urchin is able to collect, it will want to have access, so it will need to play nice. We will see.

What other options are there?