# Crawl Data Analysis: Extracting Patterns

This notebook uses more tailored techniques to target specific "patterns" we want to extract from the crawl data. The patterns we target here were informed by results of our preliminary clustering approaches, as well as what we manually observed from the sites in our dataset.

This notebook is written for Python 2.7 and requires several packages (see the `import`s below). It also assumes it is running on cycles - see the `clustering-initial-analysis/clustering.ipynb` notebook for a reference on how to run Jupyter Notebooks remotely on cycles.

In [1]:
from __future__ import print_function
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from tqdm import tqdm
import numpy as np
import os
import pandas as pd

## Experiment: Runtime of reading from database vs reading from JSON

Read from database.

In [2]:
from urlparse import urlparse
import json
import sqlite3

db = '/n/fs/darkpatterns/final-crawl/webtap/webtap.sqlite'
con = sqlite3.connect(db)
con.row_factory = sqlite3.Row
cur = con.cursor()

query = """select * from
    segments as sg left join site_visits as sv on sv.visit_id = sg.visit_id
    where lower(sg.node_name) <> 'body' and sg.inner_text <> ''
    limit 25000
"""

segments_json = 'output/segments_webtap.json'

with open(segments_json, 'w') as f:
    for row in tqdm(cur.execute(query)):
        domain = urlparse(row['site_url']).hostname
        row_d = dict(row)
        row_d['domain'] = domain
        json.dump(row_d, f)
        f.write('\n')

25000it [00:14, 1692.19it/s]


Read from JSON file.

In [3]:
with open(segments_json, 'r') as f:
    for line in tqdm(f):
        segment = json.loads(line)

25000it [00:18, 1337.69it/s]


**Conclusion**: Reading from the JSON file appears to take about the same time for a small number of iterations (for reference, the entire database is about 7.5M rows), though perhaps there wasn't enough data read to notice the difference. However, due to space constraints on my account on cycles, we cannot write the entire contents of the database to a JSON file on disk anyway, so we have to read from the database.

## Popups

In this section, we extract popup dialogs from the data and inspect various features of them.

In [13]:
db = '/n/fs/darkpatterns/final-crawl/webtap/webtap.sqlite'
con = sqlite3.connect(db)
con.row_factory = sqlite3.Row
cur = con.cursor()

query = """select * from
    segments as sg left join site_visits as sv on sv.visit_id = sg.visit_id
    where lower(sg.node_name) <> 'body' and sg.inner_text <> ''
    limit 50000
"""

popups = []
for row in tqdm(cur.execute(query)):
    url = row['site_url']
    domain = urlparse(row['site_url']).hostname
    text = row['inner_text']
    style = json.loads(row['style'])
    z = style['z-index']
    
    popups.append((url, domain, text, z))

50000it [00:34, 1469.31it/s]


Group by z-index and report the distribution of z-indices over the segments.

In [29]:
from collections import defaultdict

popups_by_z = defaultdict(lambda: 0)
for url, domain, text, z in popups:
    popups_by_z[str(z)] += 1
    
sorted_zs = [(z, count) for z, count in popups_by_z.iteritems()]
sorted_zs = sorted(sorted_zs, cmp=lambda x, y: y[1] - x[1])

for z, count in sorted_zs:
    print('%s: %d' % (z, count))

auto: 47174
1: 1172
9999: 271
2: 174
50: 126
300: 115
200: 99
0: 98
10: 90
9999999: 58
9: 47
999: 46
99: 44
3: 41
11: 41
5: 39
100: 35
1000: 35
4: 26
6: 26
999999: 20
-9: 19
90: 18
-1: 18
999999999: 17
99999: 16
600: 15
10000: 8
2147483647: 8
1046: 8
999998: 7
7: 6
1001: 6
10001: 5
99990: 5
996: 5
30: 5
20: 4
1000000: 4
500: 4
201: 4
15: 4
2000: 4
2001: 4
1006: 4
150: 3
597: 3
110: 2
4000: 2
-100: 2
1003: 2
99999999: 1
22: 1
8: 1
108: 1
125: 1
101: 1
9999998: 1
16: 1
1060: 1
54: 1
100020: 1


This indicates that the typical case for a segments is having a z-index of "auto", which is the same depth as most of the page. Sites seem to use many different values of z-index to achieve popups, but the convention seems to be to use a value > 0 for popups, and < 0 for something hidden (?). Didn't expect to see negative values - should explore this further.

**Goal** (2:36pm 2/26/19): Emulate what dismiss_dialog.js does in trying to build a filter for popups. Then run the segments through this filter and see what you get.