# Crawl Data Analysis: Clustering

This notebook tries various clustering techniques on our web crawl data. It was written for Python 2.7, and assumes it's running on cycles. You can view/edit the notebook remotely as follows:

- Clone the GitHub repo to cycles (e.g. spin.cs.princeton.edu)
- Start up this notebook. Jupyter is not installed globally, but you can install it locally with pip via `pip install --user jupyter`. Then you can run this notebook in a tmux session: `tmux`, then `cd [this directory]`, then `jupyter notebook --no-browser --port 8889` (note that you can choose whatever port number you want, but we'll assume from here on it's 8889). Copy the URL generated - this is the URL you'll visit in your browser to open the notebook. Then Ctrl-B, D to detach the tmux session, and log out of cycles.
- On your local machine, forward your local port 8889 to the remote port 8889 on cycles: `ssh -L 8889:localhost:8889 [netid]@spin.cs.princeton.edu`
- Now you can open the notebook in your browser by pasting the link you copied earlier.

In [1]:
from __future__ import print_function
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import os

## Read from database

Read the crawl data from the database. Here we read in the `site_visits` and `segments` tables and join them.

In [2]:
import sqlite3
import pandas as pd

# db = '/n/fs/darkpatterns/crawl/2018-12-08_segmentation_pilot2/2018-12-08_segmentation_pilot2.sqlite'
db = '/mnt/ssd/amathur/20190206-205000_segmentation_pilot/20190206-205000_segmentation_pilot.sqlite'
con = sqlite3.connect(db)
site_visits = pd.read_sql_query('''SELECT * from site_visits''', con)

In [3]:
print('Number of site visits: %s' % str(site_visits.shape))
print('site_visits columns: %s' % str(list(site_visits.columns.values)))

Number of site visits: (26496, 3)
site_visits columns: ['visit_id', 'crawl_id', 'site_url']


Report how many unique domains we have.

In [4]:
from urlparse import urlparse

site_visits['domain'] = site_visits['site_url'].apply(lambda x: urlparse(x).netloc)
grouped = site_visits.groupby(['domain']).count().sort_values('visit_id', ascending=False)

In [5]:
print('Number of unique domains: %s' % str(grouped.shape[0]))

Number of unique domains: 5799


In [6]:
# we do streaming processing instead
## segments = pd.read_sql_query('''SELECT * from segments''', con)

In [7]:
from urlparse import urlparse
from collections import defaultdict
import binascii
import json
from tqdm import tqdm

DB_NUM = 1  # odin crawl
# DB_NUM = 2  # webtap crawl

con = sqlite3.connect(db)
con.row_factory = sqlite3.Row
cur = con.cursor()

query = """SELECT sv.site_url, sv.visit_id,
    sg.id, sg.node_name, sg.node_id, sg.top, sg.left, sg.width, sg.height, 
    sg.num_buttons, sg.num_imgs, sg.num_anchors,
    TRIM(sg.inner_text) as inner_text, TRIM(sg.longest_text) as longest_text
    FROM segments as sg LEFT JOIN site_visits as sv
    ON sv.visit_id = sg.visit_id WHERE
    LOWER(sg.node_name) <> 'body' AND TRIM(sg.inner_text) <> ''
    """
# seen_checksums = defaultdict(set)

In [8]:
segment_json = "segments_odin.json"
# segment_json = "segments_webtap.json"

In [None]:
try:
    os.remove(segment_json)
    print ("Removed %s " % segment_json)
except:
    pass

seen_checksums = defaultdict(set)
with open(segment_json, "a") as f:
    for row in tqdm(cur.execute(query)):
        inner_processed = row["inner_text"].replace(r'\d+', 'DPNUM').replace('\n', ' ').replace('\r', '')
        hostname = urlparse(row["site_url"]).hostname
        inner_processed_crc = binascii.crc32(inner_processed.encode('utf-8'))
        if inner_processed_crc in seen_checksums[hostname]:
            continue
        seen_checksums[hostname].add(inner_processed_crc)
        row_d = dict(row)
        row_d["inner_processed"] = inner_processed
        del row_d["inner_text"]
        json.dump(dict(row), f)
        f.write("\n")

982it [00:00, 9814.42it/s]

Removed segments_odin.json 


1822950it [04:09, 7319.93it/s]

In [None]:
! wc -l segments_odin.json