# Analysis with `crawl`

`crawl` has been design with BigQuery in mind as a target for data analysis. It's implementation is flexible, so it shouldn't be hard to get it working with other tools as well. This notebook is a tutorial on using BigQuery to analyze your crawl data.

To make things easier, we'll use the BigQuery Jupyter extension. Most of the analysis here will just use BigQuery's [Standard SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/).

In [1]:
%load_ext google.cloud.bigquery

## Basic query

If you want to know about a single page, or a group of pages, a simple query will answer most questions.

In [2]:
%%bigquery
SELECT Address.Full,
       Depth
FROM `be-analysis.crawl.distilled`
LIMIT 5

Unnamed: 0,Full,Depth
0,https://www.distilled.net/store/profile/login/...,1
1,https://www.distilled.net/store/profile/login/...,2
2,https://www.distilled.net/store/profile/login/...,2
3,https://www.distilled.net/store/profile/login/...,2
4,https://www.distilled.net/store/profile/login/...,2


## Grouping

`StatusCode`, `Depth`, and `BodyTextHash` are great candidates for grouping.

In [3]:
%%bigquery
SELECT Depth,
       COUNT(*) AS N
FROM `be-analysis.crawl.distilled`
GROUP BY depth
ORDER BY depth ASC
LIMIT 5

Unnamed: 0,Depth,N
0,0,1
1,1,31
2,2,445
3,3,1394
4,4,1448


## Accessing nested data with UNNEST

Some characteristics of a page can differ widely. For instance, all pages should have a single title tag. If they happen to have more than that, we only concern ourselves with the first. On the other hand, pages can have any number of links — from none to many — and we concern ourselves with all of them. Links aren't the only example of this. `hreflang` tags are another example. In fact, this applies to metadata about the crawler's request as well. There are a variable number of headers returned by the server.

BigQuery handles this possibility with _repeated fields_.

## Understanding relationships with JOIN

Repeated fields let us answer many common questions, but not all of them. Often, once we've established that a relationship between two pages exists, we need to understand something about the _other_ page. Data about the source page is easy to access — all relevant fields are immediately available. But if a page links to another, how do we get the StatusCode field for the target page?

This requires a maneuver called a "self-join". It can be a bit hard to wrap your head around at first. However, once you understand this pattern, most other questions you have can be answered this way.

I'll present a sample query and then explain.

## Analysis with Python

### Do you really need Python?

For most analyses, SQL should be sufficient. If you structure your queries wisely, it will also be efficient. However, there are some computations SQL will be unable to perform. For instance, anything involving recursion isn't happening in SQL. But think long and hard about whether you really need recursion.

For instance — if you want to know where redirect chains are, do you really need to see the whole chain of redirects? That would require recursion. Or could you get away with having a list of URLs that redirect to pages with a 3XX status code? That can be done without recursion.

If you're convinced you need to use Python to process your data set, well...

### OK, here's an example.

Imagine that our data set does _not_ include `Depth`. Calculating the distance of a page from the some starting page — usually the home page — would then require treating the entire crawl as a graph. The links between pages are the edges of this graph. We would perform a breadth-first search of this graph, incrementing a depth variable each time we complete a "level" without finding the page we're looking for.

This only requires two steps: first, execute a named query to associate the query results with a Python variable. Then, do something with that variable.

The first line in the next box is `%%bigquery link_graph`. That means the result of the query will be stored in the Python variable `link_graph` in subsequent cells.

In [4]:
%%bigquery link_graph
SELECT source.Address.Full AS SourceAddress,
       target.Address.Full AS TargetAddress
FROM `be-analysis.crawl.distilled` AS source, UNNEST(Links) AS target

Unnamed: 0,SourceAddress,TargetAddress
0,https://www.distilled.net/manifesto/,https://www.distilled.net/
1,https://www.distilled.net/events/searchlove-bo...,https://www.distilled.net/events/searchlove-bo...
2,https://www.distilled.net/events/searchlove-bo...,https://www.distilled.net/events/searchlove-bo...
3,https://www.distilled.net/events/searchlove-bo...,https://www.distilled.net/
4,https://www.distilled.net/events/searchlove-bo...,https://twitter.com/search?q=%23searchlove
5,https://www.distilled.net/events/searchlove-bo...,https://www.distilled.net/events/searchlove-bo...
6,https://www.distilled.net/events/searchlove-bo...,https://www.distilled.net/events/searchlove-bo...
7,https://www.distilled.net/events/searchlove-bo...,https://www.distilled.net/events/searchlove-bo...
8,https://www.distilled.net/events/searchlove-bo...,https://www.distilled.net/store/profile/passwo...
9,https://www.distilled.net/events/searchlove-bo...,https://www.distilled.net/store/profile/regist...


Now we execute the breadth-first-search, keeping track of which pages we've seen and how many levels we've traversed. When we see a page for the first time, we put an entry into a dictionary. The key is the URL and the value is the current depth.

This will take approximately three years to complete — that's why the crawler includes `Depth` as a field by default!

In [6]:
from collections import deque

def edges(graph, node):
    for edge in graph.loc[graph["SourceAddress"] == node].itertuples():
        yield edge

def breadth_first_search(graph, source):
    depth = 0
    depths = {source: depth}
    q = deque()
    q.append(source)
    while q:
        source = q.pop()
        depth = depths[source]
        for edge in edges(graph, source):
            target = edge.TargetAddress
            if target not in depths:
                depths[target] = depth + 1
                q.append(target)
    return depths
    
depths = breadth_first_search(link_graph, "https://www.distilled.net/")

OK, more like 3 minutes. But that's for a site with this many pages:

In [8]:
%%bigquery
SELECT COUNT(Address)
FROM `be-analysis.crawl.distilled`

Unnamed: 0,f0_
0,6887


And to prove it works, an example result:

In [11]:
depths['https://www.distilled.net/resources/posts/marketing/']

6