# Archive

Some of the notebooks here will examine the HTML served up by the domains identified by DomainTools. To make that work a bit faster we're going to create an archive of web content using the [WARC](https://en.wikipedia.org/wiki/Web_ARChive) format so we don't have to keep going back to the web to fetch the content. The [warcio](https://github.com/webrecorder/warcio) library lets us easily write and read WARC data, which is essentially a concatenation of all the web content along with HTTP request and response headers that document when and how it was requested. The HTTP metadata is actually pretty important for anlyzing where things are redirecting to. warcio works nicely with the [requests](https://requests.readthedocs.io/en/master/) that makes fetching the content easy too. So let's start by importing those.

In [2]:
from warcio.capture_http import capture_http
import requests

We also need the DomainTools data:

In [3]:
import pandas

df = pandas.read_csv('data/domaintools/2020-04-07.csv.gz',
    parse_dates=['created'], 
    sep='\t',
    names=['domain', 'created', 'risk']
)
df.head()

Unnamed: 0,domain,created,risk
0,covid-19vacuna.net,2020-04-06,99
1,coronafiji.online,2020-04-06,99
2,thecoronacon.com,2020-04-06,99
3,covid19simplecottonmasksforhospitals.fail,2020-04-06,99
4,covidplasmasavealife.com,2020-04-06,99


Since all we have is a domain we will need to fish around a little bit seeing if the website is http or https, and if a www prefix is needed. These two functions will do this fishing around.

This one just fetches a URL with HTTP and returns a URL if the response was a 200 OK. If the response was not 200 OK, or some kind of excpetion was thrown it returns None. Since requests follows redirects the supplied URL could be different from the URL that is returned.

In [4]:
def get(url):
    try:
        resp = requests.get(url, timeout=5)
        if resp.status_code == 200:
            # return the final url after potential redirects 
            return resp.url
        else:
            return None
    except:
        return None

You can give `get_homepage` a domain and it will attempt to get the "home page" for that domain, by first trying an https protocol variant, then an http, and then doing the same again but with a "www" prefix on the domain. The first one to return content will cause the function to return the found URL. Note the found URL could be different than the URL that was checked because there could be HTTP redirects involved (often the case with malware).

In [5]:
def get_homepage(domain):  
    urls = [
        "https://" + domain,
        "http://" + domain,
        "https://www." + domain,
        "http://www." + domain
    ]
        
    for url in urls:
        found_url = get(url)
        if found_url:
            return found_url
    
    return None

Now we just need to iterate through the domains, fetch them from the web, and write them to a WARC file that we are going to name using the current date. We are also going to keep track of the URL where we found content.

In [None]:
import sys
import pathlib
import datetime

today = datetime.date.today().strftime('%Y-%m-%d')
warc_file = pathlib.Path('data') / "warc" / "domaintools-{}.warc.gz".format(today)
print("writing to {}".format(warc_file))

found = {}
with capture_http(warc_file.as_posix()):
    for domain in df.domain:
        url = get_homepage(domain)
        if url:
            sys.stdout.write('+')
        else:
            sys.stdout.write('-')
        found[domain] = url

writing to data/warc/domaintools-2020-04-08.warc.gz
++--+++++++-++-++--++++++-+++-++++++-+-+-+++