# ArchiveTeam Growth

This notebook lets you calculate the growth of the data deposited at the Internet Archive by ArchiveeTeam. It just uses the Internet Archive API to look through the metadata for all their items.

## Install

To get started you'll need to save your Internet Archive account login details. If you don't have an account go over to archive.org and create one, and then:

    % ia configure
    
## Internet Archive Metadata

Now we're ready to load and use the [internetarchive](https://github.com/jjjake/internetarchive) Python extension:
    

In [1]:
import internetarchive as ia

Internet Archive organizes their stuff according to *collections*, which can contain one or many *items*. Each item then can contain one or many files. Collections and items have identifiers that uniquely identify them. For example if you know the the item identifier `ARCHIVEIT-2410-DAILY-JOB227586-20160728-00000` you can go get it and print its metadata:

In [2]:
from pprint import pprint

item = ia.get_item('archiveteam_tumblr20181221175524')
pprint(item.metadata)

{'addeddate': '2018-12-21 19:03:50',
 'collection': ['archiveteam_tumblr', 'archiveteam'],
 'date': '2018-12',
 'identifier': 'archiveteam_tumblr20181221175524',
 'language': 'eng',
 'mediatype': 'web',
 'publicdate': '2018-12-21 19:03:50',
 'title': 'Archive Team Tumblr Tumbledown: 20181221175524',
 'uploader': 'me@harrycross.me'}


Notice that the item identifier `archiveteam_tumblr20181221175524` contains a date/time? We're going to take a bit of a leap here and assume that the date contained in there is the date that the WARC data was collected from SavePageNow. This might in fact not be the case, unless we learn more from Internet Archive about the provenance of this data.

There's actually much more detailed metadata available for the files in the item. Here's how many files are in the item:

In [3]:
print(len(item.item_metadata['files']))

7


We can see what metadata is available for the first file:

In [4]:
pprint(item.item_metadata['files'][0])

{'btih': 'fb9192f71567e55112fdcc8bbec9961431a94d21',
 'crc32': '988d5423',
 'format': 'Archive BitTorrent',
 'md5': 'd5545927ddd7f87e06d5985553c568f3',
 'mtime': '1545421126',
 'name': 'archiveteam_tumblr20181221175524_archive.torrent',
 'sha1': '08b02a02f9c601d0d7d42a2adc409f2a5197e422',
 'size': '258891',
 'source': 'metadata'}


See the *size* property? That's the size in bytes of the file.

## Fetch the Data

So now we know enough to write a function that can return the date and the size of the warc files in an item given its identifier.

In [5]:
import re

def item_summary(item_id):
    item = ia.get_item(item_id)

    size = 0
    for file in item.item_metadata['files']:
        if 'size' in file:
            size += int(file['size'])
            
    m = re.match('(^\d\d\d\d-\d\d-\d\d)', item.item_metadata['metadata']['addeddate'])
    date = m.group(1)
    
    return date, size

Let's give a try:

In [6]:
print(item_summary('archiveteam_tumblr20181221175524'))

('2018-12-21', 53766825137)


The internetarchive Python library doesn't offer an abstraction for collections. But it does provide a way to search for a collection and iterate through the items that it contains. If you now the name of your collection you can iterate through it pretty easily. So lets see how many items are contained in the collection:

In [7]:
num_items = len(ia.search_items('collection:archiveteam'))
print(num_items)

493237


That's quite a few. If it takes a second to get the metadata for each item we are going to need to wait a while:

In [8]:
print(num_items / 60 / 60.0, 'hours')

137.0102777777778 hours


Ok, so let's go through each item, get the size and day for the item, and store them in a dictionary. Since there may be more than one item in a day it's important to add to the existing value for a date instead of simply storing it.

In [None]:
import os
import re
import json

sizes = {}

if not os.path.isfile('ArchiveTeam.csv'):

    for result in ia.search_items('collection:archiveteam'):
        date, size = item_summary(result['identifier'])
        sizes[date] = sizes.get(date, 0) + size
        print(result['identifier'], date, size, sizes[date])

Now we can write out the *sizes* dictionary to a CSV file, where every row is a date. This way we won't need to fetch it every time we run this notebook.

In [None]:
if sizes:

    import csv

    dates = sorted(sizes.keys())

    with open('ArchiveTeam.csv', 'w') as output:
        writer = csv.writer(output)
        writer.writerow(['date', 'size'])
        for date in dates:
            writer.writerow([date, sizes[date]])

## Analyze the Data

So now we have our CSV of data we can analyze it a bit with [pandas](https://pandas.pydata.org/) and maybe generate a useful graph. First we'll import pandas and load in the data as a pandas DataFrame.

*Aside: I'm still learning pandas, and this is not meant to be a tutorial. If you want to learn more about all the amazing stuff you can do with it you'll want to spend some time in their excellent [tutorial](https://pandas.pydata.org/pandas-docs/stable/dsintro.html). If you work with R at all it should look pretty familiar. If not, treat this as just dipping your toe in to test the water. That's what I'm doing. If you do know pandas and notice a better way of doing any of this please let me know!*

In [None]:
import pandas as pd

sizes = pd.read_csv('ArchiveTeam.csv', index_col=0, parse_dates=True)
sizes.head()

It looks like the liveweb data started saving back in 2011. So now we've got a DataFrame that is indexed by the day, with one Series *size* that contains the bytes. I don't know about you, but I find it difficult to think of size in terms of bytes. So let's use pandas to calcuate a gigabyte column for us using the bytes:

In [None]:
sizes = sizes.assign(gb=lambda x: x / 1024 ** 3)
sizes.head()

Now we can tell pandas to calcuate some statistics on our data:

In [None]:
sizes.describe()

## Visualize the Data

Since we have thousands of days, it might be useful to see the stats by month rather than by day. That's not too hard to do since our dataframe as a date index and pandas support for [timeseries](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries) data allows us to resample the dataframe on a monthly basis:

In [None]:
sizes_by_month = sizes.resample('M').sum()
sizes_by_month.head()

In [None]:
%matplotlib inline

sizes_by_month['gb'].plot()

Kinda cool right!? The dip at the end is the result of me running the data collection at the beginning of November. So let's remove that:

In [None]:
import datetime

sizes_without_nov = sizes_by_month['gb'].drop(datetime.date(2018, 11, 30))

plot = sizes_without_nov
.plot(figsize=(12, 5))
plot.set_xlabel('Year')
plot.set_ylabel('Gigabytes per Month')
plot.set_title('Save-Page-Now Ingest Rate')