Home

ekgade edited this page Nov 10, 2014 · 5 revisions
Clone this wiki locally

Hello,

This Wiki contains information re: the .gov data from Internet Archive for the PoliInformatics Conference, November 2014. The instructions on this page deal with a subset of the nearly 90 terabytes of web archive data from the .gov domaine that Internet Archive has captured (ranging from 1995-2013). As a sample for this conference, I searched for a list of climate change related words (see PhpPgAdmin wiki for details). I have created two different datasources from this search.

First, is the full text of the capture which had at least one mention of a climate change related word. I piped the title of the page, the date the capture was taken, the domaine/url, and page content (along with checksum, code and description fields) to an RDS Amazon instance and created a GUI front-end to make accessing the data a bit easier (see PhpPgAdmin wiki). It is 252GB of data, so it is slow to run commands (though much faster than running a command across all 90 terabytes on the .gov cluster). As a result, I create a TEST dataset, with only 3.5GB of data, and I suggest you use this to test any and all commands, get a sense of what sort of data is available and what it looks like, etc. before running anything against the whole dataset.

Second is an aggregation of number of counts per word and per domain group by month and year. That data is hosted on my github page as FullCounts8Nov_2. It has five columns: year, month, url group, regEx pattern/word of interest and count. The URL groups are simply the major departments of the US government, plus the House, Senate and White House. The regEx patterns are the list of words related to climate change plus "total", which is the total number of words that were captured during a given time period. Counts are simply the number of times that regular expression appeared in a given month on a given URL. To get the total use of that term, aggregate across URLS. To get the total words used per ULR root, aggregate across terms. The code used to extract this set of captures from the .gov dataset is available on my (messy) github repository as 6Nov_climatechange_fullcounts.pig and the python functions it calls are on there under cilmate_6Nov.py -- these are then aggregated using the script trend2.pig