# Analyzing US Museum Websites with Web Archives

This notebook was created during the [Archives Unleashed] event held in Washington DC on June 14-15.

## Background

[IMLS] maintains the [Museums Universal Data File] which is a census of museums in the United States. Conveniently the database is available for download as a [dataset] in CSV format. The [ArchiveIt] folks at Internet Archive pulled the website URLs from the CSV and then used them to create a seed list for a (big) web crawl. The 10 gigabytes of [CDX files] for those crawls were made available as a dataset for use during Archives Unleashed.

With your archivist hat on (if you have one) you can think of [CDX] files as a *finding aid* for the Web crawl data, or WARC data files. They document what URLs were requested, what the [response code] was (404 Not Found, 500 Internal Server Error, 200 OK, etc), the size of the response, the media type of the response (text/html, image/jpeg, etc), as well as other metadata to aid in seeking within the CDX file and in the corresponding WARC file to get the actual content.

## Processing

We thought that perhaps it might be interesting to augment the IMLS data file with information from the crawl, such as the total number of pages on the museums website, the total size of resources hosted, the number of missing resources, etc. In addition we thought it could be interesting to look at the data through the lens of the museum's income.

To be able to augment the IMLS table with data from the CDX file we chose to read the CDX file and count attributes of the request (status code, mimetype, size, etc) using [Redis]. Unfortunately we found out during the processing that there are a great deal of duplicate URLs in the crawl: a given URL may have been crawled many times. So we needed to tally just the first occurrence of the resource, and ignore the rest, so that our results don't favor resources that were requested many times by the crawler. You can see the resulting program here in [index.py].

The Redis database is populated with the following keys:

* `hosts`: a sorted set of hostnames where the score is the number of pages at that hostname
* `size`: a sorted set of hostnames where the score is the total number of bytes retrieved from the hostname
* `mime-{mimetype}`: a sorted set of hostnames where the score is the total number of pages at that hostname with a given mimetype: e.g. text/html
* `status-{status code}`: a sorted set of hostnames where the score is the total number of pages that responded with a given HTTP status code: e.g. 200.
* `robots`: a set of hostnames which have robots.txt files

This indexing process took a about 10 hours! If you want to play with the redis instance yourself you can find it here in the Git reposity: [dump.rdb]. Once the data has been processed into Redis you can then run [augment.py] which will read in the existing CSV file, and write out a new CSV with the additional columns: 

* pages: number of *unique* pages requested
* size: total size in bytes received from a website
* 404: number of Not Found errors at a website
* 500: number of Internal Server Errors at a website
* robots: whether the website had a robots.txt file

More columns could be added, such as number of images (jpeg, tiff, etc) if desirable. This notebook that you are looking at uses the [augmented CSV dataset] to examine a few things.

## Caveats:

* Any Website URLs listed in the IMLS data file that were at a path within a larger site were ignored in this analysis. The reason being that it is not clear where one website begins and another ends unless there is a naked domain. So a website identified as www.example.org/museum/ would not be included, but www.example.org would be.
* Naked domains are only a limited view of a particular organization's web space. It could be that they have a lot of content at another domain. For example: a website at www.example.org might have a lot of images host at images.example.org.
* The web crawl ran for 3 months, but still hasn't completed. According to [Jefferson] it apparently is stuck in some NFL website at the moment.


[augmented CSV dataset]: imls-2015-augmented.csv
[Jefferson]: https://twitter.com/jefferson_bail
[IMLS]: http://www.imls.gov
[Redis]: http://redis.io
[ArchiveIt]: http://archiveit.org
[Archives Unleashed]: http://archivesunleashed.com
[Museums Universal Data File]: https://www.imls.gov/research-evaluation/data-collection/museum-universe-data-file
[dataset]: https://data.imls.gov/Museum/Museum-Universe-Data-File-FY-2015-Q1/bqh6-bapa
[CDX files]: http://qa-server.us.archive.org/vinay-misc/imls-cdx/cdx-manifest.txt
[WARC]: https://en.wikipedia.org/wiki/Web_ARChive
[CDX]: https://archive.org/web/researcher/cdx_file_format.php
[response code]: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
[index.py]: index.py
[augment.py]: augment.py
[dump.rdb]: dump.rdb

## Analysis

Let's get started by importing a few things. We're going to use pandas, and maybe view some things with matplotlib.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib notebook


In [2]:
imls = pd.read_csv("imls-2015-augmented.csv")

So here's what our table looks like:

In [3]:
imls.head()

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots
0,ARTSREVIVE,artsrevive.com,138290.0,0,0,0,0,False
1,REV BIRMINGHAM INC,www.revbirmingham.org,1695242.0,0,0,0,0,False
2,BELLINGRATH GARDENS AND HOME,www.bellingrath.org,3422595.0,2369,866250739,10,0,True
3,PIONEER MUSEUM OF ALABAMA ASSN,www.pioneer-museum.org,103113.0,125,103624363,2,0,True
4,ALABAMA SPORTS HALL OF FAME,ashof.org,552152.0,2812,41542788,27,0,False


## Neoliberal Museum Analysis

It's not entirely clear what the income column represents in the IMLS data. So perhaps this isn't a good idea. Since we have the income for the museum and the total size (bytes) retrieved from their website we can calculate a ratio of the gigabytes per dollar on the museum website. Then we can sort it so that the institutions that are making the most amount of data relative to their income, or *gigabytes per dollar*.

In [4]:
imls = imls.assign(size_gb = imls['Size'] / 1024 / 1024 / 1024)
imls = imls.assign(size_ratio = imls["size_gb"]/ imls["Income"])
imls.sort_values(by="size_ratio", ascending=False)

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots,size_gb,size_ratio
5516,COLLEGE OF EASTERN UTAH,www.ceu.edu,336.0,35333,2385968467,184,0,True,2.222106,0.006613
507,VALLEY INSTITUTE VISUAL ART,www.vivagallery.org,22.0,2394,135252540,9,0,False,0.125964,0.005726
3034,SHELBY COUNTY HISTORICAL SOCIETY INC,shelby.mogenweb.org,7156.0,13170,27640930782,94,0,False,25.742623,0.003597
143,TEMPE HISTORICAL SOCIETY,www.tempe.gov,6053.0,211267,8209173685,641,112,True,7.645389,0.001263
337,ALLIED GARDENS PARK & RECREATION COUNCIL,www.sandiego.gov,113766.0,68640,89461797984,4398,0,True,83.317792,0.000732
1869,ALLEN COUNTY COURTHOUSE PRESERVATION TRUST INC,in.gov,91289.0,197162,63544413789,15248,0,True,59.180347,0.000648
2841,MUSKEGON MUSEUM OF ART FOUNDATION,www.muskegonartmuseum.org,731.0,1350,490731139,88,0,True,0.457029,0.000625
1568,MUSEUM OF CLASSIC CHICAGO TELEVISION,www.fuzzymemories.tv,12385.0,1469756,6058835491,0,1,True,5.642730,0.000456
3526,FOUNDATION FOR THE CHAPEL OF SACRED MIRRORS LTD,www.cosm.org,3815.0,3566,1309523066,41,0,True,1.219588,0.000320
2008,SHAWNEE INDIAN MISSION FOUNDATION,www.kshs.org,60952.0,89964,18826671716,416,1,True,17.533704,0.000288


This is a bit weird because there are institutions that have an income of \$22.00. So let's look at museums that have an income of more than \$100,000.00

In [5]:
large_imls = imls[imls['Income'] > 100000]
large_imls.sort_values(by='size_ratio', ascending=False)

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots,size_gb,size_ratio
337,ALLIED GARDENS PARK & RECREATION COUNCIL,www.sandiego.gov,113766.0,68640,89461797984,4398,0,True,83.317792,0.000732
2025,RILEY COUNTY HISTORICAL SOCIETY,www.kshs.org,113926.0,89964,18826671716,416,1,True,17.533704,0.000154
3946,ALUMNI SOCIETY OF THE SCHOOL OF VISUAL ARTS,www.sva.edu,393953.0,59904,48091618879,256,18,True,44.788810,0.000114
2374,JEFFERSON PATTERSON PARK AND MUSEUM,www.jefpat.org,162410.0,76441,16582683538,202,7,True,15.443828,0.000095
4173,TOBACCO FARM LIFE MUSEUM INC,www.tobaccofarmlifemuseum.org,153719.0,1133369,6842997344,2,62640,True,6.373038,0.000041
5211,TEXAS ARCHIVE OF THE MOVING IMAGE,www.texasarchive.org,314066.0,124893,12360836326,247,1,False,11.511926,0.000037
1660,ELMHURST HISTORICAL MUSEUM,www.elmhurst.org,156031.0,21317,5678932205,215,48,True,5.288918,0.000034
4594,GILBERT HOUSE CHILDRENS MUSEUM INC,www.acgilbert.org,618496.0,1331673,22090329664,5,0,False,20.573223,0.000033
710,HERITAGE MUSEUM OF ORANGE COUNTY,www.heritagemuseumoc.org,359685.0,435970,11838237402,62,24,True,11.025218,0.000031
1214,AMELIA ISLAND MUSEUM OF HISTORY INC,www.ameliamuseum.org,370153.0,1083793,8932488171,69,0,True,8.319028,0.000022


We can also calculate a similar ratio for the number of pages per dollar, and get the top results.

In [6]:
imls = imls.assign(pages_ratio = imls['Pages'] / imls['Income'])
imls.sort_values(by='pages_ratio', ascending=False)

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots,size_gb,size_ratio,pages_ratio
1568,MUSEUM OF CLASSIC CHICAGO TELEVISION,www.fuzzymemories.tv,12385.0,1469756,6058835491,0,1,True,5.642730,0.000456,118.672265
507,VALLEY INSTITUTE VISUAL ART,www.vivagallery.org,22.0,2394,135252540,9,0,False,0.125964,0.005726,108.818182
5516,COLLEGE OF EASTERN UTAH,www.ceu.edu,336.0,35333,2385968467,184,0,True,2.222106,0.006613,105.157738
143,TEMPE HISTORICAL SOCIETY,www.tempe.gov,6053.0,211267,8209173685,641,112,True,7.645389,0.001263,34.902858
5909,ART WORKS AROUND TOWN INC,www.artworksaroundtown.com,32009.0,722936,5414788767,7,0,False,5.042915,0.000158,22.585398
3375,FRIENDS OF ABRAHAM STAATS HOUSE INC,staatshouse.org,2173.0,36787,484321818,1495,0,True,0.451060,0.000208,16.929130
386,TREASURE ISLAND MUSEUM ASSOCIATION,www.treasureislandmuseum.org,40718.0,573052,1631136627,42,0,False,1.519114,0.000037,14.073677
6035,OSHKOSH PUBLIC MUSEUM AUXILIARY INC,www.oshkoshmuseum.org,2419.0,22765,360671584,14,0,False,0.335902,0.000139,9.410914
4173,TOBACCO FARM LIFE MUSEUM INC,www.tobaccofarmlifemuseum.org,153719.0,1133369,6842997344,2,62640,True,6.373038,0.000041,7.372992
3993,BUNGERS SURFING FOUNDATION,www.bungersurf.com,400.0,2850,99990027,143,0,False,0.093123,0.000233,7.125000


## Integrity

Since we have the total pages requested from a domain, and the total number of 404s at that domain we can calcuate a new column *missing_ratio* . The lower this ratio, the better the link integrity in the website. We can then sort our table in ascending order and see which museum websites have the best link integrity.

In [7]:
imls = imls.assign(missing_ratio = imls["404"] / imls["Pages"])
imls.sort_values(by='missing_ratio')

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots,size_gb,size_ratio,pages_ratio,missing_ratio
4091,CATAWBA SCIENCE CENTER INC,www.catawbascience.org,2129313.0,1,426,0,0,True,3.967434e-07,1.863246e-13,4.696350e-07,0.0
3882,MALCOLM X DR BETTY SHABAZZ MEMORIAL EDUCATIONA...,www.theshabazzcenter.net,312303.0,1549864,6849435876,0,12,True,6.379034e+00,2.042579e-05,4.962693e+00,0.0
1302,MENNELLO MUSEUM OF AMERICAN ART FRIENDS INC,www.mennellomuseum.org,379840.0,3,1319,0,0,False,1.228414e-06,3.234031e-12,7.898062e-06,0.0
3880,MUSEUM OF BIBLICAL ART,www.mobia.org,2193466.0,4,1858,0,0,False,1.730397e-06,7.888872e-13,1.823598e-06,0.0
1307,SOUTHWEST FLORIDA HOLOCAUST MUSEUM INC,www.hmswfl.org,354750.0,82,3778787,0,0,True,3.519270e-03,9.920422e-09,2.311487e-04,0.0
1309,LIGHTNER MUSEUM OF HOBBIES,www.lightnermuseum.org,722968.0,471,247067268,0,0,True,2.300993e-01,3.182704e-07,6.514811e-04,0.0
3867,FENTON HISTORICAL SOCIETY,www.fentonhistorycenter.org,200300.0,266032,3768011150,0,1107,True,3.509234e+00,1.751989e-05,1.328168e+00,0.0
3864,GLENN H CURTISS MUSEUM OF LOCAL HISTORY INC,www.glennhcurtissmuseum.org,540702.0,937,11292280,0,0,False,1.051676e-02,1.945019e-08,1.732932e-03,0.0
3861,AURORA HISTORICAL SOCIETY,www.aurorahistoricalsociety.com,20734.0,199,1156683,0,0,True,1.077245e-03,5.195548e-08,9.597762e-03,0.0
3859,JAMESTOWN AUDUBON SOCIETY,www.jamestownaudubon.org,700167.0,4,2312,0,0,False,2.153218e-06,3.075292e-12,5.712923e-06,0.0


The first two rows aren't interesting because the websites only had 3 pages. But checkout the [Malcolm X and Dr Betty Shabazz Memorial and Education Center](http://www.theshabazzcenter.net/) which has 867,395 pages and **ZERO** 404 errors.

We can limit our results to websites with more than 100,000 pages.

In [8]:
large_imls = imls[imls['Pages'] > 100000]
large_imls.sort_values(by='missing_ratio')

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots,size_gb,size_ratio,pages_ratio,missing_ratio
1402,MUSEUM OF AVIATION AT ROBINS AIR FORCE BASE GE...,www.museumofaviation.org,3.020252e+06,971065,10182593941,0,0,True,9.483280,3.139897e-06,0.321518,0.000000
3686,ANTHOLOGY FILM ARCHIVES,www.anthologyfilmarchives.org,7.898710e+05,1572883,8366301668,0,3,True,7.791726,9.864555e-06,1.991316,0.000000
4541,COLUMBIA RIVER MARITIME MUSEUM ENDOWMENT TR,www.crmm.org,1.840690e+05,162723,316522062,0,0,True,0.294784,1.601487e-06,0.884033,0.000000
2143,BRENNAN HOUSE INC,www.thebrennanhouse.org,1.122500e+05,330311,1176929799,0,0,True,1.096101,9.764822e-06,2.942637,0.000000
1568,MUSEUM OF CLASSIC CHICAGO TELEVISION,www.fuzzymemories.tv,1.238500e+04,1469756,6058835491,0,1,True,5.642730,4.556100e-04,118.672265,0.000000
3867,FENTON HISTORICAL SOCIETY,www.fentonhistorycenter.org,2.003000e+05,266032,3768011150,0,1107,True,3.509234,1.751989e-05,1.328168,0.000000
3882,MALCOLM X DR BETTY SHABAZZ MEMORIAL EDUCATIONA...,www.theshabazzcenter.net,3.123030e+05,1549864,6849435876,0,12,True,6.379034,2.042579e-05,4.962693,0.000000
5443,ARTS COUNCIL WICHITA FLS AREA,www.kempcenter.org,9.257160e+05,134837,2137654982,0,1,True,1.990846,2.150602e-06,0.145657,0.000000
4937,ALLISON-ANTRIM MUSEUM INC,www.greencastlemuseum.org,2.663860e+05,396074,3676957484,0,0,True,3.424434,1.285516e-05,1.486842,0.000000
2484,JOHN WOODMAN HIGGINS ARMORY INC,www.higgins.org,1.392623e+06,1708668,2952065786,0,0,True,2.749326,1.974207e-06,1.226942,0.000000


Hopefully this provides a little glimmer of an example of how web archives could be used to study Museum websites.