# Analyzing US Museum Websites with Web Archives

This notebook was created during the [Archives Unleashed] event held in Washington DC on June 14-15.

## Background

[IMLS] maintains the [Museums Universal Data File] which is a census of museums in the United States. Conveniently the database is available for download as a [dataset] in CSV format. The [ArchiveIt] folks at Internet Archive pulled the website URLs from the CSV and then used them to create a seed list for a (big) web crawl. The 10 gigabytes of [CDX files] for those crawls were made available as a dataset for use during Archives Unleashed.

With your archivist hat on (if you have one) you can think of [CDX] files as a *finding aid* for the Web crawl data, or WARC data files. They document what URLs were requested, what the [response code] was (404 Not Found, 500 Internal Server Error, 200 OK, etc), the size of the response, the media type of the response (text/html, image/jpeg, etc), as well as other metadata to aid in seeking within the CDX file and in the corresponding WARC file to get the actual content.

## Processing

We thought that perhaps it might be interesting to augment the IMLS data file with information from the crawl, such as the total number of pages on the museums website, the total size of resources hosted, the number of missing resources, etc. In addition we thought it could be interesting to look at the data through the lens of the museum's income.

To be able to augment the IMLS table with data from the CDX file I chose to read the CDX file and count attributes of the request (status code, mimetype, size, etc) using [Redis]. Unfortunately we found out during the processing that there is a great deal of duplicate URLs in the crawl: a given URL may have been crawled many times. So we also needed to only tally the first occurrence of the resource, and ignore the rest, so that our results don't favor resources that were requested many times by the crawler. You can see the resulting program here in [index.py].

Once the data has been processed into Redis you can then run [augment.py] which will read in the existing CSV file, and write out a new CSV with the additional columns: 

* pages: number of *unique* pages requested
* size: total size in bytes received from a domain
* 404: number of Not Found errors at a domain
* 500: number of Internal Server Errors at a domain
* robots: 

More columns could be added, such as number of images (jpeg, tiff, etc) if desirable. This notebook uses the [augmented CSV dataset] to examine a few things.

## Caveats:

* These results are preliminary, index.py is still running after 10 hours! It appears to be 75% done, so these results are impartial. Maybe I should have tried [ArchiveSpark] or something.  Sigh. I'll update the notebook with the latest CSV when it completes.
* Any Website URLs that were at a path within a larger site were ignored in this analysis. The reason being that it is not clear where one website begins and another ends unless there is a naked domain.
* Naked domains are only a limited view of a particular organizations web space. It could be that they have a lot of content at another domain.
* The web crawl ran for 3 months, but still hasn't completed. According to [Jefferson] it apparently is stuck in some NFL website at the moment.


[augmented CSV dataset]: imls-2015-augmented.csv
[Jefferson]: https://twitter.com/jefferson_bail
[IMLS]: http://www.imls.gov
[Redis]: http://redis.io
[ArchiveIt]: http://archiveit.org
[Archives Unleashed]: http://archivesunleashed.com
[Museums Universal Data File]: https://www.imls.gov/research-evaluation/data-collection/museum-universe-data-file
[dataset]: https://data.imls.gov/Museum/Museum-Universe-Data-File-FY-2015-Q1/bqh6-bapa
[CDX files]: http://qa-server.us.archive.org/vinay-misc/imls-cdx/cdx-manifest.txt
[WARC]: https://en.wikipedia.org/wiki/Web_ARChive
[CDX]: https://archive.org/web/researcher/cdx_file_format.php
[response code]: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
[index.py]: index.py

## Analysis

Let's get started by importing a few things. We're going to use pandas, and maybe view some things with matplotlib.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib notebook


In [2]:
imls = pd.read_csv("imls-2015-augmented.csv")

So here's what our table looks like:

In [3]:
imls.head()

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots
0,ARTSREVIVE,artsrevive.com,138290.0,0,0,0,0,False
1,REV BIRMINGHAM INC,www.revbirmingham.org,1695242.0,0,0,0,0,False
2,BELLINGRATH GARDENS AND HOME,www.bellingrath.org,3422595.0,2101,768279385,10,0,True
3,PIONEER MUSEUM OF ALABAMA ASSN,www.pioneer-museum.org,103113.0,109,69725620,2,0,True
4,ALABAMA SPORTS HALL OF FAME,ashof.org,552152.0,2230,33108465,24,0,False


## Neoliberal Museum Analysis

It's not entirely clear what the income column represents in the IMLS data. So perhaps this isn't a good idea.Since we have the income for the museum and the total size (bytes) retrieved from their website we can calculate a ratio of the gigabytes per dollar on the museum website. Then we can sort it so that the institutions that are making the most amount of data relative to their income: *gigabytes per dollar*.

In [4]:
imls = imls.assign(size_gb = imls['Size'] / 1024 / 1024 / 1024)
imls = imls.assign(size_ratio = imls["size_gb"]/ imls["Income"])
imls.sort_values(by="size_ratio", ascending=False)

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots,size_gb,size_ratio
507,VALLEY INSTITUTE VISUAL ART,www.vivagallery.org,22.0,2153,123098457,8,0,False,0.114644,0.005211
5516,COLLEGE OF EASTERN UTAH,www.ceu.edu,336.0,22676,1589230787,138,0,True,1.480087,0.004405
3034,SHELBY COUNTY HISTORICAL SOCIETY INC,shelby.mogenweb.org,7156.0,10980,22380537929,94,0,False,20.843500,0.002913
143,TEMPE HISTORICAL SOCIETY,www.tempe.gov,6053.0,119784,5044019549,467,78,True,4.697609,0.000776
337,ALLIED GARDENS PARK & RECREATION COUNCIL,www.sandiego.gov,113766.0,53357,74895087335,3093,0,True,69.751486,0.000613
1869,ALLEN COUNTY COURTHOUSE PRESERVATION TRUST INC,in.gov,91289.0,152042,50989630130,11144,0,True,47.487794,0.000520
2841,MUSKEGON MUSEUM OF ART FOUNDATION,www.muskegonartmuseum.org,731.0,1255,368217669,76,0,True,0.342929,0.000469
3526,FOUNDATION FOR THE CHAPEL OF SACRED MIRRORS LTD,www.cosm.org,3815.0,3197,1134064280,31,0,True,1.056180,0.000277
1568,MUSEUM OF CLASSIC CHICAGO TELEVISION,www.fuzzymemories.tv,12385.0,813857,3353868160,0,1,True,3.123533,0.000252
5266,MUSEUM OF COMPUTER CULTURE,www.computerculture.org,883.0,2359,229946688,295,0,True,0.214155,0.000243


This is a bit weird because there are institutions that have an income of \$22.00. So let's look at museums that have an income of more than \$100,000.00

In [5]:
large_imls = imls[imls['Income'] > 100000]
large_imls.sort_values(by='size_ratio', ascending=False)

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots,size_gb,size_ratio
337,ALLIED GARDENS PARK & RECREATION COUNCIL,www.sandiego.gov,113766.0,53357,74895087335,3093,0,True,69.751486,0.000613
2025,RILEY COUNTY HISTORICAL SOCIETY,www.kshs.org,113926.0,60364,14781730183,342,0,True,13.766559,0.000121
3946,ALUMNI SOCIETY OF THE SCHOOL OF VISUAL ARTS,www.sva.edu,393953.0,49712,38945037160,234,0,True,36.270392,0.000092
2374,JEFFERSON PATTERSON PARK AND MUSEUM,www.jefpat.org,162410.0,46033,8462341995,129,3,True,7.881170,0.000049
5211,TEXAS ARCHIVE OF THE MOVING IMAGE,www.texasarchive.org,314066.0,95588,12108190149,233,1,False,11.276631,0.000036
1660,ELMHURST HISTORICAL MUSEUM,www.elmhurst.org,156031.0,16064,4171599640,180,16,True,3.885105,0.000025
4173,TOBACCO FARM LIFE MUSEUM INC,www.tobaccofarmlifemuseum.org,153719.0,624919,3812936433,2,30253,True,3.551074,0.000023
4594,GILBERT HOUSE CHILDRENS MUSEUM INC,www.acgilbert.org,618496.0,809159,13379682516,5,0,False,12.460800,0.000020
710,HERITAGE MUSEUM OF ORANGE COUNTY,www.heritagemuseumoc.org,359685.0,267800,7252557567,58,2,True,6.754471,0.000019
1707,ELGIN AREA HISTORICAL SOCIETY,www.cityofelgin.org,182775.0,8963,3431885338,217,1,True,3.196192,0.000017


We can also calculate a similar ratio for the number of pages per dollar, and get the top results.

In [6]:
imls = imls.assign(pages_ratio = imls['Pages'] / imls['Income'])
imls.sort_values(by='pages_ratio', ascending=False)

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots,size_gb,size_ratio,pages_ratio
507,VALLEY INSTITUTE VISUAL ART,www.vivagallery.org,22.0,2153,123098457,8,0,False,0.114644,0.005211,97.863636
5516,COLLEGE OF EASTERN UTAH,www.ceu.edu,336.0,22676,1589230787,138,0,True,1.480087,0.004405,67.488095
1568,MUSEUM OF CLASSIC CHICAGO TELEVISION,www.fuzzymemories.tv,12385.0,813857,3353868160,0,1,True,3.123533,0.000252,65.713121
143,TEMPE HISTORICAL SOCIETY,www.tempe.gov,6053.0,119784,5044019549,467,78,True,4.697609,0.000776,19.789195
3375,FRIENDS OF ABRAHAM STAATS HOUSE INC,staatshouse.org,2173.0,32323,431534924,1317,0,True,0.401898,0.000185,14.874827
5909,ART WORKS AROUND TOWN INC,www.artworksaroundtown.com,32009.0,449244,3369816321,7,0,False,3.138386,0.000098,14.034928
386,TREASURE ISLAND MUSEUM ASSOCIATION,www.treasureislandmuseum.org,40718.0,362125,1102337129,41,0,False,1.026631,0.000025,8.893487
6035,OSHKOSH PUBLIC MUSEUM AUXILIARY INC,www.oshkoshmuseum.org,2419.0,18387,291604163,9,0,False,0.271578,0.000112,7.601075
3993,BUNGERS SURFING FOUNDATION,www.bungersurf.com,400.0,2400,93915092,137,0,False,0.087465,0.000219,6.000000
1280,CIVIC MEDIA CENTER AND LIBRARY INC,www.civicmediacenter.org,58541.0,249815,2191390273,15,0,True,2.040891,0.000035,4.267351


## Integrity

Since we have the total pages requested from a domain, and the total number of 404s at that domain we can calcuate a new column *missing_ratio* . The lower this ratio, the better the link integrity in the website. We can then sort our table in ascending order and see which museum websites have the best link integrity.

In [7]:
imls = imls.assign(missing_ratio = imls["404"] / imls["Pages"])
imls.sort_values(by='missing_ratio')

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots,size_gb,size_ratio,pages_ratio,missing_ratio
2566,NEW BEDFORD MUSEUM AND ART CENTER INC,www.newbedfordartmuseum.org,328130.0,3,1989,0,0,True,1.852401e-06,5.645325e-12,0.000009,0.0
3884,STEEL PLANT MUSEUM INC,www.steelpltmuseum.org,42806.0,3,1540,0,0,False,1.434237e-06,3.350551e-11,0.000070,0.0
3882,MALCOLM X DR BETTY SHABAZZ MEMORIAL EDUCATIONA...,www.theshabazzcenter.net,312303.0,867395,3829017053,0,4,True,3.566050e+00,1.141856e-05,2.777415,0.0
848,SANGRE DE CRISTO ARTS & CONFERENCE CENTER,sangredecristoarts.org,1494933.0,3,1523,0,0,False,1.418404e-06,9.488079e-13,0.000002,0.0
1522,NEZ PERCE COUNTY HISTORICAL SOCIETY INCORPORATED,www.npchistsoc.org,62816.0,2,1549,0,0,True,1.442619e-06,2.296578e-11,0.000032,0.0
850,CHILDRENS MUSEUM OF DENVER INC,www.mychildsmuseum.org,5828008.0,642,66545235,0,0,True,6.197508e-02,1.063401e-08,0.000110,0.0
852,PUEBLO ZOOLOGICAL SOCIETY,www.pueblozoo.org,1662497.0,3,31355,0,0,True,2.920162e-05,1.756492e-11,0.000002,0.0
854,FOUR MILE HISTORIC PARK INC,www.fourmilepark.org,524397.0,134,62061110,0,0,True,5.779891e-02,1.102198e-07,0.000256,0.0
3880,MUSEUM OF BIBLICAL ART,www.mobia.org,2193466.0,4,1858,0,0,False,1.730397e-06,7.888872e-13,0.000002,0.0
3867,FENTON HISTORICAL SOCIETY,www.fentonhistorycenter.org,200300.0,175516,2523574453,0,723,True,2.350262e+00,1.173371e-05,0.876266,0.0


The first two rows aren't interesting because the websites only had 3 pages. But checkout the [Malcolm X and Dr Betty Shabazz Memorial and Education Center](http://www.theshabazzcenter.net/) which has 867,395 pages and **ZERO** 404 errors.

We can limit our results to websites with more than 100,000 pages.

In [8]:
large_imls = imls[imls['Pages'] > 100000]
large_imls.sort_values(by='missing_ratio')

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots,size_gb,size_ratio,pages_ratio,missing_ratio
2484,JOHN WOODMAN HIGGINS ARMORY INC,www.higgins.org,1.392623e+06,949328,1640204637,0,0,True,1.527560,1.096894e-06,0.681683,0.000000
3882,MALCOLM X DR BETTY SHABAZZ MEMORIAL EDUCATIONA...,www.theshabazzcenter.net,3.123030e+05,867395,3829017053,0,4,True,3.566050,1.141856e-05,2.777415,0.000000
3867,FENTON HISTORICAL SOCIETY,www.fentonhistorycenter.org,2.003000e+05,175516,2523574453,0,723,True,2.350262,1.173371e-05,0.876266,0.000000
3686,ANTHOLOGY FILM ARCHIVES,www.anthologyfilmarchives.org,7.898710e+05,868477,4634327128,0,2,True,4.316053,5.464251e-06,1.099518,0.000000
5946,WISCONSIN HISTORICAL FOUNDATION,www.wisconsinhistory.org,5.150257e+06,171272,3817285783,0,5,True,3.555124,6.902810e-07,0.033255,0.000000
4937,ALLISON-ANTRIM MUSEUM INC,www.greencastlemuseum.org,2.663860e+05,226083,2126782350,0,0,True,1.980720,7.435527e-06,0.848705,0.000000
1568,MUSEUM OF CLASSIC CHICAGO TELEVISION,www.fuzzymemories.tv,1.238500e+04,813857,3353868160,0,1,True,3.123533,2.522029e-04,65.713121,0.000000
1402,MUSEUM OF AVIATION AT ROBINS AIR FORCE BASE GE...,www.museumofaviation.org,3.020252e+06,548826,5808218208,0,0,True,5.409325,1.791018e-06,0.181715,0.000000
2143,BRENNAN HOUSE INC,www.thebrennanhouse.org,1.122500e+05,263252,938682466,0,0,True,0.874216,7.788117e-06,2.345229,0.000000
2441,PROPRIETORS OF THE SALEM ATHENAEUM,www.salemathenaeum.net,1.508943e+06,406129,4100949867,1,0,True,3.819307,2.531114e-06,0.269148,0.000002


Hopefully this provides a little glimmer of an example of how web archives could be used to study Museum websites.