# Analyzing IMLS Funded Websites

This notebook was created during the [Archives Unleashed] event held in Washington DC on June 14-15.

IMLS maintains the [Museums Universal Data File] which is a census of museums in the United States. Conveniently the database is available for download as a [dataset] in CSV format. The ArchiveIt folks at Internet Archive pulled the website URLs from the CSV and then used them to create a seed list for a (big) web crawl. The 10 gigabytes of [CDX files] for those crawls were made available as a dataset for use during Archives Unleashed.

If you are an archivist you can think of [CDX] files as a finding aid for the Web crawl, or WARC data files. They document what URLs were requested, what the [response code] was (404 Not Found, 500 Internal Server Error, 200 OK, etc), the size of the response, the media type of the response (text/html, image/jpeg, etc), as well as other metadata to aid in seeking within the CDX file and in the WARC file that it describes.

We thought that perhaps it might be interesting to augment the IMLS data file with information from the crawl, such as the total number of pages on the museums website, the total size of resources hosted, the number of missing resources, etc. In addition we thought it could be interesting to look at the data through the lens of the museum's income.

[Archives Unleashed]: http://archivesunleashed.com
[Museums Universal Data File]: https://www.imls.gov/research-evaluation/data-collection/museum-universe-data-file
[dataset]: https://data.imls.gov/Museum/Museum-Universe-Data-File-FY-2015-Q1/bqh6-bapa
[CDX files]: http://qa-server.us.archive.org/vinay-misc/imls-cdx/cdx-manifest.txt
[WARC]: https://en.wikipedia.org/wiki/Web_ARChive
[CDX]: https://archive.org/web/researcher/cdx_file_format.php
[response code]: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [8]:
%matplotlib notebook

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [9]:
imls = pd.read_csv("imls-2015-augmented.csv")

So here's what our table looks like:

In [10]:
imls.head()

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots
0,ARTSREVIVE,artsrevive.com,138290.0,0,0,0,0,False
1,REV BIRMINGHAM INC,www.revbirmingham.org,1695242.0,0,0,0,0,False
2,BELLINGRATH GARDENS AND HOME,www.bellingrath.org,3422595.0,2101,768279385,10,0,True
3,PIONEER MUSEUM OF ALABAMA ASSN,www.pioneer-museum.org,103113.0,109,69725620,2,0,True
4,ALABAMA SPORTS HALL OF FAME,ashof.org,552152.0,2230,33108465,24,0,False


Since we have the income for the museum and the total size (bytes) retrieved from their website we can calculate a ratio of the bytes per dollar. Then we can sort it so that the institutions that are making the most amount of data relative to their income: *gigabytes per dollar*.

In [26]:
imls = imls.assign(size_gb = imls['Size'] / 1024 / 1024 / 1024)
imls = imls.assign(size_ratio = imls["size_gb"]/ imls["Income"])
imls.sort_values(by="size_ratio", ascending=False)

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots,size_ratio,pages_ratio,size_gb
507,VALLEY INSTITUTE VISUAL ART,www.vivagallery.org,22.0,2153,123098457,8,0,False,0.005211,97.863636,0.114644
5516,COLLEGE OF EASTERN UTAH,www.ceu.edu,336.0,22676,1589230787,138,0,True,0.004405,67.488095,1.480087
3034,SHELBY COUNTY HISTORICAL SOCIETY INC,shelby.mogenweb.org,7156.0,10980,22380537929,94,0,False,0.002913,1.534377,20.843500
143,TEMPE HISTORICAL SOCIETY,www.tempe.gov,6053.0,119784,5044019549,467,78,True,0.000776,19.789195,4.697609
337,ALLIED GARDENS PARK & RECREATION COUNCIL,www.sandiego.gov,113766.0,53357,74895087335,3093,0,True,0.000613,0.469007,69.751486
1869,ALLEN COUNTY COURTHOUSE PRESERVATION TRUST INC,in.gov,91289.0,152042,50989630130,11144,0,True,0.000520,1.665502,47.487794
2841,MUSKEGON MUSEUM OF ART FOUNDATION,www.muskegonartmuseum.org,731.0,1255,368217669,76,0,True,0.000469,1.716826,0.342929
3526,FOUNDATION FOR THE CHAPEL OF SACRED MIRRORS LTD,www.cosm.org,3815.0,3197,1134064280,31,0,True,0.000277,0.838008,1.056180
1568,MUSEUM OF CLASSIC CHICAGO TELEVISION,www.fuzzymemories.tv,12385.0,813857,3353868160,0,1,True,0.000252,65.713121,3.123533
5266,MUSEUM OF COMPUTER CULTURE,www.computerculture.org,883.0,2359,229946688,295,0,True,0.000243,2.671574,0.214155


We can also calculate a similar ratio for the number of pages per dollar, and get the top results.

In [23]:
imls = imls.assign(pages_ratio = imls['Pages'] / imls['Income'])
imls.sort_values(by='pages_ratio', ascending=False)

Unnamed: 0,Name,URL,Income,Pages,Size,404,500,Robots,size_ratio,pages_ratio
507,VALLEY INSTITUTE VISUAL ART,www.vivagallery.org,22.0,2153,123098457,8,0,False,5.595384e+06,97.863636
5516,COLLEGE OF EASTERN UTAH,www.ceu.edu,336.0,22676,1589230787,138,0,True,4.729854e+06,67.488095
1568,MUSEUM OF CLASSIC CHICAGO TELEVISION,www.fuzzymemories.tv,12385.0,813857,3353868160,0,1,True,2.708008e+05,65.713121
143,TEMPE HISTORICAL SOCIETY,www.tempe.gov,6053.0,119784,5044019549,467,78,True,8.333090e+05,19.789195
3375,FRIENDS OF ABRAHAM STAATS HOUSE INC,staatshouse.org,2173.0,32323,431534924,1317,0,True,1.985895e+05,14.874827
5909,ART WORKS AROUND TOWN INC,www.artworksaroundtown.com,32009.0,449244,3369816321,7,0,False,1.052772e+05,14.034928
386,TREASURE ISLAND MUSEUM ASSOCIATION,www.treasureislandmuseum.org,40718.0,362125,1102337129,41,0,False,2.707248e+04,8.893487
6035,OSHKOSH PUBLIC MUSEUM AUXILIARY INC,www.oshkoshmuseum.org,2419.0,18387,291604163,9,0,False,1.205474e+05,7.601075
3993,BUNGERS SURFING FOUNDATION,www.bungersurf.com,400.0,2400,93915092,137,0,False,2.347877e+05,6.000000
1280,CIVIC MEDIA CENTER AND LIBRARY INC,www.civicmediacenter.org,58541.0,249815,2191390273,15,0,True,3.743343e+04,4.267351
