# ftp.census.gov Coverage

This notebook uses a listing of files from `ftp2.census.gov` to determine how many of them were archived by the Wayback Machine at their web "frontend" `www2.census.gov`. The list was created by Andrew Berger using lftp to copy all the files from ftp2.census.gov.

## Get a DataFrame

Create a DataFrame from the ISO-8859-1 encoded list of files.

In [1]:
import gzip
import pandas

urls = [url.strip().decode("iso-8859-1") for url in gzip.open('data/all-downloaded-census-2025-03-29.txt.gz')]
df = pandas.DataFrame({"ftp_url": urls})
df

Unnamed: 0,ftp_url
0,ftp2.census.gov/econ/esp/2012/esp2012_table7.xlsx
1,ftp2.census.gov/econ/esp/2012/Table 4.xlsx
2,ftp2.census.gov/econ/esp/2012/esp2012_table6.xlsx
3,ftp2.census.gov/econ/esp/2012/Table 6.xlsx
4,ftp2.census.gov/econ/esp/2012/esp2012_table8.xlsx
...,...
4496866,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...
4496867,ftp2.census.gov/acs2011_1yr/summaryfile/ACS201...
4496868,ftp2.census.gov/acs2011_1yr/summaryfile/ACS_20...
4496869,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...


Use the FTP URL to create the respective Web URL:

In [3]:
import re

df['web_url'] = df.ftp_url.apply(lambda s: re.sub('^ftp2', 'https://www2', s))
df

Unnamed: 0,ftp_url,web_url
0,ftp2.census.gov/econ/esp/2012/esp2012_table7.xlsx,https://www2.census.gov/econ/esp/2012/esp2012_...
1,ftp2.census.gov/econ/esp/2012/Table 4.xlsx,https://www2.census.gov/econ/esp/2012/Table 4....
2,ftp2.census.gov/econ/esp/2012/esp2012_table6.xlsx,https://www2.census.gov/econ/esp/2012/esp2012_...
3,ftp2.census.gov/econ/esp/2012/Table 6.xlsx,https://www2.census.gov/econ/esp/2012/Table 6....
4,ftp2.census.gov/econ/esp/2012/esp2012_table8.xlsx,https://www2.census.gov/econ/esp/2012/esp2012_...
...,...,...
4496866,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...,https://www2.census.gov/acs2011_1yr/summaryfil...
4496867,ftp2.census.gov/acs2011_1yr/summaryfile/ACS201...,https://www2.census.gov/acs2011_1yr/summaryfil...
4496868,ftp2.census.gov/acs2011_1yr/summaryfile/ACS_20...,https://www2.census.gov/acs2011_1yr/summaryfil...
4496869,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...,https://www2.census.gov/acs2011_1yr/summaryfil...


## Lookup in Wayback

To look up a URL in the Wayback Machine we can use the [wayback](https://wayback.readthedocs.io/en/stable/index.html) module which talks to the Wayback Machine's [CDX API](https://archive.org/developers/wayback-cdx-server.html).

In [9]:
import wayback

wb = wayback.WaybackClient()

In [10]:
df.web_url[0]

'https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx'

In [11]:
for result in wb.search('https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx'):
    print(result)

CdxRecord(key='gov,census)/econ/esp/2012/esp2012_table7.xlsx', timestamp=datetime.datetime(2017, 7, 22, 6, 4, 38, tzinfo=datetime.timezone.utc), url='http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx', mime_type='unk', status_code=302, digest='3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ', length=418, raw_url='https://web.archive.org/web/20170722060438id_/http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx', view_url='https://web.archive.org/web/20170722060438/http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx')
CdxRecord(key='gov,census)/econ/esp/2012/esp2012_table7.xlsx', timestamp=datetime.datetime(2017, 8, 17, 11, 48, 24, tzinfo=datetime.timezone.utc), url='https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx', mime_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', status_code=200, digest='6Y6373IAFSDMMIOTXDXHGACICLMA2Y4J', length=30557, raw_url='https://web.archive.org/web/20170817114824id_/https://www2.census.gov/econ/esp/2012/esp2012_table7.xls

This shows that there were seven snapshots taken of this URL. However, it's important to look at the `status_code` for each snapshot. The first one is a `302` redirect to a page that was not archived.

* https://web.archive.org/web/20170722060438id_/http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx

The 5th and 6th snapshots are a `403` access denied error.

* https://web.archive.org/web/20241215202549id_/https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx

So we need a function that will return the timestamps for all 200 OK responses, or an empty list in the case of the URL not being archived in the Wayback Machine.

In [14]:
def snapshots(url):
    print(f"looking up {url}")
    return [result.timestamp for result in wb.search(url) if result.status_code == 200]

In [15]:
snapshots('https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx')

looking up https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx


[datetime.datetime(2017, 8, 17, 11, 48, 24, tzinfo=datetime.timezone.utc),
 datetime.datetime(2021, 3, 26, 18, 7, 8, tzinfo=datetime.timezone.utc),
 datetime.datetime(2025, 3, 24, 6, 42, 7, tzinfo=datetime.timezone.utc)]

In [16]:
snapshots('https://www2.census.gov/econ/esp/2012/esp2012_table7-bogus.xlsx')

looking up https://www2.census.gov/econ/esp/2012/esp2012_table7-bogus.xlsx


[]

## Sample the Data

There are 4,496,871 URLs. If we checked one per second, one at a time, it would take 52 days. In this notebook we're just going to sample them to get a sense of the coverage. 

According to [this](https://www.calculator.net/sample-size-calculator.html?type=1&cl=95&ci=5&pp=50&ps=4496871&x=Calculate) calculator, if we want 95% confidence with 5% margin of error we can randomly sample 385 URLs. This should be good enough to get a sense of the coverage to start.

In [17]:
sample = df.sample(385)
sample

Unnamed: 0,ftp_url,web_url
2242842,ftp2.census.gov/geo/maps/DC2020/DC20BLK/st38_n...,https://www2.census.gov/geo/maps/DC2020/DC20BL...
1061509,ftp2.census.gov/geo/tiger/TIGER_RD18/STATE/48_...,https://www2.census.gov/geo/tiger/TIGER_RD18/S...
3092520,ftp2.census.gov/programs-surveys/acs/data/cust...,https://www2.census.gov/programs-surveys/acs/d...
2428354,ftp2.census.gov/geo/maps/pl10map/cou_blk/st29_...,https://www2.census.gov/geo/maps/pl10map/cou_b...
1157868,ftp2.census.gov/geo/pvs/tiger2010st/28_Mississ...,https://www2.census.gov/geo/pvs/tiger2010st/28...
...,...,...
3280494,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...
1928425,ftp2.census.gov/geo/maps/blk2000/st42_Pennsylv...,https://www2.census.gov/geo/maps/blk2000/st42_...
2490152,ftp2.census.gov/geo/maps/trt1990/st39_Ohio/390...,https://www2.census.gov/geo/maps/trt1990/st39_...
2961891,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...


And now we can apply our function to it to create a new column holding the archive timestamps if available:

In [None]:
sample['archived'] = sample.web_url.apply(snapshots)
sample

looking up https://www2.census.gov/geo/maps/DC2020/DC20BLK/st38_nd/place/p3821520_east_dunseith/DC20BLK_P3821520_BLK2MS.txt
looking up https://www2.census.gov/geo/tiger/TIGER_RD18/STATE/48_TEXAS/48031/tl_rd22_48031_edges.zip
looking up https://www2.census.gov/programs-surveys/acs/data/custom_tabulation/ST404/2019 ACS 1-year/Detailed Tables/000-Legal-Services-Areas/B28009F_000.csv
looking up https://www2.census.gov/geo/maps/pl10map/cou_blk/st29_mo/c29145_newton/PL10BLK_C29145_013.pdf
looking up https://www2.census.gov/geo/pvs/tiger2010st/28_Mississippi/28135/tl_2010_28135_featnames.zip
looking up https://www2.census.gov/programs-surveys/acs/data/archive/multiyear_estimates_study/5 Year Profile/2001-2005/Arizona/County Level/Tract Level/Census Tract 004629 Pima County.xls
looking up https://www2.census.gov/geo/maps/dc10map/tract/st49_ut/c49053_washington/DC10CT_C49053_001.pdf
looking up https://www2.census.gov/geo/maps/dc10map/GUBlock/st30_mt/cousub/cs3008992688_plains/DC10BLK_CS30089926

In [19]:
sample

Unnamed: 0,ftp_url,web_url,archived
2242842,ftp2.census.gov/geo/maps/DC2020/DC20BLK/st38_n...,https://www2.census.gov/geo/maps/DC2020/DC20BL...,[]
1061509,ftp2.census.gov/geo/tiger/TIGER_RD18/STATE/48_...,https://www2.census.gov/geo/tiger/TIGER_RD18/S...,[]
3092520,ftp2.census.gov/programs-surveys/acs/data/cust...,https://www2.census.gov/programs-surveys/acs/d...,[]
2428354,ftp2.census.gov/geo/maps/pl10map/cou_blk/st29_...,https://www2.census.gov/geo/maps/pl10map/cou_b...,[]
1157868,ftp2.census.gov/geo/pvs/tiger2010st/28_Mississ...,https://www2.census.gov/geo/pvs/tiger2010st/28...,[]
...,...,...,...
3280494,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...,[]
1928425,ftp2.census.gov/geo/maps/blk2000/st42_Pennsylv...,https://www2.census.gov/geo/maps/blk2000/st42_...,[]
2490152,ftp2.census.gov/geo/maps/trt1990/st39_Ohio/390...,https://www2.census.gov/geo/maps/trt1990/st39_...,[]
2961891,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[]


Create a column to make it easy to see the counts of snapshots per URL:

In [48]:
sample['num_snapshots'] = sample['archived'].apply(lambda l: len(l))
sample

Unnamed: 0,ftp_url,web_url,archived,num_snapshots
2242842,ftp2.census.gov/geo/maps/DC2020/DC20BLK/st38_n...,https://www2.census.gov/geo/maps/DC2020/DC20BL...,[],0
1061509,ftp2.census.gov/geo/tiger/TIGER_RD18/STATE/48_...,https://www2.census.gov/geo/tiger/TIGER_RD18/S...,[],0
3092520,ftp2.census.gov/programs-surveys/acs/data/cust...,https://www2.census.gov/programs-surveys/acs/d...,[],0
2428354,ftp2.census.gov/geo/maps/pl10map/cou_blk/st29_...,https://www2.census.gov/geo/maps/pl10map/cou_b...,[],0
1157868,ftp2.census.gov/geo/pvs/tiger2010st/28_Mississ...,https://www2.census.gov/geo/pvs/tiger2010st/28...,[],0
...,...,...,...,...
3280494,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...,[],0
1928425,ftp2.census.gov/geo/maps/blk2000/st42_Pennsylv...,https://www2.census.gov/geo/maps/blk2000/st42_...,[],0
2490152,ftp2.census.gov/geo/maps/trt1990/st39_Ohio/390...,https://www2.census.gov/geo/maps/trt1990/st39_...,[],0
2961891,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[],0


Despite the truncated view above, there are some with snapshots:

In [49]:
sample[sample.num_snapshots > 0]

Unnamed: 0,ftp_url,web_url,archived,num_snapshots
1805247,ftp2.census.gov/geo/maps/dc10map/tract/st49_ut...,https://www2.census.gov/geo/maps/dc10map/tract...,"[2017-08-24 21:40:04+00:00, 2020-11-01 09:31:2...",2
1327356,ftp2.census.gov/geo/maps/blk1990/st13_Georgia/...,https://www2.census.gov/geo/maps/blk1990/st13_...,[2025-03-24 21:53:21+00:00],1
4237752,ftp2.census.gov/programs-surveys/nychvs/tables...,https://www2.census.gov/programs-surveys/nychv...,[2025-03-03 21:08:25+00:00],1
2979766,ftp2.census.gov/census_2000/datasets/AIAN_SF/N...,https://www2.census.gov/census_2000/datasets/A...,"[2008-11-05 10:26:22+00:00, 2015-09-22 15:30:3...",4
1904405,ftp2.census.gov/geo/maps/blk2000/st47_Tennesse...,https://www2.census.gov/geo/maps/blk2000/st47_...,[2025-03-20 17:03:16+00:00],1
...,...,...,...,...
693091,ftp2.census.gov/geo/tiger/TIGER2018/FEATNAMES/...,https://www2.census.gov/geo/tiger/TIGER2018/FE...,"[2020-10-30 16:05:47+00:00, 2021-05-21 04:57:0...",2
1897217,ftp2.census.gov/geo/maps/blk2000/st13_Georgia/...,https://www2.census.gov/geo/maps/blk2000/st13_...,[2025-03-26 00:42:20+00:00],1
1871975,ftp2.census.gov/geo/maps/blk2000/st20_Kansas/P...,https://www2.census.gov/geo/maps/blk2000/st20_...,[2025-03-31 03:50:07+00:00],1
2911663,ftp2.census.gov/census_2000/datasets/109_Congr...,https://www2.census.gov/census_2000/datasets/1...,[2015-10-20 08:37:18+00:00],1


Lets double check a few that have zero snapshots to take a look manually...

In [58]:
pandas.set_option('max_colwidth', 0)

sample[sample['num_snapshots'] == 0][0:10].web_url

2242842    https://www2.census.gov/geo/maps/DC2020/DC20BLK/st38_nd/place/p3821520_east_dunseith/DC20BLK_P3821520_BLK2MS.txt                                                                 
1061509    https://www2.census.gov/geo/tiger/TIGER_RD18/STATE/48_TEXAS/48031/tl_rd22_48031_edges.zip                                                                                        
3092520    https://www2.census.gov/programs-surveys/acs/data/custom_tabulation/ST404/2019 ACS 1-year/Detailed Tables/000-Legal-Services-Areas/B28009F_000.csv                               
2428354    https://www2.census.gov/geo/maps/pl10map/cou_blk/st29_mo/c29145_newton/PL10BLK_C29145_013.pdf                                                                                    
1157868    https://www2.census.gov/geo/pvs/tiger2010st/28_Mississippi/28135/tl_2010_28135_featnames.zip                                                                                     
3292122    https://www2.census.gov/programs-surveys/acs

So https://www2.census.gov/geo/maps/pl10map/cou_blk/st29_mo/c29145_newton/PL10BLK_C29145_013.pdf is available on the web. But does appear to be missing from the Wayback Machine: https://web.archive.org/web/20250000000000*/https://www2.census.gov/geo/maps/pl10map/cou_blk/st29_mo/c29145_newton/PL10BLK_C29145_013.pdf

The same is true of https://www2.census.gov/geo/maps/DC2020/DC20BLK/st38_nd/place/p3821520_east_dunseith/DC20BLK_P3821520_BLK2MS.txt which hasn't been archived: https://web.archive.org/web/20250000000000*/https://www2.census.gov/geo/maps/DC2020/DC20BLK/st38_nd/place/p3821520_east_dunseith/DC20BLK_P3821520_BLK2MS.txt

So it appears that the lookup is working correctly...

## Findings

What percentage of our sample have snapshots?

In [60]:
len(sample[sample.num_snapshots > 0]) / len(sample)

0.4623376623376623

So based on our sample we can say that with 95% confidence, and a 5% margin of error, only 46% of the Census FTP URLs have a snapshot in the Wayback Machine. This is a bit of a surprise?