# Census FTP Archival Coverage

This notebook uses a listing of files from `ftp2.census.gov` to determine how many of them were archived by the Wayback Machine at the FTP sites web "frontend" `www2.census.gov`. The list was created by Andrew Berger using lftp to copy all the files from ftp2.census.gov, and the using `ls` to list files.

## Get a DataFrame

Create a DataFrame from the ISO-8859-1 encoded list of files.

In [1]:
import gzip
import pandas

urls = [url.strip().decode("iso-8859-1") for url in gzip.open('data/all-downloaded-census-2025-03-29.txt.gz')]
df = pandas.DataFrame({"ftp_url": urls})
df

Unnamed: 0,ftp_url
0,ftp2.census.gov/econ/esp/2012/esp2012_table7.xlsx
1,ftp2.census.gov/econ/esp/2012/Table 4.xlsx
2,ftp2.census.gov/econ/esp/2012/esp2012_table6.xlsx
3,ftp2.census.gov/econ/esp/2012/Table 6.xlsx
4,ftp2.census.gov/econ/esp/2012/esp2012_table8.xlsx
...,...
4496866,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...
4496867,ftp2.census.gov/acs2011_1yr/summaryfile/ACS201...
4496868,ftp2.census.gov/acs2011_1yr/summaryfile/ACS_20...
4496869,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...


Use the FTP URL to create the respective Web URL:

In [2]:
import re

df['web_url'] = df.ftp_url.apply(lambda s: re.sub('^ftp2', 'https://www2', s))
df

Unnamed: 0,ftp_url,web_url
0,ftp2.census.gov/econ/esp/2012/esp2012_table7.xlsx,https://www2.census.gov/econ/esp/2012/esp2012_...
1,ftp2.census.gov/econ/esp/2012/Table 4.xlsx,https://www2.census.gov/econ/esp/2012/Table 4....
2,ftp2.census.gov/econ/esp/2012/esp2012_table6.xlsx,https://www2.census.gov/econ/esp/2012/esp2012_...
3,ftp2.census.gov/econ/esp/2012/Table 6.xlsx,https://www2.census.gov/econ/esp/2012/Table 6....
4,ftp2.census.gov/econ/esp/2012/esp2012_table8.xlsx,https://www2.census.gov/econ/esp/2012/esp2012_...
...,...,...
4496866,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...,https://www2.census.gov/acs2011_1yr/summaryfil...
4496867,ftp2.census.gov/acs2011_1yr/summaryfile/ACS201...,https://www2.census.gov/acs2011_1yr/summaryfil...
4496868,ftp2.census.gov/acs2011_1yr/summaryfile/ACS_20...,https://www2.census.gov/acs2011_1yr/summaryfil...
4496869,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...,https://www2.census.gov/acs2011_1yr/summaryfil...


## Lookup in Wayback

To look up a URL in the Wayback Machine we can use the [wayback](https://wayback.readthedocs.io/en/stable/index.html) module which talks to the Wayback Machine's [CDX API](https://archive.org/developers/wayback-cdx-server.html).

In [26]:
import wayback

wb = wayback.WaybackClient()

In [27]:
df.web_url[0]

'https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx'

In [28]:
for result in wb.search('https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx'):
    print(result)

CdxRecord(key='gov,census)/econ/esp/2012/esp2012_table7.xlsx', timestamp=datetime.datetime(2017, 7, 22, 6, 4, 38, tzinfo=datetime.timezone.utc), url='http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx', mime_type='unk', status_code=302, digest='3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ', length=418, raw_url='https://web.archive.org/web/20170722060438id_/http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx', view_url='https://web.archive.org/web/20170722060438/http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx')
CdxRecord(key='gov,census)/econ/esp/2012/esp2012_table7.xlsx', timestamp=datetime.datetime(2017, 8, 17, 11, 48, 24, tzinfo=datetime.timezone.utc), url='https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx', mime_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', status_code=200, digest='6Y6373IAFSDMMIOTXDXHGACICLMA2Y4J', length=30557, raw_url='https://web.archive.org/web/20170817114824id_/https://www2.census.gov/econ/esp/2012/esp2012_table7.xls

This shows that there were seven snapshots taken of this URL. However, it's important to look at the `status_code` for each snapshot. The first one is a `302` redirect to a page that was not archived.

* https://web.archive.org/web/20170722060438id_/http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx

The 5th and 6th snapshots are a `403` access denied error.

* https://web.archive.org/web/20241215202549id_/https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx

So we need a function that will return the timestamps for all 200 OK responses, or an empty list in the case of the URL not being archived in the Wayback Machine. I'm going to have it cache results in case we happen to look up the same URL more than once. It is helpful to catch exceptions from the underlying requests module, since the CDX API can sometimes be erratic and close connections.

In [35]:
import time
from functools import cache

@cache
def snapshots(url):
    # flagging our wayback client as global lets us recreate it if we hit an exception
    global wb
    
    tries = 0
    while tries < 20:
        try:
            tries += 1
            time.sleep(0.5)
            results = [result.timestamp for result in wb.search(url) if result.status_code == 200]
            print(url, len(results))
            return results
        except requests.exceptions.RequestException as e:
            print(f'caught exception {e} on try {tries}')
            wb = wayback.WaybackClient()
            time.sleep(1 * tries)

In [36]:
snapshots('https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx')

https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx 3


[datetime.datetime(2017, 8, 17, 11, 48, 24, tzinfo=datetime.timezone.utc),
 datetime.datetime(2021, 3, 26, 18, 7, 8, tzinfo=datetime.timezone.utc),
 datetime.datetime(2025, 3, 24, 6, 42, 7, tzinfo=datetime.timezone.utc)]

In [37]:
snapshots('https://www2.census.gov/econ/esp/2012/esp2012_table7-bogus.xlsx')

https://www2.census.gov/econ/esp/2012/esp2012_table7-bogus.xlsx 0


[]

## Sample the Data

There are 4,496,871 URLs. If we checked one per second, one at a time, it would take 52 days. This is generous because some of the requests can take multiple seconds. Yes, they could be run in parallel but this would put strain on the CDX API and we might get temporarily blocked by the Internet Archive. So, in this notebook we're going to sample the URLs to get a sense of the coverage, rather than looking at all of them.

According to [this](https://www.calculator.net/sample-size-calculator.html?type=1&cl=95&ci=5&pp=50&ps=4496871&x=Calculate) calculator, if we want 95% confidence with 5% margin of error we can randomly sample 385 URLs out of the 4,496,871. This should be good enough to get a sense of the coverage to start.

In [17]:
sample = df.sample(385)
sample

Unnamed: 0,ftp_url,web_url
2242842,ftp2.census.gov/geo/maps/DC2020/DC20BLK/st38_n...,https://www2.census.gov/geo/maps/DC2020/DC20BL...
1061509,ftp2.census.gov/geo/tiger/TIGER_RD18/STATE/48_...,https://www2.census.gov/geo/tiger/TIGER_RD18/S...
3092520,ftp2.census.gov/programs-surveys/acs/data/cust...,https://www2.census.gov/programs-surveys/acs/d...
2428354,ftp2.census.gov/geo/maps/pl10map/cou_blk/st29_...,https://www2.census.gov/geo/maps/pl10map/cou_b...
1157868,ftp2.census.gov/geo/pvs/tiger2010st/28_Mississ...,https://www2.census.gov/geo/pvs/tiger2010st/28...
...,...,...
3280494,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...
1928425,ftp2.census.gov/geo/maps/blk2000/st42_Pennsylv...,https://www2.census.gov/geo/maps/blk2000/st42_...
2490152,ftp2.census.gov/geo/maps/trt1990/st39_Ohio/390...,https://www2.census.gov/geo/maps/trt1990/st39_...
2961891,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...


And now we can apply our function to it to create a new column holding the archive timestamps if available:

In [None]:
sample['archived'] = sample.web_url.apply(snapshots)
sample

looking up https://www2.census.gov/geo/maps/DC2020/DC20BLK/st38_nd/place/p3821520_east_dunseith/DC20BLK_P3821520_BLK2MS.txt
looking up https://www2.census.gov/geo/tiger/TIGER_RD18/STATE/48_TEXAS/48031/tl_rd22_48031_edges.zip
looking up https://www2.census.gov/programs-surveys/acs/data/custom_tabulation/ST404/2019 ACS 1-year/Detailed Tables/000-Legal-Services-Areas/B28009F_000.csv
looking up https://www2.census.gov/geo/maps/pl10map/cou_blk/st29_mo/c29145_newton/PL10BLK_C29145_013.pdf
looking up https://www2.census.gov/geo/pvs/tiger2010st/28_Mississippi/28135/tl_2010_28135_featnames.zip
looking up https://www2.census.gov/programs-surveys/acs/data/archive/multiyear_estimates_study/5 Year Profile/2001-2005/Arizona/County Level/Tract Level/Census Tract 004629 Pima County.xls
looking up https://www2.census.gov/geo/maps/dc10map/tract/st49_ut/c49053_washington/DC10CT_C49053_001.pdf
looking up https://www2.census.gov/geo/maps/dc10map/GUBlock/st30_mt/cousub/cs3008992688_plains/DC10BLK_CS30089926

In [19]:
sample

Unnamed: 0,ftp_url,web_url,archived
2242842,ftp2.census.gov/geo/maps/DC2020/DC20BLK/st38_n...,https://www2.census.gov/geo/maps/DC2020/DC20BL...,[]
1061509,ftp2.census.gov/geo/tiger/TIGER_RD18/STATE/48_...,https://www2.census.gov/geo/tiger/TIGER_RD18/S...,[]
3092520,ftp2.census.gov/programs-surveys/acs/data/cust...,https://www2.census.gov/programs-surveys/acs/d...,[]
2428354,ftp2.census.gov/geo/maps/pl10map/cou_blk/st29_...,https://www2.census.gov/geo/maps/pl10map/cou_b...,[]
1157868,ftp2.census.gov/geo/pvs/tiger2010st/28_Mississ...,https://www2.census.gov/geo/pvs/tiger2010st/28...,[]
...,...,...,...
3280494,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...,[]
1928425,ftp2.census.gov/geo/maps/blk2000/st42_Pennsylv...,https://www2.census.gov/geo/maps/blk2000/st42_...,[]
2490152,ftp2.census.gov/geo/maps/trt1990/st39_Ohio/390...,https://www2.census.gov/geo/maps/trt1990/st39_...,[]
2961891,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[]


Create a column to make it easy to see the counts of snapshots per URL:

In [48]:
sample['num_snapshots'] = sample['archived'].apply(lambda l: len(l))
sample

Unnamed: 0,ftp_url,web_url,archived,num_snapshots
2242842,ftp2.census.gov/geo/maps/DC2020/DC20BLK/st38_n...,https://www2.census.gov/geo/maps/DC2020/DC20BL...,[],0
1061509,ftp2.census.gov/geo/tiger/TIGER_RD18/STATE/48_...,https://www2.census.gov/geo/tiger/TIGER_RD18/S...,[],0
3092520,ftp2.census.gov/programs-surveys/acs/data/cust...,https://www2.census.gov/programs-surveys/acs/d...,[],0
2428354,ftp2.census.gov/geo/maps/pl10map/cou_blk/st29_...,https://www2.census.gov/geo/maps/pl10map/cou_b...,[],0
1157868,ftp2.census.gov/geo/pvs/tiger2010st/28_Mississ...,https://www2.census.gov/geo/pvs/tiger2010st/28...,[],0
...,...,...,...,...
3280494,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...,[],0
1928425,ftp2.census.gov/geo/maps/blk2000/st42_Pennsylv...,https://www2.census.gov/geo/maps/blk2000/st42_...,[],0
2490152,ftp2.census.gov/geo/maps/trt1990/st39_Ohio/390...,https://www2.census.gov/geo/maps/trt1990/st39_...,[],0
2961891,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[],0


Despite the truncated view above, there are some with snapshots:

In [49]:
sample[sample.num_snapshots > 0]

Unnamed: 0,ftp_url,web_url,archived,num_snapshots
1805247,ftp2.census.gov/geo/maps/dc10map/tract/st49_ut...,https://www2.census.gov/geo/maps/dc10map/tract...,"[2017-08-24 21:40:04+00:00, 2020-11-01 09:31:2...",2
1327356,ftp2.census.gov/geo/maps/blk1990/st13_Georgia/...,https://www2.census.gov/geo/maps/blk1990/st13_...,[2025-03-24 21:53:21+00:00],1
4237752,ftp2.census.gov/programs-surveys/nychvs/tables...,https://www2.census.gov/programs-surveys/nychv...,[2025-03-03 21:08:25+00:00],1
2979766,ftp2.census.gov/census_2000/datasets/AIAN_SF/N...,https://www2.census.gov/census_2000/datasets/A...,"[2008-11-05 10:26:22+00:00, 2015-09-22 15:30:3...",4
1904405,ftp2.census.gov/geo/maps/blk2000/st47_Tennesse...,https://www2.census.gov/geo/maps/blk2000/st47_...,[2025-03-20 17:03:16+00:00],1
...,...,...,...,...
693091,ftp2.census.gov/geo/tiger/TIGER2018/FEATNAMES/...,https://www2.census.gov/geo/tiger/TIGER2018/FE...,"[2020-10-30 16:05:47+00:00, 2021-05-21 04:57:0...",2
1897217,ftp2.census.gov/geo/maps/blk2000/st13_Georgia/...,https://www2.census.gov/geo/maps/blk2000/st13_...,[2025-03-26 00:42:20+00:00],1
1871975,ftp2.census.gov/geo/maps/blk2000/st20_Kansas/P...,https://www2.census.gov/geo/maps/blk2000/st20_...,[2025-03-31 03:50:07+00:00],1
2911663,ftp2.census.gov/census_2000/datasets/109_Congr...,https://www2.census.gov/census_2000/datasets/1...,[2015-10-20 08:37:18+00:00],1


Lets double check a few that have zero snapshots to take a look manually...

In [58]:
pandas.set_option('max_colwidth', 0)

sample[sample['num_snapshots'] == 0][0:10].web_url

2242842    https://www2.census.gov/geo/maps/DC2020/DC20BLK/st38_nd/place/p3821520_east_dunseith/DC20BLK_P3821520_BLK2MS.txt                                                                 
1061509    https://www2.census.gov/geo/tiger/TIGER_RD18/STATE/48_TEXAS/48031/tl_rd22_48031_edges.zip                                                                                        
3092520    https://www2.census.gov/programs-surveys/acs/data/custom_tabulation/ST404/2019 ACS 1-year/Detailed Tables/000-Legal-Services-Areas/B28009F_000.csv                               
2428354    https://www2.census.gov/geo/maps/pl10map/cou_blk/st29_mo/c29145_newton/PL10BLK_C29145_013.pdf                                                                                    
1157868    https://www2.census.gov/geo/pvs/tiger2010st/28_Mississippi/28135/tl_2010_28135_featnames.zip                                                                                     
3292122    https://www2.census.gov/programs-surveys/acs

So https://www2.census.gov/geo/maps/pl10map/cou_blk/st29_mo/c29145_newton/PL10BLK_C29145_013.pdf is available on the web. But does appear to be missing from the Wayback Machine: https://web.archive.org/web/20250000000000*/https://www2.census.gov/geo/maps/pl10map/cou_blk/st29_mo/c29145_newton/PL10BLK_C29145_013.pdf

The same is true of https://www2.census.gov/geo/maps/DC2020/DC20BLK/st38_nd/place/p3821520_east_dunseith/DC20BLK_P3821520_BLK2MS.txt which hasn't been archived: https://web.archive.org/web/20250000000000*/https://www2.census.gov/geo/maps/DC2020/DC20BLK/st38_nd/place/p3821520_east_dunseith/DC20BLK_P3821520_BLK2MS.txt

So it appears that the lookup is working correctly...

## Initial Findings

What percentage of our sample have snapshots?

In [60]:
len(sample[sample.num_snapshots > 0]) / len(sample)

0.4623376623376623

So based on our sample we can say that with 95% confidence, and a 5% margin of error, only 46% of the Census FTP URLs have a snapshot in the Wayback Machine. This is a bit of a surprise?

## Stratified Sampling

The full set of files are not equally distributed across directories, and the directories refer to specific Census datasets in some cases. Since we have sampled from all of them missing files in one large directory may be overly influencing the results.

The [Big Local News](https://biglocalnews.org/content/about/) project at Stanford has expressed interest in archiving specific directories, so it is helpful to be able to analyze these specifically to ascertain how important it is to archive them. 

* programs-surveys/acs/summary_file/ 
* programs-surveys/decennial/ 
* geo/tiger/TIGER2024/ 
* geo/tiger/TIGER2023/ 
* geo/tiger/TIGER2022/

In [19]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/programs-surveys/acs/summary_file')])

415122

In [20]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/programs-surveys/decennial/')])

20653

In [21]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2024/')])

33224

In [22]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2023/')])

33165

In [23]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2022/')])

33163

To speed things up with these sample tests it's helpful to create a function that will package up the logic we did above. A sample size of 385 will still work with these population sizes to give us the same level of confidence and margin of error.

In [38]:
def make_sample(df, n=385):
    sample = df.sample(n)
    sample['archived'] = sample.web_url.apply(snapshots)
    return sample

Now the different samples for each subset can be created. This will take some time to run.

In [39]:
summary_file = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/programs-surveys/acs/summary_file')])

https://www2.census.gov/programs-surveys/acs/summary_file/2016/data/5_year_seq_by_state/WestVirginia/All_Geographies_Not_Tracts_Block_Groups/20165wv0042000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2020/data/5_year_seq_by_state/Iowa/Tracts_Block_Groups_Only/20205ia0050000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2020/data/5_year_seq_by_state/Illinois/All_Geographies_Not_Tracts_Block_Groups/20205il0140000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2014/data/5_year_seq_by_state/Florida/All_Geographies_Not_Tracts_Block_Groups/20145fl0062000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/computer_internet_tables/B28009B_39.csv.gz 3
https://www2.census.gov/programs-surveys/acs/summary_file/2018/data/5_year_seq_by_state/NorthDakota/All_Geographies_Not_Tracts_Block_Groups/20185nd0063000.zip 2
https://www2.census.gov/programs-surveys/acs/summary_file/2013/data/1_year_seq_by_state/Montana/20131mt0017000

In [40]:
decennial = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/programs-surveys/decennial/')])

https://www2.census.gov/programs-surveys/decennial/2010/data/10-Island_Areas_Detailed_Cross_Tabulations/Virgin_Islands/FOR DCT-1 SMOC User Note for FTP v2.doc 0
https://www2.census.gov/programs-surveys/decennial/2010/data/01-Redistricting_File--PL_94-171/Illinois/0README_PL.doc 0
https://www2.census.gov/programs-surveys/decennial/2000/phc/phc-t-32/tab01-UT.xls 2
https://www2.census.gov/programs-surveys/decennial/coverage-measurement/tables/2010/ccm-results-alaska/plc0203000.pdf 2
https://www2.census.gov/programs-surveys/decennial/coverage-measurement/stage-coverage_measurement/pdfs/florida/plc1225175.pdf 0
https://www2.census.gov/programs-surveys/decennial/2020/program-management/pmr-materials/2013-09-24/2020-census-9-24-13-pmr-materials-final.zip 4
https://www2.census.gov/programs-surveys/decennial/tables/2000/stp-159/national/stp159-bolivia.pdf 2
https://www2.census.gov/programs-surveys/decennial/tables/2000/county-to-county-worker-flow-files/2kwrkco_az.txt 1
https://www2.census.gov/

In [None]:
tiger2024 = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2024/')])

https://www2.census.gov/geo/tiger/TIGER2024/FACES/tl_2024_28127_faces.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/AREAWATER/tl_2024_05065_areawater.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/FEATNAMES/tl_2024_32033_featnames.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/LINEARWATER/tl_2024_19195_linearwater.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/EDGES/tl_2024_06029_edges.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/ADDRFEAT/tl_2024_45067_addrfeat.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/BG/tl_2024_34_bg.zip 2
https://www2.census.gov/geo/tiger/TIGER2024/ADDR/tl_2024_01019_addr.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/LINEARWATER/tl_2024_08049_linearwater.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/AREAWATER/tl_2024_55067_areawater.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/EDGES/tl_2024_27093_edges.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/ADDR/tl_2024_48043_addr.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/ADD

In [None]:
tiger2023 = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2023/')])

In [None]:
tiger2022 = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2022/')])

The subset size in total relative to all the URLs:

In [None]:
(415122 + 20653 + 33224 + 33165 + 33163) / len(df)