# Census FTP Archival Coverage

This notebook uses a listing of files from `ftp2.census.gov` to determine how many of them were archived by the Wayback Machine at the FTP sites web "frontend" `www2.census.gov`. The list was created by Andrew Berger using lftp to copy all the files from ftp2.census.gov, and the using `ls` to list files.

## Get a DataFrame

Create a DataFrame from the ISO-8859-1 encoded list of files.

In [1]:
import gzip
import pandas

urls = [url.strip().decode("iso-8859-1") for url in gzip.open('data/all-downloaded-census-2025-03-29.txt.gz')]
df = pandas.DataFrame({"ftp_url": urls})
df

Unnamed: 0,ftp_url
0,ftp2.census.gov/econ/esp/2012/esp2012_table7.xlsx
1,ftp2.census.gov/econ/esp/2012/Table 4.xlsx
2,ftp2.census.gov/econ/esp/2012/esp2012_table6.xlsx
3,ftp2.census.gov/econ/esp/2012/Table 6.xlsx
4,ftp2.census.gov/econ/esp/2012/esp2012_table8.xlsx
...,...
4496866,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...
4496867,ftp2.census.gov/acs2011_1yr/summaryfile/ACS201...
4496868,ftp2.census.gov/acs2011_1yr/summaryfile/ACS_20...
4496869,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...


Use the FTP URL to create the respective Web URL:

In [2]:
import re

df['web_url'] = df.ftp_url.apply(lambda s: re.sub('^ftp2', 'https://www2', s))
df

Unnamed: 0,ftp_url,web_url
0,ftp2.census.gov/econ/esp/2012/esp2012_table7.xlsx,https://www2.census.gov/econ/esp/2012/esp2012_...
1,ftp2.census.gov/econ/esp/2012/Table 4.xlsx,https://www2.census.gov/econ/esp/2012/Table 4....
2,ftp2.census.gov/econ/esp/2012/esp2012_table6.xlsx,https://www2.census.gov/econ/esp/2012/esp2012_...
3,ftp2.census.gov/econ/esp/2012/Table 6.xlsx,https://www2.census.gov/econ/esp/2012/Table 6....
4,ftp2.census.gov/econ/esp/2012/esp2012_table8.xlsx,https://www2.census.gov/econ/esp/2012/esp2012_...
...,...,...
4496866,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...,https://www2.census.gov/acs2011_1yr/summaryfil...
4496867,ftp2.census.gov/acs2011_1yr/summaryfile/ACS201...,https://www2.census.gov/acs2011_1yr/summaryfil...
4496868,ftp2.census.gov/acs2011_1yr/summaryfile/ACS_20...,https://www2.census.gov/acs2011_1yr/summaryfil...
4496869,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...,https://www2.census.gov/acs2011_1yr/summaryfil...


## Lookup in Wayback

To look up a URL in the Wayback Machine we can use the [wayback](https://wayback.readthedocs.io/en/stable/index.html) module which talks to the Wayback Machine's [CDX API](https://archive.org/developers/wayback-cdx-server.html).

In [3]:
import wayback

wb = wayback.WaybackClient()

In [4]:
df.web_url[0]

'https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx'

In [5]:
for result in wb.search('https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx'):
    print(result)

CdxRecord(key='gov,census)/econ/esp/2012/esp2012_table7.xlsx', timestamp=datetime.datetime(2017, 7, 22, 6, 4, 38, tzinfo=datetime.timezone.utc), url='http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx', mime_type='unk', status_code=302, digest='3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ', length=418, raw_url='https://web.archive.org/web/20170722060438id_/http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx', view_url='https://web.archive.org/web/20170722060438/http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx')
CdxRecord(key='gov,census)/econ/esp/2012/esp2012_table7.xlsx', timestamp=datetime.datetime(2017, 8, 17, 11, 48, 24, tzinfo=datetime.timezone.utc), url='https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx', mime_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', status_code=200, digest='6Y6373IAFSDMMIOTXDXHGACICLMA2Y4J', length=30557, raw_url='https://web.archive.org/web/20170817114824id_/https://www2.census.gov/econ/esp/2012/esp2012_table7.xls

This shows that there were seven snapshots taken of this URL. However, it's important to look at the `status_code` for each snapshot. The first one is a `302` redirect to a page that was not archived.

* https://web.archive.org/web/20170722060438id_/http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx

The 5th and 6th snapshots are a `403` access denied error.

* https://web.archive.org/web/20241215202549id_/https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx

So we need a function that will return the timestamps for all 200 OK responses, or an empty list in the case of the URL not being archived in the Wayback Machine. I'm going to have it cache results in case we happen to look up the same URL more than once. It is helpful to catch exceptions from the underlying requests module, since the CDX API can sometimes be erratic and close connections.

In [6]:
import time
from functools import cache

@cache
def snapshots(url):
    # flagging our wayback client as global lets us recreate it if we hit an exception
    global wb
    
    tries = 0
    while tries < 20:
        try:
            tries += 1
            time.sleep(0.5)
            results = [result.timestamp for result in wb.search(url) if result.status_code == 200]
            print(url, len(results))
            return results
        except requests.exceptions.RequestException as e:
            print(f'caught exception {e} on try {tries}')
            wb = wayback.WaybackClient()
            time.sleep(1 * tries)

In [7]:
snapshots('https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx')

https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx 3


[datetime.datetime(2017, 8, 17, 11, 48, 24, tzinfo=datetime.timezone.utc),
 datetime.datetime(2021, 3, 26, 18, 7, 8, tzinfo=datetime.timezone.utc),
 datetime.datetime(2025, 3, 24, 6, 42, 7, tzinfo=datetime.timezone.utc)]

In [8]:
snapshots('https://www2.census.gov/econ/esp/2012/esp2012_table7-bogus.xlsx')

https://www2.census.gov/econ/esp/2012/esp2012_table7-bogus.xlsx 0


[]

## Sample the Data

There are 4,496,871 URLs. If we checked one per second, one at a time, it would take 52 days. This is generous because some of the requests can take multiple seconds. Yes, they could be run in parallel but this would put strain on the CDX API and we might get temporarily blocked by the Internet Archive. So, in this notebook we're going to sample the URLs to get a sense of the coverage, rather than looking at all of them.

According to [this](https://www.calculator.net/sample-size-calculator.html?type=1&cl=95&ci=5&pp=50&ps=4496871&x=Calculate) calculator, if we want 95% confidence with 5% margin of error we can randomly sample 385 URLs out of the 4,496,871. This should be good enough to get a sense of the coverage to start.

In [9]:
sample = df.sample(385)
sample

Unnamed: 0,ftp_url,web_url
3413433,ftp2.census.gov/programs-surveys/acs/summary_f...,https://www2.census.gov/programs-surveys/acs/s...
4387421,ftp2.census.gov/EEO_2014_2018/EEO_Tables_By_Al...,https://www2.census.gov/EEO_2014_2018/EEO_Tabl...
2110840,ftp2.census.gov/geo/maps/blk2000/st27_Minnesot...,https://www2.census.gov/geo/maps/blk2000/st27_...
1102668,ftp2.census.gov/geo/tiger/TIGERrd13/FACES/tl_r...,https://www2.census.gov/geo/tiger/TIGERrd13/FA...
2627584,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...
...,...,...
3951886,ftp2.census.gov/programs-surveys/cps/tables/po...,https://www2.census.gov/programs-surveys/cps/t...
3256668,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...
383015,ftp2.census.gov/geo/tiger/TIGER2008/24_MARYLAN...,https://www2.census.gov/geo/tiger/TIGER2008/24...
2371708,ftp2.census.gov/geo/maps/pl10map/cou_blk/st46_...,https://www2.census.gov/geo/maps/pl10map/cou_b...


And now we can apply our function to it to create a new column holding the archive timestamps if available:

In [10]:
sample['archived'] = sample.web_url.apply(snapshots)
sample

https://www2.census.gov/programs-surveys/acs/summary_file/2008/data/3_year/California/20083ca0138000.zip 0
https://www2.census.gov/EEO_2014_2018/EEO_Tables_By_All_Areas/JSON-Files/ALL5R/160/eeoall5r_16000us1208150_databytype.json 0
https://www2.census.gov/geo/maps/blk2000/st27_Minnesota/County/27037_Dakota/CBC27037_010.pdf 0
https://www2.census.gov/geo/tiger/TIGERrd13/FACES/tl_rd13_56031_faces.zip 3
https://www2.census.gov/census_2000/datasets/Summary_File_4/West_Virginia/wv52002_uf4.zip 1
https://www2.census.gov/geo/maps/pl10map/cou_blk/st16_id/c16035_clearwater/PL10BLK_C16035_016.pdf 0
https://www2.census.gov/census_2000/datasets/Summary_File_4/North_Dakota/nd21522_uf4.zip 3
https://www2.census.gov/census_2000/datasets/Summary_File_4/Kentucky/ky50436_uf4.zip 3
https://www2.census.gov/programs-surveys/trade/indicators/2007/12/exh16a.xls 0
https://www2.census.gov/geo/maps/dc10map/GUBlock/st47_tn/place/p4710440_calhoun/DC10BLK_P4710440_BLK2MS.txt 0
https://www2.census.gov/programs-surve

Unnamed: 0,ftp_url,web_url,archived
3413433,ftp2.census.gov/programs-surveys/acs/summary_f...,https://www2.census.gov/programs-surveys/acs/s...,[]
4387421,ftp2.census.gov/EEO_2014_2018/EEO_Tables_By_Al...,https://www2.census.gov/EEO_2014_2018/EEO_Tabl...,[]
2110840,ftp2.census.gov/geo/maps/blk2000/st27_Minnesot...,https://www2.census.gov/geo/maps/blk2000/st27_...,[]
1102668,ftp2.census.gov/geo/tiger/TIGERrd13/FACES/tl_r...,https://www2.census.gov/geo/tiger/TIGERrd13/FA...,"[2020-10-18 11:29:23+00:00, 2021-05-06 22:43:2..."
2627584,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[2021-05-13 09:37:43+00:00]
...,...,...,...
3951886,ftp2.census.gov/programs-surveys/cps/tables/po...,https://www2.census.gov/programs-surveys/cps/t...,[2025-03-03 22:52:51+00:00]
3256668,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...,[]
383015,ftp2.census.gov/geo/tiger/TIGER2008/24_MARYLAN...,https://www2.census.gov/geo/tiger/TIGER2008/24...,[]
2371708,ftp2.census.gov/geo/maps/pl10map/cou_blk/st46_...,https://www2.census.gov/geo/maps/pl10map/cou_b...,[]


In [11]:
sample

Unnamed: 0,ftp_url,web_url,archived
3413433,ftp2.census.gov/programs-surveys/acs/summary_f...,https://www2.census.gov/programs-surveys/acs/s...,[]
4387421,ftp2.census.gov/EEO_2014_2018/EEO_Tables_By_Al...,https://www2.census.gov/EEO_2014_2018/EEO_Tabl...,[]
2110840,ftp2.census.gov/geo/maps/blk2000/st27_Minnesot...,https://www2.census.gov/geo/maps/blk2000/st27_...,[]
1102668,ftp2.census.gov/geo/tiger/TIGERrd13/FACES/tl_r...,https://www2.census.gov/geo/tiger/TIGERrd13/FA...,"[2020-10-18 11:29:23+00:00, 2021-05-06 22:43:2..."
2627584,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[2021-05-13 09:37:43+00:00]
...,...,...,...
3951886,ftp2.census.gov/programs-surveys/cps/tables/po...,https://www2.census.gov/programs-surveys/cps/t...,[2025-03-03 22:52:51+00:00]
3256668,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...,[]
383015,ftp2.census.gov/geo/tiger/TIGER2008/24_MARYLAN...,https://www2.census.gov/geo/tiger/TIGER2008/24...,[]
2371708,ftp2.census.gov/geo/maps/pl10map/cou_blk/st46_...,https://www2.census.gov/geo/maps/pl10map/cou_b...,[]


Create a column to make it easy to see the counts of snapshots per URL:

In [12]:
sample['num_snapshots'] = sample['archived'].apply(lambda l: len(l))
sample

Unnamed: 0,ftp_url,web_url,archived,num_snapshots
3413433,ftp2.census.gov/programs-surveys/acs/summary_f...,https://www2.census.gov/programs-surveys/acs/s...,[],0
4387421,ftp2.census.gov/EEO_2014_2018/EEO_Tables_By_Al...,https://www2.census.gov/EEO_2014_2018/EEO_Tabl...,[],0
2110840,ftp2.census.gov/geo/maps/blk2000/st27_Minnesot...,https://www2.census.gov/geo/maps/blk2000/st27_...,[],0
1102668,ftp2.census.gov/geo/tiger/TIGERrd13/FACES/tl_r...,https://www2.census.gov/geo/tiger/TIGERrd13/FA...,"[2020-10-18 11:29:23+00:00, 2021-05-06 22:43:2...",3
2627584,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[2021-05-13 09:37:43+00:00],1
...,...,...,...,...
3951886,ftp2.census.gov/programs-surveys/cps/tables/po...,https://www2.census.gov/programs-surveys/cps/t...,[2025-03-03 22:52:51+00:00],1
3256668,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...,[],0
383015,ftp2.census.gov/geo/tiger/TIGER2008/24_MARYLAN...,https://www2.census.gov/geo/tiger/TIGER2008/24...,[],0
2371708,ftp2.census.gov/geo/maps/pl10map/cou_blk/st46_...,https://www2.census.gov/geo/maps/pl10map/cou_b...,[],0


Despite the truncated view above, there are some with snapshots:

In [13]:
sample[sample.num_snapshots > 0]

Unnamed: 0,ftp_url,web_url,archived,num_snapshots
1102668,ftp2.census.gov/geo/tiger/TIGERrd13/FACES/tl_r...,https://www2.census.gov/geo/tiger/TIGERrd13/FA...,"[2020-10-18 11:29:23+00:00, 2021-05-06 22:43:2...",3
2627584,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[2021-05-13 09:37:43+00:00],1
2860546,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,"[2008-11-06 05:42:50+00:00, 2021-04-21 20:16:2...",3
2567471,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,"[2021-05-02 21:38:05+00:00, 2024-12-12 00:07:2...",3
864104,ftp2.census.gov/geo/tiger/TIGER2019/FACESAH/tl...,https://www2.census.gov/geo/tiger/TIGER2019/FA...,"[2020-10-19 00:12:39+00:00, 2021-05-10 14:42:2...",3
...,...,...,...,...
2857037,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,"[2008-11-07 19:20:58+00:00, 2021-05-08 05:47:3...",2
859939,ftp2.census.gov/geo/tiger/TIGER2019/ADDRFEAT/t...,https://www2.census.gov/geo/tiger/TIGER2019/AD...,"[2020-10-30 17:28:43+00:00, 2023-01-03 08:21:3...",4
2119657,ftp2.census.gov/geo/maps/blk2000/st04_Arizona/...,https://www2.census.gov/geo/maps/blk2000/st04_...,"[2003-05-17 03:55:27+00:00, 2025-03-25 14:33:2...",2
2642357,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[2025-03-31 01:12:26+00:00],1


Lets double check a few that have zero snapshots to take a look manually...

In [14]:
pandas.set_option('max_colwidth', 0)

sample[sample['num_snapshots'] == 0][0:10].web_url

3413433    https://www2.census.gov/programs-surveys/acs/summary_file/2008/data/3_year/California/20083ca0138000.zip                                                         
4387421    https://www2.census.gov/EEO_2014_2018/EEO_Tables_By_All_Areas/JSON-Files/ALL5R/160/eeoall5r_16000us1208150_databytype.json                                       
2110840    https://www2.census.gov/geo/maps/blk2000/st27_Minnesota/County/27037_Dakota/CBC27037_010.pdf                                                                     
2416220    https://www2.census.gov/geo/maps/pl10map/cou_blk/st16_id/c16035_clearwater/PL10BLK_C16035_016.pdf                                                                
4102627    https://www2.census.gov/programs-surveys/trade/indicators/2007/12/exh16a.xls                                                                                     
1471738    https://www2.census.gov/geo/maps/dc10map/GUBlock/st47_tn/place/p4710440_calhoun/DC10BLK_P4710440_BLK2MS.txt                 

So https://www2.census.gov/geo/maps/pl10map/cou_blk/st29_mo/c29145_newton/PL10BLK_C29145_013.pdf is available on the web. But does appear to be missing from the Wayback Machine: https://web.archive.org/web/20250000000000*/https://www2.census.gov/geo/maps/pl10map/cou_blk/st29_mo/c29145_newton/PL10BLK_C29145_013.pdf

The same is true of https://www2.census.gov/geo/maps/DC2020/DC20BLK/st38_nd/place/p3821520_east_dunseith/DC20BLK_P3821520_BLK2MS.txt which hasn't been archived: https://web.archive.org/web/20250000000000*/https://www2.census.gov/geo/maps/DC2020/DC20BLK/st38_nd/place/p3821520_east_dunseith/DC20BLK_P3821520_BLK2MS.txt

So it appears that the lookup is working correctly...

In [44]:
sample.to_csv('data/census-sample.csv', index=False)

## Initial Findings

What percentage of our sample have snapshots?

In [15]:
len(sample[sample.num_snapshots > 0]) / len(sample)

0.4675324675324675

So based on our sample we can say that with 95% confidence, and a 5% margin of error, only 46% of the Census FTP URLs have a snapshot in the Wayback Machine. This is a bit of a surprise?

## Stratified Sampling

The full set of files are not equally distributed across directories, and the directories refer to specific Census datasets in some cases. Since we have sampled from all of them missing files in one large directory may be overly influencing the results.

The [Big Local News](https://biglocalnews.org/content/about/) project at Stanford has expressed interest in archiving specific directories, so it is helpful to be able to analyze these specifically to ascertain how important it is to archive them. 

* programs-surveys/acs/summary_file/ 
* programs-surveys/decennial/ 
* geo/tiger/TIGER2024/ 
* geo/tiger/TIGER2023/ 
* geo/tiger/TIGER2022/

In [16]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/programs-surveys/acs/summary_file')])

415122

In [17]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/programs-surveys/decennial/')])

20653

In [18]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2024/')])

33224

In [19]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2023/')])

33165

In [20]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2022/')])

33163

To speed things up with these sample tests it's helpful to create a function that will package up the logic we did above. A sample size of 385 will still work with these population sizes to give us the same level of confidence and margin of error.

In [21]:
def make_sample(df, n=385):
    sample = df.sample(n)
    sample['archived'] = sample.web_url.apply(snapshots)
    sample['num_snapshots'] = sample['archived'].apply(lambda l: len(l))
    return sample

Now the different samples for each subset can be created. This will take some time to run.

In [22]:
summary_file = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/programs-surveys/acs/summary_file')])

https://www2.census.gov/programs-surveys/acs/summary_file/2015/data/5_year_seq_by_state/NewYork/All_Geographies_Not_Tracts_Block_Groups/20155ny0066000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2015/data/5_year_seq_by_state/Wyoming/Tracts_Block_Groups_Only/20155wy0016000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2015/data/1_year_seq_by_state/Oregon/20151or0014000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2022/table-based-SF/data/1YRData/acsdt1y2022-b11006.dat 0
https://www2.census.gov/programs-surveys/acs/summary_file/2011/data/5_year_seq_by_state/Kentucky/All_Geographies_Not_Tracts_Block_Groups/20115ky0112000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2018/data/1_year_seq_by_state/Nevada/20181nv0097000.zip 3
https://www2.census.gov/programs-surveys/acs/summary_file/2012/data/5_year_seq_by_state/WestVirginia/Tracts_Block_Groups_Only/20125wv0080000.zip 0
https://www2.census.gov/programs-surveys/acs/sum

In [23]:
decennial = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/programs-surveys/decennial/')])

https://www2.census.gov/programs-surveys/decennial/coverage-measurement/stage-coverage_measurement/pdfs/maryland/hucou24013.pdf 0
https://www2.census.gov/programs-surveys/decennial/2020/program-management/pmr-materials/2018-01-26/2-welcome-high-level-updates.pdf 4
https://www2.census.gov/programs-surveys/decennial/2000/technical-documentation/complete-tech-docs/summary-files/congressional-districts/cd109h.pdf 3
https://www2.census.gov/programs-surveys/decennial/2000/phc/phc-t-39/tab02-hi-02.csv 0
https://www2.census.gov/programs-surveys/decennial/tables/1990/1990-census-disability/tab1ind.txt 0
https://www2.census.gov/programs-surveys/decennial/2020/resources/language-materials/templates/instructions-language-guide-template.pdf 4
https://www2.census.gov/programs-surveys/decennial/coverage-measurement/tables/2010/ccm-results-south-carolina/hucou45077.pdf 3
https://www2.census.gov/programs-surveys/decennial/2000/phc/phc-t-39/tab03-ma-02.xls 1
https://www2.census.gov/programs-surveys/dece

In [24]:
tiger2024 = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2024/')])

https://www2.census.gov/geo/tiger/TIGER2024/ADDRFN/tl_2024_25015_addrfn.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/ADDRFEAT/tl_2024_54085_addrfeat.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/ROADS/tl_2024_29161_roads.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/AREAWATER/tl_2024_25011_areawater.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/ROADS/tl_2024_29041_roads.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/EDGES/tl_2024_05057_edges.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/FACESAH/tl_2024_21103_facesah.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/FACESAH/tl_2024_28145_facesah.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/ADDRFEAT/tl_2024_17163_addrfeat.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/FEATNAMES/tl_2024_29079_featnames.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/ADDR/tl_2024_35039_addr.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/LINEARWATER/tl_2024_33001_linearwater.zip 2
https://www2.census.gov/geo/tiger/TIGER2

In [25]:
tiger2023 = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2023/')])

https://www2.census.gov/geo/tiger/TIGER2023/FACES/tl_2023_55009_faces.zip 3
https://www2.census.gov/geo/tiger/TIGER2023/LINEARWATER/tl_2023_08027_linearwater.zip 2
https://www2.census.gov/geo/tiger/TIGER2023/EDGES/tl_2023_72141_edges.zip 1
https://www2.census.gov/geo/tiger/TIGER2023/ADDR/tl_2023_18061_addr.zip 1
https://www2.census.gov/geo/tiger/TIGER2023/EDGES/tl_2023_05133_edges.zip 2
https://www2.census.gov/geo/tiger/TIGER2023/EDGES/tl_2023_39103_edges.zip 1
https://www2.census.gov/geo/tiger/TIGER2023/FACESAH/tl_2023_35028_facesah.zip 1
https://www2.census.gov/geo/tiger/TIGER2023/FACES/tl_2023_29129_faces.zip 2
https://www2.census.gov/geo/tiger/TIGER2023/ADDRFEAT/tl_2023_36053_addrfeat.zip 0
https://www2.census.gov/geo/tiger/TIGER2023/FACES/tl_2023_01091_faces.zip 1
https://www2.census.gov/geo/tiger/TIGER2023/FACESAH/tl_2023_56033_facesah.zip 1
https://www2.census.gov/geo/tiger/TIGER2023/FEATNAMES/tl_2023_48073_featnames.zip 1
https://www2.census.gov/geo/tiger/TIGER2023/AREAWATER/tl

In [26]:
tiger2022 = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2022/')])

https://www2.census.gov/geo/tiger/TIGER2022/LINEARWATER/tl_2022_08011_linearwater.zip 2
https://www2.census.gov/geo/tiger/TIGER2022/EDGES/tl_2022_19057_edges.zip 3
https://www2.census.gov/geo/tiger/TIGER2022/EDGES/tl_2022_08075_edges.zip 2
https://www2.census.gov/geo/tiger/TIGER2022/ROADS/tl_2022_19027_roads.zip 3
https://www2.census.gov/geo/tiger/TIGER2022/AREAWATER/tl_2022_01093_areawater.zip 3
https://www2.census.gov/geo/tiger/TIGER2022/FEATNAMES/tl_2022_72145_featnames.zip 2
https://www2.census.gov/geo/tiger/TIGER2022/AREAWATER/tl_2022_28089_areawater.zip 2
https://www2.census.gov/geo/tiger/TIGER2022/ROADS/tl_2022_16067_roads.zip 1
https://www2.census.gov/geo/tiger/TIGER2022/EDGES/tl_2022_48257_edges.zip 3
https://www2.census.gov/geo/tiger/TIGER2022/ADDRFEAT/tl_2022_16085_addrfeat.zip 2
https://www2.census.gov/geo/tiger/TIGER2022/ADDRFN/tl_2022_72149_addrfn.zip 2
https://www2.census.gov/geo/tiger/TIGER2022/ROADS/tl_2022_53041_roads.zip 3
https://www2.census.gov/geo/tiger/TIGER2022/

The subset size in total relative to all the URLs:

In [27]:
(415122 + 20653 + 33224 + 33165 + 33163) / len(df)

0.11904433104707696

So what do the percentages look like for these subsets? lets also save them since they took some time to calculate.

In [39]:
summary_file.to_csv('data/census-summary-file.csv', index=False)
len(summary_file[summary_file.num_snapshots == 0]) / 385

0.7714285714285715

In [40]:
decennial.to_csv('data/census-decennial.csv', index=False)
len(decennial[decennial.num_snapshots == 0]) / 385

0.37402597402597404

In [41]:
tiger2024.to_csv('data/census-tiger2024.csv', index=False)
len(tiger2024[tiger2024.num_snapshots == 0]) / 385

0.6051948051948052

In [42]:
tiger2023.to_csv('data/census-tiger2023.csv', index=False)
len(tiger2023[tiger2023.num_snapshots == 0]) / 385

0.14025974025974025

In [43]:
tiger2022.to_csv('data/census-tiger2022.csv', index=False)
len(tiger2022[tiger2022.num_snapshots == 0]) / 385

0.007792207792207792