# Census FTP Archival Coverage

This notebook uses a listing of files from `ftp2.census.gov` to determine how many of them were archived by the Wayback Machine at the FTP sites web "frontend" `www2.census.gov`. The list was created by Andrew Berger using lftp to copy all the files from ftp2.census.gov, and the using `ls` to list files.

## Get a DataFrame

Create a DataFrame from the windows-1252 encoded list of files (maybe it's a Windows filesystem behind the FTP server):

In [1]:
import gzip
import pandas

urls = [url.strip().decode("windows-1252") for url in gzip.open('data/all-downloaded-census-2025-03-29.txt.gz')]
df = pandas.DataFrame({"ftp_url": urls})
df

Unnamed: 0,ftp_url
0,ftp2.census.gov/econ/esp/2012/esp2012_table7.xlsx
1,ftp2.census.gov/econ/esp/2012/Table 4.xlsx
2,ftp2.census.gov/econ/esp/2012/esp2012_table6.xlsx
3,ftp2.census.gov/econ/esp/2012/Table 6.xlsx
4,ftp2.census.gov/econ/esp/2012/esp2012_table8.xlsx
...,...
4496866,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...
4496867,ftp2.census.gov/acs2011_1yr/summaryfile/ACS201...
4496868,ftp2.census.gov/acs2011_1yr/summaryfile/ACS_20...
4496869,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...


Use the FTP URL to create the respective Web URL:

In [2]:
import re

df['web_url'] = df.ftp_url.apply(lambda s: re.sub('^ftp2', 'https://www2', s))
df

Unnamed: 0,ftp_url,web_url
0,ftp2.census.gov/econ/esp/2012/esp2012_table7.xlsx,https://www2.census.gov/econ/esp/2012/esp2012_...
1,ftp2.census.gov/econ/esp/2012/Table 4.xlsx,https://www2.census.gov/econ/esp/2012/Table 4....
2,ftp2.census.gov/econ/esp/2012/esp2012_table6.xlsx,https://www2.census.gov/econ/esp/2012/esp2012_...
3,ftp2.census.gov/econ/esp/2012/Table 6.xlsx,https://www2.census.gov/econ/esp/2012/Table 6....
4,ftp2.census.gov/econ/esp/2012/esp2012_table8.xlsx,https://www2.census.gov/econ/esp/2012/esp2012_...
...,...,...
4496866,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...,https://www2.census.gov/acs2011_1yr/summaryfil...
4496867,ftp2.census.gov/acs2011_1yr/summaryfile/ACS201...,https://www2.census.gov/acs2011_1yr/summaryfil...
4496868,ftp2.census.gov/acs2011_1yr/summaryfile/ACS_20...,https://www2.census.gov/acs2011_1yr/summaryfil...
4496869,ftp2.census.gov/acs2011_1yr/summaryfile/Sequen...,https://www2.census.gov/acs2011_1yr/summaryfil...


## Lookup in Wayback

To look up a URL in the Wayback Machine we can use the [wayback](https://wayback.readthedocs.io/en/stable/index.html) module which talks to the Wayback Machine's [CDX API](https://archive.org/developers/wayback-cdx-server.html).

In [3]:
import wayback

wb = wayback.WaybackClient()

In [4]:
df.web_url[0]

'https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx'

In [5]:
for result in wb.search('https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx'):
    print(result)

CdxRecord(key='gov,census)/econ/esp/2012/esp2012_table7.xlsx', timestamp=datetime.datetime(2017, 7, 22, 6, 4, 38, tzinfo=datetime.timezone.utc), url='http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx', mime_type='unk', status_code=302, digest='3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ', length=418, raw_url='https://web.archive.org/web/20170722060438id_/http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx', view_url='https://web.archive.org/web/20170722060438/http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx')
CdxRecord(key='gov,census)/econ/esp/2012/esp2012_table7.xlsx', timestamp=datetime.datetime(2017, 8, 17, 11, 48, 24, tzinfo=datetime.timezone.utc), url='https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx', mime_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', status_code=200, digest='6Y6373IAFSDMMIOTXDXHGACICLMA2Y4J', length=30557, raw_url='https://web.archive.org/web/20170817114824id_/https://www2.census.gov/econ/esp/2012/esp2012_table7.xls

This shows that there were seven snapshots taken of this URL. However, it's important to look at the `status_code` for each snapshot. The first one is a `302` redirect to a page that was not archived.

* https://web.archive.org/web/20170722060438id_/http://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx

The 5th and 6th snapshots are a `403` access denied error.

* https://web.archive.org/web/20241215202549id_/https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx

So we need a function that will return the timestamps for all 200 OK responses, or an empty list in the case of the URL not being archived in the Wayback Machine. I'm going to have it cache results in case we happen to look up the same URL more than once. It is helpful to catch exceptions from the underlying requests module, since the CDX API can sometimes be erratic and close connections.

In [6]:
import time
from functools import cache

@cache
def snapshots(url):
    # flagging our wayback client as global lets us recreate it if we hit an exception
    global wb
    
    tries = 0
    while tries < 20:
        try:
            tries += 1
            time.sleep(0.5)
            results = [result.timestamp for result in wb.search(url) if result.status_code == 200]
            print(url, len(results))
            return results
        except requests.exceptions.RequestException as e:
            print(f'caught exception {e} on try {tries}')
            wb = wayback.WaybackClient()
            time.sleep(1 * tries)

In [7]:
snapshots('https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx')

https://www2.census.gov/econ/esp/2012/esp2012_table7.xlsx 3


[datetime.datetime(2017, 8, 17, 11, 48, 24, tzinfo=datetime.timezone.utc),
 datetime.datetime(2021, 3, 26, 18, 7, 8, tzinfo=datetime.timezone.utc),
 datetime.datetime(2025, 3, 24, 6, 42, 7, tzinfo=datetime.timezone.utc)]

In [8]:
snapshots('https://www2.census.gov/econ/esp/2012/esp2012_table7-bogus.xlsx')

https://www2.census.gov/econ/esp/2012/esp2012_table7-bogus.xlsx 0


[]

## Sample the Data

There are 4,496,871 URLs. If we checked one per second, one at a time, it would take 52 days. This is generous because some of the requests can take multiple seconds. Yes, they could be run in parallel but this would put strain on the CDX API and we might get temporarily blocked by the Internet Archive. So, in this notebook we're going to sample the URLs to get a sense of the coverage, rather than looking at all of them.

According to [this](https://www.calculator.net/sample-size-calculator.html?type=1&cl=95&ci=5&pp=50&ps=4496871&x=Calculate) calculator, if we want 95% confidence with 5% margin of error we can randomly sample 385 URLs out of the 4,496,871. This should be good enough to get a sense of the coverage to start.

In [9]:
sample = df.sample(385)
sample

Unnamed: 0,ftp_url,web_url
635923,ftp2.census.gov/geo/tiger/TIGER2016/AREAWATER/...,https://www2.census.gov/geo/tiger/TIGER2016/AR...
2777086,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...
3123666,ftp2.census.gov/programs-surveys/acs/data/eeo_...,https://www2.census.gov/programs-surveys/acs/d...
2747981,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...
1713886,ftp2.census.gov/geo/maps/dc10map/GUBlock/st56_...,https://www2.census.gov/geo/maps/dc10map/GUBlo...
...,...,...
2092407,ftp2.census.gov/geo/maps/blk2000/st48_Texas/Co...,https://www2.census.gov/geo/maps/blk2000/st48_...
3281860,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...
3638931,ftp2.census.gov/programs-surveys/acs/summary_f...,https://www2.census.gov/programs-surveys/acs/s...
190622,ftp2.census.gov/geo/tiger/TIGER2007FE/19_IOWA/...,https://www2.census.gov/geo/tiger/TIGER2007FE/...


And now we can apply our function to it to create a new column holding the archive timestamps if available:

In [10]:
sample['archived'] = sample.web_url.apply(snapshots)
sample

https://www2.census.gov/geo/tiger/TIGER2016/AREAWATER/tl_2016_55107_areawater.zip 5
https://www2.census.gov/census_2000/datasets/Summary_File_4/North_Carolina/nc55336_uf4.zip 1
https://www2.census.gov/programs-surveys/acs/data/eeo_tabulation/EEO_2014_2018/EEO_Tables_By_All_Areas/JSON-Files/CIT6R/050/eeocit6r_05000us21155_databytype.json 0
https://www2.census.gov/census_2000/datasets/Summary_File_4/Missouri/mo54920_uf4.zip 0
https://www2.census.gov/geo/maps/dc10map/GUBlock/st56_wy/county/c56037_sweetwater/DC10BLK_C56037_014.pdf 0
https://www2.census.gov/econ2007/Reference_materials/htm files/33441100.htm 3
https://www2.census.gov/geo/tiger/TIGER2022/AREAWATER/tl_2022_38001_areawater.zip 2
https://www2.census.gov/geo/tiger/TIGER2008/46_SOUTH_DAKOTA/46091_Marshall_County/tl_2008_46091_facesal.zip 0
https://www2.census.gov/census_2000/datasets/Summary_File_4/Wisconsin/wi58215_uf4.zip 1
https://www2.census.gov/geo/tiger/TIGER2007FE/21_KENTUCKY/21121_Knox/fe_2007_21121_tabblock00.zip 0
https

Unnamed: 0,ftp_url,web_url,archived
635923,ftp2.census.gov/geo/tiger/TIGER2016/AREAWATER/...,https://www2.census.gov/geo/tiger/TIGER2016/AR...,"[2020-10-21 10:23:23+00:00, 2021-05-11 16:30:3..."
2777086,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[2025-04-10 20:28:31+00:00]
3123666,ftp2.census.gov/programs-surveys/acs/data/eeo_...,https://www2.census.gov/programs-surveys/acs/d...,[]
2747981,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[]
1713886,ftp2.census.gov/geo/maps/dc10map/GUBlock/st56_...,https://www2.census.gov/geo/maps/dc10map/GUBlo...,[]
...,...,...,...
2092407,ftp2.census.gov/geo/maps/blk2000/st48_Texas/Co...,https://www2.census.gov/geo/maps/blk2000/st48_...,[]
3281860,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...,[]
3638931,ftp2.census.gov/programs-surveys/acs/summary_f...,https://www2.census.gov/programs-surveys/acs/s...,[]
190622,ftp2.census.gov/geo/tiger/TIGER2007FE/19_IOWA/...,https://www2.census.gov/geo/tiger/TIGER2007FE/...,[]


In [11]:
sample

Unnamed: 0,ftp_url,web_url,archived
635923,ftp2.census.gov/geo/tiger/TIGER2016/AREAWATER/...,https://www2.census.gov/geo/tiger/TIGER2016/AR...,"[2020-10-21 10:23:23+00:00, 2021-05-11 16:30:3..."
2777086,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[2025-04-10 20:28:31+00:00]
3123666,ftp2.census.gov/programs-surveys/acs/data/eeo_...,https://www2.census.gov/programs-surveys/acs/d...,[]
2747981,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[]
1713886,ftp2.census.gov/geo/maps/dc10map/GUBlock/st56_...,https://www2.census.gov/geo/maps/dc10map/GUBlo...,[]
...,...,...,...
2092407,ftp2.census.gov/geo/maps/blk2000/st48_Texas/Co...,https://www2.census.gov/geo/maps/blk2000/st48_...,[]
3281860,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...,[]
3638931,ftp2.census.gov/programs-surveys/acs/summary_f...,https://www2.census.gov/programs-surveys/acs/s...,[]
190622,ftp2.census.gov/geo/tiger/TIGER2007FE/19_IOWA/...,https://www2.census.gov/geo/tiger/TIGER2007FE/...,[]


Create a column to make it easy to see the counts of snapshots per URL:

In [12]:
sample['num_snapshots'] = sample['archived'].apply(lambda l: len(l))
sample

Unnamed: 0,ftp_url,web_url,archived,num_snapshots
635923,ftp2.census.gov/geo/tiger/TIGER2016/AREAWATER/...,https://www2.census.gov/geo/tiger/TIGER2016/AR...,"[2020-10-21 10:23:23+00:00, 2021-05-11 16:30:3...",5
2777086,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[2025-04-10 20:28:31+00:00],1
3123666,ftp2.census.gov/programs-surveys/acs/data/eeo_...,https://www2.census.gov/programs-surveys/acs/d...,[],0
2747981,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[],0
1713886,ftp2.census.gov/geo/maps/dc10map/GUBlock/st56_...,https://www2.census.gov/geo/maps/dc10map/GUBlo...,[],0
...,...,...,...,...
2092407,ftp2.census.gov/geo/maps/blk2000/st48_Texas/Co...,https://www2.census.gov/geo/maps/blk2000/st48_...,[],0
3281860,ftp2.census.gov/programs-surveys/acs/data/arch...,https://www2.census.gov/programs-surveys/acs/d...,[],0
3638931,ftp2.census.gov/programs-surveys/acs/summary_f...,https://www2.census.gov/programs-surveys/acs/s...,[],0
190622,ftp2.census.gov/geo/tiger/TIGER2007FE/19_IOWA/...,https://www2.census.gov/geo/tiger/TIGER2007FE/...,[],0


Despite the truncated view above, there are some with snapshots:

In [13]:
sample[sample.num_snapshots > 0]

Unnamed: 0,ftp_url,web_url,archived,num_snapshots
635923,ftp2.census.gov/geo/tiger/TIGER2016/AREAWATER/...,https://www2.census.gov/geo/tiger/TIGER2016/AR...,"[2020-10-21 10:23:23+00:00, 2021-05-11 16:30:3...",5
2777086,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[2025-04-10 20:28:31+00:00],1
4296840,ftp2.census.gov/econ2007/Reference_materials/h...,https://www2.census.gov/econ2007/Reference_mat...,"[2021-04-01 16:20:21+00:00, 2024-12-04 08:41:1...",3
778119,ftp2.census.gov/geo/tiger/TIGER2022/AREAWATER/...,https://www2.census.gov/geo/tiger/TIGER2022/AR...,"[2022-11-30 16:37:40+00:00, 2025-03-22 14:42:4...",2
2765753,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[2025-04-03 12:51:56+00:00],1
...,...,...,...,...
1225841,ftp2.census.gov/geo/pvs/53/partnership_shapefi...,https://www2.census.gov/geo/pvs/53/partnership...,[2025-04-12 07:29:42+00:00],1
3776022,ftp2.census.gov/programs-surveys/acs/summary_f...,https://www2.census.gov/programs-surveys/acs/s...,"[2023-12-11 22:39:38+00:00, 2025-04-09 07:34:4...",2
2614625,ftp2.census.gov/census_2000/datasets/Summary_F...,https://www2.census.gov/census_2000/datasets/S...,[2021-04-15 09:31:16+00:00],1
1094764,ftp2.census.gov/geo/tiger/TIGER2020/LINEARWATE...,https://www2.census.gov/geo/tiger/TIGER2020/LI...,[2021-04-10 16:43:49+00:00],1


Lets double check a few that have zero snapshots to take a look manually...

In [14]:
pandas.set_option('max_colwidth', 0)

sample[sample['num_snapshots'] == 0][0:10].web_url

3123666    https://www2.census.gov/programs-surveys/acs/data/eeo_tabulation/EEO_2014_2018/EEO_Tables_By_All_Areas/JSON-Files/CIT6R/050/eeocit6r_05000us21155_databytype.json
2747981    https://www2.census.gov/census_2000/datasets/Summary_File_4/Missouri/mo54920_uf4.zip                                                                             
1713886    https://www2.census.gov/geo/maps/dc10map/GUBlock/st56_wy/county/c56037_sweetwater/DC10BLK_C56037_014.pdf                                                         
405747     https://www2.census.gov/geo/tiger/TIGER2008/46_SOUTH_DAKOTA/46091_Marshall_County/tl_2008_46091_facesal.zip                                                      
172984     https://www2.census.gov/geo/tiger/TIGER2007FE/21_KENTUCKY/21121_Knox/fe_2007_21121_tabblock00.zip                                                                
432295     https://www2.census.gov/geo/tiger/TIGER2008/35_NEW_MEXICO/35039_Rio_Arriba_County/tl_2008_35039_addrfn.zip                  

So https://www2.census.gov/programs-surveys/cps/tables/archive/decommissioned-after-2020/pov-03/2007/pov03_185_5.xls is available on the web. But does appear to be missing from the Wayback Machine: https://web.archive.org/web/20250000000000*/https://www2.census.gov/programs-surveys/cps/tables/archive/decommissioned-after-2020/pov-03/2007/pov03_185_5.xls

https://www2.census.gov/geo/maps/dc10map/GUBlock/st17_il/cousub/cs1709522879_elba/DC10BLK_CS1709522879_BLK2MS.txt is available on the web, and appears to be in the Wayback Machine https://web.archive.org/web/20250000000000*/https://www2.census.gov/geo/maps/dc10map/GUBlock/st17_il/cousub/cs1709522879_elba/DC10BLK_CS1709522879_BLK2MS.txt but what has been archived is actually a 302 redirect to something that has not been archived.

So it appears that the lookup is working correctly...

In [15]:
sample.to_csv('data/census-sample.csv', index=False)

## Initial Findings

What percentage of our sample have snapshots?

In [16]:
len(sample[sample.num_snapshots > 0]) / len(sample)

0.5428571428571428

So based on our sample we can say that with 95% confidence, and a 5% margin of error, only 54% of the Census FTP URLs have a snapshot in the Wayback Machine. This is a bit of a surprise?

## Stratified Sampling

The full set of files are not equally distributed across directories, and the directories refer to specific Census datasets in some cases. Since we have sampled from all of them missing files in one large directory may be overly influencing the results.

The [Big Local News](https://biglocalnews.org/content/about/) project at Stanford has expressed interest in archiving specific directories, so it is helpful to be able to analyze these specifically to ascertain how important it is to archive them. 

* programs-surveys/acs/summary_file/ 
* programs-surveys/decennial/ 
* geo/tiger/TIGER2024/ 
* geo/tiger/TIGER2023/ 
* geo/tiger/TIGER2022/

In [17]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/programs-surveys/acs/summary_file')])

415122

In [18]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/programs-surveys/decennial/')])

20653

In [19]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2024/')])

33224

In [20]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2023/')])

33165

In [21]:
len(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2022/')])

33163

To speed things up with these sample tests it's helpful to create a function that will package up the logic we did above. A sample size of 385 will still work with these population sizes to give us the same level of confidence and margin of error.

In [22]:
def make_sample(df, n=385):
    sample = df.sample(n)
    sample['archived'] = sample.web_url.apply(snapshots)
    sample['num_snapshots'] = sample['archived'].apply(lambda l: len(l))
    return sample

Now the different samples for each subset can be created. This will take some time to run.

In [23]:
summary_file = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/programs-surveys/acs/summary_file')])

https://www2.census.gov/programs-surveys/acs/summary_file/2012/data/1_year_seq_by_state/Florida/20121fl0129000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2020/data/5_year_seq_by_state/Wyoming/All_Geographies_Not_Tracts_Block_Groups/20205wy0066000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2020/data/5_year_seq_by_state/Connecticut/Tracts_Block_Groups_Only/20205ct0139000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2015/data/5_year_seq_by_state/NewMexico/All_Geographies_Not_Tracts_Block_Groups/20155nm0020000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2019/data/1_year_seq_by_state/WestVirginia/20191wv0108000.zip 1
https://www2.census.gov/programs-surveys/acs/summary_file/2009/data/1_year_seq_by_state/IL/20091il0033000.zip 0
https://www2.census.gov/programs-surveys/acs/summary_file/2009/data/5_year_seq_by_state/Wyoming/Tracts_Block_Groups_Only/20095wy0002000.zip 0
https://www2.census.gov/programs-surveys/acs/

In [24]:
decennial = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/programs-surveys/decennial/')])

https://www2.census.gov/programs-surveys/decennial/coverage-measurement/stage-coverage_measurement/pdfs/california/huplc0682590.pdf 0
https://www2.census.gov/programs-surveys/decennial/coverage-measurement/tables/2010/ccm-results-nebraska/cou31109.pdf 2
https://www2.census.gov/programs-surveys/decennial/tables/2000/county-to-county-worker-flow-files/2kresco_mi.zip 0
https://www2.census.gov/programs-surveys/decennial/2000/program-management/2-count/statistics-in-schools/esl-adult-literacy/esladultint.pdf 0
https://www2.census.gov/programs-surveys/decennial/2000/phc/phc-t-39/tab03-nc-02.xls 1
https://www2.census.gov/programs-surveys/decennial/coverage-measurement/stage-coverage_measurement/pdfs/florida/hucou12019.pdf 0
https://www2.census.gov/programs-surveys/decennial/2010/program-management/5-review/cqr/cqr-submission-guidelines-final-rev8-1-11.pdf 5
https://www2.census.gov/programs-surveys/decennial/2000/phc/phc-t-39/tab07-mn-03.csv 2
https://www2.census.gov/programs-surveys/decennial

In [25]:
tiger2024 = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2024/')])

https://www2.census.gov/geo/tiger/TIGER2024/ADDR/tl_2024_51081_addr.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/LINEARWATER/tl_2024_01021_linearwater.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/FACES/tl_2024_24029_faces.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/EDGES/tl_2024_45067_edges.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/ROADS/tl_2024_37193_roads.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/EDGES/tl_2024_39113_edges.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/ADDR/tl_2024_45047_addr.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/EDGES/tl_2024_46043_edges.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/ADDR/tl_2024_27025_addr.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/ADDRFN/tl_2024_47075_addrfn.zip 0
https://www2.census.gov/geo/tiger/TIGER2024/FEATNAMES/tl_2024_48141_featnames.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/ADDRFN/tl_2024_42035_addrfn.zip 1
https://www2.census.gov/geo/tiger/TIGER2024/ADDRFN/tl_2024_48023_addrf

In [26]:
tiger2023 = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2023/')])

https://www2.census.gov/geo/tiger/TIGER2023/ADDR/tl_2023_13081_addr.zip 2
https://www2.census.gov/geo/tiger/TIGER2023/ADDRFEAT/tl_2023_21209_addrfeat.zip 3
https://www2.census.gov/geo/tiger/TIGER2023/FEATNAMES/tl_2023_13127_featnames.zip 2
https://www2.census.gov/geo/tiger/TIGER2023/FACES/tl_2023_12015_faces.zip 1
https://www2.census.gov/geo/tiger/TIGER2023/FEATNAMES/tl_2023_13209_featnames.zip 1
https://www2.census.gov/geo/tiger/TIGER2023/ROADS/tl_2023_06057_roads.zip 3
https://www2.census.gov/geo/tiger/TIGER2023/LINEARWATER/tl_2023_48285_linearwater.zip 1
https://www2.census.gov/geo/tiger/TIGER2023/FEATNAMES/tl_2023_40135_featnames.zip 2
https://www2.census.gov/geo/tiger/TIGER2023/EDGES/tl_2023_06081_edges.zip 2
https://www2.census.gov/geo/tiger/TIGER2023/LINEARWATER/tl_2023_13063_linearwater.zip 3
https://www2.census.gov/geo/tiger/TIGER2023/ADDRFN/tl_2023_42065_addrfn.zip 0
https://www2.census.gov/geo/tiger/TIGER2023/ADDRFEAT/tl_2023_19175_addrfeat.zip 2
https://www2.census.gov/geo/

In [27]:
tiger2022 = make_sample(df[df.ftp_url.str.startswith('ftp2.census.gov/geo/tiger/TIGER2022/')])

https://www2.census.gov/geo/tiger/TIGER2022/FACESAH/tl_2022_19111_facesah.zip 3
https://www2.census.gov/geo/tiger/TIGER2022/ADDR/tl_2022_18023_addr.zip 1
https://www2.census.gov/geo/tiger/TIGER2022/ADDR/tl_2022_12079_addr.zip 3
https://www2.census.gov/geo/tiger/TIGER2022/ADDRFN/tl_2022_08021_addrfn.zip 1
https://www2.census.gov/geo/tiger/TIGER2022/ADDR/tl_2022_29153_addr.zip 2
https://www2.census.gov/geo/tiger/TIGER2022/ADDRFN/tl_2022_39135_addrfn.zip 2
https://www2.census.gov/geo/tiger/TIGER2022/ADDRFN/tl_2022_27135_addrfn.zip 2
https://www2.census.gov/geo/tiger/TIGER2022/AREAWATER/tl_2022_12073_areawater.zip 2
https://www2.census.gov/geo/tiger/TIGER2022/AREAWATER/tl_2022_72073_areawater.zip 3
https://www2.census.gov/geo/tiger/TIGER2022/EDGES/tl_2022_29085_edges.zip 2
https://www2.census.gov/geo/tiger/TIGER2022/FACES/tl_2022_08029_faces.zip 3
https://www2.census.gov/geo/tiger/TIGER2022/ADDRFN/tl_2022_31173_addrfn.zip 2
https://www2.census.gov/geo/tiger/TIGER2022/AREAWATER/tl_2022_3603

The subset size in total relative to all the URLs:

In [28]:
(415122 + 20653 + 33224 + 33165 + 33163) / len(df)

0.11904433104707696

So what do the percentages look like for these subsets? lets also save them since they took some time to calculate.

In [29]:
summary_file.to_csv('data/census-summary-file.csv', index=False)
len(summary_file[summary_file.num_snapshots > 0]) / 385

0.23896103896103896

In [30]:
decennial.to_csv('data/census-decennial.csv', index=False)
len(decennial[decennial.num_snapshots > 0]) / 385

0.6233766233766234

In [31]:
tiger2024.to_csv('data/census-tiger2024.csv', index=False)
len(tiger2024[tiger2024.num_snapshots > 0]) / 385

0.4909090909090909

In [32]:
tiger2023.to_csv('data/census-tiger2023.csv', index=False)
len(tiger2023[tiger2023.num_snapshots > 0]) / 385

0.8857142857142857

In [33]:
tiger2022.to_csv('data/census-tiger2022.csv', index=False)
len(tiger2022[tiger2022.num_snapshots > 0]) / 385

0.9974025974025974