# StatsCan shapefile processing
*April 22, 2022*

This notebook takes statscan census shapefiles and processes them into various useful maps for import into Datawrapper. First, we import geopandas, pandas, and a third module to suppress some annoying warning messages.

In [1]:
import geopandas
import pandas as pd
import warnings
import re

warnings.filterwarnings("ignore")

Now we read in the latest StatsCan census boundary files, and convert the coordinate system to EPSG:4326, which is what Datawrapper likes.

In [2]:
tracts = (geopandas
          .read_file("https://www12.statcan.gc.ca/census-recensement/2021/geo/sip-pis/boundary-limites/files-fichiers/lct_000b21a_e.zip")
          .to_crs("EPSG:4326")
          )

In [3]:
diss_areas = (geopandas
          .read_file("https://www12.statcan.gc.ca/census-recensement/2021/geo/sip-pis/boundary-limites/files-fichiers/lda_000b21a_e.zip")
          .to_crs("EPSG:4326")
          )

In [4]:
subdivisions = (geopandas
          .read_file("https://www12.statcan.gc.ca/census-recensement/2021/geo/sip-pis/boundary-limites/files-fichiers/lcsd000b21a_e.zip")
          .to_crs("EPSG:4326")
          )

We'll add a column here to pull the CMA code from the DGUID for later use.

In [5]:
tracts["CMA_CODE"] = tracts["DGUID"].str[9:12]
tracts["CA_CODE"] = tracts["DGUID"].apply(lambda x: re.search("[0-9]{3}(?=\.)", x).group(0))

tracts

Unnamed: 0,CTUID,DGUID,CTNAME,LANDAREA,PRUID,geometry,CMA_CODE,CA_CODE
0,5370001.08,2021S05075370001.08,0001.08,1.6383,35,"POLYGON ((-79.85362 43.19320, -79.85380 43.192...",537,001
1,0010002.00,2021S05070010002.00,0002.00,1.9638,10,"POLYGON ((-52.72050 47.55154, -52.71877 47.550...",001,002
2,5370001.09,2021S05075370001.09,0001.09,1.9699,35,"POLYGON ((-79.85586 43.18791, -79.85592 43.187...",537,001
3,5370120.02,2021S05075370120.02,0120.02,76.9650,35,"POLYGON ((-79.94562 43.16920, -79.94638 43.167...",537,120
4,0010006.00,2021S05070010006.00,0006.00,1.0467,10,"POLYGON ((-52.71107 47.56251, -52.71143 47.562...",001,006
...,...,...,...,...,...,...,...,...
6242,5591003.00,2021S05075591003.00,1003.00,227.6981,35,"POLYGON ((-82.63729 42.14866, -82.63820 42.136...",559,003
6243,5591004.00,2021S05075591004.00,1004.00,18.3792,35,"POLYGON ((-82.69416 42.05046, -82.69498 42.035...",559,004
6244,5800300.00,2021S05075800300.00,0300.00,314.4614,35,"POLYGON ((-80.41584 46.44983, -80.41636 46.444...",580,300
6245,6020800.00,2021S05076020800.00,0800.00,8.6987,46,"POLYGON ((-97.02578 49.59204, -97.02580 49.591...",602,800


Next, we read in a table that contains pre-prepared info: a list of provinces that match to PRUIDs, and a list of CMAs that matches names of CMAs and CAs to DGUIDs.

In [6]:
province_list = pd.read_csv("./data/provinces.csv").astype(str).set_index("PRUID")
cma_list = pd.read_csv("./data/cmas.csv").astype(str).set_index("ID")

In [7]:
cma_list

Unnamed: 0_level_0,NAME,TYPE,PROVINCE
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021S0503001,St. John's,CMA,NL
2021S0503205,Halifax,CMA,NS
2021S0503305,Moncton,CMA,NB
2021S0503310,Saint John,CMA,NB
2021S0503320,Fredericton,CMA,NB
...,...,...,...
2021S0504970,Prince George,CA,BC
2021S0504975,Dawson Creek,CA,BC
2021S0504977,Fort St. John,CA,BC
2021S0504990,Whitehorse,CA,YK


In [8]:
cma_list[cma_list["NAME"].str.contains("Grand")]

Unnamed: 0_level_0,NAME,TYPE,PROVINCE
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021S0504010,Grand Falls-Windsor,CA,NL
2021S0504850,Grande Prairie,CA,AB


In [9]:
tracts[tracts["DGUID"].str.contains("850")]

Unnamed: 0,CTUID,DGUID,CTNAME,LANDAREA,PRUID,geometry,CMA_CODE,CA_CODE
1832,4210850.03,2021S05074210850.03,850.03,3.6804,24,"POLYGON ((-71.27536 46.74079, -71.27304 46.739...",421,850
1837,4210850.04,2021S05074210850.04,850.04,2.3087,24,"POLYGON ((-71.28403 46.73235, -71.28617 46.729...",421,850
1847,4210850.05,2021S05074210850.05,850.05,13.1479,24,"POLYGON ((-71.29949 46.72010, -71.29937 46.720...",421,850
2606,8500001.0,2021S05078500001.00,1.0,11.123,48,"POLYGON ((-118.84592 55.20558, -118.84580 55.2...",850,1
2607,8500002.0,2021S05078500002.00,2.0,2.5301,48,"POLYGON ((-118.76930 55.19974, -118.75649 55.1...",850,2
2608,8500003.0,2021S05078500003.00,3.0,7.4748,48,"POLYGON ((-118.73076 55.19974, -118.73070 55.1...",850,3
2611,8500004.0,2021S05078500004.00,4.0,2.2023,48,"POLYGON ((-118.77593 55.18538, -118.76921 55.1...",850,4
2612,8500005.0,2021S05078500005.00,5.0,1.821,48,"POLYGON ((-118.80155 55.16713, -118.80057 55.1...",850,5
2613,8500006.0,2021S05078500006.00,6.0,8.2066,48,"POLYGON ((-118.76900 55.17128, -118.76901 55.1...",850,6
2615,8500007.0,2021S05078500007.00,7.0,2.2604,48,"POLYGON ((-118.79821 55.15619, -118.79546 55.1...",850,7


Let's take a peek at the tracts table.

In [10]:
tracts.head(5)

Unnamed: 0,CTUID,DGUID,CTNAME,LANDAREA,PRUID,geometry,CMA_CODE,CA_CODE
0,5370001.08,2021S05075370001.08,1.08,1.6383,35,"POLYGON ((-79.85362 43.19320, -79.85380 43.192...",537,1
1,10002.0,2021S05070010002.00,2.0,1.9638,10,"POLYGON ((-52.72050 47.55154, -52.71877 47.550...",1,2
2,5370001.09,2021S05075370001.09,1.09,1.9699,35,"POLYGON ((-79.85586 43.18791, -79.85592 43.187...",537,1
3,5370120.02,2021S05075370120.02,120.02,76.965,35,"POLYGON ((-79.94562 43.16920, -79.94638 43.167...",537,120
4,10006.0,2021S05070010006.00,6.0,1.0467,10,"POLYGON ((-52.71107 47.56251, -52.71143 47.562...",1,6


Because CMAs and CAs are coded a bit differently, we'll need to use the DGUID to get the CA or CMA code differently. To get the CA code from CA DGUIDs, we need to target the last three numbers before the decimal place. So let's remove the decimal place now to make this easier later on.

In [11]:
# tracts["DGUID"] = tracts["DGUID"].str.replace("\.{1}[0-9A-B]+", "")

# tracts

First, we want to simplify our polygons a bit. Datawrapper has an upload size limit of 2MB, so we use `.simplify()` to reduce the size to an acceptable level. Let's first define a list of CMA code that are very large.

In [12]:
big_cities = [
    "535",
    "462",
    "559",
    "933",
    "205",
    "521",
    "421",
    "505",
    "835",
    "537"
]

Now, we simplify.

In [13]:
simple_tracts = tracts.copy()

simple_tracts.loc[simple_tracts["CMA_CODE"].isin(big_cities), "geometry"] = tracts["geometry"].simplify(tolerance=0.0001)

simple_tracts

Unnamed: 0,CTUID,DGUID,CTNAME,LANDAREA,PRUID,geometry,CMA_CODE,CA_CODE
0,5370001.08,2021S05075370001.08,0001.08,1.6383,35,"POLYGON ((-79.85362 43.19320, -79.85380 43.192...",537,001
1,0010002.00,2021S05070010002.00,0002.00,1.9638,10,"POLYGON ((-52.72050 47.55154, -52.71877 47.550...",001,002
2,5370001.09,2021S05075370001.09,0001.09,1.9699,35,"POLYGON ((-79.85586 43.18791, -79.85592 43.187...",537,001
3,5370120.02,2021S05075370120.02,0120.02,76.9650,35,"POLYGON ((-79.94562 43.16920, -79.95305 43.152...",537,120
4,0010006.00,2021S05070010006.00,0006.00,1.0467,10,"POLYGON ((-52.71107 47.56251, -52.71143 47.562...",001,006
...,...,...,...,...,...,...,...,...
6242,5591003.00,2021S05075591003.00,1003.00,227.6981,35,"POLYGON ((-82.63729 42.14866, -82.64087 42.098...",559,003
6243,5591004.00,2021S05075591004.00,1004.00,18.3792,35,"POLYGON ((-82.69416 42.05046, -82.69498 42.035...",559,004
6244,5800300.00,2021S05075800300.00,0300.00,314.4614,35,"POLYGON ((-80.41584 46.44983, -80.41636 46.444...",580,300
6245,6020800.00,2021S05076020800.00,0800.00,8.6987,46,"POLYGON ((-97.02578 49.59204, -97.02580 49.591...",602,800


Now, we iterate through every CMA and CA in our list, and match that DGUID to the one in our shapefiles. Then, if there's data for that CMA, we output the file as a GeoJSON.

In [27]:
simple_tracts["code"] = "0" + simple_tracts["CA_CODE"]

simple_tracts

Unnamed: 0,CTUID,DGUID,CTNAME,LANDAREA,PRUID,geometry,CMA_CODE,CA_CODE,code
0,5370001.08,2021S05075370001.08,0001.08,1.6383,35,"POLYGON ((-79.85362 43.19320, -79.85380 43.192...",537,001,0001
1,0010002.00,2021S05070010002.00,0002.00,1.9638,10,"POLYGON ((-52.72050 47.55154, -52.71877 47.550...",001,002,0002
2,5370001.09,2021S05075370001.09,0001.09,1.9699,35,"POLYGON ((-79.85586 43.18791, -79.85592 43.187...",537,001,0001
3,5370120.02,2021S05075370120.02,0120.02,76.9650,35,"POLYGON ((-79.94562 43.16920, -79.95305 43.152...",537,120,0120
4,0010006.00,2021S05070010006.00,0006.00,1.0467,10,"POLYGON ((-52.71107 47.56251, -52.71143 47.562...",001,006,0006
...,...,...,...,...,...,...,...,...,...
6242,5591003.00,2021S05075591003.00,1003.00,227.6981,35,"POLYGON ((-82.63729 42.14866, -82.64087 42.098...",559,003,0003
6243,5591004.00,2021S05075591004.00,1004.00,18.3792,35,"POLYGON ((-82.69416 42.05046, -82.69498 42.035...",559,004,0004
6244,5800300.00,2021S05075800300.00,0300.00,314.4614,35,"POLYGON ((-80.41584 46.44983, -80.41636 46.444...",580,300,0300
6245,6020800.00,2021S05076020800.00,0800.00,8.6987,46,"POLYGON ((-97.02578 49.59204, -97.02580 49.591...",602,800,0800


In [41]:
for id in cma_list.index.unique():

    name = cma_list.at[id, "NAME"].strip().lower().replace(" ", "")
    type = cma_list.at[id, "TYPE"]
    
    
    prov_name = cma_list.at[id, "PROVINCE"]

    if type == "CMA":
        data = (simple_tracts
                .loc[simple_tracts["DGUID"].str.contains("2021S050[0-9]{1}" + str(id[-3:]), regex=True), :]
                )
    else:
        print(name)
        print(id[8:11])
        data = (simple_tracts
                .loc[simple_tracts["DGUID"].str.contains("2021S050[0-9]{1}" + str(id[-3:]), regex=True), :]
                )

    if len(data) > 0:
        data.to_file(f"./data/cities/{prov_name.lower()}-{type.lower()}-{name}.geojson", driver='GeoJSON')


grandfalls-windsor
401
gander
401
cornerbrook
401
charlottetown
410
summerside
411
kentville
421
truro
421
newglasgow
422
capebreton
422
bathurst
432
miramichi
432
campbellton
433
edmundston
433
matane
440
rimouski
440
rivière-du-loup
440
baie-comeau
440
alma
441
dolbeau-mistassini
441
sept-îles
441
sainte-marie
442
saint-georges
442
thetfordmines
443
cowansville
443
victoriaville
444
shawinigan
444
granby
445
saint-hyacinthe
445
sorel-tracy
445
joliette
445
salaberry-de-valleyfield
446
sainte-agathe-des-monts
446
lachute
446
val-d'or
448
amos
448
rouyn-noranda
448
cornwall
450
hawkesbury
450
brockville
451
pembroke
451
petawawa
451
cobourg
452
porthope
452
kawarthalakes
453
centrewellington
453
ingersoll
453
woodstock
454
tillsonburg
454
norfolk
454
stratford
455
chatham-kent
455
sarnia
456
essa
456
wasagabeach
456
owensound
456
collingwood
456
orillia
456
midland
457
northbay
457
elliotlake
458
timmins
458
saultste.marie
459
kenora
459
winkler
460
steinbach
460
portagelaprairie
460
b

That's all. Now the repo should be populated with a list of useable GeoJSON files for Datawrapper maps based on the most recent data.

## Provincial maps

We're also going to produce maps for provinces using subdivisions and dissemination areas. Note that some of these files will be too big for auto import into Datawrapper, and will have to be manually shrunk using [Mapshaper](https://mapshaper.org/).

In [None]:
for prov_id in province_list.index.unique():
    
    prov_name = province_list.at[prov_id, "NAME"].strip().lower().replace(" ", "")

    # Loop through and create the dissemenation area maps.
    da = diss_areas[diss_areas["PRUID"] == prov_id]
    da["geometry"] = da["geometry"].simplify(tolerance=0.01)
    da.to_file(f"./data/provinces/{prov_name}-da.geojson", driver='GeoJSON')
    
    # Loop through and create the census subdivision maps.
    csd = subdivisions[subdivisions["PRUID"] == prov_id]
    csd["geometry"] = csd["geometry"].simplify(tolerance=0.01)
    csd.to_file(f"./data/provinces/{prov_name}-csd.geojson", driver='GeoJSON')

\-30\-