# StatsCan shapefile processing
*April 22, 2022*

This notebook takes statscan census shapefiles and processes them into various useful maps for import into Datawrapper. First, we import geopandas, pandas, and a third module to suppress some annoying warning messages.

In [1]:
import geopandas
import pandas as pd
import warnings;   warnings.filterwarnings("ignore")

Now we read in the latest StatsCan census boundary files, and convert the coordinate system to EPSG:4326, which is what Datawrapper likes.

In [2]:
tracts = (geopandas
          .read_file("https://www12.statcan.gc.ca/census-recensement/2021/geo/sip-pis/boundary-limites/files-fichiers/lcma000b21a_e.zip")
          .to_crs("EPSG:4326")
          )

Next, we read in a table that contains pre-prepared info: a list of provinces that match to PRUIDs, and a list of CMAs that matches names of CMAs and CAs to DGUIDs.

In [3]:
province_list = pd.read_csv("./data/provinces.csv").astype(str).set_index("PRUID")
cma_list = pd.read_csv("./data/cmas.csv").astype(str).set_index("ID")

Let's take a peek at the tracts table.

In [7]:
tracts.sample(20)

Unnamed: 0,CTUID,DGUID,CTNAME,LANDAREA,PRUID,geometry
4664,9330228.03,2021S05079330228.03,228.03,0.3275,59,"POLYGON ((-123.01329 49.23275, -123.01262 49.2..."
2011,9330038.0,2021S05079330038.00,38.0,0.589,59,"POLYGON ((-123.08980 49.25669, -123.08985 49.2..."
476,4620257.0,2021S05074620257.00,257.0,1.3088,24,"POLYGON ((-73.60917 45.56420, -73.60950 45.563..."
683,8250206.07,2021S05078250206.07,206.07,2.4784,48,"POLYGON ((-114.01340 51.28668, -114.01339 51.2..."
1040,5350162.0,2021S05075350162.00,162.0,0.5757,35,"POLYGON ((-79.44502 43.68579, -79.44611 43.685..."
2365,5320202.16,2021S05075320202.16,202.16,60.7246,35,"POLYGON ((-78.64099 43.94428, -78.63453 43.929..."
5831,5350620.23,2021S05075350620.23,620.23,2.7492,35,"POLYGON ((-79.86711 43.48642, -79.86650 43.486..."
2348,5410110.0,2021S05075410110.00,110.0,99.6579,35,"POLYGON ((-80.36974 43.55394, -80.36977 43.552..."
3137,5370217.02,2021S05075370217.02,217.02,2.1841,35,"POLYGON ((-79.77663 43.34630, -79.77742 43.345..."
2553,4210700.0,2021S05074210700.00,700.0,44.5657,24,"POLYGON ((-70.95024 46.85450, -70.93067 46.840..."


First, we want to simplify our polygons a bit. Datawrapper has an upload size limit of 2MB, so we use `.simplify()` to reduce the size to an acceptable level.

In [5]:
simple_tracts = tracts.copy()

simple_tracts["geometry"] = tracts["geometry"].simplify(tolerance=0.0001)

Now, we iterate through every CMA and CA in our list, and match that DGUID to the one in our shapefiles. Then, if there's data for that CMA, we output the file as a GeoJSON.

In [6]:
for id in cma_list.index.unique():
    
    name = cma_list.at[id, "NAME"].strip().lower().replace(" ", "")
    id_trim = id[-3:]
    
    data = (simple_tracts
           .loc[simple_tracts["DGUID"].str.contains("2021S0507" + id_trim, regex=True), :]
          )
    
    if len(data) > 0:
          data.to_file(f"./data/cities/{name}.geojson", driver='GeoJSON')
    

That's all. Now the repo should be populated with a list of useable GeoJSON files for Datawrapper maps based on the most recent data.

\-30\-