# Lab02: Canadian Census data

To start this lab, you have 3 options to acquire the data.

## Option 1: Using Censusmapper and R
This option works best if you have some prior R experience. You will not have to write any code of your own for this part, although you will have to create a Censusmapper API key if you don't have one already. Use [this link](https://censusmapper.ca/users/sign_in) to do so. There is documentation and a vignette to get started available [here](https://cran.r-project.org/web/packages/cancensus/vignettes/cancensus.html). 

When you have the API key and a familiarity with the package, you can run the script `lab02_download_data.r` to create the necessary datasets. As you run through it, try to identify which of the arguments to `get_census` you could change if you wanted to download a different geography, time period, or set of census variables. 

## Option 2: Using Statcan
This option works best if you have a fast internet connection. From [this site](https://www12.statcan.gc.ca/census-recensement/2021/dp-pd/prof/details/download-telecharger.cfm?Lang=E&SearchText=toronto&DGUIDlist=2021A00053520005&GENDERlist=1&STATISTIClist=1&HEADERlist=50,30,6,20,9,1), select the Comprehensive download files dropdown and download the "Census metropolitan areas (CMAs), tracted census agglomerations (CAs) and census tracts (CTs)" file. 

## Option 3: Use the provided data
This option works best if you don't know or don't want to learn R and do not want to download a large file of all census tracts in Canada.

## Working with Census data

In [121]:
import pandas as pd
import numpy as np
import re

In [122]:
# read in 2021 census data
data_21 = pd.read_csv('~/git/cp101.github.io/labs/lab02/census21_data.csv')
data_21

Unnamed: 0.1,Unnamed: 0,GeoUID,Type,Region Name,Area (sq km),Population,Dwellings,Households,CMA_UID,PR_UID,...,"v_CA21_954: $45,000 to $49,999","v_CA21_955: $50,000 to $59,999","v_CA21_956: $60,000 to $69,999","v_CA21_957: $70,000 to $79,999","v_CA21_958: $80,000 to $89,999","v_CA21_959: $90,000 to $99,999","v_CA21_960: $100,000 and over","v_CA21_961: $100,000 to $124,999","v_CA21_962: $125,000 to $149,999","v_CA21_963: $150,000 and over"
0,1,5350001.00,CT,1.00,6.8192,599,253,235,35535,,...,10.0,10.0,15.0,20.0,10.0,10.0,130.0,20.0,25.0,80.0
1,2,5350002.00,CT,2.00,3.3926,604,294,284,35535,,...,15.0,20.0,20.0,25.0,25.0,15.0,95.0,40.0,20.0,35.0
2,3,5350003.00,CT,3.00,0.9455,457,279,265,35535,,...,5.0,25.0,25.0,20.0,25.0,10.0,85.0,35.0,15.0,30.0
3,4,5350004.00,CT,4.00,0.3404,6306,3620,3276,35535,,...,185.0,335.0,275.0,200.0,160.0,110.0,395.0,165.0,100.0,135.0
4,5,5350005.00,CT,5.00,0.3764,6957,4235,3720,35535,,...,160.0,340.0,295.0,275.0,235.0,170.0,830.0,315.0,195.0,320.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
580,581,5350378.26,CT,378.26,1.6267,4867,1234,1213,35535,,...,35.0,45.0,65.0,90.0,90.0,85.0,725.0,220.0,190.0,310.0
581,582,5350378.27,CT,378.27,19.9096,5547,1565,1526,35535,,...,55.0,115.0,105.0,125.0,140.0,125.0,740.0,275.0,195.0,265.0
582,583,5350378.28,CT,378.28,2.2278,6946,2017,1931,35535,,...,60.0,130.0,115.0,130.0,135.0,130.0,950.0,295.0,235.0,425.0
583,584,5350802.01,CT,802.01,2.1742,4011,1328,1303,35535,,...,30.0,70.0,65.0,75.0,75.0,90.0,750.0,205.0,200.0,350.0


In [123]:
# fix missingness and data types
data_21 = data_21.fillna(0)
data_21 = data_21.replace({'NA': 0})
data_21 = data_21.replace({'': 0})
data_21.iloc[:,4:] = data_21.iloc[:,4:].apply(pd.to_numeric)
data_21["GeoUID"] = data_21["GeoUID"].astype(str)

In [124]:
# convert whitespace, parens, commas to underscore in column names
data_21.columns = data_21.columns.str.replace(" |\\(|\\)|,", "_")
data_21 = data_21.drop(columns = ['Unnamed:_0','v_CA21_923:_Number_of_after-tax_income_recipients_aged_15_years_and_over_in_private_households_in_2019',
       'v_CA21_924:_Under_$5_000', 'v_CA21_925:_$5_000_to_$9_999',
       'v_CA21_926:_$10_000_to_$14_999', 'v_CA21_927:_$15_000_to_$19_999',
       'v_CA21_928:_$20_000_to_$24_999', 'v_CA21_929:_$25_000_to_$29_999',
       'v_CA21_930:_$30_000_to_$34_999', 'v_CA21_931:_$35_000_to_$39_999',
       'v_CA21_932:_$40_000_to_$44_999', 'v_CA21_933:_$45_000_to_$49_999',
       'v_CA21_934:_$50_000_to_$59_999', 'v_CA21_935:_$60_000_to_$69_999',
       'v_CA21_936:_$70_000_to_$79_999', 'v_CA21_937:_$80_000_to_$89_999',
       'v_CA21_938:_$90_000_to_$99_999', 'v_CA21_939:_$100_000_and_over',
       'v_CA21_940:_$100_000_to_$124_999', 'v_CA21_941:_$125_000_to_$149_999',
       'v_CA21_942:_$150_000_to_$199_999', 'v_CA21_943:_$200_000_and_over',
        'v_CA21_944:_Household_after-tax_income_groups_in_2020_for_private_households',
       'v_CA21_945:_Under_$5_000', 'v_CA21_946:_$5_000_to_$9_999',
       'v_CA21_947:_$10_000_to_$14_999', 'v_CA21_948:_$15_000_to_$19_999',
       'v_CA21_949:_$20_000_to_$24_999', 'v_CA21_950:_$25_000_to_$29_999',
       'v_CA21_951:_$30_000_to_$34_999', 'v_CA21_952:_$35_000_to_$39_999',
       'v_CA21_953:_$40_000_to_$44_999', 'v_CA21_954:_$45_000_to_$49_999',
       'v_CA21_955:_$50_000_to_$59_999', 'v_CA21_956:_$60_000_to_$69_999',
       'v_CA21_957:_$70_000_to_$79_999', 'v_CA21_958:_$80_000_to_$89_999',
       'v_CA21_959:_$90_000_to_$99_999', 'v_CA21_960:_$100_000_and_over',
       'v_CA21_961:_$100_000_to_$124_999', 'v_CA21_962:_$125_000_to_$149_999',
       'v_CA21_963:_$150_000_and_over'])

  data_21.columns = data_21.columns.str.replace(" |\\(|\\)|,", "_")


In [125]:
data_21.columns

Index(['Unnamed:_0', 'GeoUID', 'Type', 'Region_Name', 'Area__sq_km_',
       'Population', 'Dwellings', 'Households', 'CMA_UID', 'PR_UID', 'CSD_UID',
       'CD_UID',
       'v_CA21_4872:_Total_-_Visible_minority_for_the_population_in_private_households',
       'v_CA21_4875:_Total_visible_minority_population',
       'v_CA21_4878:_South_Asian', 'v_CA21_4881:_Chinese',
       'v_CA21_4884:_Black', 'v_CA21_4887:_Filipino', 'v_CA21_4890:_Arab',
       'v_CA21_4893:_Latin_American', 'v_CA21_4896:_Southeast_Asian',
       'v_CA21_4899:_West_Asian', 'v_CA21_4902:_Korean',
       'v_CA21_4905:_Japanese', 'v_CA21_4908:_Visible_minority__n.i.e.',
       'v_CA21_4911:_Multiple_visible_minorities',
       'v_CA21_4914:_Not_a_visible_minority',
       'v_CA21_7632:_Total_-_Main_mode_of_commuting_for_the_employed_labour_force_aged_15_years_and_over_with_a_usual_place_of_work_or_no_fixed_workplace_address',
       'v_CA21_7635:_Car__truck_or_van', 'v_CA21_7644:_Public_transit',
       'v_CA21_7647:

In [126]:
# read in 2006 data
data_06 = pd.read_csv('~/git/cp101.github.io/labs/lab02/census06_data.csv')
data_06

Unnamed: 0.1,Unnamed: 0,GeoUID,Type,Region Name,Area (sq km),Population,Dwellings,Households,CMA_UID,PR_UID,...,"v_CA06_1990: $10,000 to $19,999","v_CA06_1991: $20,000 to $29,999","v_CA06_1992: $30,000 to $39,999","v_CA06_1993: $40,000 to $49,999","v_CA06_1994: $50,000 to $59,999","v_CA06_1995: $60,000 to $69,999","v_CA06_1996: $70,000 to $79,999","v_CA06_1997: $80,000 to $89,999","v_CA06_1998: $90,000 to $99,999","v_CA06_1999: $100,000 and over"
0,1,5350001.00,CT,Toronto,6.62223,571,245,231,35535,35,...,10.0,10.0,35.0,20.0,25.0,20.0,15.0,15.0,15.0,50.0
1,2,5350002.00,CT,Toronto,3.26165,627,273,262,35535,35,...,30.0,10.0,35.0,10.0,10.0,40.0,15.0,20.0,25.0,70.0
2,3,5350003.00,CT,Toronto,0.93043,0,1,1,35535,35,...,,,,,,,,,,
3,4,5350004.00,CT,Toronto,0.34390,6861,3614,3335,35535,35,...,785.0,665.0,430.0,295.0,185.0,175.0,135.0,95.0,50.0,150.0
4,5,5350005.00,CT,Toronto,0.37841,5089,2575,2413,35535,35,...,660.0,350.0,220.0,155.0,140.0,140.0,90.0,45.0,65.0,135.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
526,527,5350378.22,CT,Toronto,0.53116,3472,880,862,35535,35,...,10.0,60.0,85.0,50.0,65.0,70.0,75.0,85.0,50.0,255.0
527,528,5350378.23,CT,Toronto,1.53841,4318,1309,1282,35535,35,...,105.0,185.0,75.0,90.0,70.0,105.0,90.0,90.0,45.0,355.0
528,529,5350378.24,CT,Toronto,2.52091,5886,1947,1861,35535,35,...,180.0,340.0,145.0,220.0,140.0,105.0,85.0,110.0,45.0,330.0
529,530,5350802.01,CT,Toronto,2.22576,4065,1269,1242,35535,35,...,30.0,70.0,45.0,60.0,105.0,95.0,115.0,130.0,125.0,445.0


In [127]:
# fix missingness and data types
data_06 = data_06.fillna(0)
data_06 = data_06.replace({'NA': 0})
data_06 = data_06.replace({'': 0})
data_06.iloc[:,4:] = data_06.iloc[:,4:].apply(pd.to_numeric)
data_06["GeoUID"] = data_06["GeoUID"].astype(str)

In [130]:
# convert whitespace, parens, commas to underscore in column names
data_06.columns = data_06.columns.str.replace(" |\\(|\\)|,", "_")
data_06 = data_06.drop(columns = ['Unnamed:_0','v_CA06_1988:_Household_income_in_2005_of_private_households_-_20%_sample_data',
       'v_CA06_1989:_Under_$10_000', 'v_CA06_1990:_$10_000_to_$19_999',
       'v_CA06_1991:_$20_000_to_$29_999', 'v_CA06_1992:_$30_000_to_$39_999',
       'v_CA06_1993:_$40_000_to_$49_999', 'v_CA06_1994:_$50_000_to_$59_999',
       'v_CA06_1995:_$60_000_to_$69_999', 'v_CA06_1996:_$70_000_to_$79_999',
       'v_CA06_1997:_$80_000_to_$89_999', 'v_CA06_1998:_$90_000_to_$99_999',
       'v_CA06_1999:_$100_000_and_over'])

  data_06.columns = data_06.columns.str.replace(" |\\(|\\)|,", "_")


In [131]:
data_06.columns

Index(['Unnamed:_0', 'GeoUID', 'Type', 'Region_Name', 'Area__sq_km_',
       'Population', 'Dwellings', 'Households', 'CMA_UID', 'PR_UID', 'CSD_UID',
       'CD_UID',
       'v_CA06_1302:_Total_population_by_visible_minority_groups_-_20%_sample_data',
       'v_CA06_1303:_Total_visible_minority_population',
       'v_CA06_1304:_Chinese', 'v_CA06_1305:_South_Asian',
       'v_CA06_1306:_Black', 'v_CA06_1307:_Filipino',
       'v_CA06_1308:_Latin_American', 'v_CA06_1309:_Southeast_Asian',
       'v_CA06_1310:_Arab', 'v_CA06_1311:_West_Asian', 'v_CA06_1312:_Korean',
       'v_CA06_1313:_Japanese', 'v_CA06_1314:_Visible_minority__n.i.e.',
       'v_CA06_1315:_Multiple_visible_minority',
       'v_CA06_1316:_Not_a_visible_minority',
       'v_CA06_1100:_Total_employed_labour_force_15_years_and_over_with_usual_place_of_work_or_no_fixed_workplace_address_by_mode_of_transportation_-_20%_sample_data',
       'v_CA06_1101:_Car__truck__van__as_driver',
       'v_CA06_1102:_Car__truck__van__as_pas

What do you notice about the column names between the two datasets? How would you be able to compare them to one another? You're working with what is conceptually the same measurement between two separate census, but many of the column names do not match. 

When you have 60 or so variables to compare and you know the groups you've queried, it is often sufficient to visually compare your two datasets. But there may come a day in which you are working with
all 535 census tracts rather than just the one, or hundreds of variables, each of which is liable to have a slightly different name over censuses. You can programmatically identify columns of comparison using string operations in Python. 

In [132]:
# remove the leading part of the vector
data_06 = data_06.rename(columns = {x: re.sub(r"v_CA\d{2}_\d+:_", "", x) for x in data_06.columns.tolist()})
data_06

Unnamed: 0,Unnamed:_0,GeoUID,Type,Region_Name,Area__sq_km_,Population,Dwellings,Households,CMA_UID,PR_UID,...,Bicycle,Motorcycle,Taxicab,Other_method,Total_number_of_private_households_by_household_size_-_100%_data,1_person,2_persons,3_persons,4_to_5_persons,6_or_more_persons
0,1,5350001.0,CT,Toronto,6.62223,571,245,231,35535,35,...,40.0,0.0,0.0,0.0,230.0,65.0,70.0,45.0,40.0,5.0
1,2,5350002.0,CT,Toronto,3.26165,627,273,262,35535,35,...,80.0,0.0,0.0,15.0,260.0,85.0,85.0,40.0,55.0,5.0
2,3,5350003.0,CT,Toronto,0.93043,0,1,1,35535,35,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,5350004.0,CT,Toronto,0.34390,6861,3614,3335,35535,35,...,170.0,0.0,15.0,30.0,3335.0,1575.0,970.0,385.0,345.0,55.0
4,5,5350005.0,CT,Toronto,0.37841,5089,2575,2413,35535,35,...,155.0,10.0,0.0,35.0,2415.0,1175.0,630.0,300.0,260.0,45.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
526,527,5350378.22,CT,Toronto,0.53116,3472,880,862,35535,35,...,15.0,0.0,0.0,0.0,860.0,30.0,120.0,190.0,385.0,130.0
527,528,5350378.23,CT,Toronto,1.53841,4318,1309,1282,35535,35,...,10.0,0.0,0.0,10.0,1285.0,145.0,340.0,270.0,385.0,140.0
528,529,5350378.24,CT,Toronto,2.52091,5886,1947,1861,35535,35,...,0.0,10.0,0.0,20.0,1860.0,340.0,490.0,360.0,515.0,155.0
529,530,5350802.01,CT,Toronto,2.22576,4065,1269,1242,35535,35,...,0.0,0.0,0.0,0.0,1240.0,125.0,350.0,260.0,460.0,50.0


In [133]:
data_21 = data_21.rename(columns = {x: re.sub(r"v_CA\d{2}_\d+:_", "", x) for x in data_21.columns.tolist()})
data_21

Unnamed: 0,Unnamed:_0,GeoUID,Type,Region_Name,Area__sq_km_,Population,Dwellings,Households,CMA_UID,PR_UID,...,Public_transit,Walked,Bicycle,Other_method,Private_households_by_household_size,1_person,2_persons,3_persons,4_persons,5_or_more_persons
0,1,5350001.0,CT,1.00,6.8192,599,253,235,35535,0.0,...,25.0,30.0,0.0,0.0,235,45,80,55,40,10
1,2,5350002.0,CT,2.00,3.3926,604,294,284,35535,0.0,...,25.0,15.0,50.0,0.0,280,90,125,35,25,10
2,3,5350003.0,CT,3.00,0.9455,457,279,265,35535,0.0,...,40.0,0.0,10.0,0.0,265,135,85,30,10,5
3,4,5350004.0,CT,4.00,0.3404,6306,3620,3276,35535,0.0,...,925.0,250.0,160.0,80.0,3280,1700,910,350,215,100
4,5,5350005.0,CT,5.00,0.3764,6957,4235,3720,35535,0.0,...,725.0,310.0,120.0,70.0,3720,1845,1170,410,195,95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
580,581,5350378.26,CT,378.26,1.6267,4867,1234,1213,35535,0.0,...,215.0,10.0,0.0,40.0,1210,55,190,225,320,425
581,582,5350378.27,CT,378.27,19.9096,5547,1565,1526,35535,0.0,...,395.0,15.0,0.0,45.0,1525,140,280,315,380,415
582,583,5350378.28,CT,378.28,2.2278,6946,2017,1931,35535,0.0,...,420.0,35.0,0.0,50.0,1930,190,395,430,410,505
583,584,5350802.01,CT,802.01,2.1742,4011,1328,1303,35535,0.0,...,250.0,15.0,0.0,65.0,1300,165,410,250,310,170


In [143]:
# create sets of the two columns
# their intersection are columns that exist in both, their set difference is what is present in one but not the other
set_21 = {x for x in data_21.columns.tolist()}
set_06 = {x for x in data_06.columns.tolist()}

In [144]:
# present in both
set_21 & set_06

{'1_person',
 '2_persons',
 '3_persons',
 '4_or_more_persons',
 'Arab',
 'Area__sq_km_',
 'Bicycle',
 'Black',
 'CD_UID',
 'CMA_UID',
 'CSD_UID',
 'Car__truck_or_van',
 'Census_year',
 'Chinese',
 'Dwellings',
 'Filipino',
 'GeoUID',
 'Households',
 'Japanese',
 'Korean',
 'Latin_American',
 'Not_a_visible_minority',
 'Other_commute_method',
 'PR_UID',
 'Population',
 'Public_transit',
 'Region_Name',
 'South_Asian',
 'Southeast_Asian',
 'Total_visible_minority_population',
 'Type',
 'Unnamed:_0',
 'Visible_minority__n.i.e.',
 'Walked',
 'West_Asian',
 'hh_size_denom',
 'labour_force_denom',
 'vm_groups_denom'}

In [136]:
# present in CA21 but not CA06
set_21 - set_06

{'4_persons',
 '5_or_more_persons',
 'Car__truck_or_van',
 'Multiple_visible_minorities',
 'Private_households_by_household_size',
 'Total_-_Main_mode_of_commuting_for_the_employed_labour_force_aged_15_years_and_over_with_a_usual_place_of_work_or_no_fixed_workplace_address',
 'Total_-_Visible_minority_for_the_population_in_private_households'}

In [137]:
# present in CA06 but not CA21
set_06 - set_21

{'4_to_5_persons',
 '6_or_more_persons',
 'Car__truck__van__as_driver',
 'Car__truck__van__as_passenger',
 'Motorcycle',
 'Multiple_visible_minority',
 'Taxicab',
 'Total_employed_labour_force_15_years_and_over_with_usual_place_of_work_or_no_fixed_workplace_address_by_mode_of_transportation_-_20%_sample_data',
 'Total_number_of_private_households_by_household_size_-_100%_data',
 'Total_population_by_visible_minority_groups_-_20%_sample_data'}

In [138]:
# combine income categories in CA21 - skip this section until income data is addressed
#data_21['Under_$10_000'] = data_21[['Under_$5_000', '$5_000_to_$9_999']].sum(axis = 1)
#data_21['$10_000_to_$19_999'] = data_21[['$10_000_to_$14_999', '$15_000_to_$19_999',]].sum(axis = 1)
#data_21['$20_000_to_$29_999'] = data_21[['$20_000_to_$24_999', '$25_000_to_$29_999',]].sum(axis = 1)
#data_21['$30_000_to_$39_999'] = data_21[['$30_000_to_$34_999', '$35_000_to_$39_999',]].sum(axis = 1)
#data_21['$40_000_to_$49_999'] = data_21[['$40_000_to_$44_999', '$45_000_to_$49_999',]].sum(axis = 1)
#data_21.iloc[:,-5:]

In [140]:
# change household size groups to 1, 2, 3, and 4 or more persons
data_21["4_or_more_persons"] = data_21[['4_persons','5_or_more_persons']].sum(axis = 1)
data_06["4_or_more_persons"] = data_06[['4_to_5_persons','6_or_more_persons']].sum(axis = 1)

In [141]:
# group mode of transport categories to other, rename multiple visible minority cols, rename total cols
data_06['Other_commute_method'] = data_06[['Other_method', 'Motorcycle', 'Taxicab']].sum(axis = 1)
data_06['Car__truck_or_van'] = data_06[['Car__truck__van__as_driver', 'Car__truck__van__as_passenger']].sum(axis = 1)
data_06 = data_06.rename(columns = {'Multiple visible minority': 'Multiple visible minorities',
                                    
                                    'Total_employed_labour_force_15_years_and_over_with_usual_place_of_work_or_no_fixed_workplace_address_by_mode_of_transportation_-_20%_sample_data' : 'labour_force_denom',
                                    'Total_number_of_private_households_by_household_size_-_100%_data' : 'hh_size_denom',
                                    #'Household_income_in_2005_of_private_households_-_20%_sample_data' : 'hh_income_denom',
                                    'Total_population_by_visible_minority_groups_-_20%_sample_data' : 'vm_groups_denom'})

data_21 = data_21.rename(columns = {'Other_method' : 'Other_commute_method',
                                    'Total_-_Main_mode_of_commuting_for_the_employed_labour_force_aged_15_years_and_over_with_a_usual_place_of_work_or_no_fixed_workplace_address' : 'labour_force_denom',
                                    'Private_households_by_household_size' : 'hh_size_denom',
                                   # 'Household_after-tax_income_groups_in_2020_for_private_households' : 'hh_income_denom',
                                    'Total_-_Visible_minority_for_the_population_in_private_households' : 'vm_groups_denom'})


In [142]:
data_06['Census_year'] = "CA06"
data_21['Census_year'] = "CA21"

In [156]:
# concatenate the dataframes
all_data = pd.concat([data_06[data_06.columns[data_06.columns.isin(data_21.columns)]],data_21[data_21.columns[data_21.columns.isin(data_06.columns)]]])
all_data

Unnamed: 0,Unnamed:_0,GeoUID,Type,Region_Name,Area__sq_km_,Population,Dwellings,Households,CMA_UID,PR_UID,...,Walked,Bicycle,hh_size_denom,1_person,2_persons,3_persons,4_or_more_persons,Other_commute_method,Car__truck_or_van,Census_year
0,1,5350001.0,CT,Toronto,6.62223,571,245,231,35535,35.0,...,40.0,40.0,230.0,65.0,70.0,45.0,45.0,0.0,145.0,CA06
1,2,5350002.0,CT,Toronto,3.26165,627,273,262,35535,35.0,...,15.0,80.0,260.0,85.0,85.0,40.0,60.0,15.0,50.0,CA06
2,3,5350003.0,CT,Toronto,0.93043,0,1,1,35535,35.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,CA06
3,4,5350004.0,CT,Toronto,0.34390,6861,3614,3335,35535,35.0,...,275.0,170.0,3335.0,1575.0,970.0,385.0,400.0,45.0,795.0,CA06
4,5,5350005.0,CT,Toronto,0.37841,5089,2575,2413,35535,35.0,...,205.0,155.0,2415.0,1175.0,630.0,300.0,305.0,45.0,535.0,CA06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
580,581,5350378.26,CT,378.26,1.62670,4867,1234,1213,35535,0.0,...,10.0,0.0,1210.0,55.0,190.0,225.0,745.0,40.0,1400.0,CA21
581,582,5350378.27,CT,378.27,19.90960,5547,1565,1526,35535,0.0,...,15.0,0.0,1525.0,140.0,280.0,315.0,795.0,45.0,1610.0,CA21
582,583,5350378.28,CT,378.28,2.22780,6946,2017,1931,35535,0.0,...,35.0,0.0,1930.0,190.0,395.0,430.0,915.0,50.0,1690.0,CA21
583,584,5350802.01,CT,802.01,2.17420,4011,1328,1303,35535,0.0,...,15.0,0.0,1300.0,165.0,410.0,250.0,480.0,65.0,855.0,CA21


In [160]:
# pivot to view change over time by variable
compare_years = all_data.drop(columns = ["Unnamed:_0", "Region_Name", "Type"]).pivot(columns = "Census_year", index = "GeoUID")
compare_years

Unnamed: 0_level_0,Area__sq_km_,Area__sq_km_,Population,Population,Dwellings,Dwellings,Households,Households,CMA_UID,CMA_UID,...,2_persons,2_persons,3_persons,3_persons,4_or_more_persons,4_or_more_persons,Other_commute_method,Other_commute_method,Car__truck_or_van,Car__truck_or_van
Census_year,CA06,CA21,CA06,CA21,CA06,CA21,CA06,CA21,CA06,CA21,...,CA06,CA21,CA06,CA21,CA06,CA21,CA06,CA21,CA06,CA21
GeoUID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
5350001.0,6.62223,6.8192,571.0,599.0,245.0,253.0,231.0,235.0,35535.0,35535.0,...,70.0,80.0,45.0,55.0,45.0,50.0,0.0,0.0,145.0,110.0
5350002.0,3.26165,3.3926,627.0,604.0,273.0,294.0,262.0,284.0,35535.0,35535.0,...,85.0,125.0,40.0,35.0,60.0,35.0,15.0,0.0,50.0,35.0
5350003.0,0.93043,0.9455,0.0,457.0,1.0,279.0,1.0,265.0,35535.0,35535.0,...,0.0,85.0,0.0,30.0,0.0,15.0,0.0,0.0,0.0,130.0
5350004.0,0.34390,0.3404,6861.0,6306.0,3614.0,3620.0,3335.0,3276.0,35535.0,35535.0,...,970.0,910.0,385.0,350.0,400.0,315.0,45.0,80.0,795.0,595.0
5350005.0,0.37841,0.3764,5089.0,6957.0,2575.0,4235.0,2413.0,3720.0,35535.0,35535.0,...,630.0,1170.0,300.0,410.0,305.0,290.0,45.0,70.0,535.0,670.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5350378.26,,1.6267,,4867.0,,1234.0,,1213.0,,35535.0,...,,190.0,,225.0,,745.0,,40.0,,1400.0
5350378.27,,19.9096,,5547.0,,1565.0,,1526.0,,35535.0,...,,280.0,,315.0,,795.0,,45.0,,1610.0
5350378.28,,2.2278,,6946.0,,2017.0,,1931.0,,35535.0,...,,395.0,,430.0,,915.0,,50.0,,1690.0
5350802.01,2.22576,2.1742,4065.0,4011.0,1269.0,1328.0,1242.0,1303.0,35535.0,35535.0,...,350.0,410.0,260.0,250.0,510.0,480.0,0.0,65.0,1285.0,855.0


In [161]:
compare_years.columns

MultiIndex([(                     'Area__sq_km_', 'CA06'),
            (                     'Area__sq_km_', 'CA21'),
            (                       'Population', 'CA06'),
            (                       'Population', 'CA21'),
            (                        'Dwellings', 'CA06'),
            (                        'Dwellings', 'CA21'),
            (                       'Households', 'CA06'),
            (                       'Households', 'CA21'),
            (                          'CMA_UID', 'CA06'),
            (                          'CMA_UID', 'CA21'),
            (                           'PR_UID', 'CA06'),
            (                           'PR_UID', 'CA21'),
            (                          'CSD_UID', 'CA06'),
            (                          'CSD_UID', 'CA21'),
            (                           'CD_UID', 'CA06'),
            (                           'CD_UID', 'CA21'),
            (                  'vm_groups_denom', 'CA06'

In [164]:
# you can subset on census tract(s) and column name(s)
compare_years.filter(regex='\.\d{2}$', axis = 0)[["labour_force_denom", "Car__truck_or_van", "Public_transit", "Walked", "Bicycle", "Other_commute_method"]]

Unnamed: 0_level_0,labour_force_denom,labour_force_denom,Car__truck_or_van,Car__truck_or_van,Public_transit,Public_transit,Walked,Walked,Bicycle,Bicycle,Other_commute_method,Other_commute_method
Census_year,CA06,CA21,CA06,CA21,CA06,CA21,CA06,CA21,CA06,CA21,CA06,CA21
GeoUID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
5350007.01,1715.0,1075.0,505.0,345.0,930.0,470.0,170.0,110.0,90.0,115.0,20.0,30.0
5350007.02,2480.0,1350.0,890.0,460.0,1260.0,580.0,220.0,160.0,95.0,95.0,10.0,55.0
5350008.01,,2485.0,,1340.0,,555.0,,355.0,,105.0,,125.0
5350008.02,,3065.0,,1475.0,,760.0,,515.0,,125.0,,190.0
5350010.01,2960.0,1430.0,1205.0,685.0,1125.0,335.0,435.0,255.0,145.0,80.0,50.0,80.0
...,...,...,...,...,...,...,...,...,...,...,...,...
5350378.26,,1675.0,,1400.0,,215.0,,10.0,,0.0,,40.0
5350378.27,,2070.0,,1610.0,,395.0,,15.0,,0.0,,45.0
5350378.28,,2195.0,,1690.0,,420.0,,35.0,,0.0,,50.0
5350802.01,1930.0,1195.0,1285.0,855.0,580.0,250.0,65.0,15.0,0.0,0.0,0.0,65.0


What might the values of NA mean for each respective year? Did that census tract exist?

Using pivot tables, how might you identify tracts that experienced the greatest amount of change for a certain variable?