# Lab02: Canadian Census data

To start this lab, you have 3 options to acquire the data.

## Option 1: Using Censusmapper and R
This option works best if you have some prior R experience. You will not have to write any code of your own for this part, although you will have to create a Censusmapper API key if you don't have one already. Use [this link](https://censusmapper.ca/users/sign_in) to do so. There is documentation and a vignette to get started available [here](https://cran.r-project.org/web/packages/cancensus/vignettes/cancensus.html). 

When you have the API key and a familiarity with the package, you can run the script `lab02_download_data.r` 

## Working with Census data

In [17]:
librarian::shelf(cancensus, stringr, tidyverse)

In [31]:
# view columns
can_census_data_ct.columns.values

array(['Unnamed:_0', 'GeoUID', 'Type', 'Region_Name', 'Area__sq_km_',
       'Population', 'Dwellings', 'Households', 'CMA_UID', 'PR_UID',
       'CSD_UID', 'CD_UID', 'v_CA21_1:_Population__2021',
       'v_CA21_2:_Population__2016',
       'v_CA21_3:_Population_percentage_change__2016_to_2021',
       'v_CA21_4:_Total_private_dwellings',
       'v_CA21_5:_Private_dwellings_occupied_by_usual_residents',
       'v_CA21_6:_Population_density_per_square_kilometre',
       'v_CA21_7:_Land_area_in_square_kilometres',
       'v_CA21_8:_Total_-_Age', 'v_CA21_9:_Total_-_Age',
       'v_CA21_10:_Total_-_Age', 'v_CA21_11:_0_to_14_years',
       'v_CA21_12:_0_to_14_years', 'v_CA21_13:_0_to_14_years',
       'v_CA21_14:_0_to_4_years', 'v_CA21_15:_0_to_4_years',
       'v_CA21_16:_0_to_4_years', 'v_CA21_17:_Under_1_year',
       'v_CA21_18:_Under_1_year', 'v_CA21_19:_Under_1_year',
       'v_CA21_32:_5_to_9_years', 'v_CA21_33:_5_to_9_years',
       'v_CA21_34:_5_to_9_years', 'v_CA21_50:_10_to_14_ye

In [4]:
can_census_data_ct.columns

Index(['Unnamed: 0', 'GeoUID', 'Type', 'Region Name', 'Area (sq km)',
       'Population', 'Dwellings', 'Households', 'CMA_UID', 'PR_UID',
       ...
       'v_CA21_4902: Korean', 'v_CA21_4905: Japanese',
       'v_CA21_4908: Visible minority, n.i.e.',
       'v_CA21_4911: Multiple visible minorities',
       'v_CA21_4914: Not a visible minority',
       'v_CA21_4389: Total - Citizenship for the population in private households',
       'v_CA21_4392: Canadian citizens',
       'v_CA21_4395: Canadian citizens aged under 18',
       'v_CA21_4398: Canadian citizens aged 18 and over',
       'v_CA21_4401: Not Canadian citizens'],
      dtype='object', length=221)

In [20]:
# fix missingness and data types
can_census_data_ct = can_census_data_ct.fillna(0)
can_census_data_ct = can_census_data_ct.replace({'NA': 0})
can_census_data_ct = can_census_data_ct.replace({'': 0})
can_census_data_ct.iloc[:,4:] = can_census_data_ct.iloc[:,4:].apply(pd.to_numeric)
can_census_data_ct["GeoUID"] = can_census_data_ct["GeoUID"].astype(str)

In [21]:
# convert whitespace, parens, commas to underscore in column names
can_census_data_ct.columns = can_census_data_ct.columns.str.replace(" |\\(|\\)|,", "_")

  can_census_data_ct.columns = can_census_data_ct.columns.str.replace(" |\\(|\\)|,", "_")


In [22]:
# handle accidental division by zero
def div_0(n,d):
    try:
        return n/d
    except:
        return 0

## Exercise 1

Using `sort_values(...)`, identify which census tract in Toronto has the largest population.

In [24]:
# solution: it's tract 5350012.01
can_census_data_ct.sort_values(by = "Population", ascending = False)

Unnamed: 0,Unnamed:_0,GeoUID,Type,Region_Name,Area__sq_km_,Population,Dwellings,Households,CMA_UID,PR_UID,...,v_CA21_4902:_Korean,v_CA21_4905:_Japanese,v_CA21_4908:_Visible_minority__n.i.e.,v_CA21_4911:_Multiple_visible_minorities,v_CA21_4914:_Not_a_visible_minority,v_CA21_4389:_Total_-_Citizenship_for_the_population_in_private_households,v_CA21_4392:_Canadian_citizens,v_CA21_4395:_Canadian_citizens_aged_under_18,v_CA21_4398:_Canadian_citizens_aged_18_and_over,v_CA21_4401:_Not_Canadian_citizens
16,17,5350012.01,CT,12.01,0.4305,13554,8969,7893,35535,0.0,...,240.0,80.0,155.0,410.0,4965.0,13525.0,9265.0,1000.0,8265.0,4260.0
158,159,5350128.02,CT,128.02,0.5187,12611,8109,6896,35535,0.0,...,680.0,120.0,95.0,260.0,6330.0,12140.0,8565.0,1045.0,7520.0,3570.0
9,10,5350008.02,CT,8.02,1.9151,11591,7827,7022,35535,0.0,...,140.0,65.0,125.0,305.0,5690.0,11540.0,9420.0,815.0,8605.0,2120.0
23,24,5350016.0,CT,16.00,0.6317,11144,7596,6845,35535,0.0,...,165.0,75.0,55.0,325.0,6040.0,10845.0,9310.0,720.0,8590.0,1530.0
86,87,5350065.02,CT,65.02,0.1434,11120,5692,5135,35535,0.0,...,80.0,20.0,95.0,300.0,2050.0,11110.0,7925.0,1635.0,6285.0,3190.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10,11,5350009.0,CT,9.00,0.1032,520,176,170,35535,0.0,...,0.0,0.0,0.0,20.0,125.0,280.0,230.0,30.0,205.0,55.0
2,3,5350003.0,CT,3.00,0.9455,457,279,265,35535,0.0,...,0.0,0.0,0.0,25.0,245.0,520.0,400.0,50.0,350.0,115.0
544,545,5350376.06,CT,376.06,2.1919,428,252,222,35535,0.0,...,0.0,0.0,0.0,0.0,0.0,275.0,245.0,0.0,245.0,35.0
244,245,5350205.0,CT,205.00,0.7448,144,1,0,35535,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Exercise 2

Calculate total population, total number of dwellings, and the total number of renter and owner private households in Toronto. 

In [25]:
# total pop
can_census_data_ct["Population"].sum()

2794356

In [26]:
# total dwellings
can_census_data_ct["Dwellings"].sum()

1253238

In [28]:
# total renter private households
can_census_data_ct["v_CA21_4239:_Renter"].sum()

557940.0

In [29]:
# total owner private households
can_census_data_ct["v_CA21_4238:_Owner"].sum()

602835.0

In [30]:
# can also calculate them all at once
can_census_data_ct[["Population", "Dwellings", "v_CA21_4239:_Renter", "v_CA21_4238:_Owner"]].sum()

Population             2794356.0
Dwellings              1253238.0
v_CA21_4239:_Renter     557940.0
v_CA21_4238:_Owner      602835.0
dtype: float64

In [33]:
# compute the proportion of private households that are renters
can_census_data_ct["v_CA21_4239:_Renter"].sum() / can_census_data_ct["v_CA21_4237:_Total_-_Private_households_by_tenure"].sum()

0.4806222919017633

## Exercise 3

### Common functions and calculations

In [86]:
# read in the lemonade data
lemonade = pd.read_csv("lemonade_sales.csv")
lemonade

Unnamed: 0.1,Unnamed: 0,Day,AM Sale Count (Lemonades Sold),AM Profit,PM Sale Count (Lemonades Sold),PM Profit
0,0,2019-01-07 00:00:00,4,10.0,3,8.5
1,1,2019-01-08 00:00:00,2,5.25,2,5.0
2,2,2019-01-09 00:00:00,2,6.0,6,14.25
3,3,2019-01-10 00:00:00,5,12.5,7,13.75
4,4,2019-01-11 00:00:00,2,5.0,1,2.5
5,5,2019-01-12 00:00:00,3,7.5,4,9.75
6,6,2019-01-13 00:00:00,5,13.25,4,10.5
7,7,2019-01-14 00:00:00,5,12.25,5,12.25
8,8,2019-01-15 00:00:00,1,2.5,3,6.0
9,9,2019-01-16 00:00:00,2,4.5,3,5.0


In [87]:
# data type conversion - looks good
# ok to drop Unnamed: 0 col
# partition the data into two weeks
lemonade["day_datetime"] = pd.to_datetime(lemonade["Day"])
lemonade.drop(columns = "Unnamed: 0", inplace = True)
lemonade['week_num'] = (lemonade['day_datetime'] >= "2019-01-14") + 1
lemonade

Unnamed: 0,Day,AM Sale Count (Lemonades Sold),AM Profit,PM Sale Count (Lemonades Sold),PM Profit,day_datetime,week_num
0,2019-01-07 00:00:00,4,10.0,3,8.5,2019-01-07,1
1,2019-01-08 00:00:00,2,5.25,2,5.0,2019-01-08,1
2,2019-01-09 00:00:00,2,6.0,6,14.25,2019-01-09,1
3,2019-01-10 00:00:00,5,12.5,7,13.75,2019-01-10,1
4,2019-01-11 00:00:00,2,5.0,1,2.5,2019-01-11,1
5,2019-01-12 00:00:00,3,7.5,4,9.75,2019-01-12,1
6,2019-01-13 00:00:00,5,13.25,4,10.5,2019-01-13,1
7,2019-01-14 00:00:00,5,12.25,5,12.25,2019-01-14,2
8,2019-01-15 00:00:00,1,2.5,3,6.0,2019-01-15,2
9,2019-01-16 00:00:00,2,4.5,3,5.0,2019-01-16,2


In [88]:
# compute: totals (AM & PM)
lemonade.iloc[:,2:].sum()

  lemonade.iloc[:,2:].sum()


AM Profit                          99.75
PM Sale Count (Lemonades Sold)     48.00
PM Profit                         112.75
week_num                           21.00
dtype: float64

In [89]:
# compute: totals (AM + PM) of sales
am_pm_sales = lemonade[lemonade.columns.values[lemonade.columns.str.contains("Sale Count")]].sum()
am_pm_sales.sum()

86

In [90]:
# compute: totals (AM + PM) of profits
am_pm_profit = lemonade[lemonade.columns.values[lemonade.columns.str.contains("Profit")]].sum()
am_pm_profit.sum()

212.5

In [91]:
# pct of total sales made in AM, PM
am_pm_sales / am_pm_sales.sum()

AM Sale Count (Lemonades Sold)    0.44186
PM Sale Count (Lemonades Sold)    0.55814
dtype: float64

In [92]:
# pct of total profit made in AM, PM
am_pm_profit / am_pm_profit.sum()

AM Profit    0.469412
PM Profit    0.530588
dtype: float64

In [93]:
# the averages of week 1, week 2, am, pm for sales, profits
lemonade.groupby('week_num').mean()

  lemonade.groupby('week_num').mean()


Unnamed: 0_level_0,AM Sale Count (Lemonades Sold),AM Profit,PM Sale Count (Lemonades Sold),PM Profit
week_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,3.285714,8.5,3.857143,9.178571
2,2.142857,5.75,3.0,6.928571


In [102]:
# compute daily totals for sums and profits
lemonade['total_sales'] = lemonade[lemonade.columns.values[lemonade.columns.str.contains("Sale Count")]].sum(axis = 1)
lemonade['total_profit'] = lemonade[lemonade.columns.values[lemonade.columns.str.contains("Profit")]].sum(axis = 1)
lemonade

Unnamed: 0,Day,AM Sale Count (Lemonades Sold),AM Profit,PM Sale Count (Lemonades Sold),PM Profit,day_datetime,week_num,total_sales,total_profit
0,2019-01-07 00:00:00,4,10.0,3,8.5,2019-01-07,1,7,18.5
1,2019-01-08 00:00:00,2,5.25,2,5.0,2019-01-08,1,4,10.25
2,2019-01-09 00:00:00,2,6.0,6,14.25,2019-01-09,1,8,20.25
3,2019-01-10 00:00:00,5,12.5,7,13.75,2019-01-10,1,12,26.25
4,2019-01-11 00:00:00,2,5.0,1,2.5,2019-01-11,1,3,7.5
5,2019-01-12 00:00:00,3,7.5,4,9.75,2019-01-12,1,7,17.25
6,2019-01-13 00:00:00,5,13.25,4,10.5,2019-01-13,1,9,23.75
7,2019-01-14 00:00:00,5,12.25,5,12.25,2019-01-14,2,10,24.5
8,2019-01-15 00:00:00,1,2.5,3,6.0,2019-01-15,2,4,8.5
9,2019-01-16 00:00:00,2,4.5,3,5.0,2019-01-16,2,5,9.5


In [106]:
# the daily average sales
lemonade['total_sales'].mean()

6.142857142857143

In [107]:
# the daily average profit
lemonade['total_profit'].mean()

15.178571428571429

In [108]:
# the maximum sales in a day (AM + PM)
lemonade['total_sales'].max()

12

In [109]:
# the maximum profit in a day (AM + PM)
lemonade['total_profit'].max()

26.25

In [110]:
# the minimum sales in a day (AM + PM)
lemonade['total_sales'].min()

2

In [111]:
# the minimum profit in a day (AM + PM)
lemonade['total_profit'].min()

3.5

In [113]:
# command to compute these summary statistics and more all at once
lemonade[['total_sales', 'total_profit']].describe()

Unnamed: 0,total_sales,total_profit
count,14.0,14.0
mean,6.142857,15.178571
std,2.905092,7.092404
min,2.0,3.5
25%,4.0,9.6875
50%,6.0,14.875
75%,7.75,19.875
max,12.0,26.25
