# Lecture 4: Exploring Canadian Census Data

## The Census in Canada

- Censuses have been conducted in Canada *since 1666* - when the population of then "New France" was only 3,215!

- Starting in 1871, the brand-new Canadian government started running a census every 10 years (later changed to **every 5 years**)

- Today, the Census of Population is managed by **Statistics Canada** - a government agency responsible for collecting info about Canada's population and industry.

**Why do governments care so much about conducting censuses?**

Every 5 years, Statistics Canada sends out two different census questionnaires:

1. **Short Form Questionnaire (75% of the population)**: ~15 questions
    - Age
    - Gender
    - Marital status
    - Household size
    - Knowledge of official languages

2. **Long Form Questionnaire (25% of the population)**: 60+ questions including all of the above, plus:
    - Education
    - Employment
    - Commuting
    - Mobility
    - Ethnic origin/indigenous identity
    - Religion
    - Housing costs
    - And more!

You can use public Census data to answer questions like:

- How many people in Canada live by themselves?

- How many people in Ontario can speak French?

- How many people in Toronto have a university degree?

- How many people in your neighbourhood live in apartments?

## Public Use Microdata File (PUMF)

The Public Use Microdata File is a special dataset that provides a detailed sample of individual responses to the census! 

- Each row shows how a specific household *actually* responded to the long-form questionnaire

- **1% Sample:** Each row is intended to stand in for ~100 actual households

Let's use the PUMF to answer a question: **What share of households in the Toronto metro area own their home?**

This question can be represented through the following formula:

$${\text{Percent}_\text{Own}} = \frac{\text{Number of Toronto respondents that own their home}}{\text{Number of Toronto respondents}}\times 100 $$

We'll start by importing `pandas` as `pd`, and reading our data file called `data_donnees_2021_hier_v2.csv`

In [None]:
import pandas as pd

pumf_data = pd.read_csv("data_donnees_2021_hier_v2.csv")

Let's start by looking at the first few rows of our data using `head()`:

In [None]:
# First six rows of our table
pumf_data.head()

There's a lot going on in this table! So much that it isn't even showing us all of the columns.

We can get the dimensions of `pumf` using `.shape`:

In [None]:
pumf_data.shape

There are 149,789 rows and 106 columns. In other words, we have:

- Responses for nearly 150,000 individual households (around 1% of all households in Canada in 2021)

- **106** different pieces of information about each of these households

This is a bit more information than we need right now! Let's limit this to the pieces of information we need to answer our question: where they live, and whether they own their home.

To create an easy-to-use data set we will only keep the following columns:

- `HH_ID`:      unique household ID
- `CMA`:        Census Metropolitan Area
- `TENUR`:      housing tenure (whether someone owns or rents)

You can see the full codebook <a href="2021 Census Hierarchical PUMF User Guide_V2.pdf" target="_blank">here</a>.

In [None]:
important_columns = ["HH_ID", "CMA", "TENUR"] 

subset_pumf_data = pumf_data[important_columns]

subset_pumf_data.head()

Let's also **rename** these columns so they are easier to understand. We can do this by creating a **dictionary** that contains the original column names and our new names, and then use the function `.rename()` to assign those new column names to our data frame:

In [None]:
new_column_names = {"HH_ID": "Household ID",
                    "CMA": "Metro Area",
                    "TENUR": "Housing Tenure"}
                    
subset_pumf_data = subset_pumf_data.rename(columns = new_column_names)

subset_pumf_data.head()

To answer our question, the next thing we need to figure out is who lives in the Toronto metropolitan area. We can do this by looking up the Census Metropolitan Area variable (`CMA`) in the data dictionary, which gives us the following information:

Code | Description   | Unweighted | Weighted   |
-----|---------------|------------|------------|
462  | Montr√©al      | 41,978     | 4,202,493  |
535  | Toronto       | 61,094     | 6,134,913  |
825  | Calgary       | 14,589     | 1,467,583  |
835  | Edmonton      | 13,880     | 1,396,261  |
933  | Vancouver     | 25,884     | 2,600,855  |
999  | Other         | 204,490    | 20,526,372 |

To find only Toronto residents, we need to create a **boolean** that is `True` where the response is associated with the Toronto metro area (`Metro Area == 535`), and `False` otherwise.

In [None]:
toronto = (subset_pumf_data["Metro Area"] == 535)

toronto

We can use `.sum()` to find out how many responses in `toronto` are `True`:

In [None]:
toronto.sum()

We also need to figure out which households own their home. We can figure this out via the variable `Housing Tenure`, which can be found in the codebook under its original name `TENUR`. This variable has options for "Owner", "Renter", and "Not Available":

Code | Description   | Unweighted | Weighted   |
-----|---------------|------------|------------|
1    | Owner         | 259,358    | 26,035,043 |
2    | Renter        | 102,512    | 10,288,906 |
8    | Not available | 259,358    | 4,527      |

We are interested in the share of Toronto households that are **owners**, so we want cases where `Housing Tenure == 1`.

In [None]:
owner = (subset_pumf_data["Housing Tenure"] == 1)

owner

Now we can find the total number of homeowners in the dataset:

In [None]:
owner.sum()

But how do we specifically find owners *in Toronto*? Let's look at the first few values of `toronto` and `owner` using `.head()`:

In [None]:
toronto.head()

In [None]:
owner.head()

We can also the two together side-by-side so that we can compare them directly. Which of these rows do we want to have the value `True`?

In [None]:
pd.concat(objs = [toronto.head(), owner.head()], 
          keys = ["Toronto", "Homeowner"],
          axis = 1)

We only want the value to be `True` in cases where both `toronto` **and** `owner` are `True` (such as index 3). For this, we need `&`:

In [None]:
toronto_owner = (toronto & owner)

toronto_owner.head()

In [None]:
pd.concat(objs = [toronto.head(), owner.head(), toronto_owner.head()], 
          keys = ["Toronto", "Homeowner", "Toronto Homeowner"],
          axis = 1)

Now we have the information we need to answer our question! We just need to fill in the following formula:

$${\text{Percent}_\text{Own}} = 100\times \frac{\text{Number of Toronto respondents that own their home}}{\text{Number of Toronto respondents}} $$

Let's clean our answer a bit and present it as a rounded percentage in a sentence:

In [None]:
print(f"The share of Toronto households that own their home is {round(percent_own, 2)}%")