In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load the raw population data
file_path = "../data/raw/raw_population_data.csv"
raw_population_data = pd.read_csv(file_path)

# Inspect first 5 rows and table information
print(raw_population_data.head())
print(raw_population_data.info())

  Population and dwelling counts: Canada and census subdivisions (municipalities) 1  \
0                              Frequency: Occasional                                  
1                               Table: 98-10-0002-01                                  
2                           Release date: 2022-02-09                                  
3  Geography: Canada, Province or territory, Cens...                                  
4  Universe: All persons, 2021 and 2016 censuses ...                                  

  Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7  
0        NaN        NaN        NaN        NaN        NaN        NaN        NaN  
1        NaN        NaN        NaN        NaN        NaN        NaN        NaN  
2        NaN        NaN        NaN        NaN        NaN        NaN        NaN  
3        NaN        NaN        NaN        NaN        NaN        NaN        NaN  
4        NaN        NaN        NaN        NaN        NaN        NaN     

We need to remove the first few rows of meta-data, let's do that now.

In [3]:
raw_population_data = raw_population_data.iloc[9:]

# Verify the new data
print(raw_population_data)

     Population and dwelling counts: Canada and census subdivisions (municipalities) 1  \
9                                        Admirals Beach                                  
10                                            Aquaforte                                  
11                                        Arnold's Cove                                  
12                                             Avondale                                  
13                                              Bauline                                  
...                                                 ...                                  
5192                                                NaN                                  
5193                                                NaN                                  
5194                                                NaN                                  
5195  How to cite: Statistics Canada. Table 98-10-00...                                  
5196  http

Let's also get rid of the last few rows of meta-data. First, we'll print the tail of the data and then remove appropriately.

In [4]:
print(raw_population_data.tail())
raw_population_data = raw_population_data.iloc[:-27]
print(raw_population_data.tail())

     Population and dwelling counts: Canada and census subdivisions (municipalities) 1  \
5192                                                NaN                                  
5193                                                NaN                                  
5194                                                NaN                                  
5195  How to cite: Statistics Canada. Table 98-10-00...                                  
5196  https://www150.statcan.gc.ca/t1/tbl1/en/tv.act...                                  

     Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6  \
5192        NaN        NaN        NaN        NaN        NaN        NaN   
5193        NaN        NaN        NaN        NaN        NaN        NaN   
5194        NaN        NaN        NaN        NaN        NaN        NaN   
5195        NaN        NaN        NaN        NaN        NaN        NaN   
5196        NaN        NaN        NaN        NaN        NaN        NaN   

     Unnamed: 

Perfect! Now we can clean the data, checking to see if there are any cells that have a value of NaN (empty cells), ".." or "..." for "not available for this time period" or "not applicable", respectfully (in the words of Statistics Canada - https://www.statcan.gc.ca/en/concepts/definitions/guide-symbol) or "x"/"X" (censored data). For these values, we can just set them to 0, as they are effectively the same for the purpose of this project. Cities with these values will most likely be sorted out of our data later in this project, so I'm not overly concerned with being specific to their values.

In [5]:
# Replace values of ".." and "..." wiith 0
raw_population_data = raw_population_data.replace("..", "...")
raw_population_data = raw_population_data.replace("...", 0)

print(raw_population_data.isnull().all(axis=1).sum()) # Should print 0 if all cells have a value (non-empty)

0


Because we verified that each cell has a value, we can skip changing from NaN or from x/X, unlike the commuting data cleaning process. Next, let's rename the columns to accurately describe the data.

In [6]:
new_columns = ["Geography", "Geographic Area Type", "Province or Territory Abbreviation", "Population (2021)", 
               "Population (2016)", "Population Percentage Change (2016->2021)", "Land area (2021) (km^2)",
               "Population density per km^2 (2021)"]

# Set the new columns
raw_population_data.columns = new_columns

# Verify the columns are correct
print(raw_population_data.head())

         Geography Geographic Area Type Province or Territory Abbreviation  \
9   Admirals Beach                    T                               N.L.   
10       Aquaforte                    T                               N.L.   
11   Arnold's Cove                    T                               N.L.   
12        Avondale                    T                               N.L.   
13         Bauline                    T                               N.L.   

   Population (2021) Population (2016)  \
9                 97               135   
10                74                80   
11               964               949   
12               584               641   
13               412               452   

   Population Percentage Change (2016->2021) Land area (2021) (km^2)  \
9                                      -28.1                    24.2   
10                                      -7.5                    6.88   
11                                       1.6                  

At this time, I don't think the "Geographic Area Type" will be useful for our data analysis. For now, we can just drop this column to make our analysis easier, focusing on the more relevant data for now. This step can be undone in the future if we need the Geographic Area type data at a later date.

In [7]:
raw_population_data = raw_population_data.drop(columns="Geographic Area Type")
print(raw_population_data.head())

         Geography Province or Territory Abbreviation Population (2021)  \
9   Admirals Beach                               N.L.                97   
10       Aquaforte                               N.L.                74   
11   Arnold's Cove                               N.L.               964   
12        Avondale                               N.L.               584   
13         Bauline                               N.L.               412   

   Population (2016) Population Percentage Change (2016->2021)  \
9                135                                     -28.1   
10                80                                      -7.5   
11               949                                       1.6   
12               641                                      -8.9   
13               452                                      -8.8   

   Land area (2021) (km^2) Population density per km^2 (2021)  
9                     24.2                                  4  
10                    6.

Let's also convert the data type for the quantitative columns to prepare them for EDA at a later stage of this project.

In [8]:
print(raw_population_data.dtypes)

Geography                                    object
Province or Territory Abbreviation           object
Population (2021)                            object
Population (2016)                            object
Population Percentage Change (2016->2021)    object
Land area (2021) (km^2)                      object
Population density per km^2 (2021)           object
dtype: object


In [9]:
# Convert specific columns to numeric types
columns_to_convert = [
    "Population (2021)",
    "Population (2016)",
    "Population Percentage Change (2016->2021)",
    "Land area (2021) (km^2)",
    "Population density per km^2 (2021)"
]

# Convert only the necessary columns
for column in columns_to_convert:
    if column in raw_population_data.columns:
        if raw_population_data[column].dtype == "object":  # Check if column is string-like
            raw_population_data[column] = pd.to_numeric(
                raw_population_data[column].str.replace(",", "").fillna(0), errors="coerce"
            )
        else:
            print(f"Skipping column {column}: already numeric.")
    else:
        print(f"Column {column} does not exist in the DataFrame!")

# Verify the data types
print(raw_population_data.dtypes)
print(raw_population_data.head())

Geography                                     object
Province or Territory Abbreviation            object
Population (2021)                              int64
Population (2016)                              int64
Population Percentage Change (2016->2021)    float64
Land area (2021) (km^2)                      float64
Population density per km^2 (2021)           float64
dtype: object
         Geography Province or Territory Abbreviation  Population (2021)  \
9   Admirals Beach                               N.L.                 97   
10       Aquaforte                               N.L.                 74   
11   Arnold's Cove                               N.L.                964   
12        Avondale                               N.L.                584   
13         Bauline                               N.L.                412   

    Population (2016)  Population Percentage Change (2016->2021)  \
9                 135                                      -28.1   
10                 80 

Now that our data is complete, we can export this cleaned data as a .csv file to our processed data folder.

In [10]:
# Check for duplicates in the population data
duplicate_population = raw_population_data[raw_population_data.duplicated(subset="Geography", keep=False)]
print(f"Duplicate entries in population data: {duplicate_population}")

Duplicate entries in population data:         Geography Province or Territory Abbreviation  Population (2021)  \
85    South River                               N.L.                674   
91     St. Mary's                               N.L.                313   
98       Victoria                               N.L.               1658   
183     Deer Lake                               N.L.               4864   
191       Hampden                               N.L.                439   
...           ...                                ...                ...   
4708     Richmond                               B.C.             209937   
4801    Armstrong                               B.C.               5323   
5069       Dawson                               Y.T.               1577   
5080         Mayo                               Y.T.                188   
5157  Clyde River                               Nvt.               1181   

      Population (2016)  Population Percentage Change (2016->

In [11]:
raw_population_data = raw_population_data.drop_duplicates(subset="Geography", keep="first")

In [12]:
# Check for duplicates in the population data
duplicate_population = raw_population_data[raw_population_data.duplicated(subset="Geography", keep=False)]
print(f"Duplicate entries in population data: {duplicate_population}")

Duplicate entries in population data: Empty DataFrame
Columns: [Geography, Province or Territory Abbreviation, Population (2021), Population (2016), Population Percentage Change (2016->2021), Land area (2021) (km^2), Population density per km^2 (2021)]
Index: []


In [13]:
cleaned_population_data = raw_population_data

cleaned_population_data.to_csv("../data/processed/cleaned_population_data.csv", index=False)