<a href="https://colab.research.google.com/github/cwils021/Census-Data-Wrangling/blob/main/2016Census_data_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2016 StatCan Census Data (Age, Gender of Population of Canada, Provinces and Territories, Census Divisions, Census Subdivisions and Dissemination Areas

## Requirements before running Script

1. Download CSV data from StatCan [Age Data Table](https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=4&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1159582&GK=1&GRP=1&O=D&PID=109526&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2016&THEME=115&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0)
2. Clean up column names to match what script will expect (see table below) (ex. ensure no whitespace, drop empty columns (notes)) - I did this in Excel quickly while examining file

|                     Old Column Name                     |  New Column Name  |
| :-----------------------------------------------------: | :---------------: |
|                       CENSUS_YEAR                       |    CENSUS_YEAR    |
|                     GEO_CODE (POR)                      |     GEO_CODE      |
|                        GEO_LEVEL                        |     GEO_LEVEL     |
|                        GEO_NAME                         |     GEO_NAME      |
|                           GNR                           |        GNR        |
|                    DATA_QUALITY_FLAG                    | DATA_QUALITY_FLAG |
|                      CSD_TYPE_NAME                      |   CSD_TYPE_NAME   |
|                      ALT_GEO_CODE                       |   ALT_GEO_CODE    |
|    DIM: Age (in single years) and average  age (127)    |      AGE_CAT      |
| Member ID: Age (in single years) and  average age (127) |    AGE_CAT_ID     |
|   Notes: Age (in single years) and average  age (127)   |   **DROP COLUMN**   |
|        Dim: Sex (3): Member ID: [1]: Total - Sex        |       TOTAL       |
|           Dim: Sex (3): Member ID: [2]: Male            |    TOTAL_MALE     |
|          Dim: Sex (3): Member ID: [3]: Female           |   TOTAL_FEMALE    |



## Long to Wide Script

### Load File as DataFrame

In [None]:
# imports
from google.colab import files
import pandas as pd
from functools import reduce

# Need to Implement Chunking for full dataset to speed up file upload

raw_data = files.upload()
filename = str(list(raw_data.keys())[0])

In [None]:
raw_df = pd.read_csv(filename, dtype=str)
# sanity check
raw_df.head(10)

### Define WideDataRow Class

In [3]:
class WideDataRow:
  def __init__(self, long_data_row):
    self.year = long_data_row['CENSUS_YEAR']
    self.geo_code = long_data_row['GEO_CODE']
    self.geo_level = long_data_row['GEO_LEVEL']
    self.geo_name = long_data_row['GEO_NAME']
    self.gnr = long_data_row['GNR']
    self.data_quality = long_data_row['DATA_QUALITY_FLAG']
    self.csd_type = long_data_row['CSD_TYPE_NAME']
    self.alt_geo_code = long_data_row['ALT_GEO_CODE']
    self.age_groups_total = {}
    self.age_groups_male = {}
    self.age_groups_female = {}

  def add_age_data(self, long_data_row):
    self.age_groups_total[long_data_row['AGE_CAT']] = long_data_row['TOTAL']
    self.age_groups_male[long_data_row['AGE_CAT'] + '_male'] = long_data_row['TOTAL_MALE']
    self.age_groups_female[long_data_row['AGE_CAT'] + '_female'] = long_data_row['TOTAL_FEMALE']

### Define a Age Group Reference Dict

In [4]:
age_group_key_ref = {
   1: 'Total - Age',
   3: '0 to 4 years',
   9: '5 to 9 years',
  15: '10 to 14 years',
  22: '15 to 19 years',
  28: '20 to 24 years',
  34: '25 to 29 years',
  40: '30 to 34 years',
  46: '35 to 39 years',
  52: '40 to 44 years',
  58: '45 to 49 years',
  64: '50 to 54 years',
  70: '55 to 59 years',
  76: '60 to 64 years',
  83: '65 to 69 years',
  89: '70 to 74 years',
  95: '75 to 79 years',
 101: '80 to 84 years',
 108: '85 to 89 years',
 114: '90 to 94 years',
 120: '95 to 99 years',
 126: '100 years and over'
}

### Iterate Over Raw Data Rows with parse_data()

Here we define the function parse data, which iterates over the rows of raw data creating WideDataRow objects. It returns a dict of WideDataRow Objects

In [5]:
def parse_data(raw_df, ref_dict):
  
  wide_data = {}

  for i, row in raw_df.iterrows():  
    if row['AGE_CAT_ID'] in ref_dict:
      if row['GEO_CODE'] not in wide_data:
        wide_data_obj = WideDataRow(row)
        wide_data_obj.add_age_data(row)
        wide_data[row['GEO_CODE']] = wide_data_obj
      else:
        wide_data[row['GEO_CODE']].add_age_data(row)
      
  
  return wide_data

### Return Wide Format of Age Population Data

The function wide_rows iterates over the values of the dictionary of parsed data calling the obj_to_row function on each row object and returns the concatenated rows

In [6]:
def obj_to_row(wide_obj):
  meta = pd.Series({
                   'year': wide_obj.year,
                   'geo_code':wide_obj.geo_code,
                   'geo_name':wide_obj.geo_name,
                   'gnr':wide_obj.gnr,
                   'data_quality':wide_obj.data_quality,
                   'csd_type':wide_obj.csd_type,
                   'alt_geo_code':wide_obj.alt_geo_code
                   }).to_frame().transpose()


  total_ages = pd.Series(wide_obj.age_groups_total,dtype=int).to_frame().transpose()
  male_ages = pd.Series(wide_obj.age_groups_male, dtype=int).to_frame().transpose()
  female_ages = pd.Series(wide_obj.age_groups_female, dtype=int).to_frame().transpose()

  series_list = [meta, total_ages, male_ages, female_ages]

  combined = reduce(lambda left, right: pd.merge(left, right, left_index=True, right_index=True ), series_list)
  return combined

In [None]:
def wide_rows(parsed_data):
 
 wide_rows = [obj_to_row(row_obj) for row_obj in parsed_data]

 return pd.concat(wide_rows).reset_index()

# sanity check
wide_rows(parse_data(raw_df, age_group_key_ref).values()).head(10)

