In [4]:
import pandas as pd
import numpy as np
import altair as alt
import os

from datetime import datetime
from datetime import timedelta


#os.chdir('Masters/capstone/w2020-data599-capstone-projects-statistics-canada-covid-19')
# os.chdir('data')
os.getcwd()
cews = pd.read_csv('CEWS_SSUC_DB_En_v1.0.csv', encoding = "ISO-8859-1") 
cews.head()

Unnamed: 0,Start_date_of_CEWS_period,RegionCode,RegionName,RuralUrbanFlag,CMACAFlag,IndustryCode,IndustryName,Number_business_locations,Subsidy_amount,Supported_employees,CEWS_rehire_count
0,2020-03-15,10,Newfoundland and Labrador,Not applicable,Not applicable,11,"Agriculture, forestry, fishing and hunting",30,823000,362,0
1,2020-03-15,10,Newfoundland and Labrador,Not applicable,Not applicable,111,Crop production,10,X,90,0
2,2020-03-15,10,Newfoundland and Labrador,Not applicable,Not applicable,112,Animal production and aquaculture,10,X,X,0
3,2020-03-15,10,Newfoundland and Labrador,Not applicable,Not applicable,113,Forestry and logging,5,X,X,0
4,2020-03-15,10,Newfoundland and Labrador,Not applicable,Not applicable,114,"Fishing, hunting and trapping",5,X,X,0


In [None]:
"This is Eric's first look at the data, where he learned about the structure and factor levels of each column. \
He altered the dataset to incluse two new columns, GeoAggregation and IndustryAggregation which made the dataset much easier to work with \
by allowing us to query a single level in each column to ensure no duplicaiton of information. \
Supressed values, indicated with 'X', were replaced by null."

# Big Questions to Answer
- Is there a systematic way that data was supressed which could bias our analysis?
- Does the aggregated data account for all subsidies, or just the ones which weren't supressed?

## What is the data?
The dataframe is not structured in a typical tabular form. Instead, various layers of aggregation are all included in the same dataframe. Some rows show the aggregate of all subsidies within a province, or within a province in a specific industry, or within a province and within a specific subset of an industry. As such, any aggregation and analysis done on the whole dataframe is meaningless, and the frame needs to be split into a number of different datasets, which I have done below.

## How do we read it?
The region and industry codes are both important, and provide information on which layer of aggregation the row looks at. The first two digits of regionCode show which province the row is associated to. For example, any row which starts "10" in this column is within Newfoundland and Labrador. If this column only has the first two digits, the row is aggregated to the level of the province. If it is a larger number, like 1024234, then it is aggregated to the level of a town or county.
- First two digits correspond to province
- Second two digits to a regional municipality (These only exist for small towns and rural areas)
- Last three digits correspoind to the specific city

In general, if this column is two digits, we are looking at aggregated data for the province. If it is 5 digits, we are looking at a CMA/CA (defined below), and if it is 7 digits, it is a small town or rural area which may be a subsection of a CMA/CA, or a rural area not included.

More information on this coding scheme here: https://www.statcan.gc.ca/eng/subjects/standard/sgc/2016/introduction

The industryCode is similar. Two digit numbers represent general industries. For example, any row that starts with 44 or 45 is a retail industry, and the rows with the value "44-45" in this column are aggregates of all retail businesses. Longer values, like 4456, would correspond to one specific subset of the retail industry (like "automotive retail", for example).

Information on industry codes and subsets here: https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=1181553

As such, for any analysis we wish to perform, we must subset the data appropriately so we are only looking at the level of aggregation we are interested in. If we include both the rows which have "44-45" and the rows which have "44XXX/ 45XXX" in the industryCode, we will be counting all of those subsidies twice.

## What about the supressed data?
Suppressed data seems to typically exist at the more granular levels. For example, the rows which correspond to "all manufacturing in Ontario" will never be supressed, but "tabacco product manufacturing within St. John's" will likely be suppressed. I imagine this is due to the fact that there are only a couple businesses which exist that fit this description, so it would be extremely easy to work backward from the data to figure out how much money specific businesses were given (think back to data ethics). 

## What are the variables?
- Most are pretty self-evident, aside from the region and industry codes explained above
- CMACAFlag/ CensusLevel: One of three levels. CMA = 'Census Metropoliatan Area" (Large city), CA = "Census Agglogeration" (Smaller city/ town). "Not Applicable" = Anything else, or the row is not at the level of county (i.e., a full province row).

From the StatCan website: https://www.statcan.gc.ca/eng/subjects/standard/sgc/2016/definitions

"A census metropolitan area (CMA) or a census agglomeration (CA) is formed by one or more adjacent municipalities centred on a population centre (known as the core). A CMA must have a total population of at least 100,000 of which 50,000 or more must live in the core, based on adjusted data from the previous census. A CA must have a core population of at least 10,000, also based on data from the previous census. To be included in the CMA or CA, other adjacent municipalities must have a high degree of integration with the core, as measured by commuting flows derived from data on place of work from the previous census.

If the population of the core of a CA falls below 10,000, the CA is retired from the next census. However, once an area becomes a CMA, it is retained as a CMA even if its total population falls below 100,000 or the population of its core falls below 50,000. All areas inside the CMA or CA that are not population centres are rural areas.

When a CA has a core of at least 50,000, based on the previous Census of Population, it is subdivided into census tracts. Census tracts are maintained for the CA even if the population of the core subsequently falls below 50,000. All CMAs are subdivided into census tracts."

# Additional Data Sources
I created a folder called more_data where I'm storing this stuff.
### Population data
- Two files, AgeGroups-Groupesdage-PR-PR.csv and AgeGroups-Groupesdage-CMACA-RMRAR.csv which contain population data subdivided by province and CMA/CA respectively
- FOund here: https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/covid19/index-eng.cfm

### Industry Data
- gdpByIndustry.csv contains data on Canada's GDP broken down by industry, in Feb 2020 and Feb 2021.
- Found here: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=3610043402&pickMembers%5B0%5D=2.1&pickMembers%5B1%5D=3.1&cubeTimeFrame.startMonth=02&cubeTimeFrame.startYear=2021&referencePeriods=20210201%2C20210201

### Province Data
- gdpByProvince.csv is self-explanatory

In [6]:
# Copied from Vicens EDA

# Creating a copy of the database to work on and handle new values
cl_cews = cews.replace('X', np.NaN)

# Renaming columns to more familiar names

# I tend to find that spaces make wrangling far more difficult, and introducting them is probably more trouble than
# its worth

cl_cews.rename(columns={'Start_date_of_CEWS_period': 'Period',
                        'RegionCode' : 'RegionCode', 
                        'RegionName' : 'Region', 
                        'RuralUrbanFlag' : 'GeographicClassification',
                        'CMACAFlag' : 'CensusLevel', 
                        'IndustryCode': 'IndustryCode', 
                        'IndustryName': 'Industry', 
                        'Number_business_locations' : 'BusinessLocations', 
                        'Subsidy_amount': 'Subsidy', 
                        'Supported_employees' : 'SupportedEmployees', 
                        'CEWS_rehire_count': 'RehiredEmployees'}, inplace=True)

# Correcting data types
num_cols = ['BusinessLocations', 'Subsidy', 'SupportedEmployees', 'RehiredEmployees']
cl_cews[num_cols] = cl_cews[num_cols].replace(',', '', regex=True).astype(np.float)
cl_cews.Period = pd.to_datetime(cl_cews.Period, infer_datetime_format=True)

# Re-maping code values for analysis and changing data types

# This is very questionable, since the XX-XX lines seem to act as dividers between industry type, and are not actual data points
# This also doesn't really need to be int type, since it acts as a numeric index of industries...
# All the subsetting we want to do can be done through string methods (i.e., "where first who characters are 44 or 45" corresponds to 
# Various subsets of "retail trade")

#dic_ind = {'31-33' : '31', '44-45' : '44', '48-49' : '48', 'TOTAL' : '1'}
#cl_cews['Industry Code'].replace(dic_ind, inplace=True)

# We probably don't actually want this as an integer, since it is a key
#cl_cews['Industry Code'] = cl_cews['Industry Code'].astype(int)


# This is also probably better accomplished with string methods during wrangling instead of altering the table.
# Example, we know that any region with first two numbers "10" corresponds to Newfoundland, "12" to Nova Scotia, etc.

# cl_cews['Region Code'] = cl_cews['Region Code'].str.split('-').str[0].str.rstrip()
#dic_reg = {'TOTAL' : '1', 'rural' : '2', 'urban' : '3'}
#cl_cews['Region Code'].replace(dic_reg, inplace=True)
#cl_cews['Region Code'] = cl_cews['Region Code'].astype(int)

# I'm not sure what the rationale for this line was, but it breaks the data by, for example, turning all towns starting with "Saint-" into one group 

#cl_cews['Region'] = cl_cews['Region'].str.split('-').str[0].str.rstrip()

cl_cews.sample(5)

Unnamed: 0,Period,RegionCode,Region,GeographicClassification,CensusLevel,IndustryCode,Industry,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
271839,2020-06-07,3560010,Kenora,URBAN,Not applicable,448,Clothing and clothing accessories stores,5.0,11000.0,9.0,0.0
52761,2020-03-15,4801006,Medicine Hat,URBAN,Not applicable,99,Other and Missing NAICS,95.0,1023000.0,499.0,5.0
355518,2020-07-05,4814019,Hinton,RURAL,Not applicable,443,Electronics and appliance stores,5.0,,,0.0
324569,2020-07-05,2473030,Bois-des-Filion,URBAN,Not applicable,236,Construction of buildings,5.0,,,0.0
150470,2020-05-10,1007023,Bonavista,RURAL,Not applicable,447,Gasoline stations,5.0,,,0.0


In [7]:
cl_cews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 562491 entries, 0 to 562490
Data columns (total 11 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   Period                    562491 non-null  datetime64[ns]
 1   RegionCode                562491 non-null  object        
 2   Region                    562491 non-null  object        
 3   GeographicClassification  562491 non-null  object        
 4   CensusLevel               562491 non-null  object        
 5   IndustryCode              562491 non-null  object        
 6   Industry                  562491 non-null  object        
 7   BusinessLocations         562491 non-null  float64       
 8   Subsidy                   182793 non-null  float64       
 9   SupportedEmployees        188423 non-null  float64       
 10  RehiredEmployees          562491 non-null  float64       
dtypes: datetime64[ns](1), float64(4), object(6)
memory usage: 47.2+ M

# Data Wrangling

- This dataset is currently not in a usable tabular format. We have TOTAL data which is included as rows, along with individual data.
- We need to break this dataset down into different datasets at each granularity.

## Full country data

In [5]:
# Summary for all of Canada
countryData = cl_cews.query("RegionCode == 'TOTAL'")
# Summary for all of Canada broken down by urban/rural 
countryUrbanRural = cl_cews.query("RegionCode == 'rural' or RegionCode == 'urban'")
countryUrbanRural.sample(5)

Unnamed: 0,Period,RegionCode,Region,GeographicClassification,CensusLevel,IndustryCode,Industry,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
438204,2020-08-02,urban,Canada,URBAN,Not applicable,522,Credit intermediation and related activities,1445.0,,12600.0,15.0
225258,2020-05-10,urban,Canada,URBAN,Not applicable,334,Computer and electronic product manufacturing,675.0,59956000.0,21430.0,50.0
225190,2020-05-10,rural,Canada,RURAL,Not applicable,51,Information and cultural industries,565.0,7777000.0,3283.0,50.0
503436,2020-08-30,rural,Canada,RURAL,Not applicable,712,Heritage institutions,110.0,1275000.0,1216.0,5.0
225178,2020-05-10,rural,Canada,RURAL,Not applicable,454,Non-store retailers,260.0,4341000.0,1585.0,60.0


In [6]:
# At granularity of province 
provinceData = cl_cews[cl_cews["RegionCode"].str.match("^\d{2}$")]

# Granularity of province by urban/rural
provinceUrbanRural = cl_cews[cl_cews["RegionCode"].str.match("^\d{2}\-\w*")]
provinceUrbanRural.head()

Unnamed: 0,Period,RegionCode,Region,GeographicClassification,CensusLevel,IndustryCode,Industry,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
99,2020-03-15,10-rural,Newfoundland and Labrador - rural part,RURAL,Not applicable,11,"Agriculture, forestry, fishing and hunting",20.0,,289.0,0.0
100,2020-03-15,10-rural,Newfoundland and Labrador - rural part,RURAL,Not applicable,111,Crop production,5.0,39000.0,38.0,0.0
101,2020-03-15,10-rural,Newfoundland and Labrador - rural part,RURAL,Not applicable,112,Animal production and aquaculture,10.0,,201.0,0.0
102,2020-03-15,10-rural,Newfoundland and Labrador - rural part,RURAL,Not applicable,113,Forestry and logging,5.0,,,0.0
103,2020-03-15,10-rural,Newfoundland and Labrador - rural part,RURAL,Not applicable,114,"Fishing, hunting and trapping",0.0,,,0.0


In [8]:
# Granularity of city/ county

# When the region code is 7 digits, it references a particular town/ rural area.
countyData = cl_cews[cl_cews["RegionCode"].str.match("^\d{7}$")]

# It it is 5 digits, it references a CMA/CA
cityData = cl_cews[cl_cews["RegionCode"].str.match("^\d{5}$")]

Further wrangling needs to be done to filter by industry code before this data is useable, since there are still rows which act as aggregated data for certain sectors.

# Part 1: Group by Region Name
## Potential Questions
- In each region, how many subsidies were given out relative to the number of businesses?
- What is the distribution of subsidy amounts?
- What is the total subsidy amount? This may be impossible due to supressed data 

In [9]:
cityData.sample(20)

Unnamed: 0,Period,RegionCode,Region,GeographicClassification,CensusLevel,IndustryCode,Industry,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
45110,2020-03-15,35595,Thunder Bay,URBAN,CMA,621,Ambulatory health care services,95.0,1098000.0,539.0,0.0
243042,2020-06-07,24406,Baie-Comeau,URBAN,CA,238,Specialty trade contractors,15.0,696000.0,247.0,0.0
242828,2020-06-07,24403,Matane,URBAN,CA,323,Printing and related support activities,0.0,,,0.0
117075,2020-04-12,35501,Cornwall,URBAN,CA,237,Heavy and civil engineering construction,5.0,,,0.0
438574,2020-08-30,10001,St. John's,URBAN,CMA,517,Telecommunications,25.0,172000.0,737.0,0.0
413342,2020-08-02,35571,Midland,URBAN,CA,238,Specialty trade contractors,15.0,227000.0,112.0,0.0
222964,2020-05-10,59935,Victoria,URBAN,CMA,491,Postal service,0.0,,,0.0
92488,2020-04-12,24462,Montréal,URBAN,CMA,62,Health care and social assistance,2775.0,30478000.0,14313.0,390.0
42570,2020-03-15,35532,Oshawa,URBAN,CMA,48-49,Transportation and warehousing,50.0,1858000.0,760.0,5.0
146762,2020-04-12,59970,Prince George,URBAN,CA,41,Wholesale trade,85.0,3677000.0,1235.0,10.0


In [10]:
# We see that at the granularity of city/county, the most subsidies were given out in Toronto, Montreal, and Calgary.

countyData.query("IndustryCode == 'TOTAL'").groupby("Region").sum().sort_values("Subsidy").tail(10)

Unnamed: 0_level_0,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Québec,37915.0,923959000.0,548780.0,3800.0
Winnipeg,40270.0,967445000.0,580463.0,3950.0
Vaughan,34905.0,1038245000.0,522958.0,3740.0
Ottawa,61195.0,1410636000.0,791687.0,6550.0
Vancouver,67860.0,1568579000.0,850570.0,7430.0
Edmonton,78560.0,1977283000.0,1040977.0,7730.0
Mississauga,60160.0,2226247000.0,1198339.0,6230.0
Calgary,106690.0,3427259000.0,1616499.0,10360.0
Montréal,117070.0,3558849000.0,1907631.0,12885.0
Toronto,199360.0,5431794000.0,2881962.0,20735.0


In [11]:
countyData.query("IndustryCode == 'TOTAL'").groupby(["Region", "Period"]).sum().sort_values("Subsidy").tail(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
Region,Period,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mississauga,2020-04-12,8945.0,386619000.0,154632.0,950.0
Calgary,2020-08-02,12890.0,405949000.0,213065.0,905.0
Toronto,2020-08-30,21885.0,421999000.0,371709.0,1740.0
Montréal,2020-08-02,14930.0,459484000.0,262435.0,1210.0
Montréal,2020-03-15,14110.0,463035000.0,211715.0,1780.0
Calgary,2020-07-05,13530.0,468846000.0,225556.0,1265.0
Montréal,2020-07-05,15665.0,485606000.0,273017.0,1565.0
Calgary,2020-03-15,14000.0,494214000.0,199503.0,1215.0
Calgary,2020-06-07,13665.0,513877000.0,197025.0,1715.0
Montréal,2020-06-07,15455.0,534376000.0,231135.0,2100.0


In [12]:
# Total subsidies released by province in each period
provinceData.query("IndustryCode == 'TOTAL'").sort_values("Subsidy")

Unnamed: 0,Period,RegionCode,Region,GeographicClassification,CensusLevel,IndustryCode,Industry,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
562011,2020-09-27,62,Nunavut,Not applicable,Not applicable,TOTAL,All Industries,65.0,1.311000e+06,1369.0,0.0
561348,2020-09-27,60,Yukon,Not applicable,Not applicable,TOTAL,All Industries,220.0,2.007000e+06,2322.0,5.0
502997,2020-08-30,62,Nunavut,Not applicable,Not applicable,TOTAL,All Industries,105.0,2.105000e+06,1962.0,0.0
368176,2020-07-05,62,Nunavut,Not applicable,Not applicable,TOTAL,All Industries,110.0,3.847000e+06,2431.0,0.0
437670,2020-08-02,62,Nunavut,Not applicable,Not applicable,TOTAL,All Industries,110.0,3.958000e+06,2475.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
328627,2020-07-05,35,Ontario,Not applicable,Not applicable,TOTAL,All Industries,108330.0,2.919833e+09,1678079.0,10695.0
29723,2020-03-15,35,Ontario,Not applicable,Not applicable,TOTAL,All Industries,112875.0,3.080925e+09,1437847.0,10915.0
256781,2020-06-07,35,Ontario,Not applicable,Not applicable,TOTAL,All Industries,109055.0,3.406655e+09,1478984.0,14355.0
104526,2020-04-12,35,Ontario,Not applicable,Not applicable,TOTAL,All Industries,126015.0,3.699334e+09,1504965.0,13795.0


In [13]:
# Subsidies given out at each period 
provinceData.query("IndustryCode == 'TOTAL'").groupby("Period").sum().sort_values("Subsidy")

Unnamed: 0_level_0,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
Period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-09-27,202310.0,2618873000.0,3154473.0,12620.0
2020-08-30,240720.0,3850674000.0,3768917.0,16345.0
2020-08-02,272555.0,6724694000.0,4163167.0,19510.0
2020-03-15,294070.0,7459996000.0,3489214.0,30520.0
2020-07-05,288945.0,7495102000.0,4322785.0,25855.0
2020-06-07,291545.0,8695344000.0,3836158.0,36460.0
2020-04-12,336595.0,9111729000.0,3743506.0,38945.0
2020-05-10,331865.0,9647709000.0,4011166.0,44420.0


In [92]:
# The exact same data is found using either the countryData, which is a good sanity check
countryData.query("IndustryCode == 'TOTAL'").groupby("Period").sum().sort_values("Subsidy")

Unnamed: 0_level_0,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
Period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-09-27,202310.0,2618872000.0,3154474.0,12620.0
2020-08-30,240720.0,3850675000.0,3768917.0,16345.0
2020-08-02,272555.0,6724693000.0,4163167.0,19510.0
2020-03-15,294070.0,7459996000.0,3489214.0,30520.0
2020-07-05,288945.0,7495102000.0,4322785.0,25855.0
2020-06-07,291545.0,8695346000.0,3836158.0,36460.0
2020-04-12,336595.0,9111729000.0,3743506.0,38945.0
2020-05-10,331865.0,9647710000.0,4011166.0,44420.0


In [104]:
# Numbers are twice as large when we use countyData, which implies we are missing a necessary subsetting operation 
# Some Regions are counted twice, once in a CMA/CA row, and once in the more specific 7-digit region code
countyData.query("IndustryCode == 'TOTAL'").groupby("Period").sum().sort_values("Subsidy")

Unnamed: 0_level_0,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
Period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-09-27,377995.0,4797011000.0,5794583.0,23745.0
2020-08-30,447925.0,7049674000.0,6905927.0,30910.0
2020-08-02,506190.0,12297230000.0,7599863.0,36840.0
2020-03-15,547145.0,13678600000.0,6409662.0,57075.0
2020-07-05,535985.0,13802490000.0,7895686.0,48665.0
2020-06-07,540660.0,15864460000.0,7025561.0,68395.0
2020-04-12,623870.0,16701660000.0,6849018.0,72835.0
2020-05-10,614050.0,17614720000.0,7319201.0,82675.0


In [106]:
# This is closer, but still off
countyData.query("IndustryCode == 'TOTAL' and (CensusLevel != 'CMA' and CensusLevel != 'CA')").groupby("Period").sum().sort_values("Subsidy")

Unnamed: 0_level_0,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
Period,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-09-27,202475.0,2494895000.0,3024864.0,12620.0
2020-08-30,240950.0,3696197000.0,3623024.0,16345.0
2020-08-02,272810.0,6470056000.0,4003707.0,19510.0
2020-03-15,294335.0,7141531000.0,3357020.0,30530.0
2020-07-05,289205.0,7291501000.0,4158405.0,25860.0
2020-06-07,291795.0,8355276000.0,3700352.0,36480.0
2020-04-12,336865.0,8777934000.0,3610968.0,38950.0
2020-05-10,332135.0,9303807000.0,3872989.0,44450.0


In [46]:
provinceData.query("IndustryCode != 'TOTAL'").sort_values("Subsidy")

Unnamed: 0,Period,RegionCode,Region,GeographicClassification,CensusLevel,IndustryCode,Industry,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
70669,2020-03-15,62,Nunavut,Not applicable,Not applicable,454,Non-store retailers,0.0,0.0,0.0,0.0
436845,2020-08-02,60,Yukon,Not applicable,Not applicable,444,Building material and garden equipment and sup...,0.0,0.0,0.0,0.0
272026,2020-06-07,46,Manitoba,Not applicable,Not applicable,313,Textile mills,0.0,0.0,,0.0
256736,2020-06-07,35,Ontario,Not applicable,Not applicable,482,Rail transportation,0.0,0.0,0.0,0.0
367780,2020-07-05,61,Northwest Territories,Not applicable,Not applicable,448,Clothing and clothing accessories stores,5.0,5000.0,20.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
562006,2020-09-27,62,Nunavut,Not applicable,Not applicable,71,"Arts, entertainment and recreation",0.0,,,0.0
562007,2020-09-27,62,Nunavut,Not applicable,Not applicable,712,Heritage institutions,0.0,,,0.0
562008,2020-09-27,62,Nunavut,Not applicable,Not applicable,72,Accommodation and food services,5.0,,120.0,0.0
562009,2020-09-27,62,Nunavut,Not applicable,Not applicable,721,Accommodation services,5.0,,,0.0


# Part 2: Group by Industry

## Potential Questions

- Which industries recieved the most subsidies? Which recieved the least?

- What is the interaction between industry and region?

- How many employees were supported within each industry?

In [60]:
# Look only at retail
# We should try and find a directory for these codes.

countyData[countyData["IndustryCode"].str.match("^4[45]")]

Unnamed: 0,Period,RegionCode,Region,GeographicClassification,CensusLevel,IndustryCode,Industry,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
327,2020-03-15,10001,St. John's,URBAN,CMA,44-45,Retail trade,340.0,5556000.0,2976.0,55.0
328,2020-03-15,10001,St. John's,URBAN,CMA,441,Motor vehicle and parts dealers,45.0,2344000.0,994.0,10.0
329,2020-03-15,10001,St. John's,URBAN,CMA,442,Furniture and home furnishings stores,20.0,224000.0,154.0,0.0
330,2020-03-15,10001,St. John's,URBAN,CMA,443,Electronics and appliance stores,10.0,65000.0,37.0,5.0
331,2020-03-15,10001,St. John's,URBAN,CMA,444,Building material and garden equipment and sup...,15.0,370000.0,177.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
561951,2020-09-27,6106023,Yellowknife,URBAN,Not applicable,447,Gasoline stations,0.0,0.0,0.0,0.0
561952,2020-09-27,6106023,Yellowknife,URBAN,Not applicable,448,Clothing and clothing accessories stores,5.0,5000.0,,0.0
561953,2020-09-27,6106023,Yellowknife,URBAN,Not applicable,451,"Sporting goods, hobby, book and music stores",0.0,,,0.0
561954,2020-09-27,6106023,Yellowknife,URBAN,Not applicable,453,Miscellaneous store retailers,5.0,,18.0,0.0


In [63]:
# Example: Total subsidies within each province for all retail businesses
provinceData[provinceData["IndustryCode"].str.match("^44-45")].groupby("Region").sum()

Unnamed: 0_level_0,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alberta,35830.0,586371000.0,387571.0,3565.0
British Columbia,45205.0,629746000.0,411924.0,6330.0
Manitoba,8540.0,114763000.0,81172.0,775.0
New Brunswick,7450.0,80123000.0,58537.0,895.0
Newfoundland and Labrador,5885.0,57839000.0,40344.0,900.0
Northwest Territories,255.0,3957000.0,2063.0,5.0
Nova Scotia,8805.0,112491000.0,76483.0,1245.0
Nunavut,70.0,502000.0,801.0,0.0
Ontario,109805.0,1592659000.0,1091386.0,13975.0
Prince Edward Island,1815.0,18406000.0,14109.0,300.0


# Part 3: Group by geographic classification

Note that this was already researched somewhat extensively in the report StatCan did, at least in regard to rural classifications.
## Potential questions

In [96]:
# Example: Compare total subsidies in each province by rural/urban split
provinceUrbanRural[provinceUrbanRural["IndustryCode"] == "TOTAL"]

Unnamed: 0,Period,RegionCode,Region,GeographicClassification,CensusLevel,IndustryCode,Industry,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees
188,2020-03-15,10-rural,Newfoundland and Labrador - rural part,RURAL,Not applicable,TOTAL,All Industries,1150.0,17397000.0,8735.0,130.0
283,2020-03-15,10-urban,Newfoundland and Labrador - urban part,URBAN,Not applicable,TOTAL,All Industries,2685.0,54290000.0,28016.0,255.0
2718,2020-03-15,11-rural,Prince Edward Island - rural part,RURAL,Not applicable,TOTAL,All Industries,425.0,5608000.0,2985.0,30.0
2800,2020-03-15,11-urban,Prince Edward Island - urban part,URBAN,Not applicable,TOTAL,All Industries,980.0,17205000.0,9442.0,85.0
4101,2020-03-15,12-rural,Nova Scotia - rural part,RURAL,Not applicable,TOTAL,All Industries,2020.0,28167000.0,15461.0,170.0
...,...,...,...,...,...,...,...,...,...,...,...
561373,2020-09-27,60-rural,Yukon - rural part,RURAL,Not applicable,TOTAL,All Industries,35.0,222000.0,235.0,0.0
561432,2020-09-27,60-urban,Yukon - urban part,URBAN,Not applicable,TOTAL,All Industries,185.0,1784000.0,2087.0,5.0
561736,2020-09-27,61-rural,Northwest Territories - rural part,RURAL,Not applicable,TOTAL,All Industries,65.0,2757000.0,1551.0,0.0
561792,2020-09-27,61-urban,Northwest Territories - urban part,URBAN,Not applicable,TOTAL,All Industries,130.0,2025000.0,1972.0,10.0


# Try to make all data easily subsettable in one dataframe

In [116]:
# Add columns which indicate the level of abstraction
cl_cews["GeoAggregation"] = np.nan
cl_cews["IndustryAggregation"] = np.nan

In [135]:
import regex as re

True

In [174]:
# Function to decide abstraction level of geography
import regex as re

def getGeo(RegionCode):
    # 2 digits indicates Province
    if bool(re.fullmatch("^\d{2}$", RegionCode)):
        return "Province"
    
    # When the region code is 7 digits, it references a particular town/ rural area.
    elif bool(re.fullmatch("^\d{7}$", RegionCode)):
        return "CSD"

    # It it is 5 digits, it references a CMA/CA
    elif bool(re.fullmatch("^\d{5}$", RegionCode)):
        if bool(re.fullmatch("6\d000$", RegionCode)):
            return "Unknown Rural"
        else:    
            return "CMA/CA"
    
    # Urban or rural split
    elif bool(re.fullmatch("^\d{2}\-\w*", RegionCode)):
        return "urban/rural by province"
    
     # Urban or rural split for full country
    elif bool(re.fullmatch("^\w*", RegionCode)):
        if bool(re.fullmatch("TOTAL", RegionCode)):
            return "Canada"
        else:
            return "urban/rural by country"
    
cl_cews["GeoAggregation"] = cl_cews["RegionCode"].apply(getGeo)

In [175]:
# Function to decide abstraction of Industry

def getIndustry(IndustryCode):
    # All industries in the region
    if bool(re.fullmatch("^TOTAL", IndustryCode)):
        return "All industries"
    
    # Level 1 classifications
    elif bool(re.fullmatch("^\d{2}", IndustryCode)) or bool(re.fullmatch("^\d{2}\-\d{2}", IndustryCode)):
        return "Level 1"
    # Level 2 subclassifications
    elif bool(re.fullmatch("^\d{3}", IndustryCode)):
        return "Level 2"
    
cl_cews["IndustryAggregation"] = cl_cews["IndustryCode"].apply(getIndustry)

In [176]:
set(cl_cews["GeoAggregation"])

{'CMA/CA',
 'CSD',
 'Canada',
 'Province',
 'Unknown Rural',
 'urban/rural by country',
 'urban/rural by province'}

In [171]:
cl_cews[cl_cews["RegionCode"] == "TOTAL"]

Unnamed: 0,Period,RegionCode,Region,GeographicClassification,CensusLevel,IndustryCode,Industry,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees,GeoAggregation,IndustryAggregation
70943,2020-03-15,TOTAL,Canada,Not applicable,Not applicable,11,"Agriculture, forestry, fishing and hunting",3730.0,7.810100e+07,35240.0,165.0,Canada,Level 1
70944,2020-03-15,TOTAL,Canada,Not applicable,Not applicable,111,Crop production,1405.0,3.662100e+07,17555.0,60.0,Canada,Level 2
70945,2020-03-15,TOTAL,Canada,Not applicable,Not applicable,112,Animal production and aquaculture,975.0,1.657700e+07,7212.0,40.0,Canada,Level 2
70946,2020-03-15,TOTAL,Canada,Not applicable,Not applicable,113,Forestry and logging,545.0,1.389700e+07,5418.0,30.0,Canada,Level 2
70947,2020-03-15,TOTAL,Canada,Not applicable,Not applicable,114,"Fishing, hunting and trapping",335.0,2.301000e+06,1305.0,15.0,Canada,Level 2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
562271,2020-09-27,TOTAL,Canada,Not applicable,Not applicable,72,Accommodation and food services,30000.0,2.933060e+08,581203.0,3175.0,Canada,Level 1
562272,2020-09-27,TOTAL,Canada,Not applicable,Not applicable,721,Accommodation services,3745.0,8.656600e+07,89693.0,320.0,Canada,Level 2
562273,2020-09-27,TOTAL,Canada,Not applicable,Not applicable,722,Food services and drinking places,26255.0,2.067400e+08,491511.0,2855.0,Canada,Level 2
562274,2020-09-27,TOTAL,Canada,Not applicable,Not applicable,99,Other and Missing NAICS,24235.0,1.752390e+08,212785.0,1360.0,Canada,Level 1


In [173]:
cl_cews.query("GeoAggregation == 'Province' and Industry == 'Forestry and logging'")

Unnamed: 0,Period,RegionCode,Region,GeographicClassification,CensusLevel,IndustryCode,Industry,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees,GeoAggregation,IndustryAggregation
3,2020-03-15,10,Newfoundland and Labrador,Not applicable,Not applicable,113,Forestry and logging,5.0,,,0.0,Province,Level 2
2565,2020-03-15,11,Prince Edward Island,Not applicable,Not applicable,113,Forestry and logging,5.0,22000.0,9.0,0.0,Province,Level 2
3906,2020-03-15,12,Nova Scotia,Not applicable,Not applicable,113,Forestry and logging,30.0,312000.0,136.0,0.0,Province,Level 2
6637,2020-03-15,13,New Brunswick,Not applicable,Not applicable,113,Forestry and logging,50.0,,235.0,5.0,Province,Level 2
10515,2020-03-15,24,Quebec,Not applicable,Not applicable,113,Forestry and logging,170.0,2395000.0,987.0,15.0,Province,Level 2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
528434,2020-09-27,35,Ontario,Not applicable,Not applicable,113,Forestry and logging,20.0,291000.0,,0.0,Province,Level 2
541820,2020-09-27,46,Manitoba,Not applicable,Not applicable,113,Forestry and logging,0.0,,,0.0,Province,Level 2
543862,2020-09-27,47,Saskatchewan,Not applicable,Not applicable,113,Forestry and logging,0.0,,,0.0,Province,Level 2
547169,2020-09-27,48,Alberta,Not applicable,Not applicable,113,Forestry and logging,25.0,,556.0,5.0,Province,Level 2


In [159]:
cl_cews.query("GeoAggregation == 'Unknown Rural' and IndustryAggregation == 'Level 1' and CensusLevel == 'Not applicable'")

Unnamed: 0,Period,RegionCode,Region,GeographicClassification,CensusLevel,IndustryCode,Industry,BusinessLocations,Subsidy,SupportedEmployees,RehiredEmployees,GeoAggregation,IndustryAggregation
69989,2020-03-15,60000,Yukon - rural part excluding undetermined CSDs,RURAL,Not applicable,21,"Mining, quarrying, and oil and gas extraction",0.0,,,0.0,Unknown Rural,Level 1
69992,2020-03-15,60000,Yukon - rural part excluding undetermined CSDs,RURAL,Not applicable,23,Construction,5.0,74000.0,27.0,0.0,Unknown Rural,Level 1
69995,2020-03-15,60000,Yukon - rural part excluding undetermined CSDs,RURAL,Not applicable,41,Wholesale trade,0.0,,,0.0,Unknown Rural,Level 1
69997,2020-03-15,60000,Yukon - rural part excluding undetermined CSDs,RURAL,Not applicable,44-45,Retail trade,10.0,67000.0,31.0,5.0,Unknown Rural,Level 1
70003,2020-03-15,60000,Yukon - rural part excluding undetermined CSDs,RURAL,Not applicable,48-49,Transportation and warehousing,0.0,,,0.0,Unknown Rural,Level 1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
562054,2020-09-27,62000,Nunavut - rural part excluding undetermined CSDs,RURAL,Not applicable,55,Management of companies and enterprises,0.0,,,0.0,Unknown Rural,Level 1
562055,2020-09-27,62000,Nunavut - rural part excluding undetermined CSDs,RURAL,Not applicable,56,"Administrative and support, waste management a...",0.0,,,0.0,Unknown Rural,Level 1
562057,2020-09-27,62000,Nunavut - rural part excluding undetermined CSDs,RURAL,Not applicable,62,Health care and social assistance,5.0,,,0.0,Unknown Rural,Level 1
562060,2020-09-27,62000,Nunavut - rural part excluding undetermined CSDs,RURAL,Not applicable,71,"Arts, entertainment and recreation",0.0,,,0.0,Unknown Rural,Level 1


In [177]:
# Export dataset

cl_cews.to_csv("CEWS_with_subset_labels.csv")

In [None]:
# Incorperate other data sources

- Find sources for population, GDP, etc so we can find per-capita values