![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-viz-of-the-week&branch=main&subPath=price-comparison-Canada/price-comparison-Canada.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Callysto’s Weekly Data Visualization

## Price Comparison Across Canada 

### Recommended Grade levels: 6-12
<br>

### Instructions
#### “Run” the cells to see the graphs
Click “Cell” and select “Run All”.<br> This will import the data and run all the code, so you can see this week's data visualization. Scroll to the top after you’ve run the cells.<br> 

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

**You don’t need to do any coding to view the visualizations**.
The plots generated in this notebook are interactive. You can hover over and click on elements to see more information. 

Email contact@callysto.ca if you experience issues.

### About this Notebook

Callysto's Weekly Data Visualization is a learning resource that aims to develop data literacy skills. We provide Grades 5-12 teachers and students with a data visualization, like a graph, to interpret. This companion resource walks learners through how the data visualization is created and interpreted by a data scientist. 

The steps of the data analysis process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer? 
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data, so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer the question. This includes creating visualizations. 
5. Interpret - Describe what's happening in the data visualization. 
6. Communicate - Explain how the evidence answers the question. 

# Questions

**Have the prices of different products increased significantly in recent months?**
**How do the recent prices compare with the ones at the beginning of COVID-19 health crisis?**
**How do the price changes compare in different provinces in Canada?**

As of May 2022, prices in Canadian grocery stores are soaring, and the trend only looks set to continue. The surge in prices is attributed to a number of factors including the myriad effects of COVID-19, serious supply chain constraints, ongoing climate change, among others. Does the available data back this statement? If yes, has all the provinces in Canada been affected the same way?


### Goal
Our goal is to investigate the change in prices of different products across Canada during the past recent years. We will use interactive choropleth maps and line graphs to visualize and analyze the price changes.

# Gather

### Code:
The code below will import the Python programming libraries we need to gather and organize the data to answer our questions.

In [None]:
%pip install -q pyodide_http plotly nbformat geopandas pyproj
import pyodide_http
pyodide_http.patch_all()
import pandas as pd
import geopandas as gpd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import pyproj
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
from IPython.display import Markdown as md

### Data:

The following two datasets are available from [Statistics Canada](https://www.statcan.gc.ca/en/start):

[Dateset1](https://open.canada.ca/data/en/dataset/8015bcc6-401d-4927-a447-bb35d5dfcc91): data for Canada and provinces starting January 2017

[Dateset2](https://open.canada.ca/data/en/dataset/8e35c016-87a6-4dd5-b089-e6dfc9bc0e76): historical data only for Canada starting January 1995

The prices reported in these two datasets are retail prices including tax. Both datasets are available in CSV (comma separated values) format to download. We will use the first dataset in this notebook as we want to look at the recent changes in prices. Different ways to read the data will be presented as detailed below:

- Method-1: Read data directly from the web address. This is the default method as the dataset is continuously updated. The url and/or the csv file name may need to be updated.

- Method-2: If for any reason Method-1 does not work, we read the csv file locally (in folder *Data*). You can either use the file that is already saved in that folder or download the latest version of the csv file from the web address and over-write this file.

### Import the data

In [None]:
method1 = True
try: # try method-1
    print("Try to read the most up-to-date published data on the web ...")
    # url of the zip file that includes the data file
    url = urlopen("https://www150.statcan.gc.ca/n1/tbl/csv/18100245-eng.zip")
    # download Zipfile and create pandas DataFrame from the csv file inside it
    zipfile = ZipFile(BytesIO(url.read()))
    product_data = pd.read_csv(zipfile.open('18100245.csv'))
except: # if failed, read the file from the github repo
    method1 = False
    print("Failed to read the most recent published data on the web. Read the csv file on github ...")
    # create pandas DataFrame from the local csv file
    product_data = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/price-comparison-Canada/data/18100245.csv')
    
if method1: print("The most up-to-date data is successfully loaded.")

### Comment on the data
Let's look at the dataset structue, number of samples and some sample data.

In [None]:
# look at the dimensions of the dataset
n_samples, n_attributes = product_data.shape[0], product_data.shape[1]
md("There are $%i$ data samples each having $%i$ attributes shown in columns."%(n_samples, n_attributes))

In [None]:
# look at some data samples
product_data.head(5)

Let's look at the unique dates, regions and products for which we have data in this dataset:

In [None]:
n_dates, n_regions, n_products = product_data['REF_DATE'].nunique(), product_data['GEO'].nunique(), product_data['Products'].nunique()
md("There are $%i$ unique dates, $%i$ unique regions and $%i$ unique products in the dataset."%(n_dates, n_regions, n_products))

In [None]:
# unique dates
product_data['REF_DATE'].unique()

In [None]:
md("The data is available from $%s$ to $%s$"%(product_data['REF_DATE'].unique()[0], product_data['REF_DATE'].unique()[-1]))

In [None]:
# unique regions
product_data['GEO'].unique()

The prices are reported in the 10 provinces. The average prices in *Canada* are also reported. Unfortunately, the prices for the three territories (*Northwest Territories*, *Nunavut* and *Yukon*) are missing in this dataset.

In [None]:
# unique products
product_data['Products'].unique()

Finally, we see that a significant portion of the products in this dataset refers to food items.

In the next section, we will look more closely at the dataset and prepare it for our analysis.

# Organize

The codes below will arrange the data cleanly so that we can do analysis on it. This is a quality control step for our data and involves examining the data to detect anything odd with the data (e.g. structure, missing values), fixing the oddities, and checking if the fixes worked. Also, based on the information from the *Gather* section, we will manipulate the dataset in a few steps to make it ready for the relevant analysis that answers our questions.

### Identify irrelevant information

By looking at the column names (attributes), the potential relevant attributes of the datset to our investigation are: `REF_DATE`, `GEO`, `Products`, `UOM` and `VALUE`. As such, there are lots of information in the dataset that can be removed. Among the mentioned $5$ attributes, we can further check if the same unit of measure (the `UOM` column) is used for all data samples:

In [None]:
# unique unit of measures
product_data['UOM'].unique()

Only $1$ unit of measure (*Canadian dollars*) is used for reporting the prices and thus, the `UOM` column can also be removed from the dataset.

### Check for potential missing values


In [None]:
n_combinations = n_dates * n_regions * n_products
md("We next check for any missing data. Comapring the number of different possible combinations of unique dates, unique regions and unique products ($%i * %i * %i = %i$) with the total sample size ($%i$), we find out that there is one product (`Milk, 4 litres`) for which the prices are not reported in one region (`Newfoundland and Labrador`). Since there other similar products in the dataset (`Milk, 1 litre` and `Milk, 2 litres`), we can simply remove this product from our dataset."%(n_dates, n_regions, n_products, n_combinations, n_samples))

### Modify data

In order to simplify the data visulization process, here we drop unwanted columns/rows and rename some columns:

In [None]:
# modify original data
product_data.drop(columns=['DGUID', 'UOM', 'UOM_ID', 'SCALAR_FACTOR', 'SCALAR_ID', 'STATUS', 'SYMBOL', 
                           'TERMINATED', 'DECIMALS', 'VECTOR', 'COORDINATE'], inplace=True)
product_data.rename(columns={'REF_DATE':'Date', 'VALUE':'Price', 'GEO':'Region', 'Products':'Product'}, inplace=True)
product_data.drop(product_data[product_data['Product'] == 'Milk, 4 litres'].index, inplace=True)
product_data.reset_index(inplace=True)
product_data.drop(columns='index', inplace=True)
# create new dataframe without the Canada prices for the choropleth map
product_data_prov1 = product_data.loc[product_data['Region'] != 'Canada']
product_data_prov1.reset_index(inplace=True)
product_data_prov = product_data_prov1.drop(columns='index')
product_data.head()

We removed some columns and rows that were not needed for our data analysis. We also renamed some data attributes to access the data more easily without the need to check the attributes' names. Finally, to create choropleth maps for which only the province data is relevant, we needed to create a copy of the original dataset with the data pertaining to Canada removed.

# Explore

The code cells below will be used to visualize the data in different forms to help us answer our questions.

### Price changes over time for each province

First we look at the change in price of different products over the recent years in each province. You can select a specific product from the dropdown menu. The products are ordered alphabetically. The easiest way to navigate through the menu items is to drag the scroll bar. To show the prices for each province, click on its legend.

In [None]:
default_region = 'Alberta'
regions = product_data['Region'].unique()
products = np.sort(product_data['Product'].unique())
default_product = products[0]
markers = ["circle","square","diamond","cross","x","triangle-up","triangle-down","triangle-left","triangle-right","star","hourglass"]

# plot for all combinations of products and regions
fig = go.Figure()
for product in products:
    for iregion, region in enumerate(regions):
        if region == 'Canada':
            color = px.colors.sequential.gray[0]
        else:
            color = px.colors.sequential.Viridis[iregion - 1]
        product_data_trimmed = product_data[(product_data["Region"] == region) & 
                                            (product_data['Product'] == product)]
        
        fig.add_trace(go.Scatter(x=product_data_trimmed["Date"], y=product_data_trimmed["Price"], 
                                 mode="lines+markers", name=region, uid=product + ' (' + region + ')',
                                 line=dict(color=color), marker_symbol=markers[iregion], marker_size=10))

# set the default view: only show the graph for the default product and default region
fig.for_each_trace(
    lambda trace: trace.update(visible='legendonly') if (default_product in trace.uid) 
    else (trace.update(visible=False)))

fig.for_each_trace(
    lambda trace: trace.update(visible=True) if ((default_product in trace.uid) & (default_region in trace.uid)) else ())
#

# create buttons for all products
product_buttons = []
n_traces = len(products) * len(regions)
for id_prod, product in enumerate(products):
    vis = [False] * n_traces
    vis_update = list(regions == default_region)
    vis_update = ['legendonly' if i == False else i for i in vis_update]
    vis[(id_prod * len(regions)):((id_prod + 1) * len(regions))] = vis_update
    product_buttons.append(dict(label=product, method='update',
                                args=[{"visible":vis}, {"title": "Price of " + product}]))

fig.update_layout(xaxis_title="Date", yaxis_title="Price (CAD)",
                  title_text="Price of " + default_product, hovermode="x unified", 
                  yaxis={"tickformat":"0.2f"}, legend=dict(bgcolor='rgba(0,0,0,0)'),
                  updatemenus=[dict(active=0, buttons=product_buttons, x=0.25, y=1)])
#fig.write_html("visualizations/fig1-PricesOverTime.html")
fig.show()

After playing around with the above plot, you may agree that looking at variations of absolute prices is not the best way to investigate the data. For instance, taking a quick look at the price variations of `Apple juice, 2 litres`, one might think that on August 2021, there was a huge decrease in the price; however, by taking a closer look at the y-axis values, we see that there was only approximately 20% decrease in the price. Moreover, it is hard to comment on which product was more affected at a certain time. For instance, the price change for `Oranges` was much more dramatic with respect to `Apple juice, 2 litres` and almost doubled since January 2017; but this is not clear by quickly looking at the above graphs. 

A better way to investigate the data is to plot the price of each product normalized by its respective reference value. Here, we take the Canadian average price on January 2017 as the reference value. The code below creates a modified plot.

In [None]:
regions = product_data['Region'].unique()
products = np.sort(product_data['Product'].unique())
default_prod = products[0]
default_reg = 'Alberta'
ref_reg = 'Canada'
ref_date = '2017-01'
markers = ["circle","square","diamond","cross","x","triangle-up","triangle-down","triangle-left","triangle-right","star","hourglass"]

# plot for all combinations of products and regions
ref_prices = {}
fig = go.Figure()
for product in products:
    ref_price = product_data.loc[(product_data['Date'] == ref_date) &
                                 (product_data["Region"] == ref_reg) &
                                 (product_data['Product'] == product), 'Price'].iloc[0]
    ref_prices.update({product:ref_price})
    for iregion, region in enumerate(regions):
        if region == 'Canada':
            color = px.colors.sequential.gray[0]
        else:
            color = px.colors.sequential.Viridis[iregion - 1]
        product_data_trimmed = product_data[(product_data["Region"] == region) &
                                            (product_data['Product'] == product)]

        fig.add_trace(go.Scatter(x=product_data_trimmed["Date"], y=product_data_trimmed["Price"].div(ref_price),
                                 mode="lines+markers", name=region, uid=product + ' (' + region + ')',
                                 line=dict(color=color), marker_symbol=markers[iregion], marker_size=10))

# set the default view: only show the graph for the default product and default region
fig.for_each_trace(
    lambda trace: trace.update(visible='legendonly') if (default_prod in trace.uid)
    else (trace.update(visible=False)))

fig.for_each_trace(
    lambda trace: trace.update(visible=True) if ((default_prod in trace.uid) & (default_reg in trace.uid)) else ())

# create buttons for all products
product_buttons = []
n_traces = len(products) * len(regions)
for id_prod, product in enumerate(products):
    vis = [False] * n_traces
    vis_update = list(regions == default_reg)
    vis_update = ['legendonly' if i == False else i for i in vis_update]
    vis[(id_prod * len(regions)):((id_prod + 1) * len(regions))] = vis_update

    product_buttons.append(dict(label=product, method='update',
                                args=[{"visible":vis}, {"title": "Normalized Price of " + product +
                                                        "<br>(" + ref_reg + " Average on " + ref_date + "=" +
                                                        str(ref_prices[product]) + " CAD)"}]))

fig.update_layout(xaxis_title="Date", yaxis_title="Normalized Price",
                  title_text="Normalized Price of " + default_prod +
                  "<br>(" + ref_reg + " Average on " + ref_date + "=" + str(ref_prices[default_prod]) + " CAD)",
                  hovermode="x unified", height=600,
                  yaxis={"tickformat":"0.2f"}, legend=dict(x=1.0, y=1, bgcolor='rgba(0,0,0,0)'),
                  updatemenus=[dict(active=0, buttons=product_buttons, x=1, y=1.15)])
fig.update_yaxes(range=[0.0, 2.4])
#fig.write_html("visualizations/fig2-NormalizedPricesOverTime.html")
fig.show()

Looking at the revised plot, we quickly see that `Apple juice, 2 litres` actually has not had much change in price during the recent years. The `Oranges` on the other hand has had approximately an 80% increase in price.

Another useful way to visualize the data is to plot the price changes of different products for a specific region. This helps us quickly see which products are more affected in a specific region. The prices of different products are normalized to their value on January 2017 in that region. You can select a specific region from the dropdown menu. To show the prices for each product, click on its legend.

In [None]:
regions = product_data['Region'].unique()
products = np.sort(product_data['Product'].unique())
default_prod = products[0]
default_reg = regions[0]
ref_reg = 'Canada'
ref_date = '2017-01'

# plot for all combinations of products and regions
fig = go.Figure()
n_traces = len(products) * len(regions)
for id_prod, product in enumerate(products):
    for id_reg, region in enumerate(regions):
        product_data_trimmed = product_data[(product_data["Region"] == region) & 
                                            (product_data['Product'] == product)]
        
        ref_price = product_data.loc[(product_data['Date'] == ref_date) & 
                                     (product_data["Region"] == region) & 
                                     (product_data['Product'] == product), 'Price'].iloc[0]
            
        fig.add_trace(go.Scatter(x=product_data_trimmed["Date"], y=product_data_trimmed["Price"].div(ref_price), 
                                 mode="lines", name=product, uid=product + ' (' + region + ')'))

# set the default view: only show the graph for the default product and default region
fig.for_each_trace(
    lambda trace: trace.update(visible='legendonly') if (default_reg in trace.uid) 
    else (trace.update(visible=False)))

fig.for_each_trace(
    lambda trace: trace.update(visible=True) if ((default_prod in trace.uid) & (default_reg in trace.uid)) else ())
    
fig.update_layout(xaxis_title="Date", yaxis_title="Normalized Price",
                  title_text="Normalized Prices in " + default_reg, 
                  yaxis={"tickformat":"0.2f"})

fig.update_yaxes(range=[0.0, 2.4])

# create buttons for all regions
region_buttons = []
n_traces = len(products) * len(regions)
for id_reg, region in enumerate(regions):
    vis = [False] * n_traces
    vis_update = list(products == default_prod)
    vis_update = ['legendonly' if i == False else i for i in vis_update]
    vis[id_reg:n_traces:len(regions)] = vis_update
        
    region_buttons.append(dict(label=region, method='update', 
                               args=[{"visible":vis}, {"title": "Normalized Prices in " + region}]))
    
fig.update_layout(updatemenus=[dict(active=0, buttons=region_buttons, x=1.25, y=1.2)], 
                  annotations=[dict(text="Region:",x=0.88, y=1.18, yref="paper", xref="paper", align="left", 
                                    showarrow=False)])
#fig.write_html("visualizations/fig3-NormalizedPricesOverTime-Regions.html")
fig.show()

### Price distribution across provinces at different times

Another useful type of plot to visualize this dataset is the choropleth map which shows the data on a regional map. This is a great way to quickly get an idea of variation of prices across different geographical regions.

The following code cell plots the most recent prices of different products across Canada. You can select a specific product from the dropdown menu.

In [None]:
# read the shape file
provinces = gpd.read_file('shape-files/gpr_000a11a_e.shp')
provinces.drop(columns=['PRUID', 'PRNAME', 'PRFNAME', 'PREABBR', 'PRFABBR'], inplace=True)
provinces.rename(columns={'PRENAME':'Region'}, inplace=True)
provinces.to_crs(pyproj.CRS.from_epsg(4326), inplace=True)

# prepare variables to plot
products = np.sort(product_data_prov['Product'].unique())
dates = product_data_prov['Date'].unique()
default_prod = products[0]
date_f = dates[-1]

product_data_prov_piv = product_data_prov[product_data_prov['Date'] == date_f].pivot(index='Region', columns='Product', values='Price')
df_merged = provinces.merge(product_data_prov_piv, left_on=['Region'], right_on=['Region'])
df_merged.set_index('Region', inplace=True)

# plot
fig = px.choropleth(df_merged, geojson=df_merged.geometry, locations=df_merged.index, color=default_prod, color_continuous_scale="Viridis")

fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(title_text="Price of " + default_prod + ' on ' + date_f)
fig.update(layout = dict(title=dict(x=0.1, y=0.9)))
fig.update_layout(margin={"r":0,"t":30,"l":10,"b":10})
fig.update_coloraxes(colorbar={'title':'Price (CAD)'})

# create buttons for all products
product_buttons = []
for id_prod, product in enumerate(products):        
    product_buttons.append(dict(label=product, method='update', args=[{"z":[df_merged[product]]}, {"title": "Price of " + product}]))

fig.update_layout(updatemenus=[dict(active=0, buttons=product_buttons, x=1, y=1)], 
                  annotations=[dict(text="Product:", x=0.6, y=0.97, yref="paper", xref="paper", align="left", showarrow=False)])  
#fig.write_html("visualizations/fig4-PricesMap.html")
fig.show()

# Interpret

First of all, we can clearly see a fixed pattern of increase and decrease in prices during a year for seasonal products such as apples, cabbage, pears, cucumber, romaine lettuce and tomatoes.

Moreover, as of May 2022, looking at the trends in price variations in different provinces, we see that the majority of the products indeed have experienced increases in their prices during the recent months, with only some of them, such as beef, coffee and canned goods, reaching the same high prices experienced at the beginning months of the COVID-19 pandemic. 

Products like canned tomatoes, canned tuna, cucumber and sweet potato only experienced the price increase at the beginning of the pandemic. Besides, the frozen products such as frozen corn, frozen french fried potatoes, frozen mixed vegetables, etc. did not experience any price increase in recent months or during the pandemic.

The trends are more or less the same for most of the provinces, although some provinces are more affected than the others.

Finally, the interactive choropleth map helps in quickly compare the prices of different products across Canada.

# Communicate

Below are some writing prompts to help you reflect on the new information that is presented from the data. When we look at the evidence, think about what you perceive about the information. Is this perception based on what the evidence shows? If others were to view it, what perceptions might they have?

- I used to think ____________________but now I know____________________. 
- I wish I knew more about ____________________. 
- This visualization reminds me of ____________________. 
- I really like ____________________.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)