# Data visualisation

This lab will generate interactive visualisations of crop yield data for wheat and canola collected by a harvester from a field in Western Australia. This lab will provide an introduction to:

* interactive visualisations using Plotly Express
* using figures to represent and explore different features of a dataset
* using colour to visualise patterns in a dataset

## Setup

### Run the labs

You can run the labs locally on your machine or you can use cloud environments provided by Google Colab. **If you're working with Google Colab be aware that your sessions are temporary and you'll need to take care to save, backup, and download your work.**

<a href="https://colab.research.google.com/github/geog3300-agri3003/coursebook/blob/main/docs/notebooks/week-2_1.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Download data

If you need to download the data for this lab, run the following code snippet. 

In [None]:
import os
import subprocess

if "data_lab-2_1" not in os.listdir(os.getcwd()):
    subprocess.run('wget "https://github.com/geog3300-agri3003/lab-data/raw/main/data_lab-2_1.zip"', shell=True, capture_output=True, text=True)
    subprocess.run('unzip "data_lab-2_1.zip"', shell=True, capture_output=True, text=True)
    if "data_lab-2_1" not in os.listdir(os.getcwd()):
        print("Has a directory called data_lab-2_1 been downloaded and placed in your working directory? If not, try re-executing this code chunk")
    else:
        print("Data download OK")

## What is a figure?

Data visualisation is the process of relating data values to elements of a figure on a computer display. 

The *Grammar of Graphics* is an underlying model that describes the mapping of data values to the visual elements of a figure. It provides a consistent framework for guiding us in how to take our data values and convert them into a figure that effectively represents the data and conveys the messages and insights we seek to communicate. 

In the *Grammar of Graphics* a plot comprises *data* and a *mapping*. The *mapping* (not cartographic here) is a formal description of how data values map onto elements of a figure. The elements of a figure are termed *aesthetics* and consist of:

* **layers** - geometric elements that represent data values such as points (e.g. for scatter plots), lines (e.g. for lines of best fit), and polyons (e.g. for histograms or bar plots).
* **scales** - relate data values to visual display properties such as colour (e.g. a blue to red colour palette for temperature), size (e.g. larger points for larger numbers), position (e.g. location on axes), or shapes (e.g. using triangles for group A and circles for group B). Scales are used to draw axes and legends for figures. 
* **coords** - coordinate systems are used to map data values onto locations on the figure. On most 2D figures the x- and y-axes describe the coordinate space and on maps latitude and longitude describe the coordinate space (or you can use different coordinate reference systems). 
* **theme** - the background styling of the figure such as fonts for labels and background colours. 

![](https://github.com/geog3300-agri3003/coursebook/raw/main/docs/img/week-2-what-is-a-figure.jpg)

Reading the <a href="http://vita.had.co.nz/papers/layered-grammar.pdf" target="_blank">A Layered Grammar of Graphics</a> paper by Hadley Wickham provides a detailed description of the core concepts for designing high-quality data visualisations. 

### Interactive visualisations

Interactive visualisations are important tools for exploring complex and multidimensional data. They enable users to quickly develop an understanding of a dataset's structure and patterns by enabling them to interactively generate different views of the dataset. 

Generally, interactive visualisations are controlled by user input from mouse events (click, drag, hover), and, in response to mouse events, change what data and information is rendered on the computer display. 

Interactive visualisations are important tools for both exploratory analysis and for communicating the results of analysis to a wider audience. For exploratory analysis the quick feedback provided by interactive visualisations allows analysts to quickly build up an understanding of the datasets they are working with, spot noise or missing data, refine and develop hypotheses and research questions, and select suitable analytical and statistical tools for further work. Interactive visualisations are useful for communication as they enable active engagement with your datasets and the message you are conveying in a user friendly and non-technical manner. 

Here, we will be using <a href="https://plotly.com/python/plotly-express/" target="_blank">Plotly Express</a> to develop interactive visualisations. Plotly Express is a Python module that contains functions that convert data in Python programs into interactive visualisations that can be rendered in web browser based environments. 

Plotly Express has several useful features for developing interactive visualisations:

* functions to generate a range of figure types to explore spatial and non-spatial data (see the <a href="https://plotly.com/python/plotly-express/#gallery" target="_blank">gallery</a>)
* consistent API for functions used to generate the figures (i.e. if you learn the syntax and format to generate scatter plots it can be applied to generate histograms, density plots, bar plots, violin plots, web maps, etc.)
* simple and intuitive functions to generate the figures (i.e. produce complex interactive figures with a single line of code)

## Import modules

In [None]:
# Import modules
import os
import pandas as pd
import geopandas as gpd
import plotly.express as px
import numpy as np
import plotly.io as pio

# setup renderer
if 'google.colab' in str(get_ipython()):
    pio.renderers.default = "colab"
else:
    pio.renderers.default = "jupyterlab"

## Data input

Let's read in some wheat and canola yield data collected by a harvester into a GeoPandas `GeoDataFrame`. The canola data corresponds to variety *43Y23 RR* and the wheat data corresponds to variety *ninja*. We'll demonstrate how to create interactive visualisations using Plotly Express by generating a simple widget that displays the distribution of wheat and canola yields. 

In [None]:
# Load the crop yield data
crop_yield_data_path = os.path.join(os.getcwd(), "data_lab-2_1")

# Read the canola and wheat crop yield data
canola_fpath = os.path.join(crop_yield_data_path, "bf66-canola-yield-max-vi_sampled.geojson")
canola_gdf = gpd.read_file(canola_fpath)
wheat_fpath = os.path.join(crop_yield_data_path, "bf66-wheat-yield-max-vi_sampled.geojson")
wheat_gdf = gpd.read_file(wheat_fpath)

# Combine (stack) the geojson files into one GeoDataFrame
gdf = pd.concat([canola_gdf, wheat_gdf], axis=0)
gdf.head()

Displaying the `head` of the `GeoDataFrame` `gdf` demonstrates that we are working with tabular data. There is a `geometry` column,  which stores the geographic location that each row in the table's attributes correspond to. Other columns of note are:

* `DryYield` - crop yield values for each location (tonnes / ha)
* `Variety` - *43Y23 RR* indicates canola  *ninja* indicates wheat
* `gndvi` - green normalised difference vegetation index, a satellite derived measure of greenness
* `ndyi` - normalised difference yellowness index, a satellite derived measure of yellowness

## Interactive visualisations with Plotly Express

Now, let's unpick the syntax for specifying a Plotly Express visualisation. The functions to generate interactive figures are part of the plotly.express module which we've imported into our program as `px`. 

`px.<function name>()` is how we'll access the function to generate a given figure. For example, to generate a histogram we call `px.histogram()` (if we wanted to generate a scatter plot we'd call `px.scatter()`, if we wanted to generate a line chart we'd call `px.line()`, if we wanted to generate a pie chart we'd call `px.pie()` - you get the pattern ...).

Next, we need to pass data into the function that will be rendered on the computer display and specify arguments to map data values to elements on the figure. The <a href="https://plotly.com/python-api-reference/plotly.express.html" target="_blank">Plotly Express documentation</a> lists functions that can be used to generate figures and their parameters. 

Paramters for the `px.histogram()` function inclue:

* `data_frame` - a `DataFrame` object containing the data to render on the histogram
* `x` - specifies the column in the `DataFrame` to be mapped on the x-axis of the figure  
* `color` - a column whose values are used to assign colours to *marks* (elements) on the display
* `marginal` - either *violin*, *box*, *rug*, or *histogram* that shows the distribution of the data
* `hover_data` - list of column names with values that will be shown in a popup when the cursor hovers over a record on the display

Use the *Zoom* tool to control what data is visualised and focus the figure on where most of the data is distributed. 

In [None]:
fig = px.histogram(
    data_frame=gdf, 
    x="DryYield", 
    color="Variety", 
    marginal="box", 
    hover_data=["DryYield", "Elevation", "WetMass"])
fig.show()

There are more options that you can use to configure a histogram <a href="https://plotly.com/python-api-reference/generated/plotly.express.histogram.html#plotly.express.histogram" target="_blank">here</a>. 

#### Recap quiz

**Look up the `range_x` paramter and consider how it could be used to remove the influence of outliers on the figure. Have a go at using it to restrict the range of values mapped to the x-axis.**

In [None]:
## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>
    
```python
fig = px.histogram(
    data_frame=gdf, 
    x="DryYield", 
    color="Variety", 
    marginal="box", 
    range_x=[0, 7],
    hover_data=["DryYield", "Elevation", "WetMass"])
fig.show()
```
</details>

Let's have a go at generating a scatter plot to consolidate our understanding of how to map variables in our data to elements of a graphic. The documentation for scatter plots is <a href="https://plotly.com/python-api-reference/generated/plotly.express.scatter.html#plotly.express.scatter" target="_blank">here</a> and you should notice similarities in how we set up a scatter plot to a histogram. 

Let's use a scatter plot to see if there is a relationship beetween crop yield and elevation. We are plotting two variables here so we need to use the `y` parameter to specify what column in our `GeoDataFrame` will be mapped onto the y-axis. 

We can use the `marginal_x` and `marginal_y` parameters to attach plots to the x- and y-axes that show the distributions of variables mapped to each axis. 

Finally, we're going to use the `opacity` argument here to make the point elements on the figure semi-transparent; this will help reveal more information about the density of data values. 

**Both canola and wheat crop yield data is displayed. To see the relationship between one crop type's yield and elevation, click on the variety in the legend.**

In [None]:
fig = px.scatter(
    data_frame=gdf, 
    x="DryYield", 
    y="Elevation", 
    color="Variety", 
    opacity=0.25, 
    marginal_x="box", 
    marginal_y="violin")
fig.show()

#### Recap quiz

**Can you limit the range of x-axis values to focus the figure on where most of the data is concentrated and remove the effect of outliers? (hint, you'll need to remove the `marginal_x` argument).**

In [None]:
## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>

```python
fig = px.scatter(
    data_frame=gdf, 
    x="DryYield", 
    y="Elevation", 
    color="Variety", 
    range_x=[0,10],
    opacity=0.25,
    marginal_y="violin")
fig.show()
```
</details>

### Adding layers

The scatter plot we have generated above has layers of points for the scatter plot and layers of geometric elements for the box plot and violin plots. However, each of these layers are all rendered on their own sub-plot. 

There are often times when we want to overlay layers on the same plot. A common example of this is adding a trendline to a scatter plot to help the viewer see patterns and relationships in the data. If we refer back to the documentation for <a href="https://plotly.com/python-api-reference/generated/plotly.express.scatter.html#plotly.express.scatter" target="_blank">scatter plots</a> we can see there is a `trendline` parameter. We can use this parameter to specify the kind of trendline we'd like to draw on our scatter plot:

* `ols`: ordinary least squares (or linear line of best fit)
* `loess`: locally weighted scatterplot smoothing line
* `rolling`: rolling average or rolling median line

Let's generate a scatter plot with a trendline to explore the relationship between the green normalised difference vegetation index (GNDVI, a satellite derived measure of vegetation greenness) and crop yield. Generally, higher maximum growing season GNDVI values are correlated with higher crop yields. 

If you hover your cursor over the trendline it will show you the equation for the trendline. You will also notice that we've used the the `range_x` and `range_y` parameters to focus the figure on the region where most of the data is concentrated and clip the outliers from the view. 

In [None]:
fig = px.scatter(
    data_frame=gdf, 
    x="gndvi", 
    y="DryYield", 
    color="Variety", 
    opacity=0.05, 
    range_y=[0.1, 6], 
    range_x=[0.3, 0.9], 
    marginal_x="box", 
    marginal_y="box", 
    trendline="ols"
)
fig.show()

#### Recap quiz

<details>
    <summary><b>Generally, it seems that maximum growing season GNDVI is higher for the wheat (Ninja) crop than canola (43Y23 RR). Can you think of an explanation for this?</b></summary>
Canola canopies are characterised by yellow flowers which could reduce their greenness during the growing season. 
</details>

### Facets

So far we have distinguished groups of data points on the same figure by using a unique colour per-group. However, this can lead to cluttered figures which obscures important variation in the data. To avoid clutter we can create faceted figures where mutliple subplots of the same type are generated, which share axes, and different subsets (groups) of the data are rendered on each subplot. 

<a href="https://clauswilke.com/dataviz/multi-panel-figures.html" target="_blank">Wilke (2019)</a> distinguish between faceted figures and compound figures. **Compound figures** are multiple figure types (e.g. scatter plots, histograms, maps), possibly of different datasets, combined into one figure. A key feature of a compound figure is that the subplots do not need to be arranged in a regular grid. The figures above with violin and box plots aligned on the margins of a scatter plot are examples of compound figures.

In contrast, **facet plots** consist of subplots of the same type, showing subsets of the same dataset, and are arranged on a regular grid. You might see the term trellis or lattice plots used to describe facet plots. To ensure correct interpretation of faceted figures it is important that the axes on all plots share the same range and scalings. 

Let's create a faceted figure that shows the relationship between crop yield and the normalised difference yellowness index (NDYI) side-by-side. The NDYI is a spectral index computed from remote sensing data as a mathematical combination of green and blue reflectance values. Higher NDYI values are associated with a yellower land surface. The NDYI is often used to monitor canola flowering. 

We can use the `facet_row` parameter to align subplots on separate rows or the `facet_col` parameter to align the subplots on separate columns. We specify a column in our `GeoDataFrame` to use to create the facets. The dataset is split into subsets using unique values in the specified column and each subset is rendered on a subplot. Here, we pass in the `Variety` column to split the data by crop type.  

In [None]:
fig = px.scatter(
    data_frame=gdf, 
    x="ndyi", 
    y="DryYield", 
    facet_col="Variety", 
    opacity=0.05, 
    range_y=[0.1, 5], 
    range_x=[0.1, 0.6], 
    trendline="ols"
)
fig.show()

## Selecting the "right" figure

Chapter 5 of <a href="https://clauswilke.com/dataviz/multi-panel-figures.html" target="_blank">Wilke (2019)</a> provides a *directory of visualisations* which serves as a useful guide for selecting the correct visualisation for different types of data.


## Using Colour

A colour scale is used to map data values to colours on the display. <a href="https://clauswilke.com/dataviz/color-basics.html" target="_blank">Wilke (2019)</a> outline three uses of colour on figures:

* colour to represent data values (e.g. using red shades for low precipitation and blue shades for high precipitation)
* colour to distinguish groups (e.g. using green for forest on a land cover map, blue for water, orange-red for desert, etc.)
* colour to highlight (e.g. using colour to highlight particular features on your visualisation)

We can broadly characterise colour scales as being either continuous or qualitative. 

### Continuous palettes

Continuous colour scales can be either sequential or diverging and are typically used when using colour to represent data values (often numeric continuous variables). Continuous colour scales can be used to visualise variation in attributes of vector geospatial data on chloropleth maps and variation in attributes of raster data as surfaces. 

#### Sequential palettes

A sequential colour scale is a palette which consists of single hue such as light green to dark green or light red to dark red. Multi-hue sequential colour scales often consist of hues that imply an intuitive and increasing order to the colours such as light yellows to dark red.

Plotly express provides a range of inbuilt sequential colour scales: 

In [None]:
fig = px.colors.sequential.swatches_continuous()
fig.show()

Let's use a sequential colour palette to visualise monthly precipitation over the field since 1981. The precipitation data is obtained from the <a href="https://developers.google.com/earth-engine/datasets/catalog/IDAHO_EPSCOR_TERRACLIMATE#bands" target="_blank">TerraClimate: Monthly Climate and Climatic Water Balance for Global Terrestrial Surfaces</a> dataset. 

Use the pandas `read_csv()` function to read in the precipitation data. Inside the `CSV` file each row represents a month-year combination and stores a monthly precipitation total in mm. 

In [None]:
# visualise monthly precipitation using a diverging palette
climate_data_path = os.path.join(os.getcwd(), "data_lab-2_1")
precip_df = pd.read_csv(os.path.join(climate_data_path, "bf66-terra-precip-monthly.csv"))
precip_df["month"] = precip_df["month"].astype(str)
precip_df["year"] = precip_df["year"].astype(str)
precip_df.head()

We can create a heatmap to visualise monthly precipitation across time and using a colour palette where darker blue shades indicate wetter months. Note how we pass in the colour palette `Blues` as an argument to the `color_continuous_scale` parameter. 

In [None]:
fig = px.density_heatmap(
    precip_df,
    x="year", 
    y="month", 
    z="pr", 
    histfunc="sum",
    nbinsy=12,
    color_continuous_scale="Blues",
    range_color=(0, 75),
)
fig.show()

#### Recap quiz

**Can you create a heatmap of monthly precipitation over time using a <code>YlGnBu</code> colour palette?**

In [None]:
## ADD CODE HERE

<details>
    <summary><b>answer</b></summary>

```python
fig = px.density_heatmap(
    precip_df,
    x="year", 
    y="month", 
    z="pr", 
    histfunc="sum",
    nbinsy=12,
    color_continuous_scale="YlGnBu",
    range_color=(0, 75),
)
fig.show()
```
</details>

#### Recap quiz

**Using the `GeoDataFrame` `gdf` of crop yield values, can you create a scatter plot of crop yield (the `DryYield` column) and GNDVI (the `gndvi` column) and assign green shades to the points which reflect their GNDVI values? Tips: look up the `color`, `color_continuous_scale`, and `range_color` parameters of the `scatter()` function in the <a href="https://plotly.com/python-api-reference/generated/plotly.express.scatter.html#plotly.express.scatter" target="_blank">API docs</a>.**

In [None]:
## ADD CODE HERE

<details>
    <summary><b>answer</b></summary>

```python
fig = px.scatter(
    data_frame=gdf, 
    x="gndvi", 
    y="DryYield", 
    facet_col="Variety", 
    opacity=0.25, 
    range_y=[0.1, 5], 
    range_x=[0.4, 0.9], 
    color="gndvi",
    color_continuous_scale="Greens",
    range_color=(0.4, 0.8),
)
fig.show()
```
</details>

#### Diverging palettes

Diverging colour scales are used to represent data values deviating in two directions. Often a light colour (e.g. white) is used as the mid-point of a diverging colour scale with gradients of intensifying colour away from this mid-point. A common example of diverging colour scales are climate or weather anomalies where dry or hot years are represented with red colours and wet and cool years are represented with blue colours. Average conditions are often a pale red, pale blue, or white.

Plotly also provides a range of diverging colour palettes we can use:

In [None]:
fig = px.colors.diverging.swatches_continuous()
fig.show()

We can use a diverging colour palette to visualise the same precipitation data. Monthly precipitation values are converted to z-scores, which represent deviations in monthly precipitation away from the mean. A z-score of zero represents average rainfall and can be used as the mid-point for a diverging colour palette. Here, we can use red-to-blue colour palette, with drier months represented by red shades. 

In [None]:
# compute average rainfall and standard deviation of rainfall to compute z scores
# use z score as 0 for the mid-point of a diverging colour palette
avg_pr = precip_df["pr"].mean()
std_pr = precip_df["pr"].std()
precip_df.loc[:, "z_score"] = (precip_df.loc[:, "pr"] - avg_pr) / std_pr

fig = px.density_heatmap(
    precip_df,
    x="year", 
    y="month", 
    z="z_score", 
    histfunc="sum",
    nbinsy=12,
    color_continuous_scale="RdBu",
    color_continuous_midpoint=0,
)
fig.show()

### Qualitative palettes

Qualitative (or discrete) colour scales should be used to represent groups or categorical data (i.e. data where there is no logical ordering). Thus, qualitative colour scales should not represent gradients of light to dark or use colours that can be interpreted as having an implied ordering. Often, it is sensible to select colours that relate to the category (e.g. on land cover maps using green for vegetated categories, blue for water etc.). 