Erika Duan 2022-12-11
# Load required packages -------------------------------------------------------
if (!require("pacman")) install.packages("pacman")
pacman::p_load(here,
janitor,
rsdmx,
clock,
tidyverse)
Data analysts can find themselves stuck in a cycle of ad hoc analytical requests, which prevents them from developing productionisation skills. Productionisation can be thought of as the creation of automatable data analysis workflows, which output data products to business users.
An automated workflow promotes code longevity and frees your time from performing repetitive manual processes. It is particularly useful when business areas require analyses on defined data sources.
In this example, I will be using the monthly labour force data segregated by measure (employment status) and region from the Australian Bureau of Statistics (ABS) to explore R report productionisation tips.
Note: This tutorial uses the R base pipe |>
, which can be accessed
via Tools/Global Options for R version 4.1.0 and above. Using |>
instead of %>%
hypothetically reduces the number of R package
dependencies for your automated report and this is a good housekeeping
practice when creating CI/CD pipelines.
I am using the publicly available ABS Labour Force dataset and consuming data through the ABS Data Explorer API.
Datasets from the ABS Data Explorer API can be read as a SDMX format, which seems to store data and metadata in a heavily nested XML or JSON format. The ABS currently provides one URL for the dataset itself and one for the metadata.
# Connect to Labour Force API --------------------------------------------------
# Apply all filters and then select the developer API tab for the data query url
data_url <- "https://api.data.abs.gov.au/data/ABS,LF,1.0.0/M2+M1.2+1+3.1599.20+30.AUS.M?startPeriod=2019-01&dimensionAtObservation=AllDimensions"
# Obtain data as tibble data frame
labour_force <- readSDMX(data_url) |>
as_tibble() |>
clean_names()
# Preview labour force dataset
labour_force |>
head(3) |>
knitr::kable()
time_period | measure | sex | age | tsest | region | freq | obs_value | unit_measure | unit_mult | decimals | obs_comment |
---|---|---|---|---|---|---|---|---|---|---|---|
2019-01 | M1 | 3 | 1599 | 30 | AUS | M | 8734.379 | NUM | 3 | 1 | NA |
2019-02 | M1 | 3 | 1599 | 30 | AUS | M | 8749.254 | NUM | 3 | 1 | NA |
2019-03 | M1 | 3 | 1599 | 30 | AUS | M | 8765.281 | NUM | 3 | 1 | NA |
# Obtain metadata as tibble data frame -----------------------------------------
# The metadata URL can be found via the developer API tab under structure query
schema_url <- "https://api.data.abs.gov.au/dataflow/ABS/LF/1.0.0?references=all"
schema_labour_force <- readSDMX(schema_url)
# List metadata S4 object attributes
slotNames(schema_labour_force)
#> [1] "organisationSchemes" "concepts" "codelists" "datastructures" "xmlObj" "schema"
#> [7] "header" "footer"
# Obtain list of column names from the codelist attribute
codelists_labour_force <- slot(schema_labour_force, "codelists")
map_chr(slot(codelists_labour_force, "codelists"), ~slot(.x, "id"))
#> [1] "CL_DECIMALS" "CL_FREQ" "CL_LF_AGE" "CL_LF_MEASURE" "CL_OBS_STATUS" "CL_SEX" "CL_STATE" "CL_TSEST"
#> [9] "CL_UNIT_MEASURE" "CL_UNIT_MULT"
# Print metadata as tibble data frame -----------------------------------------
# Obtain a data dictionary from the concepts attribute
slot(schema_labour_force, "concepts") |>
as_tibble() |>
unite("Name", Name.en, en, na.rm = TRUE) |>
knitr::kable()
id | Name | Description.en |
---|---|---|
UNIT_MEASURE | Unit of Measure | Base unit in which the observation value is expressed. |
UNIT_MULT | Unit of Multiplier | The multiple of the unit of measure for which the data is expressed. |
OBS_STATUS | Observation Status | NA |
OBS_COMMENT | Observation Comment | NA |
REPYEAREND | Reporting Year End Day | A month day that specifies the ending day of the reporting periods (e.g. YYYY-A1) within the datastucture. |
BASE_PERIOD | Reference Base Period | Reference Period that the value has been based against. |
Note: Some metadata for the labour force dataset seems to be missing
from concepts
, i.e. M1 for employed full-time and M2 for employed
part-time under Measure, and 1 for males, 2 for females and 3 for
persons under Sex (NB: a non-binary gender variable would be useful to
include).
We first need to identify the event that we are interested in reporting. For example, the labour force dataset contains information on the number of full-time versus part-time individuals, which can be further subdivided by sex.
Imagine that we are interested in visualising the monthly change in the number of part-time females in the labour force since January 2019. The standard way of doing this analysis is to hard code our variables of interest.
# Filter dataset by part-time females ------------------------------------------
females_part_time <- labour_force |>
filter(sex == 2, # 2 represents females
measure == "M2", # M2 represents part-time employed
tsest == 20) |> # 20 represents seasonally adjusted
mutate(time_period = as.Date(paste0(time_period, "-01"), format = "%Y-%m-%d"),
last_obs_value = lag(obs_value),
change_obs_value = case_when(
is.na(last_obs_value) ~ 0,
TRUE ~ obs_value - last_obs_value)) |>
select(time_period,
obs_value,
change_obs_value)
# Plot time series -------------------------------------------------------------
females_part_time |>
ggplot(aes(x = time_period,
y = change_obs_value)) +
geom_line() +
geom_vline(xintercept = as.Date("2020-02-01"),
colour = "firebrick",
linetype = "dashed") +
annotate("label", x = as.Date("2020-02-01"), y = 150,
label = "COVID-19", color = "firebrick") +
labs(title = "Change in female part-time labour force participation",
x = NULL,
y = "Individuals (1000s)") +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(linetype = "dotted"),
panel.grid.minor.y = element_line(linetype = "dotted"),
plot.title = element_text(hjust = 0.5))
To use parameters instead of hard coded variables, we first list our parameters of interest in the YAML header of our R Markdown report.
We can then reference the parameter keys directly in our R code. I recommend choosing short but descriptive strings as parameter values, as these can also be printed in the plot title (unlike the example below).
# Replace code with parameter keys ---------------------------------------------
females_part_time <- labour_force |>
filter(sex == params$sex, # 2 represents females
measure == params$measure, # M2 represents part-time employed
tsest == params$tsest) |> # 20 represents seasonally adjusted
mutate(time_period = as.Date(paste0(time_period, "-01"), format = "%Y-%m-%d"),
last_obs_value = lag(obs_value),
change_obs_value = case_when(
is.na(last_obs_value) ~ 0,
TRUE ~ obs_value - last_obs_value)) |>
select(time_period,
obs_value,
change_obs_value)
# Plot time series -------------------------------------------------------------
females_part_time |>
ggplot(aes(x = time_period,
y = change_obs_value)) +
geom_line() +
geom_vline(xintercept = as.Date("2020-02-01"),
colour = "firebrick",
linetype = "dashed") +
annotate("label", x = as.Date("2020-02-01"), y = 150,
label = "COVID-19", color = "firebrick") +
labs(title = paste0("Labour force changes for measure ",
params$measure,
" and sex ",
params$sex),
x = NULL,
y = "Individuals (1000s)") +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(linetype = "dotted"),
panel.grid.minor.y = element_line(linetype = "dotted"),
plot.title = element_text(hjust = 0.5))
- A great presentation by Thomas Mock on advanced R Markdown features.
- A great blog post by Emily Riederer on data analysis productionisation in R.
- A great blog post by Schaun Wheeler on the importance of data science productionisation.
- A
guide
on using the
rsdmx
package to consume SDMX formatted datasets via an API call.