# Explore the time-series of metrics in the pre-test period

#### Aim

Explore the characteristics of the time-series of the main evaluation metrics (in terms of trends and seasonality) to determine how they should be modelled.

#### Background

The time-series data was created by the `notebooks/2021_tests/gather_metrics.sql` for the pre-intervention time period (segments A0 to A3) from 11-Oct-2021 to 19-Nov-2021 and saved in BigQuery data in the `govuk-bigquery-analytics.datascience.related_links_20211011_20211119_pre_test_data` table.


**IMPORTANT**: This notebook uses an R kernel.

#### How to setup Jupyter Notebook for R

These instructions assume that you already have a working Python environment for your local repository of this project, and Jupyter Notebook already installed in that environment that you can execute from your Terminal.

1. Install R 

   If not already installed, see https://cloud.r-project.org/index.html
   

2. Install R kernel for Jupyter Notebook

    In your Terminal (note: not in RStudio, not in the R GUI):
    
    - launch R by entering `R` on the command line.

    - You should now be using R from your Terminal. Thus, run:
    ```
    install.packages('IRkernel')
    IRkernel::installspec()
    ```

    Done! You can now quit R by entering `q()`.

If you now launch Jupyter Notebook, you'll have the option to choose `R` as kernel.


### Setting things up

In [None]:
# Install packages, if they aren't already available.
# This can take a minute or two.
packages <- c("bigrquery", "tidyverse", "plotly", "gridExtra", "tsibble", "feasts", "DT", "TTR")
install.packages(setdiff(packages, rownames(installed.packages())), quiet = TRUE) 

In [None]:
for(pckg in packages){
    suppressPackageStartupMessages(library(pckg, character.only = TRUE))
}

In [None]:
#Authenticate
#/path/to/your/service-account.json
bq_auth(path = "/Users/alessiatosi/Secrets/govuk-bigquery-analytics-service-credentials.json")  

In [None]:
# Make plots wider 
options(repr.plot.width=15, repr.plot.height=8)

In [None]:
# create custom plotting theme
theme_custom <- theme(plot.title = element_text(face = "bold", hjust = 0.5, size=18),
                      plot.subtitle = element_text(size=14),
                      axis.text.y = element_text(colour = 'black', size = 12), 
                      axis.title.y = element_text(size = 16, hjust = 0.5, vjust = 0.2),
                      axis.text.x = element_text(colour = 'black', size = 12), 
                      axis.title.x = element_text(size = 16, hjust = 0.5, vjust = 0.2),
                      panel.background = element_blank(),
                      axis.line = element_line(colour = "black"),
                      legend.position = "bottom",
                      legend.direction = "horizontal")

### Get the data

In [None]:
#billing <- "govuk-xgov" # replace this with your project ID 
project = "govuk-bigquery-analytics"
sql <- "SELECT * FROM `govuk-bigquery-analytics.datascience.related_links_20211011_20211119_pre_test_data`"

tb <- bq_table_download(bq_project_query(project, sql))

In [None]:
tb

### Data pre-processing

In [None]:
# cast date as a date type variable
tb$date <- as.Date(strptime(tb$date, "%Y%m%d"))

In [None]:
tb <- tb %>% 
    arrange(date)

### Plotting the time series of data

Here we will plotting the time series of data for all the metrics. Later in the notebook we will explore seasonality and trends in the time series of the two main evaluation metrics:
- Proportion of visitors who click a related link (RL) at least once
- Proportion of repeated-clicker visitors (those that, having click on a RL, click on others)

In [None]:
plot_timeseries <- function(data, ts_var="", title="", x_title=""){
    #'@param data (data.frame) : dataset  
    #'@param ts_var (character string) : name of the variable containing the time-series data
    #'@param title (character string) : plot title
    #'@param x_title (character string) : x-axis title
    #'@return time-series plot
    
    if(!"date" %in% colnames(data)) stop(paste0("column `date` is missing from dataset"))
    
    sym_ts_var <- dplyr::sym(ts_var)
    
    data %>% 
    ggplot2::ggplot(., aes(date, !!sym_ts_var)) +
    geom_point(size=2) +
    geom_line(size=1) +
    #geom_smooth(method="lm", colour="blue") +
    geom_smooth(method = "loess", formula=y~x, colour="red", se=TRUE) +
    geom_vline(aes(xintercept = as.Date("20211111", "%Y%m%d")), col="blue", linetype=2) +
    geom_vline(aes(xintercept = as.Date("20211025", "%Y%m%d")), col="blue", linetype=2) +
    labs(
        title = title,
        subtitle = "Pre-intervention time series") +
    ylab(x_title) +
    theme_custom
    }

In [None]:
plot_timeseries(data=tb,
               ts_var="pc_visitors_used_rl",
               title="Proportion of visitors who clicked on at least 1 related link",
               x_title="Proportion of visitors")

In [None]:
plot_timeseries(data=tb,
               ts_var="pc_visitors_that_clicked_navigation",
               title="Proportion of visitors who clicked on a navigation element",
               x_title="Proportion of visitors")

In [None]:
plot_timeseries(data=tb,
               ts_var="pc_visitors_2_or_more_rl",
               title="Proportion of visitors who clicked 2 or more related links",
               x_title="Proportion of visitors")

In [None]:
plot_timeseries(data=tb,
               ts_var="pc_visitors_that_used_search",
               title="Proportion of visitors who used internal search",
               x_title="Proportion of visitors")

In [None]:
plot_timeseries(data=tb,
               ts_var="pc_visitors_returning_to_rl",
               title="Proportion of repeated-clicker visitors (they clicked one RL who then click again on a RL)",
               x_title="Proportion of visitors")

### Trends, seasonality and stationarity

- **Trend**: whether and when there is an overall increasing or decreasing pattern in our observations over time
- **Seasonality**: whether and when there are repeating patterns in the series at fixed and known periods (e.g., weekly)
- **Stationarity**: when a time-series has constant mean, variance and covariance over time
Put another way, a time-series is **stationary** when it has no trend nor seasonality, and has constant variance over time. Typically, this will mean when you plot values over time, it will be roughly horizontal (though some cyclic behaviour is possible) and have constant variance.

- **Remainder/random noise**: leftover of original time-series after trend and seasonality are removed
- **Autocorrelation**: the strength of the relationship between a variable and its observations at prior time-periods
The **autocorrelation function** is a plot of a **stationary** time-series with its lags (meaning its observations at prior time-periods). It can be used to obtain the order of a **moving-average model**, *q*. It will be the first lag at which the **autocorrelation** value passes the upper 95% **confidence interval**, as indicated by the blue dotted line in the corresponding **ACF** plot.

- **Partial autocorrelation**: the strength of the relationship between an observation in a time-series with its observations at prior time-peridos, with the relationships of intervening observations removed. **Partial autocorrelation** is different to **autocorrelation** because the latter is comprised of both *direct* and *indirect* correlations, whereas the former removes these *indirect* correlations. It can be used to obtain the order of an auto-regressive model, *p*.^[Indirect correlations are a linear function of the correlation of the observation, with observations at intervening time periods.]

We explore **trend** to help identify whether the shares in page view traffic by device cateogry has evolved over time, and whether this change in the cookie-policy has further affected this trend in any peculiar way. Whereas for the **ACF** and **PACF** concepts, we explore these to inform our choice of the statistical method to model our time-series data with.


In [None]:
# convert to time-series object
tb <- tb %>%
    tsibble::as_tsibble(index = date)

In [None]:
plot_SLT <- function(data, ts_var="", title_ts_var=""){
    #'@param data (data.frame) : dataset  
    #'@param ts_var (character string) : name of the variable containing the time-series data
    #'@param title_ts_var (character string) : Plain English description of time-series variable
    #'@return time-series plot
    
    if(!"date" %in% colnames(data)) stop(paste0("column `date` is missing from dataset"))
    
    sym_ts_var <- dplyr::sym(ts_var)
    
    decomp <- data %>% model(STL(!!sym_ts_var)) %>% components()
    
    p1 <- data %>%
        feasts::gg_tsdisplay(y = !!sym_ts_var, plot_type = "partial") + 
        labs(title = paste(title_ts_var, "- Time, ACF and PACF plots"))
    
    p2 <- decomp %>% autoplot()
    
    list(p1, p2)
    }

#### Proportion of visitors who click RL at least once

In [None]:
plot_SLT(tb, "pc_visitors_used_rl", "Percentage of visitors who clicked RL at least once")

#### Proportion of repeated-clicker visitors

In [None]:
plot_SLT(tb, "pc_visitors_returning_to_rl", 
         "Proportion of visitors who clicked on more RLs after having clicked on one")

#### Proportion of visitors who clicked on a navigation element while on a RL page

In [None]:
plot_SLT(tb, "pc_visitors_that_clicked_navigation", 
         "Proportion of visitors who clicked on a navigation element")

## Conclusions

The time series of our two main metrics of evaluation display both weekly seasonality and some upward not-fully linear trend that we will try to account for when modelling the time-series as part of the interrupted time series analysis.