
# GA4 (Document Loads) Forecasting

**Problem to Solve**

The document loads data is reported weekly.  The chosen start time for the week is Monday at 00:00:00.  The causes the visuals to look like a drastic descrease in loads throughout the current week.  The rolling weekly data shows 2-3k weekly total document loads but the Monday morning report (run at 3 AM) only shows 78 document loads for this week - because it only see 3 hours of the week.

**Solution to Develop**

Solve this problem with an enhancement!  Using a time series forecasting model to find the forecasted (expected) number of document loads this week (and next) and overlay on the plot as goal.

**Approach**

Use BigQuery ML with `model_type=`ARIMA_PLUS`' to create the forecasting model. Use a forecast horizon of 2 week and run the model each Monday.  The new model will update the second week of the old model with the latest forecast.

---
## Colab Setup
To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/architectures/tracking/setup/ga4/GA4%20Forecasting.ipynb) and run the cells in this section.  Otherwise, skip this section. 

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'vertex-ai-mlops-369716' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

Updated property [core/project].


---
## Setup

In [3]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'vertex-ai-mlops-369716'

In [4]:
BQ_PROJECT = PROJECT_ID

In [5]:
from google.cloud import bigquery

In [6]:
bq = bigquery.Client(project = PROJECT_ID)

## Build Forecasting Model

This approach to forecasting is covered in detail in the notebook [BQML Univariate Forecasting with ARIMA+](../../../../Applied%20Forecasting/2%20-%20BQML%20Univariate%20Forecasting%20with%20ARIMA+.ipynb).


In [None]:
query = f"""
    CREATE OR REPLACE MODEL `{BQ_PROJECT}.reporting.daily_loads_forecast`
    OPTIONS
      (model_type = 'ARIMA_PLUS',
       time_series_timestamp_col = 'event_date',
       time_series_data_col = 'loads',
       data_frequency = 'DAILY',
       auto_arima_max_order = 5,
       holiday_region = 'US',
       horizon = 14
      ) AS
    SELECT event_date, SUM(loads) as loads
    FROM `{BQ_PROJECT}.reporting.daily_loads`
    WHERE event_date < CURRENT_DATE() # runs on mondays for weeks ending on sundays
    GROUP BY event_date
"""
job = bq.query(query)
job.result()
print(job.state, (job.ended-job.started).total_seconds())

DONE 28.351


## Review the Forecast

In [None]:
query = f"""
    SELECT *, 
        EXTRACT(DATE FROM time_series_timestamp) AS event_date
    FROM ML.EXPLAIN_FORECAST(
        MODEL `{BQ_PROJECT}.reporting.daily_loads_forecast`,
        STRUCT(14 AS horizon, 0.95 AS confidence_level))
    WHERE time_series_type = 'forecast'
    ORDER BY event_date
"""
explain = bq.query(query).to_dataframe()

In [None]:
explain

Unnamed: 0,time_series_timestamp,time_series_type,time_series_data,time_series_adjusted_data,standard_error,confidence_level,prediction_interval_lower_bound,prediction_interval_upper_bound,trend,seasonal_period_yearly,seasonal_period_quarterly,seasonal_period_monthly,seasonal_period_weekly,seasonal_period_daily,holiday_effect,spikes_and_dips,step_changes,event_date
0,2023-03-06 00:00:00+00:00,forecast,439.413839,439.413839,75.157265,0.95,292.371799,586.455879,357.571591,,,,81.842248,,,,,2023-03-06
1,2023-03-07 00:00:00+00:00,forecast,419.042945,419.042945,75.155442,0.95,272.004472,566.081418,290.483868,,,,128.559077,,,,,2023-03-07
2,2023-03-08 00:00:00+00:00,forecast,339.79206,339.79206,75.694538,0.95,191.698868,487.885252,327.491,,,,12.30106,,,,,2023-03-08
3,2023-03-09 00:00:00+00:00,forecast,448.505123,448.505123,84.966313,0.95,282.272093,614.738153,376.412086,,,,72.093037,,,,,2023-03-09
4,2023-03-10 00:00:00+00:00,forecast,376.396008,376.396008,95.0241,0.95,190.485338,562.306677,389.752299,,,,-13.356292,,,,,2023-03-10
5,2023-03-11 00:00:00+00:00,forecast,219.626673,219.626673,95.0241,0.95,33.716004,405.537343,391.583197,,,,-171.956523,,,,,2023-03-11
6,2023-03-12 00:00:00+00:00,forecast,280.483508,280.483508,95.0241,0.95,94.572838,466.394177,393.414094,,,,-112.930586,,,,,2023-03-12
7,2023-03-13 00:00:00+00:00,forecast,478.564736,478.564736,95.0241,0.95,292.654066,664.475406,395.244991,,,,83.319745,,,,,2023-03-13
8,2023-03-14 00:00:00+00:00,forecast,525.443301,525.443301,95.0241,0.95,339.532632,711.353971,397.075888,,,,128.367413,,,,,2023-03-14
9,2023-03-15 00:00:00+00:00,forecast,409.777768,409.777768,95.0241,0.95,223.867098,595.688438,398.906786,,,,10.870983,,,,,2023-03-15


## Integrate the data with the daily_loads table

**Issue**
- The daily-loads table has `file`
- This forecast is overall for `event_date`
- The correct way would be using hierarchical forecasting, but I am not as concerened with actual file level forecasting.  Really just want the data to show up as target goals on a report!

**Approach**
- Get a distinct list of files for last month with atleast 20 loads along with % of those loads that were that file.  This will add to 100% and represent the distribution of loads recently.
- Join the distribution with the forecast, multiply the percent*forecast to get the share of the forecast.
- What about confidence intervals....
  - still focused on a target line for reporting, do we really need the intervals?
  - keep the full interval width but center it around the adjustment (remember to truncate at zero for this)?

**Integrating with Existing Table**
- add a column data type: 'actual', 'forecast'
  - update the notebook 'GA4 Reporting' to incorporate this in the table and incremental daily loads for data type: 'actual'
- forecast data goes in as new rows with data type: 'forecast'

In [7]:
query = f"""
  WITH
    FORECAST AS (
      SELECT
        time_series_data as loads, 
        standard_error,
        EXTRACT(DATE FROM time_series_timestamp) AS event_date
      FROM ML.EXPLAIN_FORECAST(
        MODEL `{BQ_PROJECT}.reporting.daily_loads_forecast`,
        STRUCT(14 AS horizon, 0.95 AS confidence_level)) 
      WHERE time_series_type = 'forecast'
    ),
    ACTUALS AS (
      SELECT
        file,
        SUM(loads) as loads
      FROM `{BQ_PROJECT}.reporting.daily_loads`
      WHERE
        event_date < CURRENT_DATE()
        and event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY)
        and row_type = 'actual'
      GROUP BY file
    ),
    DISTRIB AS (
      SELECT
        file,
        loads / SUM(loads) OVER() as percent
      FROM ACTUALS
      WHERE loads >= 28
    )
  SELECT
    event_date,
    file,
    loads*percent as loads,
    loads*percent - 1.96*standard_error*percent as lower_bound,
    loads*percent + 1.96*standard_error*percent as upper_bound
  FROM FORECAST CROSS JOIN DISTRIB
  ORDER BY event_date, loads
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,event_date,file,loads,lower_bound,upper_bound
0,2023-03-20,statmike/vertex-ai-mlops/04 - scikit-learn/04d...,0.836042,0.417820,1.254263
1,2023-03-20,statmike/vertex-ai-mlops/architectures/trackin...,0.865900,0.432743,1.299058
2,2023-03-20,statmike/vertex-ai-mlops/Applied Forecasting/B...,0.865900,0.432743,1.299058
3,2023-03-20,statmike/vertex-ai-mlops/architectures/trackin...,0.865900,0.432743,1.299058
4,2023-03-20,statmike/vertex-ai-mlops/03 - BigQuery ML (BQM...,0.925617,0.462587,1.388648
...,...,...,...,...,...
653,2023-04-02,statmike/vertex-ai-mlops/04 - scikit-learn/rea...,4.213703,-3.965629,12.393036
654,2023-04-02,statmike/vertex-ai-mlops/00 - Setup/readme.md,4.854447,-4.568650,14.277544
655,2023-04-02,statmike/vertex-ai-mlops/02 - Vertex AI AutoML...,6.190234,-5.825796,18.206264
656,2023-04-02,statmike/vertex-ai-mlops/Dev/BQML Predictions ...,12.858311,-12.101302,37.817923


## Alter Existing table to make room for forecast

The current table `{BQ_PROJECT}.reporting.daily_loads` needs more columns: `row_type`, `lower_bound`, `upper_bound`.  Also, change data type of `loads` column to FLOAT64. Fill value for `row_type` with 'actual' for all existing rows.

In [None]:
query = f"""
ALTER TABLE `{BQ_PROJECT}.reporting.daily_loads`
  ADD COLUMN IF NOT EXISTS row_type STRING,
  ADD COLUMN IF NOT EXISTS lower_bound FLOAT64,
  ADD COLUMN IF NOT EXISTS upper_bound FLOAT64,
  ALTER COLUMN loads SET DATA TYPE FLOAT64
;
UPDATE `{BQ_PROJECT}.reporting.daily_loads`
  SET row_type = 'actual'
  WHERE 1=1
;
"""
job = bq.query(query = query)
job.result()
print(job.state, (job.ended-job.started).total_seconds())

DONE 3.656


## Append Forecast to Existing Table

Run the first forecast (done above) and add values to `{BQ_PROJECT}.reporting.daily_loads` with `row_type = 'forecast'`.

In [8]:
query = f"""
INSERT INTO `{BQ_PROJECT}.reporting.daily_loads`
  WITH
    FORECAST AS (
      SELECT
        time_series_data as loads, 
        standard_error,
        EXTRACT(DATE FROM time_series_timestamp) AS event_date
      FROM ML.EXPLAIN_FORECAST(
        MODEL `{BQ_PROJECT}.reporting.daily_loads_forecast`,
        STRUCT(14 AS horizon, 0.95 AS confidence_level)) 
      WHERE time_series_type = 'forecast'
    ),
    ACTUALS AS (
      SELECT
        file,
        SUM(loads) as loads
      FROM `{BQ_PROJECT}.reporting.daily_loads`
      WHERE
        event_date < CURRENT_DATE()
        and event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY)
        and row_type = 'actual'
      GROUP BY file
    ),
    DISTRIB AS (
      SELECT
        file,
        loads / SUM(loads) OVER() as percent
      FROM ACTUALS
      WHERE loads >= 28
    )
  SELECT
    event_date,
    file,
    loads*percent as loads,
    'forecast' as row_type,
    loads*percent - 1.96*standard_error*percent as lower_bound,
    loads*percent + 1.96*standard_error*percent as upper_bound
  FROM FORECAST CROSS JOIN DISTRIB
  ORDER BY event_date, loads
"""
job = bq.query(query = query)
job.result()
print(job.state, (job.ended-job.started).total_seconds())

DONE 1.956


## Updated Daily Loads Increment Query - BQ Scheduled Query

Modify the BQ Scheduled Query developed in `GA4 Reporting` to account for `row_type = 'actual'`.  It should not impact rows with `row_type = 'forecast'`.

In [None]:
BQ_DATASET = 'analytics_343629755'

In [None]:
query = f"""
DELETE
  FROM `{BQ_PROJECT}.reporting.daily_loads`
  WHERE
    event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
    AND row_type = 'actual';
INSERT INTO `{BQ_PROJECT}.reporting.daily_loads`
  WITH
    RAW AS (
      SELECT event_timestamp, event_params
      FROM `{BQ_PROJECT}.{BQ_DATASET}.events_intraday_*`
      WHERE
        _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
    ),
    DAILY AS (
      SELECT
        event_timestamp,
        STRING_AGG(keys.value.string_value, '/' ORDER BY keys.key) as file
      FROM RAW, UNNEST(event_params) as keys
      WHERE keys.key IN ('page_location', 'page_title')
      GROUP BY event_timestamp
    )
  SELECT
    EXTRACT(DATE FROM TIMESTAMP_MICROS(event_timestamp) AT TIME ZONE '-05') as event_date, 
    file, 
    COUNT(*) as loads,
    'actual' as row_type,
    NULL as lower_bound,
    NULL as upper_bound
  FROM DAILY
  GROUP BY event_date, file
"""
print(query)


DELETE
  FROM `vertex-ai-mlops-369716.reporting.daily_loads`
  WHERE
    event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
    AND row_type = 'actual';
INSERT INTO `vertex-ai-mlops-369716.reporting.daily_loads`
  WITH
    RAW AS (
      SELECT event_timestamp, event_params
      FROM `vertex-ai-mlops-369716.analytics_343629755.events_intraday_*`
      WHERE
        _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
    ),
    DAILY AS (
      SELECT
        event_timestamp,
        STRING_AGG(keys.value.string_value, '/' ORDER BY keys.key) as file
      FROM RAW, UNNEST(event_params) as keys
      WHERE keys.key IN ('page_location', 'page_title')
      GROUP BY event_timestamp
    )
  SELECT
    EXTRACT(DATE FROM TIMESTAMP_MICROS(event_timestamp) AT TIME ZONE '-05') as event_date, 
    file, 
    COUNT(*) as loads,
    'actual' as row_type,
    NULL as lower_bound,
    NULL as upper_bound
  FROM DAILY
  GROUP BY event_date, file



In [None]:
job = bq.query(query = query)
job.result()
print(job.state, (job.ended-job.started).total_seconds())

DONE 5.38


## New Weekly Forecast Increment Query - BQ Scheduled Query

Make a new BQ Scheduled Query that:
- Makes Forecasting model for 2 weeks out
- delete rows for future - keep last week forecasted values though
- inserts 2 weeks for forecast

Details:
- BQ Scheduled Query
- name = `daily_loads - forecast weekly`
- schedule: every monday at 3AM EST, 8 UTC
- email on failure


In [9]:
query = f"""
    # Create the forecast model using data before today
    CREATE OR REPLACE MODEL `{BQ_PROJECT}.reporting.daily_loads_forecast`
    OPTIONS
      (model_type = 'ARIMA_PLUS',
       time_series_timestamp_col = 'event_date',
       time_series_data_col = 'loads',
       data_frequency = 'DAILY',
       auto_arima_max_order = 5,
       holiday_region = 'US',
       horizon = 14
      ) AS
    SELECT event_date, SUM(loads) as loads
    FROM `{BQ_PROJECT}.reporting.daily_loads`
    WHERE
      event_date < CURRENT_DATE() # runs on mondays for weeks ending on sundays
      AND row_type = 'actual'
    GROUP BY event_date;

    # drop row for forecast for today forward (preserve last weeks forecast)
    DELETE
      FROM `{BQ_PROJECT}.reporting.daily_loads`
      WHERE
        event_date >= CURRENT_DATE()
        AND row_type = 'forecast';

    # insert forecast rows for today forward - 2 week horizon
    INSERT INTO `{BQ_PROJECT}.reporting.daily_loads`
      WITH
        FORECAST AS (
          SELECT
            time_series_data as loads, 
            standard_error,
            EXTRACT(DATE FROM time_series_timestamp) AS event_date
          FROM ML.EXPLAIN_FORECAST(
            MODEL `{BQ_PROJECT}.reporting.daily_loads_forecast`,
            STRUCT(14 AS horizon, 0.95 AS confidence_level)) 
          WHERE time_series_type = 'forecast'
        ),
        ACTUALS AS (
          SELECT
            file,
            SUM(loads) as loads
          FROM `{BQ_PROJECT}.reporting.daily_loads`
          WHERE
            event_date < CURRENT_DATE()
            and event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY)
            and row_type = 'actual'
          GROUP BY file
        ),
        DISTRIB AS (
          SELECT
            file,
            loads / SUM(loads) OVER() as percent
          FROM ACTUALS
          WHERE loads >= 28
        )
      SELECT
        event_date,
        file,
        loads*percent as loads,
        'forecast' as row_type,
        loads*percent - 1.96*standard_error*percent as lower_bound,
        loads*percent + 1.96*standard_error*percent as upper_bound
      FROM FORECAST CROSS JOIN DISTRIB
      ORDER BY event_date, loads;
"""
print(query)


    # Create the forecast model using data before today
    CREATE OR REPLACE MODEL `vertex-ai-mlops-369716.reporting.daily_loads_forecast`
    OPTIONS
      (model_type = 'ARIMA_PLUS',
       time_series_timestamp_col = 'event_date',
       time_series_data_col = 'loads',
       data_frequency = 'DAILY',
       auto_arima_max_order = 5,
       holiday_region = 'US',
       horizon = 14
      ) AS
    SELECT event_date, SUM(loads) as loads
    FROM `vertex-ai-mlops-369716.reporting.daily_loads`
    WHERE
      event_date < CURRENT_DATE() # runs on mondays for weeks ending on sundays
      AND row_type = 'actual'
    GROUP BY event_date;

    # drop row for forecast for today forward (preserve last weeks forecast)
    DELETE
      FROM `vertex-ai-mlops-369716.reporting.daily_loads`
      WHERE
        event_date >= CURRENT_DATE()
        AND row_type = 'forecast';

    # insert forecast rows for today forward - 2 week horizon
    INSERT INTO `vertex-ai-mlops-369716.reporting.daily_load

In [None]:
job = bq.query(query = query)
job.result()
print(job.state, (job.ended-job.started).total_seconds())

DONE 18.226


## Add fitted forecast history to the table

For this first forecast run, add the fitted forecast history to the `daily_loads` table.  Going forward leave this in place rather than replacing.


In [None]:
query = f"""
    SELECT *, 
        EXTRACT(DATE FROM time_series_timestamp) AS event_date
    FROM ML.EXPLAIN_FORECAST(
        MODEL `{BQ_PROJECT}.reporting.daily_loads_forecast`,
        STRUCT(14 AS horizon, 0.95 AS confidence_level))
    #WHERE time_series_type = 'forecast'
    ORDER BY event_date
"""
explain = bq.query(query).to_dataframe()

In [None]:
explain

Unnamed: 0,time_series_timestamp,time_series_type,time_series_data,time_series_adjusted_data,standard_error,confidence_level,prediction_interval_lower_bound,prediction_interval_upper_bound,trend,seasonal_period_yearly,seasonal_period_quarterly,seasonal_period_monthly,seasonal_period_weekly,seasonal_period_daily,holiday_effect,spikes_and_dips,step_changes,event_date
0,2022-11-27 00:00:00+00:00,history,62.000000,62.000000,74.095682,,,,0.000000,,,,-113.547110,,,,,2022-11-27
1,2022-11-28 00:00:00+00:00,history,280.000000,238.018260,74.095682,,,,177.377815,,,,60.640445,,,,,2022-11-28
2,2022-11-29 00:00:00+00:00,history,396.000000,335.466761,74.095682,,,,202.949285,,,,132.517476,,,,,2022-11-29
3,2022-11-30 00:00:00+00:00,history,152.000000,249.890910,74.095682,,,,218.020991,,,,31.869919,,,,,2022-11-30
4,2022-12-01 00:00:00+00:00,history,327.000000,260.687555,74.095682,,,,195.142938,,,,65.544617,,,,,2022-12-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108,2023-03-15 00:00:00+00:00,forecast,409.777768,409.777768,95.024100,0.95,223.867098,595.688438,398.906786,,,,10.870983,,,,,2023-03-15
109,2023-03-16 00:00:00+00:00,forecast,473.354436,473.354436,95.024100,0.95,287.443766,659.265106,400.737683,,,,72.616754,,,,,2023-03-16
110,2023-03-17 00:00:00+00:00,forecast,393.766263,393.766263,95.024100,0.95,207.855593,579.676933,402.568580,,,,-8.802317,,,,,2023-03-17
111,2023-03-18 00:00:00+00:00,forecast,227.207477,227.207477,95.024100,0.95,41.296807,413.118147,404.399477,,,,-177.192000,,,,,2023-03-18


In [10]:
query = f"""
    SELECT 
        EXTRACT(DATE FROM time_series_timestamp) AS event_date,
        'forecast overall' as file,
        time_series_adjusted_data as loads,
        'forecast' as row_type,
        prediction_interval_lower_bound as lower_bound,
        prediction_interval_upper_bound as upper_bound
    FROM ML.EXPLAIN_FORECAST(
        MODEL `{BQ_PROJECT}.reporting.daily_loads_forecast`,
        STRUCT(14 AS horizon, 0.95 AS confidence_level))
    WHERE time_series_type = 'history'
    ORDER BY event_date
"""
print(query)


    SELECT 
        EXTRACT(DATE FROM time_series_timestamp) AS event_date,
        'forecast overall' as file,
        time_series_adjusted_data as loads,
        'forecast' as row_type,
        prediction_interval_lower_bound as lower_bound,
        prediction_interval_upper_bound as upper_bound
    FROM ML.EXPLAIN_FORECAST(
        MODEL `vertex-ai-mlops-369716.reporting.daily_loads_forecast`,
        STRUCT(14 AS horizon, 0.95 AS confidence_level))
    WHERE time_series_type = 'history'
    ORDER BY event_date



In [11]:
bq.query(query = query).to_dataframe()

Unnamed: 0,event_date,file,loads,row_type,lower_bound,upper_bound
0,2022-11-27,forecast overall,62.000000,forecast,,
1,2022-11-28,forecast overall,236.187373,forecast,,
2,2022-11-29,forecast overall,333.191033,forecast,,
3,2022-11-30,forecast overall,244.476569,forecast,,
4,2022-12-01,forecast overall,255.906663,forecast,,
...,...,...,...,...,...,...
108,2023-03-15,forecast overall,321.416249,forecast,,
109,2023-03-16,forecast overall,357.797028,forecast,,
110,2023-03-17,forecast overall,398.623647,forecast,,
111,2023-03-18,forecast overall,307.159428,forecast,,


In [12]:
query = f"""
INSERT INTO `{BQ_PROJECT}.reporting.daily_loads`
    SELECT 
        EXTRACT(DATE FROM time_series_timestamp) AS event_date,
        'forecast overall' as file,
        time_series_adjusted_data as loads,
        'forecast' as row_type,
        prediction_interval_lower_bound as lower_bound,
        prediction_interval_upper_bound as upper_bound
    FROM ML.EXPLAIN_FORECAST(
        MODEL `{BQ_PROJECT}.reporting.daily_loads_forecast`,
        STRUCT(14 AS horizon, 0.95 AS confidence_level))
    WHERE time_series_type = 'history'
    ORDER BY event_date
"""
job = bq.query(query = query)
job.result()
print(job.state, (job.ended-job.started).total_seconds())

DONE 1.527
