Skip to content
Scripts to dump mozaggregator to BigQuery
Jupyter Notebook Python Shell
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
notebooks
scripts
.gitignore
README.md
requirements.txt

README.md

mozaggregator2bq

A set of scripts for loading Firefox Telemetry aggregates into BigQuery. These aggregates power the Telemetry Dashboard and Evolution Viewer.

Overview

Install the required dependencies. Here, we will use a Google Cloud VM on an n1-standard-4, on a CentOS 8 image.

sudo dnf install \
    git \
    postgresql \
    java-1.8.0-openjdk \
    jq \
    tmux

Start a new tmux session, which can be detached and reattached from a new ssh session.

# pg_dump is included in postgresql or postgresql-client
pg_dump --version
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Interacting with the database

To start a psql instance with the read-only replica of the production Postgres instance, run the following commands. Ensure that you have the appropriate AWS credentials.

source scripts/export_credentials_s3

PGPASSWORD=$POSTGRES_PASS psql \
    --host="$POSTGRES_HOST" \
    --username="$POSTGRES_USER" \
    --dbname="$POSTGRES_DB"

An example query:

-- list all aggregates by build_id
select tablename
from pg_catalog.pg_tables
where schemaname='public' and tablename like 'build_id%';

--  build_id_aurora_0_20130414
--  build_id_aurora_0_20150128
--  build_id_aurora_0_20150329
--  build_id_aurora_1_20130203
--  build_id_aurora_1_20150604
-- ...

-- list all aggregates by submission_date
select tablename
from pg_catalog.pg_tables
where schemaname='public' and tablename like 'submission_date%';

--  submission_date_beta_1_20151027
--  submission_date_nightly_40_20151029
--  submission_date_beta_39_20151027
--  submission_date_nightly_1_20151025
--  submission_date_nightly_39_20151031
-- ...

Database dumps by aggregate type and date

To start dumping data, run the following commands.

source scripts/export_credentials_s3

time DATA_DIR=data AGGREGATE_TYPE=submission DS_NODASH=20191201 scripts/pg_dump_by_day
# 23.92s user 1.97s system 39% cpu 1:05.48 total

time DATA_DIR=data AGGREGATE_TYPE=build_id DS_NODASH=20191201 scripts/pg_dump_by_day
# 3.47s user 0.49s system 24% cpu 16.188 total

This should result in gzipped files in the following hierarchy.

data
├── [  96]  build_id
│   └── [ 128]  20191201
│       ├── [8.4M]  474306.dat.gz
│       └── [1.6K]  toc.dat
└── [  96]  submission
    └── [3.2K]  20191201
        ├── [ 74K]  474405.dat.gz
        ├── [ 48K]  474406.dat.gz
        ....
        ├── [1.8M]  474504.dat.gz
        └── [ 93K]  toc.dat

4 directories, 103 files

See the pg_dump documentation for details on the file format.

$ gzip -cd data/submission/20191201/474405.dat.gz | head -n3
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_INSTANTIATED_FLAG", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}    {0,2,0,2,2}
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_CONSUMERS", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}    {0,0,0,0,0,0,0,0,0,0,2,0,20,2}
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_ISIMPLEDOM_USAGE_FLAG", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}        {2,0,0,0,2}

Running a notebook

Ensure your data directory in the top-level directory matches the one in the notebook. Run the following script.

bin/start-jupyter

This script can be modified to include various configuration parameters for spark, including the default parallelism and the amount of executor memory.

Processing pg_dump into parquet

Run the following scripts to transform the data dumps into parquet, where the json fields have been transformed into appropriate columns and arrays.

bin/submit-local scripts/pg_dump_to_parquet.py \
    --input-dir data/submission_date/20191201 \
    --output-dir data/parquet/submission_date/20191201

Running backfill

The bin/backfill script will dump data from the Postgres database, transform the data into Parquet, and load the data into a BigQuery table. The current schema for the table is as follows:

Field name Type Mode
ingest_date DATE REQUIRED
aggregate_type STRING NULLABLE
ds_nodash STRING NULLABLE
channel STRING NULLABLE
version STRING NULLABLE
os STRING NULLABLE
child STRING NULLABLE
label STRING NULLABLE
metric STRING NULLABLE
osVersion STRING NULLABLE
application STRING NULLABLE
architecture STRING NULLABLE
aggregate STRING NULLABLE

There is a table for the build id aggregates and the submission date aggregates. The build ids are truncated to the nearest date.

It may be useful to use a small entry-point script.

#!/bin/bash
set -x
start=2015-06-01
end=2020-04-01
while ! [[ $start > $end ]]; do
    rm -r data
    up_to=$(date -d "$start + 1 month" +%F)
    START_DS=$start END_DS=$up_to bash -x bin/backfill
    start=$up_to
done
You can’t perform that action at this time.