Revitalize the collection of PUDL usage metrics #128

jdangerx · 2024-06-10T19:44:27Z

Overview

In order to better trace the development of PUDL, the success of our outreach efforts and the effects of our new Superset instance, we need to revitalize the pudl-usage-metrics repository and collect usage metrics from the following sources:

S3
Datasette (until retirement)
Superset
Zenodo
Kaggle
Github

We're interested in the following types of metrics to start:

how many different IPs are accessing our data via each method?
what tables and versions of the data are people accessing?

As a first step, we should be able to ETL the logs and metrics from each of these data sources and get a weekly summary that we can look at. As a second step, we want to hook up our metrics to a private Superset dataset and build some dashboards for easy interpretation.

Out of scope

Google Analytics - we already have a dashboard we can use to look at these metrics!
Migrating out of CloudSQL to another storage backend
Setting up a permanent dagster server with sensors and schedules

Infrastructure

The pudl-usage-metrics repository hasn't been maintained for a while. We'll need to get it up to speed to support this development work.

Infrastructure tasks

Give feedback

Update packages Move from setup.py to pyproject.toml, switch to using ruff and tox, bump python version and update pre-commit hooks #119
The repo was created with an older version of the cheshire template. Update the packaging methodology and pre-commit hooks. Replace setup.py with pyproject.toml and conda for mamba. Move from setup.py to pyproject.toml, switch to using ruff and tox, bump python version and update pre-commit hooks #119
Update Python version to 3.12 Move from setup.py to pyproject.toml, switch to using ruff and tox, bump python version and update pre-commit hooks #119
Add pudl usage metrics gcp infrastructure pudl#3841

0 of 7

cloud superset
Options

S3 Logs

Our main programmatic access method. S3 logs are currently mirrored to a GCS bucket. Each request produces one log.

S3 Usage Metrics

Give feedback

If config to mirror S3 bucket to GCS is in S3, start with getting existing code to run locally on latest S3 logs
Put in devtools and run using a GHA weekly? Or update the usage metrics repo? Decide
Clean up pudl-usage-metrics repo: see Revamp pudl-usage-metrics repo #118
Configure maximum run concurrency
Update existing code and add to pudl-usage-metrics repo
Create a Dagster ETL pipeline for S3 usage metrics #140

8 of 8

s3
Automate local S3 ETL in cloud #147

13 of 13

github_actions s3
Options

Datasette

While we're planning to retire Datasette, it'd still be helpful to understand the history of usage and to see how usage changes during the transition to Superset. The log ETL that exists in pudl-usage-metrics hasn't worked since the transition to fly.io.

fly.io currently doesn't retain logs for a long time so we need to use the https://github.com/superfly/fly-log-shipper fly log shipper to send logs to S3.

It also doesn't log out the IP address of the datasette requests - guessing that the IP currently logged is the load balancer IP. Usually the load balancer includes some sort of "forwarded this request from original IP" information in the headers, so we should be able to extract that somehow. Seems like we can't configure the datasette access logs so we'll need to set it up behind something we can configure, like NGINX.

Datasette Tasks

Give feedback

Ship datasette metrics from fly.io using fly-log-shipper #148

0 of 2

datasette
write a script to parse logs into timestamp/IP/resource records
compare & contrast to old Datasette logs
Update datasette ETL to use assets rather than ops, and to conform to the format of the S3 processing
Options

Superset

We're slowly deploying a new data visualization tool! It'll give us a lot of usage information, which we should process and handle. See https://engineering.hometogo.com/monitor-superset-usage-via-superset-c7f9fba79525 for a template.

Superset Tasks

Give feedback

Figure out how Superset logs usage metrics and how long usage metrics are retained for.
If retention timeline is short, figure out strategy for archiving logs periodically.
See https://github.com/hometogo/hometogo-data-code-snippets/tree/main/superset/usage_dashboard
Options

Zenodo

Zenodo API calls return stats on views and downloads for a record at a particular point in time. We should periodically (weekly?) collect stats on all of our archives on Zenodo and archive them for later processing.

Zenodo Tasks

Give feedback

Write a script to collect stats for all Zenodo archives and save them to a GCS bucket
Write an ETL script to combine and process these archived records
Options

Kaggle

Kaggle collects data on views and downloads through its dataset metadata JSON, accessible through the api.metadata_get(KAGGLE_OWNER, KAGGLE_DATASET) call from the KaggleApi. Like Zenodo, this is data reported at the time of query, so we'll need to archive these metrics to see changes over time.

Kaggle

Give feedback

Write a GHA to archive the metadata file daily into a Kaggle Metrics GCS bucket
ETL this data, keeping only the relevant metrics fields: ["info.usabilityRatingNullable", "info.totalViews", "info.totalVotes", "info.totalDownloads"]
Investigate whether it is possible to pull kernel/notebook metrics as well
Options

Github

Migrate our Github metrics archiving from the business repository, and add it to our ETL.

Github

Give feedback

Move Github metric archive out of business repo #161

2 of 4

github_actions
ETL this data, keeping only the relevant metrics fields
Options

Reporting and Visualization

Once the data is processed, we'll need to analyze and report on metrics of interest in order to interpret changes in usage and highlight trends of interest.

Some interesting references for Superset usage dashboards can be found here.

Reporting and Summaries

Give feedback

Define usage metrics of interest across data sources
Write a script to output a weekly usage summary
Write a GHA to produce and upload this weekly summary to GCS every week
Hook up the database to Superset
Create some dashboards
Options

The text was updated successfully, but these errors were encountered:

jdangerx · 2024-07-01T19:52:56Z

We should timebox this to 5h and prioritize getting S3 parquet logs because of the possibility of replacing datasette altogether.

bendnorman · 2024-07-02T22:02:13Z

I think revamping the pudl-usage-metrics repo will take some work. Maybe we can simplify the task by "disabling" the current metrics in the ETL:

old datasette logs from the Cloud Run days which are probably still helpful for us
intake catalog logs which we never really utilized

and integrate just the s3 logs since those are the highest value / most relevant rn. I opened a PR with my janky s3 log download script and notebook.

The ETl generally works like this:

Pull some logs from GCS
Does some cleaning with pandas and dagster
Load the cleaned logs into Cloud SQL postgres

I have a github action that processes the latest logs and loads them to Cloud SQL. Cloud SQL is kind of expensive so it might make more sense to use BQ.

I think it makes sense to create a quick design doc for the usage metrics revamp, given there is a lot we could do.

e-belfer · 2024-08-14T15:22:28Z

I've updated this issue to be an epic reflecting all our logs and possible workflows, and have tried to structure out smaller steps in the tasklists.

zaneselvans · 2024-09-16T14:57:13Z

@e-belfer was this issue supposed to get closed by #162?

e-belfer · 2024-09-16T15:14:45Z

Definitely not!

This was referenced Jun 12, 2024

Add staging environment catalyst-cooperative/pudl#3666

Merged

Log datasette access IPs catalyst-cooperative/pudl#3669

Merged

e-belfer self-assigned this Jul 1, 2024

e-belfer transferred this issue from catalyst-cooperative/pudl Jul 17, 2024

e-belfer mentioned this issue Jul 31, 2024

Create a Dagster ETL pipeline for S3 usage metrics #140

Merged

e-belfer added Epic github_actions Pull requests that update GitHub Actions code s3 Relating to S3 usage metrics datasette Relating to Datasette usage metrics superset Relating to Superset usage metrics labels Aug 14, 2024

e-belfer changed the title ~~Get basic user metrics we technically have access to~~ Revitalize the collection of PUDL usage metrics Aug 14, 2024

e-belfer mentioned this issue Aug 14, 2024

Revamp pudl-usage-metrics repo #118

Closed

e-belfer linked a pull request Sep 13, 2024 that will close this issue

Archive raw Kaggle and Github metrics daily #162

Merged

e-belfer closed this as completed in #162 Sep 16, 2024

e-belfer reopened this Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revitalize the collection of PUDL usage metrics #128

Revitalize the collection of PUDL usage metrics #128

jdangerx commented Jun 10, 2024 •

edited by e-belfer

Loading

Infrastructure tasks

S3 Usage Metrics

Datasette Tasks

Superset Tasks

Zenodo Tasks

Kaggle

Github

Reporting and Summaries

jdangerx commented Jul 1, 2024

bendnorman commented Jul 2, 2024

e-belfer commented Aug 14, 2024

zaneselvans commented Sep 16, 2024

e-belfer commented Sep 16, 2024

Revitalize the collection of PUDL usage metrics #128

Revitalize the collection of PUDL usage metrics #128

Comments

jdangerx commented Jun 10, 2024 • edited by e-belfer Loading

Overview

Out of scope

Infrastructure

Infrastructure tasks

S3 Logs

S3 Usage Metrics

Datasette

Datasette Tasks

Superset

Superset Tasks

Zenodo

Zenodo Tasks

Kaggle

Kaggle

Github

Github

Reporting and Visualization

Reporting and Summaries

jdangerx commented Jul 1, 2024

bendnorman commented Jul 2, 2024

e-belfer commented Aug 14, 2024

zaneselvans commented Sep 16, 2024

e-belfer commented Sep 16, 2024

jdangerx commented Jun 10, 2024 •

edited by e-belfer

Loading