Skip to content

Datablock Storage: replace firebase for applab data features#56279

Merged
cnbrenci merged 454 commits intostagingfrom
unfirebase
Mar 21, 2024
Merged

Datablock Storage: replace firebase for applab data features#56279
cnbrenci merged 454 commits intostagingfrom
unfirebase

Conversation

@snickell
Copy link
Copy Markdown
Contributor

@snickell snickell commented Feb 5, 2024

Datablock Storage

This PR implements a Rails/MySQL backend to the Applab data features currently backed with Firebase:

  1. The Data Library:
    image
  2. The Data blocks:
    image
  3. The data browser, including KVP editing, Table editing, etc:
    image

This PR does not immediately switch to Datablock Storage, in fact when this PR is merged by default 0% of Applab projects will use it. Instead a gradual rollout concurrent with bugfixing and optimization is envisioned, see "Rollout plan" below.

Motivation

This project is part of the "Eng Excellence" series. Primary motivation is cost (Firebase is on track to cost us $20k/month and it rises significantly each year), secondary motivation is consolidating the technology we are using: datablock storage uses our existing MySQL DB. This provides small advantages like local-dev using your existing DB, vs firebase where all local dev actually shares a DB, leading to weird behavior like editing the dataset manifests in local dev would actually modify production (!!!). Firebase provides very cool live subscription features, but in practice we did not use these in our curriculum.

File changes in this PR fall into one of three categories:

  1. New MySQL backed storage implementation including new models, controller, and javascript API:
  2. Refactors to fully encapsulate Firebase-related calls behind an interface that could be implemented in parallel in datablockStorage.js, allowing for the approach of both code paths existing alongside each-other, switching on configuration when initializing the data store.
    • New methods added to: firebaseStorage.js. These are extracted from Firebase-specific code scattered through JSX and applab.js, and created an interface that can then be implemented in datablockStorage.js.
    • storage.js breaks direct coupling between storage-accessing code and firebaseStorage.js. A lot of the changes to files outside this core set are simply replacing a hardcoded import FirebaseStorage import with import {storageBackend} from '../storage'
  3. Todos & references to follow-up issues, and work to be done after a full rollout to remover Firebase code (// TODO: post-firebase-cleanup): Remove TODO: post-firebase-cleanup items after full deploy #56994

UX Changes

Ideally this is a backend-only change without student-facing, curriculum-facing or teacher-facing changes.

New features

  • Rather than copying static tables each time they are selected in the dataset browser, Datablock Storage implements a copy-on-write semantic. Only when a table is modified is it copied. This is because we analyzed the 1TB of JSON we currently have on Firebase and discovered that the majority of it is unchanged copies of our static tables.
  • As a result, static tables and "current" tables (covid, spotify, daily weather) are equivalent
  • An optimized getColumn block doesn't fetch the whole table client-side every time. This is important because curriculum has told us this is the main block they use, and many datasets like Ramen are challenging on student machines as a result (timing out, etc).

Rollout plan

This code implements the core of the student-facing data-in-applab experience, but does NOT include curriculum-facing dataset editing interfaces. During rollout, we'll request curriculum authors to notify us if they edit datasets (only happens a few times a year and we've already started the conversation with them) and we'll manually sync the changes to Datablock Storage.

While the AP CSP create task may result in delaying the final rollout until its completed, we would like to do at least at least do a test-the-waters rollout prior to it. Its possible we'll decide at that point to fully switch back to Firebase storage, or might feel confident to go further depending on how things look.

  1. Test the waters: this PR does NOT switch to Datablock Storage by default, instead a DCDO flag can be used to control the percentage of newly created projects that will use Datablock Storage (defaults to 0%). Initially, we plan to do a 1-day trial with <1% of projects created on that day using Datablock Storage. We'll triage any issues we find, and analyze the performance impact on our DB and rails backend.
  2. Gradual to rollout, fixing and optimizing as needed: from there, we'll determine what we need to optimize now, what we need to fix now, and do so if necessary. Then, slowly day by day or week by week, we'll increase the percentage of new projects that run on Datablock Storage until 100% are on the new backend.
  3. Once we've got a majority of new projects running on Datablock Storage, we'll migrate the Dataset editor code to Datablock Storage by default.
  4. Migrate old projects: once 100% of new projects are on Datablock Storage, we'll start converting old projects. This process will include deduplicating the many copies of our static tables (currently most of the 1TB of data stored in firebase is copies). More information on migration work can be found here
  5. Post-firebase-cleanup: we've marked extensively throughout the code what to remove when Firebase is no longer needed. After everything is migrated, we'll do a followup PR that removes all Firebase code paths.

Data Model

The data model is very simple, we initially thought we might have to try fancy things to deal with the scale of data involved, but we tried importing all 1TB of Firebase data into various MySQL schemas and found a very basic approach to be performant (various query and data structures were timed and optimized, leading to the ActiveRecord calls we have now).

  • Table/Record data: a single row-per-table in DatablockStorageTables, includes table metadata including its name, the columns on the table (json blob), and an indicator whether this is a pointer to a table, or a table. If its a pointer, row-data will be loaded from the corresponding "shared table". Table records are stored in DatablockStorageRecords table one-record-per-row as a record_id and a JSON typed column, keyed by a composite key: project_id and table_name.
  • KVP data: a single row-per-kvp in the DatablockStorageKvps table
  • Data Library manifest data: this is stored as a singleton JSON row in the DatablockStorageLibraryManifest table. We decided not to decompose this data because its already edited, saved, and loaded/used as a single JSON blob in the existing datasets controller.

The most challenging issue was dealing with auto-incrementing the record_id relative to the composite key. For this, we lock the first row in the table when computing a new record ID.

Data Migrations

At @bethanyaconnor 's suggestion, we've split the migrations out into a separate PR to be applied first: #57392

Followup Work

This PR implements the core student-facing data features. Anticipated followups include possible backend optimizations, and UX bug fixes. Followup work is being tracked on a GitHub kanban board

Major required followups include:

  1. Rate limiting checks
  2. Converting the datasets controller and associates to Datablock Storage

Release critical issues can be found here: https://github.com/orgs/code-dot-org/projects/4/views/10

Co-authored-by: Cassi Brenci cassi.brenci@code.org

@snickell snickell requested a review from a team as a code owner February 5, 2024 10:17
@snickell
Copy link
Copy Markdown
Contributor Author

snickell commented Feb 5, 2024

This PR is a continuation of #54643, which was mis-closed by the Git LFS migration (#55759).

Previous Comments:

  • snickell commented - Another way of looking at this is not just where we use "firebase" in method names, but where do we actually use the firebase(TM) library. That's limited to two files:
image
  • snickell commented - This is the one function used outside of apps/storage/firebaseMetadata.js that I wonder if we could easily handle through SQL, it feels like a firebase "subscription".... is it?
export function onColumnsChange(database, tableName, callback) {
  getColumnsRef(database, tableName).on('value', snapshot => {
    const columnsData = snapshot.val() || {};
    const columnNames = _.values(columnsData).map(column => column.columnName);
    callback(columnNames);
  });
}

In applab.js this is just used to update data preview pages when columns are renamed, seems like that doesn't necessitate a full firebase-level DB.

The only thing that looks like it might need subscription like functionality is onRecordEvent:

onRecordEvent("MyNewTable", function(record, eventType) {
  console.log(eventType);
  console.log(record);
});
  • All of our APIs update one record at a time, using the record ID, so its easy on the backend to know exactly what change message to publish
  • onRecordEvent gives you the JSON record contents, including the record.id, and whether it was a create, update, or delete event. Events are all about ONE ROW.
  • snickell commented - Idea: could we /just/ use redis? Like actually use that as our DB?
  • snickell commented - Discussing with @cnbrenci we naively might consider four approaches:
  1. Back our data feature with MySQL and deprecate support for onRecordEvent (see question 1)
  2. Back our data feature with MySQL, applab projects hold an open websocket (or poll???), and use a redis pubsub channel to broadcast (tableId, recordId) change events (or potentially the whole row value, either way), and then rails backend to the websocket sends the new row data to the client along with a changeEvent
  3. Like store sprite.value again, use it properly in setSpriteSize #2, but back our data feature directly with redis, skip the whole mysql in the middle bit. Like aurora for mysql, AWS has a hosted durable redis option, and the pricing is very reasonable.... something like $600/mo would get us 12gbps of redis bandwidth.

Questions to answer:

  1. How actively used by student projects is onRecordEvent? How many calls to onRecordEvent do we see per day? How many projects use it? TODO: download all student code that uses onRecordEvent and skim through them, see how frequent it is overall, how its generally used, etc
  2. How many concurrent applab users do we see? Can we hold that many websockets open with our existing rails infra? If we don't have an approach to holding lots websockets open, is there an approach to this we can implement? How hard is it?
  • snickell commented - Per first convo with product, we may be willing to just deprecate onRecordEvent, breaking existing apps that use it. Conventional wisdom is that its mainly used by chat apps that are technically not allowed under our terms of service anyway, which might explain some of the high cost issues with firebase.

I'd personally like to see how widely used the onRecordEvent block is in Applab projects that have been used/run/loaded in, say, the last 3 months. Ideally used would mean loaded even just to use the student applab app as a user, vs developed on, but if we can't get "when was this last run by anyone", we could use "code was changed in the last 6 months" as a proxy.

From this we can compute: "percent of active applab projects using the onRecord event block", which can then inform a decision to just deprecate onRecordEvent, making option #1 above viable, and not having to add a need to hold constant websockets open to our rails server (expensive and possibly not scalable).

require_relative './block_finder'
require 'active_support/time'
blockFinder(blockName: 'onRecord', game: 'applab', lastActiveAfter: 90.days.ago.to_date)

@cnbrenci and I tracked down where applab student source is being saved, and its in main.json files in S3, e.g.: https://s3.console.aws.amazon.com/s3/object/cdo-v3-sources?region=us-east-1&prefix=sources/50000063/131542020/main.json

  • snickell commented - Say we start by writing a script to determine percent_active_applabs_using_on_record_event():
def sample_random_student_source(game, last_active_after)
  # TO IMPLEMENT
end

def source_contains_block(block_type)
  # TO IMPLEMENT
end

def days_to_seconds (days)
  return days * 24 * 60 * 60
end

def percent_applabs_actively_using_on_record_event()
  num_samples = 1000
  num_samples_using = 0
  for i in O.num_samples do
    source = sample_random_student_source(game: applab, last_active_after: Time.now - days_to_seconds(90))
    if source_contains_block(source)
      nun_samples_using += 1
    end
    return num_samples_using / num_samples
  end

  puts "Percent of recent applab projects using 'onRecordEvent':"
  puts(percent_applabs_actively_using_on_record_event())
end

@cnbrenci and I tracked down where applab student source is being saved, and its in main.json files in S3, e.g.:
https://s3.console.aws.amazon.com/s3/object/cdo-v3-sources?region=us-east-
1&prefix=sources/50000063/131542020/main.json

  • implement sample_random_student_source()
    • research (then implement) how to get list of student applab sources active in last N days using rails DB models in dashboard/app/models
    • research (then implement) how to get S3 path to cdo-v3-sources bucket for the main.json given applab source (using more rails DB models I think?)
    • fetch relevant main.json path from S3, possibly using lib/cdo/aws/s3.rb
  • implement source_contains_block()
    • parse main.json, determine if it contains the 'onRecordEvent' block
  • get relevant S3 credentials, and run sampler on real student source code
  • snickell commented - The "Projects" table (project.rb, Project model) has all the data we need:
  • The projects table includes both updated_at and project_type="applab"
  • the ID column of the Projects table is the "storage_app_id", which is the SECOND number in the S3 paths
  • the storage_id column of the projects table is the FIRST number in the S3 paths

So we can go:

project = Project.where(project_type: 'applab', updated_at: 30.days.ago...).take!
s3_path_prefix = "sources" # or in dev: "sources_development"
s3_path = "#{s3_path_prefix}/#{project.storage_id}/#{project.id}/main.json"
s3_bucket = "cdo-v3-sources"
source_code = AWS::S3.download_from_bucket(‘cdo-v3-sources’, s3_path)

Now we need to figure out how to do download_from_s3()

  • cnbrenci commented - > oject = Project.where(project_type: 'applab', updated_at: 30.days.ago...).take!

s3_path_prefix = "sources" # or in dev: "sources_development"
s3_path = "#{s3_path_prefix}/#{project.storage_id}/#{project.id}/main.json"
s3_bucket = "cdo-v3-sources"
source_code = AWS::S3.download_from_bucket(‘cdo-v3-sources’, s3_path)

Confirmed that this does work (I successfully downloaded the source)

  • cnbrenci commented - open Q to look into: can Project.state be used to determine if the project is active rather than updated time? (I see in my dev env all my projects are state='active')
  • cnbrenci commented - First pass on full script after everything we figured out yesterday.

The last step would be we need to run it against the prod dashboard DB with the "sources" s3_path_prefix instead of "sources_development"

#!/usr/bin/env ruby
require_relative('../config/environment')

# Gets a sampling of applab source code from s3
# computes the percentage of those sources that use the onRecordEvent block

def percent_projects_actively_using_block(project_type, block_type, last_active_after, sample_size)
  projects = Project.where(project_type: project_type, updated_at: last_active_after...).take(sample_size)
  s3_path_prefix = "sources_development"
  # s3_path_prefix = "sources" # or in dev: "sources_development"
  s3_bucket = "cdo-v3-sources"

  projects_have_block = projects.map do |project|
    s3_path = "#{s3_path_prefix}/#{project.storage_id}/#{project.id}/main.json"
    file_contents = AWS::S3.download_from_bucket(s3_bucket, s3_path)
    contains_block = JSON.parse(file_contents)["source"].include?("onRecordEvent(")
    contains_block
  end

  percent_using_block = (projects_have_block.count(true).to_f / projects_have_block.size) * 100
  puts "Percentage of #{project_type} projects using the #{block_type} block is #{percent_using_block}"
end

def percent_applabs_actively_using_on_record_event
  num_samples = 1000
  timeframe = 30.days.ago
  percent_projects_actively_using_block("applab", "onRecordEvent", timeframe, num_samples)
end

puts "Percent of recent applab projects using 'onRecordEvent':"
puts(percent_applabs_actively_using_on_record_event)
  • snickell commented - Nice!! Curious what percent we're gonna end up with 🤞 when we run on production.

  • We should probably use random samples rather than whatever MySQL happens to give for a default ordering. I generally avoid using DB-specific SQL, but since we're just a one-off script.... Project.where(...).order('RANDOM()').take(sample_size) (https://www.geeksforgeeks.org/sql-select-random/). This does require the DB to internally walk the entire set of matching rows, join with a random number each, and then sort N rows. So it would probably suck if we had a million results.... but in reality: I'm betting this will work in our case if "applabs updated in the last N days" isn't too huge.

  • In reality on prod we might want to do 10k samples instead of 1k since I'm guessing we're going to get a value thats less than 1%.

  • Thinking 90 days might be a good date range

  • I think we should save each file_contents to a .json file, and have a copy of those in a tarball or whatever, because the next question after we get a percentage is probably for you and I to pick a dozen or so "uses onRecord" code bits, and loosely read them to find out WHAT its being used for (notably: "is it a chat app?" since we don't consider nixing things against our TOS as a deprecation)

I guess a next step is figuring out how/where we can run this to have production DB creds (preferably read-only) available. One option, which might actually be simplest, is to run this on our production-console instance.

  • snickell commented - We sampled the prod DB to find instances of the onRecordEvent block being used in applab projects. Where we found instances, we read the code to see if they were broken or TOS violations.

TL;DR: 0.002% of applab projects (that aren't stubs or chat apps) created in the last 2 months use the onRecordEvent block, with a sample size of 100k.

  1. Sampling applab projects created in the last 2 months, 6 out of 100000 applab projects created in the last 2 months used onRecordEvent. We read the source code for the 6 projects, and found 4 of the 6 were chat apps or non-functional stub code. Thus only 2 out of 100000 or 0.002% of this sample would be of concern.

    • "who needs assistance" crud app, 1200LOC
    • daily habit tracker, 160LOC (single-user: could be easily written w/o onRecordEvent)
    • 3x non-functional stub code
    • 1x chat app
  2. In case onRecordEvent was more used for older projects with active current usage, we sampled older applab projects that were updated_at the last 3 months, 7 out of 9000 older projects used onRecordEvent. We read the source code for the 7 projects, and found 5 out of 7 where either chat apps or non-functional stub code. Thus only 2 out of 9000 or 0.02% of this sample would be of concern if we remove the onRecordEvent block.

    • 4x chat app
    • 1x non-functional stub code
    • "multiplayer farming simulator", 600LOC
    • math game (used for high score table), 200LOC

Also worth noting that none of the apps appear to be finished to an extent that I think its likely anyone is actively using them in a multiplayer context, including the chat apps.

  • snickell commented - Product noticed (TY @moneppo) we hadn't considered that students use different blocks throughout the year as they complete different curriculum.

So we went back to the sampling, and grabbed 105k samples covering all of 2022...

We scanned 2022 (samples from each month), and we found 3 "valid" apps in 2022 out of 105k samples (=0.003%, about the same as the "last two months" sample). "Valid" meant apps that weren't either chat apps, or stub code (e.g. "drag some blocks out"). We found 11 chat apps out of the 105k, so chat apps were 0.01%, though some of the 11 are variations on the same "WUT by Adam" chat.

So basically, looks like our rate of apps that use the onRecordEvent block is ~~0.002 - 0.003% (pretty in the noise given how few examples there were even with 100k samples each time)

  • snickell commented - We couldn't figure out how people were writing gamelab based chat apps. I found the "secret" gamelab APIs thanks to Hannah Nees giving me a great sample of "banned chat apps", including samples from Gamelab. These use "setKeyValue" and "getKeyValue" functions.... which aren't exposed as blocks (?!?).

Relevant code is here: https://github.com/code-dot-org/code-dot-org/blob/e6905b0c899383ca2c7e43885f847cedf84478b2/apps/src/p5lab/gamelab/commands.js/#L25-L47

The PR that added them is here: #11951

@davidsbailey any chance you remember why we were adding these APIs without adding them as blocks?

  • davidsbailey commented - > @davidsbailey any chance you remember why we were adding these APIs without adding them as blocks?

no, sadly I do not think I was involved in this decision.

  • snickell commented - - Next we should start implementing more methods from the firebase interface we extracted, and get corresponding tests passing.
  • We should also consider adding tests that actually create student source using the data blocks, and thereby test end to end, from JS api through the Rails controller. I think these tests, while a little more bulky, will actually catch much more and more helpful potential breakages in the future.

Previous Reviews:

snickell and others added 18 commits February 8, 2024 21:46
Co-authored-by: Cassi Brenci <cnbrenci@users.noreply.github.com>
Co-authored-by: Cassi Brenci <cnbrenci@users.noreply.github.com>
Co-authored-by: Cassi Brenci <cnbrenci@users.noreply.github.com>
Co-authored-by: Cassi Brenci <cnbrenci@users.noreply.github.com>
DatablockStorage: switch to fat models
…ethods

Re-group DatablockStorage API to look more REST-ish
Firebase still uses type argument, since it can't check on the backend.
Will permit future search cleanups to be more confident.
Rework DatablockStorage methods to not use Firebase-isms
# Table Column API, Table Record API, Library Manifest API & Project API
#
# More details can be found in the PR that initially created Datablock Storage:
# https://github.com/code-dot-org/code-dot-org/pull/56279
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I was reading through and trying to learn about the data model, the main question in my head was "How are we doing both a KV store and a more traditional table (Relational DB?) at the same time?"

I couldn't find much about the KV store in the documentation (comments, issues, or PR description). In retrospect, this is probably because it's actually the much simpler of the two. But when I was learning, I assumed it would be the more complex because we're using mySql as our underlying DB and in my mind that's more similar to the tables students are making than to the KV stores they're making.

It would be really helpful to have a bit more description of the KV store that helps me know I shouldn't overthink it and it's pretty straightforward. A few main possibilities that come to mind are in the PR description, the header of this file, or in the Key-Value-Pair API section below, as those are where I looked for details before tracing through the code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best place to look to answer that would be in the model file for the key-value pairs model (datablock_storage_kvp.rb). This file describes the schema of the table where we're storing the data & implementation of each of the methods we're calling here on the model, & is guaranteed to not be out of date. It sounds like you found your way to the issue describing the models, but it was out of date and caused some confusion with channel_id vs project_id; But you eventually did find your way to the model file and find the answers you were looking for. Do you have any suggestions on what we could say here to lead you in the direction of looking at the model files first?

Copy link
Copy Markdown
Contributor

@jmkulwik jmkulwik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Two main thoughts: not blocking for this PR.

  • Are we doing any sort of rate limiting? Should we be?
  • I naively expected the KV store implementation to be the more complicated of the two, so I kept looking for hidden complexity. It'd be helpful having some comments explaining how it works it so the reader knows that it's actually pretty simple.

cnbrenci pushed a commit that referenced this pull request Mar 20, 2024
This PR splits the DB migrations out from the main Datablock Storage PR: #56279
@snickell
Copy link
Copy Markdown
Contributor Author

Are we doing any sort of rate limiting? Should we be?

We have a followup issue for rate limiting, its probably the highest priority followup for us after our experiment, I was curious to see natural traffic patterns before we decided which way we wanted to rate limit.

Copy link
Copy Markdown
Contributor

@cnbrenci cnbrenci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥🧊🍾🚀

@cnbrenci cnbrenci merged commit add2567 into staging Mar 21, 2024
@cnbrenci cnbrenci deleted the unfirebase branch March 21, 2024 01:19
app_options
end

def firebase_options
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the line that changed to cause the name error?

app_options[:legacyShareStyle] = true if @legacy_share_style
app_options[:isMobile] = true if browser.mobile?
app_options[:labUserId] = lab_user_id if @game == Game.applab || @game == Game.gamelab
app_options.merge!(firebase_options)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or more specifically, here?

cnbrenci added a commit that referenced this pull request Mar 22, 2024
cnbrenci added a commit that referenced this pull request Mar 22, 2024
…56279)" (#57471)

* Revert "Datablock Storage: replace firebase for applab data features (#56279)"

This reverts commit add2567.

* Include manual revert of 3d22e61
cnbrenci added a commit that referenced this pull request Mar 22, 2024
snickell added a commit that referenced this pull request Mar 22, 2024
Revert "Revert "Datablock Storage: replace firebase for applab data features (#56279)"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

unfirebase https://github.com/orgs/code-dot-org/projects/4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants