Datablock Storage: replace firebase for applab data features#56279
Datablock Storage: replace firebase for applab data features#56279
Conversation
|
This PR is a continuation of #54643, which was mis-closed by the Git LFS migration (#55759). Previous Comments:
In applab.js this is just used to update data preview pages when columns are renamed, seems like that doesn't necessitate a full firebase-level DB.
The only thing that looks like it might need subscription like functionality is
Questions to answer:
I'd personally like to see how widely used the From this we can compute: "percent of active applab projects using the onRecord event block", which can then inform a decision to just deprecate
@cnbrenci and I tracked down where applab student source is being saved, and its in main.json files in S3, e.g.: https://s3.console.aws.amazon.com/s3/object/cdo-v3-sources?region=us-east-1&prefix=sources/50000063/131542020/main.json
@cnbrenci and I tracked down where applab student source is being saved, and its in main.json files in S3, e.g.:
So we can go: Now we need to figure out how to do
Confirmed that this does work (I successfully downloaded the source)
The last step would be we need to run it against the prod dashboard DB with the "sources" s3_path_prefix instead of "sources_development" #!/usr/bin/env ruby
require_relative('../config/environment')
# Gets a sampling of applab source code from s3
# computes the percentage of those sources that use the onRecordEvent block
def percent_projects_actively_using_block(project_type, block_type, last_active_after, sample_size)
projects = Project.where(project_type: project_type, updated_at: last_active_after...).take(sample_size)
s3_path_prefix = "sources_development"
# s3_path_prefix = "sources" # or in dev: "sources_development"
s3_bucket = "cdo-v3-sources"
projects_have_block = projects.map do |project|
s3_path = "#{s3_path_prefix}/#{project.storage_id}/#{project.id}/main.json"
file_contents = AWS::S3.download_from_bucket(s3_bucket, s3_path)
contains_block = JSON.parse(file_contents)["source"].include?("onRecordEvent(")
contains_block
end
percent_using_block = (projects_have_block.count(true).to_f / projects_have_block.size) * 100
puts "Percentage of #{project_type} projects using the #{block_type} block is #{percent_using_block}"
end
def percent_applabs_actively_using_on_record_event
num_samples = 1000
timeframe = 30.days.ago
percent_projects_actively_using_block("applab", "onRecordEvent", timeframe, num_samples)
end
puts "Percent of recent applab projects using 'onRecordEvent':"
puts(percent_applabs_actively_using_on_record_event)
I guess a next step is figuring out how/where we can run this to have production DB creds (preferably read-only) available. One option, which might actually be simplest, is to run this on our
TL;DR: 0.002% of applab projects (that aren't stubs or chat apps) created in the last 2 months use the
Also worth noting that none of the apps appear to be finished to an extent that I think its likely anyone is actively using them in a multiplayer context, including the chat apps.
So we went back to the sampling, and grabbed 105k samples covering all of 2022... We scanned 2022 (samples from each month), and we found 3 "valid" apps in 2022 out of 105k samples (=0.003%, about the same as the "last two months" sample). "Valid" meant apps that weren't either chat apps, or stub code (e.g. "drag some blocks out"). We found 11 chat apps out of the 105k, so chat apps were 0.01%, though some of the 11 are variations on the same "WUT by Adam" chat. So basically, looks like our rate of apps that use the onRecordEvent block is ~~0.002 - 0.003% (pretty in the noise given how few examples there were even with 100k samples each time)
Relevant code is here: https://github.com/code-dot-org/code-dot-org/blob/e6905b0c899383ca2c7e43885f847cedf84478b2/apps/src/p5lab/gamelab/commands.js/#L25-L47 The PR that added them is here: #11951 @davidsbailey any chance you remember why we were adding these APIs without adding them as blocks?
no, sadly I do not think I was involved in this decision.
Previous Reviews: |
Co-authored-by: Cassi Brenci <cnbrenci@users.noreply.github.com>
Co-authored-by: Cassi Brenci <cnbrenci@users.noreply.github.com>
Co-authored-by: Cassi Brenci <cnbrenci@users.noreply.github.com>
Co-authored-by: Cassi Brenci <cnbrenci@users.noreply.github.com>
DatablockStorage: switch to fat models
…ethods Re-group DatablockStorage API to look more REST-ish
Firebase still uses type argument, since it can't check on the backend.
Will permit future search cleanups to be more confident.
Rework DatablockStorage methods to not use Firebase-isms
…aticTable and FirebaseStorage.addCurrentTableToProject
…r testing everything else
Datablock Storage: use square corners for [Data] to flag "in experiment"
| # Table Column API, Table Record API, Library Manifest API & Project API | ||
| # | ||
| # More details can be found in the PR that initially created Datablock Storage: | ||
| # https://github.com/code-dot-org/code-dot-org/pull/56279 |
There was a problem hiding this comment.
As I was reading through and trying to learn about the data model, the main question in my head was "How are we doing both a KV store and a more traditional table (Relational DB?) at the same time?"
I couldn't find much about the KV store in the documentation (comments, issues, or PR description). In retrospect, this is probably because it's actually the much simpler of the two. But when I was learning, I assumed it would be the more complex because we're using mySql as our underlying DB and in my mind that's more similar to the tables students are making than to the KV stores they're making.
It would be really helpful to have a bit more description of the KV store that helps me know I shouldn't overthink it and it's pretty straightforward. A few main possibilities that come to mind are in the PR description, the header of this file, or in the Key-Value-Pair API section below, as those are where I looked for details before tracing through the code.
There was a problem hiding this comment.
I think the best place to look to answer that would be in the model file for the key-value pairs model (datablock_storage_kvp.rb). This file describes the schema of the table where we're storing the data & implementation of each of the methods we're calling here on the model, & is guaranteed to not be out of date. It sounds like you found your way to the issue describing the models, but it was out of date and caused some confusion with channel_id vs project_id; But you eventually did find your way to the model file and find the answers you were looking for. Do you have any suggestions on what we could say here to lead you in the direction of looking at the model files first?
jmkulwik
left a comment
There was a problem hiding this comment.
Looks great! Two main thoughts: not blocking for this PR.
- Are we doing any sort of rate limiting? Should we be?
- I naively expected the KV store implementation to be the more complicated of the two, so I kept looking for hidden complexity. It'd be helpful having some comments explaining how it works it so the reader knows that it's actually pretty simple.
This PR splits the DB migrations out from the main Datablock Storage PR: #56279
We have a followup issue for rate limiting, its probably the highest priority followup for us after our experiment, I was curious to see natural traffic patterns before we decided which way we wanted to rate limit. |
| app_options | ||
| end | ||
|
|
||
| def firebase_options |
There was a problem hiding this comment.
Is this the line that changed to cause the name error?
| app_options[:legacyShareStyle] = true if @legacy_share_style | ||
| app_options[:isMobile] = true if browser.mobile? | ||
| app_options[:labUserId] = lab_user_id if @game == Game.applab || @game == Game.gamelab | ||
| app_options.merge!(firebase_options) |
There was a problem hiding this comment.
or more specifically, here?
Revert "Revert "Datablock Storage: replace firebase for applab data features (#56279)"


Datablock Storage
This PR implements a Rails/MySQL backend to the Applab data features currently backed with Firebase:
This PR does not immediately switch to Datablock Storage, in fact when this PR is merged by default 0% of Applab projects will use it. Instead a gradual rollout concurrent with bugfixing and optimization is envisioned, see "Rollout plan" below.
Motivation
This project is part of the "Eng Excellence" series. Primary motivation is cost (Firebase is on track to cost us $20k/month and it rises significantly each year), secondary motivation is consolidating the technology we are using: datablock storage uses our existing MySQL DB. This provides small advantages like local-dev using your existing DB, vs firebase where all local dev actually shares a DB, leading to weird behavior like editing the dataset manifests in local dev would actually modify production (!!!). Firebase provides very cool live subscription features, but in practice we did not use these in our curriculum.
File changes in this PR fall into one of three categories:
datablockStorage.js, allowing for the approach of both code paths existing alongside each-other, switching on configuration when initializing the data store.import FirebaseStorageimport withimport {storageBackend} from '../storage'// TODO: post-firebase-cleanup): RemoveTODO: post-firebase-cleanupitems after full deploy #56994UX Changes
Ideally this is a backend-only change without student-facing, curriculum-facing or teacher-facing changes.
New features
getColumnblock doesn't fetch the whole table client-side every time. This is important because curriculum has told us this is the main block they use, and many datasets like Ramen are challenging on student machines as a result (timing out, etc).Rollout plan
This code implements the core of the student-facing data-in-applab experience, but does NOT include curriculum-facing dataset editing interfaces. During rollout, we'll request curriculum authors to notify us if they edit datasets (only happens a few times a year and we've already started the conversation with them) and we'll manually sync the changes to Datablock Storage.
While the AP CSP create task may result in delaying the final rollout until its completed, we would like to do at least at least do a test-the-waters rollout prior to it. Its possible we'll decide at that point to fully switch back to Firebase storage, or might feel confident to go further depending on how things look.
Data Model
The data model is very simple, we initially thought we might have to try fancy things to deal with the scale of data involved, but we tried importing all 1TB of Firebase data into various MySQL schemas and found a very basic approach to be performant (various query and data structures were timed and optimized, leading to the ActiveRecord calls we have now).
The most challenging issue was dealing with auto-incrementing the record_id relative to the composite key. For this, we lock the first row in the table when computing a new record ID.
Data Migrations
At @bethanyaconnor 's suggestion, we've split the migrations out into a separate PR to be applied first: #57392
Followup Work
This PR implements the core student-facing data features. Anticipated followups include possible backend optimizations, and UX bug fixes. Followup work is being tracked on a GitHub kanban board
Major required followups include:
Release critical issues can be found here: https://github.com/orgs/code-dot-org/projects/4/views/10
Co-authored-by: Cassi Brenci cassi.brenci@code.org