Design performant schema for formerly-Firebase data #55344

snickell · 2023-12-08T04:50:53Z

Part of the firebase deprecation project: #55084

We have approximately 4 billion records currently in Firebase (by record we mean one row in a student project dataset), spread across millions of student data tables (contained in millions of applab projects).

Now that we have full data imports going (#55189) we can start performance testing possible schemas to store this data in MySQL.

Store each student-channel (=student project) in a mysql row
Store each student-table (multiple tables per channel) in a JSON mysql row
Store each student-table in a TEXT mysql row?
Store each student-record (multiple records per table) in a TEXT mysql row

snickell · 2023-12-08T10:06:23Z

Basic SELECT statements on the record-per-row table unfirebase.records show excellent performance:

I believe (?) this is only half the total rows (and includes stock table data), but should be generally representative:

Indices are on a composite PK (channel_id, table_name, record_id) as well as on (channel_id), and take about 20% of storage space. Queries that use channel_id= take approximately 0.25s over 1.2 billion rows.

snickell · 2023-12-12T04:59:35Z

Row-per-record schema with table dedupe seems to give favorable query performance (of the queries we see in apps/src/storage/firebaseStorage.js) and reasonable storage

Uses row-per-record for table schema
Dedupes whole tables, and can even dedupe tables when running in KIND_OF_ROW=Record. Reduces total DB byte size to 1/5th of before.
Stores records as the MySQL JSON column type. I find this re-assuring because it means 100% of the individual records have passed MySQL's json validation.... ="our data is probably imported correctly 😅"
Only takes mysql 370GB on-disk for the DB.... including totally working perf-wise indices!
Schema shows good performance: fetching a full table of records takes ~0.25s (=readRecord block), individual record inserts are 0.14s (=createRecord block), individual record updates are also 0.14s (=updateRecord block).

Storing records as a JSON column gives us validity confidence. It turns out that MySQL parses JSON using RapidJSON, the same parser we are using. As a result, we were able to pre-parse each record (slows importing, but worth it) before sending the bulk dumps to MySQL, and drop rows/records that don't parse.

JSON validation resulted in dropping only 450 out of 600 million (deduped) records. All seemed like attempts to do exotic/fun things with Unicode, and all but two records appeared to be from chat apps. The two other records appeared to be entries in a high score table for a game. Note that when validation fails, we're not dropping the entire table, just exact records that don't validate.

snickell · 2023-12-12T05:01:10Z

Deduplication leaves us with an additional requirement for the schema: a way to indicate that a table is a "pointer" to a stock table.

snickell · 2023-12-15T00:42:20Z

@cnbrenci and I discussed schema, here's where we got:
Records table:

channelID: VARCHAR(22),
tableName: VARCHAR(768),
recordID: INT,
json: JSON

Tables table:

channelID: VARCHAR(22),
tableName: VARCHAR(768),
columns: JSON,
isSharedTable: VARCHAR(768), /* if it points to a shared/stock table, load records using this table name instead */

columns (JSON) might look like: ['id', 'Word'], note this assumes the columnIDs (like -Mw7ENQB6uKfQYc0kI8U) that are present in Firebase are not actually used. It doesn't look like they are, e.g. firebaseStorage.js references columns by name not by ID.

KeyValuePairs table:

channelID: VARCHAR(22),
key: VARCHAR(768), 
value: VARCHAR(4096), -- v3.config.channels.maxPropertySize, see `rules.bolt`

VARCHAR sizes:

Firebase validation rules:

maxPropertySize:4096 // value length of a key value pair
maxRecordSize:4096 // max length of a record
maxTableCount:10 // this doesn't seem to be encorced in rules.bolt?
maxTableRows: 20000 // number of records inside a table
maxKeySize: 768 // this is a firebase limit

Stock/Shared tables are also stored in Records and Tables but use a fixed channelID: channelID: shared

We propose dropping firebaseChannelIdSuffix. firebaseChannelIdSuffix: was being used to create per-developer separation in the shared cdo-v3-dev firebase, e.g. firebase_channel_id_suffix: -DEVELOPMENT-<%=ENV['USER']%>. It's also used by the test setup firebase_channel_id_suffix: '<%=ci ? "-#{circle_run_identifier}" : ''%>' for similar purposes. However, we believe firebaseChannelIdSuffix is no longer required, since we will be able to run the DB locally on MySQL in dev, rather than pointing all developers at a shared firebase instance.

For data export, we can sign a "what channel ID should you have access to" token, and put that in the data export.

/projects/applab//export_config (Enable applab data APIs in exported projects #22522)
- currently returns (firebaseName, firebaseAuthToken, and firebaseChannelIdSuffix)
- we can instead return (dbName, auth_token=signed(channelID)), where dbName is like ${unfirebase}_development or ${unfirebase}_test

We may still have to add rate limiting to the MySQL schema, see #55481

cnbrenci · 2024-02-27T02:42:01Z

We've changed the tableName and key limits from 768->700. We had to do this because we had to set the charset to utf8mb4 from utfmb3, and the max key length in mysql2 is 3072 bytes. At utf8mb3, the total bytes come out to 3x768=2304 which fits within the limit, but with utf8mb4 it's 4x768=3072 hits the limit.

Given that the keys in firebase have a max size of 768 bytes and a prefix 69 characters which we are not storing in the key field in mysql, we can safely shorten the key with confidence that we'll still be able to migrate all the data into the smaller key size.

Default unicode type for our DB is utf8, which in mysql is (weirdly) 3-byte. This doesn't permit emojis in KVP key names, or in table names. To match firebase we need to support full 4-byte UTF-8. #55344 (comment)

Default unicode type for our DB is utf8, which in mysql is (weirdly) 3-byte. This doesn't permit emojis in KVP key names, or in table names. To match firebase we need to support full 4-byte UTF-8. #55344 (comment) Co-authored-by: Cassi Brenci <cnbrenci@users.noreply.github.com>

snickell mentioned this issue Dec 8, 2023

Datablock Storage - deprecate Firebase (tracking issue) #55084

Closed

66 tasks

snickell assigned snickell and cnbrenci Dec 8, 2023

snickell closed this as completed Dec 15, 2023

snickell mentioned this issue Jan 4, 2024

Design backend for Datablock Storage #55554

Closed

6 tasks

snickell mentioned this issue Feb 27, 2024

Allow emojis in DatablockStorage key names and table names #56837

Closed

snickell mentioned this issue Mar 14, 2024

Datablock Storage: replace firebase for applab data features #56279

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design performant schema for formerly-Firebase data #55344

Design performant schema for formerly-Firebase data #55344

snickell commented Dec 8, 2023 •

edited

Loading

snickell commented Dec 8, 2023 •

edited

Loading

snickell commented Dec 12, 2023

snickell commented Dec 12, 2023

snickell commented Dec 15, 2023 •

edited

Loading

cnbrenci commented Feb 27, 2024 •

edited

Loading

Design performant schema for formerly-Firebase data #55344

Design performant schema for formerly-Firebase data #55344

Comments

snickell commented Dec 8, 2023 • edited Loading

snickell commented Dec 8, 2023 • edited Loading

snickell commented Dec 12, 2023

snickell commented Dec 12, 2023

snickell commented Dec 15, 2023 • edited Loading

cnbrenci commented Feb 27, 2024 • edited Loading

snickell commented Dec 8, 2023 •

edited

Loading

snickell commented Dec 8, 2023 •

edited

Loading

snickell commented Dec 15, 2023 •

edited

Loading

cnbrenci commented Feb 27, 2024 •

edited

Loading