Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design performant schema for formerly-Firebase data #55344

Closed
snickell opened this issue Dec 8, 2023 · 5 comments
Closed

Design performant schema for formerly-Firebase data #55344

snickell opened this issue Dec 8, 2023 · 5 comments
Assignees

Comments

@snickell
Copy link
Contributor

snickell commented Dec 8, 2023

Part of the firebase deprecation project: #55084

We have approximately 4 billion records currently in Firebase (by record we mean one row in a student project dataset), spread across millions of student data tables (contained in millions of applab projects).

Now that we have full data imports going (#55189) we can start performance testing possible schemas to store this data in MySQL.

  1. Store each student-channel (=student project) in a mysql row
  2. Store each student-table (multiple tables per channel) in a JSON mysql row
  3. Store each student-table in a TEXT mysql row?
  4. Store each student-record (multiple records per table) in a TEXT mysql row
@snickell
Copy link
Contributor Author

snickell commented Dec 8, 2023

Basic SELECT statements on the record-per-row table unfirebase.records show excellent performance:
Image

I believe (?) this is only half the total rows (and includes stock table data), but should be generally representative:
Image

Indices are on a composite PK (channel_id, table_name, record_id) as well as on (channel_id), and take about 20% of storage space. Queries that use channel_id= take approximately 0.25s over 1.2 billion rows.

@snickell
Copy link
Contributor Author

Row-per-record schema with table dedupe seems to give favorable query performance (of the queries we see in apps/src/storage/firebaseStorage.js) and reasonable storage

  1. Uses row-per-record for table schema
  2. Dedupes whole tables, and can even dedupe tables when running in KIND_OF_ROW=Record. Reduces total DB byte size to 1/5th of before.
  3. Stores records as the MySQL JSON column type. I find this re-assuring because it means 100% of the individual records have passed MySQL's json validation.... ="our data is probably imported correctly 😅"
  4. Only takes mysql 370GB on-disk for the DB.... including totally working perf-wise indices!
  5. Schema shows good performance: fetching a full table of records takes ~0.25s (=readRecord block), individual record inserts are 0.14s (=createRecord block), individual record updates are also 0.14s (=updateRecord block).

Image

Storing records as a JSON column gives us validity confidence. It turns out that MySQL parses JSON using RapidJSON, the same parser we are using. As a result, we were able to pre-parse each record (slows importing, but worth it) before sending the bulk dumps to MySQL, and drop rows/records that don't parse.

JSON validation resulted in dropping only 450 out of 600 million (deduped) records. All seemed like attempts to do exotic/fun things with Unicode, and all but two records appeared to be from chat apps. The two other records appeared to be entries in a high score table for a game. Note that when validation fails, we're not dropping the entire table, just exact records that don't validate.

@snickell
Copy link
Contributor Author

Deduplication leaves us with an additional requirement for the schema: a way to indicate that a table is a "pointer" to a stock table.

@snickell
Copy link
Contributor Author

snickell commented Dec 15, 2023

@cnbrenci and I discussed schema, here's where we got:
Records table:

channelID: VARCHAR(22),
tableName: VARCHAR(768),
recordID: INT,
json: JSON

Tables table:

channelID: VARCHAR(22),
tableName: VARCHAR(768),
columns: JSON,
isSharedTable: VARCHAR(768), /* if it points to a shared/stock table, load records using this table name instead */

columns (JSON) might look like: ['id', 'Word'], note this assumes the columnIDs (like -Mw7ENQB6uKfQYc0kI8U) that are present in Firebase are not actually used. It doesn't look like they are, e.g. firebaseStorage.js references columns by name not by ID.

KeyValuePairs table:

channelID: VARCHAR(22),
key: VARCHAR(768), 
value: VARCHAR(4096), -- v3.config.channels.maxPropertySize, see `rules.bolt`

VARCHAR sizes:

Firebase validation rules:

maxPropertySize:4096 // value length of a key value pair
maxRecordSize:4096 // max length of a record
maxTableCount:10 // this doesn't seem to be encorced in rules.bolt?
maxTableRows: 20000 // number of records inside a table
maxKeySize: 768 // this is a firebase limit

Stock/Shared tables are also stored in Records and Tables but use a fixed channelID: channelID: shared

We propose dropping firebaseChannelIdSuffix. firebaseChannelIdSuffix: was being used to create per-developer separation in the shared cdo-v3-dev firebase, e.g. firebase_channel_id_suffix: -DEVELOPMENT-<%=ENV['USER']%>. It's also used by the test setup firebase_channel_id_suffix: '<%=ci ? "-#{circle_run_identifier}" : ''%>' for similar purposes. However, we believe firebaseChannelIdSuffix is no longer required, since we will be able to run the DB locally on MySQL in dev, rather than pointing all developers at a shared firebase instance.

For data export, we can sign a "what channel ID should you have access to" token, and put that in the data export.

  • /projects/applab//export_config (Enable applab data APIs in exported projects #22522)
    • currently returns (firebaseName, firebaseAuthToken, and firebaseChannelIdSuffix)
    • we can instead return (dbName, auth_token=signed(channelID)), where dbName is like ${unfirebase}_development or ${unfirebase}_test

We may still have to add rate limiting to the MySQL schema, see #55481

@cnbrenci
Copy link
Contributor

cnbrenci commented Feb 27, 2024

We've changed the tableName and key limits from 768->700. We had to do this because we had to set the charset to utf8mb4 from utfmb3, and the max key length in mysql2 is 3072 bytes. At utf8mb3, the total bytes come out to 3x768=2304 which fits within the limit, but with utf8mb4 it's 4x768=3072 hits the limit.

Given that the keys in firebase have a max size of 768 bytes and a prefix 69 characters which we are not storing in the key field in mysql, we can safely shorten the key with confidence that we'll still be able to migrate all the data into the smaller key size.

snickell added a commit that referenced this issue Feb 27, 2024
Default unicode type for our DB is utf8, which in mysql is (weirdly) 3-byte. This doesn't permit emojis in KVP key names, or in table names. To match firebase we need to support full 4-byte UTF-8.

#55344 (comment)
snickell added a commit that referenced this issue Feb 27, 2024
Default unicode type for our DB is utf8, which in mysql is (weirdly) 3-byte. This doesn't permit emojis in KVP key names, or in table names. To match firebase we need to support full 4-byte UTF-8.

#55344 (comment)

Co-authored-by: Cassi Brenci <cnbrenci@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants