-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Feature] Seeking alternative to Redis as Online Storage. BigTable research #48
Comments
Will it become performance issue considering some table will be bigger compared to others?
|
If I understand correctly, there will be 2 query to BigQuery? One for retrieving the values and the second one retrieve the schema. Only after that the particular row will can be decoded. Isn't it going to be a slow process? |
Assumptions is that schema is not being updated very frequently and thus can be cached on serving side. And reference is a hash of the schema itself. |
Question is what's Alternatively, we could add entity key into row key, but wouldn't that increase row size?
For one Feast request we're always gonna need only one table. On the balancing note, I was thinking to separate tables into different BigTable instances, probably using Feast projects. (one instance per project, eg). That will also help us to scale them independently. Some projects are used more than others. |
At Gojek we extensively use Feast in production with Redis Cluster as our storage layer. The capacity of our cluster is currently 1Tb, which is not too much, right? But we already started to bump into some limitations:
That being sad we want to explore possible alternatives and try to preserve as best as possible performance (read latency) while simplifying scaling & increasing durability.
BigTable (research)
There are two main factors that impact read performance in BigTable:
(This is under assumptions that all 100% of Feast reads are Random Access since we have no knowledges about distribution of ingested entities, neither about distribution of entities in
get_online_features
requests).Let's focus on the row size.
Some facts that we know about BigTable
(3 + 4) means that each cell costs approx 12 bytes of metadata.
We also conducted several experiments to better understand underlying storage model and find approach that will make each individual stored row as compact as possible. We generated test dataset which consists of 20 features (type float / 4 bytes) and entity key with length 20 bytes. Hence, effective size of raw dataset is 1Gb.
Experiment A. Every feature stored as separate column (within one column family). Qualifier length is
10 bytes
. Table size after ingestion is4.4Gb
Experiment B. Every feature stored as separate column (within one column family). Qualifier length is
30 bytes
. Table size after ingestion is5.4Gb
Experiment C. Same as Experiment A, but with longer column family name. Table size after ingestion is
4.4Gb
Experiment D. All features stored in single byte array. Qualifier is empty. Table size after ingestion is
1.4Gb
Experiment E. 20 float features replaced with 10 double features. Effective data set size is the same. Qualifier length is
10 bytes
. But now it's just 10 of them. Table size after ingestion is2.0Gb
Conclusions:
Proposed design
Taking into account these factors:
we propose next design solution:
A. Use entities as table name
B. Use entity values as row key
C. Store all features (from single Feature Table) as single byte array (one column with empty qualifier)
D. Prepend byte array with schema reference (hash of schema of 4/8 bytes)
E. Store schema under separate key (
schema#<hash>
) in the same tableIn our implementation #46 we also plan to use Avro as serializer (although we leave the room for alternatives).
Example
Table: customer_id__merchant_id
TTL=300
0x00 0x00 0xAA 0xAA
0x00 0x00 0x01 ... 0x01
^ schema reference ^ ^ avro-serialized features ^
{"type": "record", "fields": [{"name": "money_spent", "type": "double"}, {"name": "orders_amount", "type": "integer"}]
The text was updated successfully, but these errors were encountered: