docs: add the design document of Warehouse #114

huachaohuang · 2021-11-17T13:09:13Z

The semantics of Warehouse is useful to build some analytical databases like Snowflake and DeltaLake.

zojw · 2021-11-18T13:09:10Z

docs/warehouse.md

+When a client connects to Warehouse, it gets the last version from Warehouse as its base version and then subscribes to delta versions.
+When a delta version arrives, the client applies it to the base version to catch up with Warehouse.
+The client can maintain a list of live versions for ongoing queries and release a version once it is no longer required.
+Warehouse guarantees that objects recorded in all client versions remain valid until the corresponding versions are released.


how do warehouses know "all client candidates"~? maybe the client can be temp leave or be network partition, it seems not simple to know "all live client released the version" :)?

I think the implementation will be similar to a distributed lock or leases in etcd. Clients need to communicate with Warehouse to keep their versions alive.

zojw · 2021-11-18T13:25:30Z

docs/warehouse.md

+It maintains multiple versions of metadata. Each version represents a snapshot of metadata at a specific time.
+Each metadata transaction (add or remove objects) creates a delta version that transforms the last version into a new one.
+When a client connects to Warehouse, it gets the last version from Warehouse as its base version and then subscribes to delta versions.
+When a delta version arrives, the client applies it to the base version to catch up with Warehouse.


IMO, reverse synced version info just used to let client can release version? how about query & modification arrival client during "un-catch up" time period? I guess they should not take care about those reverse synced version? (if so maybe read stale data and inconsistent between different warehouse client component)

I don't think I understand the questions. Specifically, what does "reverse synced version" mean?

tisonkun

Thanks for submitting this PR! Comments inline.

Also \n in markdown without a blank line won't start a new line, you can check the rendered version and I think it's the primarily used version that we should take care of. I don't insert a line break without a blank line :P

tisonkun · 2021-11-20T16:13:20Z

docs/warehouse.md

+Warehouse stores object data in Storage and stores object metadata in Manifest.
+
+To add objects to Warehouse, a client uploads objects to Storage first and then commits the uploaded objects to Warehouse.
+To delete objects from Warehouse, a client commits the to be deleted objects to Warehouse and relies on Warehouse to delete those objects.


Suggested change

To delete objects from Warehouse, a client commits the to be deleted objects to Warehouse and relies on Warehouse to delete those objects.

To delete objects from Warehouse, a client commits the to-be-deleted objects to Warehouse and relies on Warehouse to delete those objects.

tisonkun · 2021-11-20T16:15:14Z

docs/warehouse.md

+To add objects to Warehouse, a client uploads objects to Storage first and then commits the uploaded objects to Warehouse.
+To delete objects from Warehouse, a client commits the to be deleted objects to Warehouse and relies on Warehouse to delete those objects.


In this way, a client should talk to both Storage and Warehouse. So a user should keep in mind these usages. It seems a bit awkward. I'll imagine with the Warehouse facade, a user can talk to the Warehouse only.

Make senses. We can hide Storage from the abstraction.

With #140 and #143, a client (engine) is still holding all of:

Kernel (Warehouse here)

Stream

Journal

#[derive(Clone)] pub struct Engine<K: Kernel> { kernel: K, stream: K::Stream, bucket: K::Bucket, // ... }

tisonkun · 2021-11-20T16:16:00Z

docs/warehouse.md

+To add objects to Warehouse, a client uploads objects to Storage first and then commits the uploaded objects to Warehouse.
+To delete objects from Warehouse, a client commits the to be deleted objects to Warehouse and relies on Warehouse to delete those objects.
+It is possible that a client fails to upload an object or fails to commit the uploaded objects. In this case, the corresponding objects become obsolete.
+Warehouse relies on garbage collection to purge obsoleted or to be deleted objects in the end.


Suggested change

Warehouse relies on garbage collection to purge obsoleted or to be deleted objects in the end.

Warehouse relies on garbage collection to purge obsoleted or to-be-deleted objects eventually.

tisonkun · 2021-11-20T16:23:44Z

docs/warehouse.md

+When a client connects to Warehouse, it gets the last version from Warehouse as its base version and then subscribes to delta versions.
+When a delta version arrives, the client applies it to the base version to catch up with Warehouse.


So the client will hold a connection to subscribe updates per object or per Warehouse? It seems like something that can help in implementing materialize view. We can think more in this direction.

I think it should be bucket-level. And maybe we should only support bucket-level transactions too.

i think bucket level is too large, warehouse can also import database concept like hive? it seems more complicated

I did consider using database concepts here. But it seems that Warehouse is more like an object management abstraction. So it may be easier to understand if we refer to the bucket/object concepts from Storage.

Bucket concepts may cause many clients to fetch many recent metadata updates that it does not need

Hmm, I think most applications will map one database to one bucket. I need to make it clear that the bucket concept may not actually map to a bucket in the cloud storage. It is ok for Warehouse to map multiple buckets to a single bucket in Storage. Does it sound confused? Maybe I should consider use "database" instead of "bucket" here 🤔

Yes, it's will be more clearly, s3's bucket is more familiar with me, thanks for feedback

tisonkun · 2021-11-20T16:29:37Z

docs/warehouse.md

+
+## Architecture
+
+![Architecture](images/warehouse-architecture.drawio.svg)


From the graph, it seems the client talks to Warehouse directly. Will Warehouse becomes a new kind of unit? I'd like to confirm whether it's at the same level of Storage or a component inside Compute.

Good question. Warehouse is between Storage and Compute. I am considering running Warehouse and Manifest in the same unit. There are still some designs to be done in the overall architecture and microunit.

SGZW · 2021-11-21T16:45:10Z

docs/warehouse.md

+
+Warehouse stores object data in Storage and stores object metadata in Manifest.
+
+To add objects to Warehouse, a client uploads objects to Storage first and then commits the uploaded objects to Warehouse.


How to resolve concurrent upload conflicts? Warehouse unit maintains txn manager?

The assumption is that clients will not upload the same object concurrently. This can be achieved by assigning a unique name to each object (for example, a UUID or a sequence number issued by Warehouse).

SGZW · 2021-11-21T16:49:50Z

docs/warehouse.md

+When a client connects to Warehouse, it gets the last version from Warehouse as its base version and then subscribes to delta versions.
+When a delta version arrives, the client applies it to the base version to catch up with Warehouse.


i think bucket level is too large, warehouse can also import database concept like hive? it seems more complicated

huachaohuang · 2021-11-23T09:14:38Z

Thanks for everyone's feedback. While the design is far from perfect, I think it's still better to have it landed and move the design forward. So I am going to merge it now, some wording can be improved later.

huachaohuang added this to the Version 0.2 milestone Nov 17, 2021

huachaohuang marked this pull request as draft November 17, 2021 13:09

huachaohuang mentioned this pull request Nov 17, 2021

platform: add aws s3 storage #99

Merged

huachaohuang added 2 commits November 18, 2021 17:09

docs: add the design document of Warehouse

4a859c6

Add more detailed descriptions

0cf2058

huachaohuang marked this pull request as ready for review November 18, 2021 09:09

zojw reviewed Nov 18, 2021

View reviewed changes

tisonkun reviewed Nov 20, 2021

View reviewed changes

SGZW reviewed Nov 22, 2021

View reviewed changes

huachaohuang merged commit 80a05fd into engula:main Nov 23, 2021

huachaohuang deleted the docs branch November 23, 2021 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add the design document of Warehouse #114

docs: add the design document of Warehouse #114

huachaohuang commented Nov 17, 2021 •

edited

Loading

zojw Nov 18, 2021

huachaohuang Nov 18, 2021

zojw Nov 18, 2021

huachaohuang Nov 18, 2021

tisonkun left a comment

tisonkun Nov 20, 2021

tisonkun Nov 20, 2021

huachaohuang Nov 20, 2021

tisonkun Dec 7, 2021

tisonkun Nov 20, 2021

tisonkun Nov 20, 2021

huachaohuang Nov 20, 2021

SGZW Nov 21, 2021

huachaohuang Nov 22, 2021

SGZW Nov 22, 2021

huachaohuang Nov 22, 2021

SGZW Nov 22, 2021 •

edited

Loading

tisonkun Nov 20, 2021

huachaohuang Nov 20, 2021

SGZW Nov 21, 2021

huachaohuang Nov 22, 2021

SGZW Nov 21, 2021

huachaohuang commented Nov 23, 2021

	To delete objects from Warehouse, a client commits the to be deleted objects to Warehouse and relies on Warehouse to delete those objects.
	To delete objects from Warehouse, a client commits the to-be-deleted objects to Warehouse and relies on Warehouse to delete those objects.

		To add objects to Warehouse, a client uploads objects to Storage first and then commits the uploaded objects to Warehouse.
		To delete objects from Warehouse, a client commits the to be deleted objects to Warehouse and relies on Warehouse to delete those objects.

	Warehouse relies on garbage collection to purge obsoleted or to be deleted objects in the end.
	Warehouse relies on garbage collection to purge obsoleted or to-be-deleted objects eventually.

		When a client connects to Warehouse, it gets the last version from Warehouse as its base version and then subscribes to delta versions.
		When a delta version arrives, the client applies it to the base version to catch up with Warehouse.


		## Architecture

		![Architecture](images/warehouse-architecture.drawio.svg)


		Warehouse stores object data in Storage and stores object metadata in Manifest.

		To add objects to Warehouse, a client uploads objects to Storage first and then commits the uploaded objects to Warehouse.

docs: add the design document of Warehouse #114

docs: add the design document of Warehouse #114

Conversation

huachaohuang commented Nov 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tisonkun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SGZW Nov 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huachaohuang commented Nov 23, 2021

huachaohuang commented Nov 17, 2021 •

edited

Loading

SGZW Nov 22, 2021 •

edited

Loading