plateau structures your data using these concepts:
- One whole unit of data that plateau manages is called a dataset.
- A dataset consists of one or more tables that each have a schema.
- Table rows are partitioned by any number of columns: Rows having the same combination of values in these columns are grouped together.
- A partition consists of one or more Parquet files, which contain a chunk of rows that were written at a time.
- plateau can also generate an index for any number of columns, which speeds up finding the relevant Parquet files for specific values for the indexed column.
A general plateau storage layout thus looks as follows:
─ <dataset_uuid>.by-dataset-metadata.json ─ <dataset_uuid>/ ├── <table1>/ │ ├── _common_metadata │ ├── <partition1>=value/ │ │ ├── <partition2>=value/ │ │ │ ├ ... │ │ │ ├── <partitionN>=value/ │ │ │ │ ├── df1.parquet │ │ │ │ ├── df2.parquet │ │ │ │ └── ... │ │ │ ├── <partitionN>=value/ │ │ │ │ ├── df1.parquet │ │ │ │ ├── df2.parquet │ │ │ │ └── ... │ │ │ └── ... │ │ ├── <partition2>=value/ │ │ │ ├ ... │ │ │ ├── <partitionN>=value/ │ │ │ │ ├── df1.parquet │ │ │ │ ├── df2.parquet │ │ │ │ └── ... │ │ │ ├── <partitionN>=value/ │ │ │ │ ├── df1.parquet │ │ │ │ ├── df2.parquet │ │ │ │ └── ... │ │ │ └── ... │ │ └── <partition2>=value/ ... │ ├── <partition1>=value/ ... │ └── <partition1>=value/ ... ├── <table2>/ ... ├── <table3>/ ... └── indices/ ├── <index_column1>/ │ └── <timestamp>.by-dataset-index.parquet ├── <index_column2>/ ... └── <index_column3>/ ...
Where:
<dataset_uuid>.by-dataset-metadata.json
contains theDatasetMetadata
you have seen above.<tableN>
contains the data for any tables in the dataset, partitioned by N >= 0 columns. The directory structure will be N folders deep._common_metadata
contains the table schema ofdfN.parquet
. It is always identical for all Parquet files of a table.indices
contains a database index for each index column, for quick lookup of rows where the column value matches a given value.