-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate table definitions from Hive to Mimir Metadata #321
Comments
A good question for discussion is how to implement table mutators in Mimir. Clearly, updates to tables should be registered from Mimir's side. However, it's worth asking whether such updates should be propagated back to the source. Tagging @mrb24 and @lordpretzel for feedback. There's really only one major argument for propagating: It keeps the view of the data from the backend in-sync with the view of the data from Mimir. Conversely...
Which leaves the question of how to implement this simulation of updates. My own proposal would be to adopt a GProM-like versioning scheme, where each version is defined as a view over the previous version of the table. This means that we'd need to keep a log of updates to the table in the metadata backend. This could become a performance bottleneck in the longer term, but we could both keep a log and cache the most recent version of the table. |
virtualizing the updates we can batch them before we apply them. Also we can combine this with other lossless versioning at the storage level, e.g., we can make the storage versioned by adding a version column. Updates are kept in the log forever, but every X updates (e.g., X = 1000) we append the new row versions to the table stored in HDFS. This would work like a DB with time-travel + an audit log which GProM uses in Oracle to replay history with or w/o provenance tracking. The question is ofcourse what the performance overhead is to reenact up to X updates with every query. In C the compilation of 1000 updates is not a bottleneck. Also I think if most or all of the updates come from the spreadsheet, then this simplifies the compilation problem (only last update to a cell has to be applied, updates affect one cell typically, they do not have complex SET expressions) |
After exploring the code a bit, it seems like this should be doable easily if broken up into a read-only step and a read-write step. For the first read-only step, we would support only At present
I propose the following changes
Examples of SchemaProviders include:
A schema provider would be registered with
For step two we can define traits that indicate that the provider can accept bulk writes / view materialization, or updates/inserts/deletes. |
A quick chat with @mrb24 led to a revision to the SchemaProvider API. The limiting factor is that depending on situation, different interfaces might be appropriate: In some cases we may want the full query provenance (i.e., we want a view), while in others we want just the dataframe. The idea would be to expose each access path explicitly:
For any caller that just wants one particular format, we can provide translation utilities. |
I would like to decouple loading data in Mimir from the process of data staging My proposal is as follows: First, the LOAD command (and related mechanisms within Mimir) do nothing more than creating links. You give it the command URI, and the corresponding dataset becomes visible within Mimir (analogous to Spark's load() command. As I see it, the primary use-cases for staging is Vizier. For uploaded files, at least, it makes a lot more sense to have Vizier handle staging directly (i.e., files get streamed directly into S3/HDFS rather than going through Mimir). This makes it possible to avoid the redundant copy as the data goes through the local filesystem. For URLs and other network resources like google sheets, spark already seems to do some caching internally. If necessary, we could materialize one of the views used in data loading, transparent to the user. What do you think @mrb24 ? |
As a point of curiosity: With Mimir handling table definitions, is it safe to drop the |
- Work towards getting `SchemaProvider`s integrated into the system as per issue #321 - SparkBackend code has been split between: - `mimir.data.SparkSchemaProvider`: Access to spark's Derby/Hive backend data repository - `mimir.exec.spark.*` and `mimir.exec.Compiler`: Code related to Spark compilation and execution management. - Database no longer takes a backend parameter (just a metadata backend). Spark is now the only query processing infrastructure used. - Code throughout has been re-factored to manage a global spark context (TODO: Further refactor to make the global spark context a feature of Database) - OperatorTranslation has been renamed to `RAToSpark` for consistency with the other translation classes.
Most of the immutable data access functionality is alive and kicking. In particular see the commit notes for:
I'm moving on to implementing mutability. In particular
Where do materialized views live?I see several places where materialized views can live.
Hive makes sense when Mimir is connected to a remote Spark instance (e.g., when it's used in Vizier), but carries many of the same limitations that this ticket is trying to address when Mimir is run locally. The metadata backend is a second potential option, but materialized views are bulk data that rather than simple key-value associations. This will hurt if, for example, we ever want to use GIT or Ground for metadata. The local filesystem has the advantage that it also makes Mimir's views accessible from outside of Mimir once they're materialized (the same goes for Hive). In other words, we want the materialized view target to be configurable (either LoadedTables, Hive, or eventually HDFS). Moreover, it should just be a slightly more powerful form of a schema provider, so I plan to make
The database (view manager?) would then be configured at start-up to use one of these providers. Part of the configuration would also involve deciding on a format for these files. Hive already defaults to parquet, and I think I'm going to do the same with LoaedTables Where does bulk model-related metadata live?A related problem arises with some lenses and adaptive schemas. For example the (now defunct) DiscalaAbadi adaptive schema needed to materialize several tables of derived state (the FD graph). I see the solution as being similar: The database has a pre-configured bulk storage target, and we just use that. How/Do we allow write-through access to Hive/HDFS/S3?Since we're no longer staging, a reasonable question is whether we want to allow users to dump data back in to Hive or similar data stores (i.e., Support for UpdatesPunting on this for now, but when it gets implemented it'll be along the lines of an |
The final question is how to implement staging: locally caching remote resources. Specifically, we'd add an option to the
when
There seem to be two ways to accomplish this:
I'm honestly not sure how the latter approach would be implemented for SparkSchemaProvider, since Spark kind of assumes that everything is a Dataset or RDD. We'd have to go around Spark to whatever datastore (Hive, Derby, etc...) it's using on the backend -- and that assumes that the backend even supports raw file access. It might be more appropriate to define something like:
I'm a little worried that this overlaps in purpose with
However, I can't quite square the two. There's a major distinction (and maybe this should be reflected in the names) in that StagingProvider is for raw-file storage (staged files are accessed by URL), while BulkStorageProvider is specifically for storing tables (stored tables are immediately visible within Mimir).
Action items:
Of course, we can't just use raw staging. Parquet files don't download easily (multiple files), and some formats (e.g., Google Sheets) just straight up don't download locally. For these, the only practical approach is to materialize the corresponding dataframe. Doing this through ViewManager is tempting. However, ViewManager also tries to materialize supporting metadata (e.g., taint columns) which will not exist here. Worse, if metadata columns are requested that do not exist in the materialized view, ViewManager defaults back to running the source query. This is not behavior that we want. We need a stand-alone thingie for materializing the dataframe. The natural approach would be to use the database's preferred MaterializedTableProvider. This has the advantage of being simpler to implement, but creates a behavioral fork based on the table format. Some staged tables go down one path, while other staged tables go down another. It also places some loaded tables in the A more stable way to implement this would be to use the
Action Items:
|
Modulo making sure test cases pass, this branch seems to be feature complete. |
closed with #338 merge |
Presently, Mimir's Spark backend uses Hive to store table definitions. This means that:
It would be useful to create an infrastructure within Mimir for manually managing table definitions / table schemas, etc... (with the option of reading from Hive as well if configured to do so). Specifically:
CREATE TABLE
should no longer invoke Spark (purely meta-data based)Table mutators (UPDATE
,INSERT
, etc...) should no longer invoke SparkLOAD
should no longer trigger a write to Spark (though the schema detector pipeline will still read from Spark). Notably,LOAD
should store the source URL, source format, and any other information needed to create a Spark DataFrameCREATE MATERIALIZED VIEW
andALTER VIEW MATERIALIZE
should materialize data to a configurable location (e.g., local filesystem / hive / s3) in a configurable format (e.g., scala-native binary / csv / json / etc...)mimir.compiler.Compiler
) should dynamically create the appropriate type of Spark DataFrame as needed.schema
field used by Adaptive Schemas to hide these imports behind a distinct namespace.The text was updated successfully, but these errors were encountered: