Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semi-automatic Hive Metastore sync (#41) #63

Merged
merged 3 commits into from
Mar 2, 2016

Conversation

velvia
Copy link
Member

@velvia velvia commented Feb 19, 2016

This PR adds a feature to optionally automatically sync FiloDB datasets into a Hive Metastore database.

  • Uncomment/enable filodb.hive.database-name config to specify the Hive metastore database to sync the FiloDB tables to
  • The automatic sync happens at the end of ingesting a dataset into FiloDb using the Spark data source write API
  • Manual syncing can be triggered in Spark shell or in any Spark app that has the FiloDB jar linked in by calling filodb.spark.syncToHive(sqlContext)
  • The tables registered are tied to the Spark data source, and thus only works through Spark, Spark SQL, the Spark Thrift server etc.

This change should allow BI tools to be used with Spark Thrift Server via JDBC connection without the need to create temporary tables every time.

@velvia
Copy link
Member Author

velvia commented Feb 19, 2016

I tested it locally, but I'm not quite sure how to create a Hive database.

@velvia velvia mentioned this pull request Feb 19, 2016
velvia added a commit that referenced this pull request Mar 2, 2016
…ue-41

Semi-automatic Hive Metastore sync (#41)
@velvia velvia merged commit ad52e08 into master Mar 2, 2016
@velvia velvia deleted the feature/hive-metastore-sync-issue-41 branch March 2, 2016 23:48
velvia pushed a commit that referenced this pull request Feb 14, 2018
…definition (#63)

This is a pretty huge refactor.  Gone are `Projection` and `RichProjection`, and in its place we have a completely revamped and much simpler Dataset class.  

- The Dataset definition is now static.  Columns cannot be added after the fact like before.
- Unlike the complex Projection, which was based on a single Projection of a Dataset, and combined with input columns, a Dataset consists of data columns and partition columns, which are static and do not change
- Computed column support has been removed
- Column IDs are properly created on Dataset creation.  They are used in place of names in the query API now, and in the C* chunks table.
- When specifying the Dataset definition, you pass in "columnName:type" strings which are validated
- It is now the responsibility of the input source / `IngestionStream` to map the input schema into partition and data columns matching the Dataset definition
- `IngestRecord` now requires an `IngestRouting` which maps a flat column schema (like in CSV, Spark Rows, etc.) into an IngestRecord, figuring out the routing from input columns into partition/data columns.  This is much much cleaner since the burden and complexity of input schema routing is removed from Dataset (formerly Projection)
- version is removed from all APIs and Cassandra table schemas as well
- The format of the ingestion source config and the dataset definition config have both changed
- Remove ALL the unused imports!!  This is now enforced by compiler.
- SetupDataset now no longer takes a list of column names, due to simpler Dataset definition and IngestionStream responsibility
  to map incoming schema to Dataset dataColumns schema
- As a result, no longer need to first get list of CSV file header column names, this can be done automatically  :)

- New cassandra schema; please drop old keyspaces
- Kafka: RecordConverter API change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant