Semi-automatic Hive Metastore sync (#41) #63

velvia · 2016-02-19T00:40:27Z

This PR adds a feature to optionally automatically sync FiloDB datasets into a Hive Metastore database.

Uncomment/enable filodb.hive.database-name config to specify the Hive metastore database to sync the FiloDB tables to
The automatic sync happens at the end of ingesting a dataset into FiloDb using the Spark data source write API
Manual syncing can be triggered in Spark shell or in any Spark app that has the FiloDB jar linked in by calling filodb.spark.syncToHive(sqlContext)
The tables registered are tied to the Spark data source, and thus only works through Spark, Spark SQL, the Spark Thrift server etc.

This change should allow BI tools to be used with Spark Thrift Server via JDBC connection without the need to create temporary tables every time.

…datasets

…write/append

velvia · 2016-02-19T00:40:33Z

I tested it locally, but I'm not quite sure how to create a Hive database.

…ue-41 Semi-automatic Hive Metastore sync (#41)

…definition (#63) This is a pretty huge refactor. Gone are `Projection` and `RichProjection`, and in its place we have a completely revamped and much simpler Dataset class. - The Dataset definition is now static. Columns cannot be added after the fact like before. - Unlike the complex Projection, which was based on a single Projection of a Dataset, and combined with input columns, a Dataset consists of data columns and partition columns, which are static and do not change - Computed column support has been removed - Column IDs are properly created on Dataset creation. They are used in place of names in the query API now, and in the C* chunks table. - When specifying the Dataset definition, you pass in "columnName:type" strings which are validated - It is now the responsibility of the input source / `IngestionStream` to map the input schema into partition and data columns matching the Dataset definition - `IngestRecord` now requires an `IngestRouting` which maps a flat column schema (like in CSV, Spark Rows, etc.) into an IngestRecord, figuring out the routing from input columns into partition/data columns. This is much much cleaner since the burden and complexity of input schema routing is removed from Dataset (formerly Projection) - version is removed from all APIs and Cassandra table schemas as well - The format of the ingestion source config and the dataset definition config have both changed - Remove ALL the unused imports!! This is now enforced by compiler. - SetupDataset now no longer takes a list of column names, due to simpler Dataset definition and IngestionStream responsibility to map incoming schema to Dataset dataColumns schema - As a result, no longer need to first get list of CSV file header column names, this can be done automatically :) - New cassandra schema; please drop old keyspaces - Kafka: RecordConverter API change

velvia added 2 commits February 18, 2016 11:53

Add MetaStore.getAllDatasets(); CLI --command list will now list all …

2f8e128

…datasets

Add functionality to sync FiloDB tables to Hive Metastore on dataset …

dab5bd1

…write/append

velvia mentioned this pull request Feb 19, 2016

Support Hive #41

Closed

Merge from master

478a6fc

velvia added a commit that referenced this pull request Mar 2, 2016

Merge pull request #63 from tuplejump/feature/hive-metastore-sync-iss…

ad52e08

…ue-41 Semi-automatic Hive Metastore sync (#41)

velvia merged commit ad52e08 into master Mar 2, 2016

velvia deleted the feature/hive-metastore-sync-issue-41 branch March 2, 2016 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semi-automatic Hive Metastore sync (#41) #63

Semi-automatic Hive Metastore sync (#41) #63

velvia commented Feb 19, 2016

velvia commented Feb 19, 2016

Semi-automatic Hive Metastore sync (#41) #63

Semi-automatic Hive Metastore sync (#41) #63

Conversation

velvia commented Feb 19, 2016

velvia commented Feb 19, 2016