Hive as an External Data Source #4826

BohuTANG · 2022-04-13T01:14:44Z

Summary

To integration with Hive, Databend has two ways:

Introduce a new Hive Database Engine, like github engine did:using github database engine, this way is not flexible, for example: the parameters settings and if we have many hive databases that need to be created multiple times
Introduce a new Hive External Stage, this will be more flexible and scalable, treating hive as a object rather than a database

We prefer the second way.

Tasks:

~~Support new stage uri for hive configurations : create stage hive url='hive://....' external stage reference~~
Support new hive storage(hive metastore connector), like s3 storage
~~Support query from external stage~~
OpenDAL supports HDFS backend HDFS backend apache/opendal#154

The text was updated successfully, but these errors were encountered:

zhang2014 · 2022-04-13T01:18:40Z

Add tasks: implement hive supported file formats(https://cwiki.apache.org/confluence/display/Hive/FileFormats).

BohuTANG · 2022-04-13T01:23:40Z

@zhang2014 ,
Will priority to supporting the parquet/json/csv format, which is already supported by databend

FANNG1 · 2022-04-13T02:44:41Z

excited to see the process，it would be very useful to accelerate the query on exsiting hive data。
there could some work to compatible Hive SQL （syntax, datatypes，funcs，udf）

Xuanwo · 2022-05-07T10:08:45Z

OpenDAL added hdfs support now, visit #5215 to get the status of this feature.

BohuTANG · 2022-05-07T10:09:26Z

Let's bump OpenDAL 🚀

FANNG1 · 2022-05-21T09:35:56Z

could we do simple hive query now?

BohuTANG · 2022-05-21T10:19:42Z

could we do simple hive query now?

Yes, there is already a simple PoC:
https://github.com/datafuselabs/databend/blob/main/tests/suites/2_stateful_hive/00_basics/00_0000_hms_basics.sql

@dantengsky knows more about it, welcome comments :)

FANNG1 · 2022-05-21T13:42:35Z

great! @dantengsky , could you give some instruction to setup the basic env, I'd like to join your work to improve it

dantengsky · 2022-05-22T04:44:37Z

great! @dantengsky , could you give some instruction to setup the basic env

great :)

build databend with feature hive enabled, e.g.

cargo build --bin databend-query --features hive

bring up the hive meta store
there is a docker-compose yml, which used in the CI workflow, may be a simple starting point

databend/.github/actions/test_stateful_hive_standalone/action.yml

Lines 42 to 45 in 9ce7c71

    
               - name: Hive Cluster Setup 
        
                 shell: bash 
        
                 run: | 
        
                   docker-compose -f "./docker/it-hive/hive-docker-compose.yml" up -d

but it may bring up too many services ... :D

by default, databend will attach he default catalog of HSM, to the databend's hive catalog

config of HSM,

databend/query/src/config/inner.rs

Lines 235 to 242 in 9ce7c71

    
           impl Default for HiveCatalogConfig { 
        
               fn default() -> Self { 
        
                   Self { 
        
                       meta_store_address: "127.0.0.1:9083".to_string(), 
        
                       protocol: ThriftProtocol::Binary, 
        
                   } 
        
               } 
        
           }

where the cat of HSM is attached

databend/query/src/catalogs/catalog_manager.rs

Lines 68 to 77 in 9ce7c71

    
           #[cfg(feature = "hive")] 
        
           fn register_external_catalogs(&mut self, conf: &Config) -> Result<()> { 
        
               let hms_address = &conf.catalog.meta_store_address; 
        
               if !hms_address.is_empty() { 
        
                   // register hive catalog 
        
                   let hive_catalog: Arc<dyn Catalog> = Arc::new(HiveCatalog::try_create(hms_address)?); 
        
                   self.catalogs.insert(CATALOG_HIVE.to_owned(), hive_catalog); 
        
               } 
        
               Ok(()) 
        
           }

A simple test case

create a simple test table

databend/.github/actions/test_stateful_hive_standalone/action.yml

Lines 57 to 60 in 9ce7c71

    
               - name: Hive Testing Data 
        
                 shell: bash 
        
                 run: | 
        
                   docker-compose -f "./docker/it-hive/hive-docker-compose.yml" exec -T hive-server bash -c "/opt/hive/bin/beeline -u jdbc:hive2://127.0.0.1:10000 -e 'CREATE TABLE if not exists pokes (foo INT);'"

access the table

databend/tests/suites/2_stateful_hive/00_basics/00_0000_hms_basics.sql

Lines 2 to 3 in 9ce7c71

desc hive.default.pokes;

select * from hive.default.pokes;

hive table skeleton

the read method

databend/query/src/catalogs/hive/hive_table.rs

Lines 91 to 103 in 9ce7c71

    
           async fn read( 
        
               &self, 
        
               _ctx: Arc<QueryContext>, 
        
               _plan: &ReadDataSourcePlan, 
        
           ) -> Result<SendableDataBlockStream> { 
        
               let block = DataBlock::empty_with_schema(self.table_info.schema()); 
        
               Ok(Box::pin(DataBlockStream::create( 
        
                   self.table_info.schema(), 
        
                   None, 
        
                   vec![block], 
        
               ))) 
        
           }

from the _plan, we can get the name of catalog
from the _ctx, we can get the catalog (QueryContext::get_catalog)
which will give back a instance of hive catalog

the origin PR #4947

FANNG1 · 2022-06-10T11:05:59Z

PR #5895 add the basic ability to query simple hive table, with following limititions:

not support partition table, not support to read dirs in table location
only suport query int and string field columns
just parse the first rowgroup in parquet file
just could parse little sized parquet file, for large parquet file may use delta encoding and not supported by arrow2 now
not support predict pushdown

to run though the hive sql quickly, I had done some desicions:
1, use file not rowgroup as the hive partition unit
2. create a HiveParquetBLockReader not using BlockReader, for in current implement, in executor, hive should parse meta and read not just one rowgroup. but if we use rowgroup as partition unit, we could reuse BlockReader and remove HiveParquetBlockReader
3. add some hive table field (table location& partition keys) to table meta info, maybe we should create a HiveTableMetaInfo and DataBendTableMetaInfo? abstraction and design is needed.

but maybe we could optimize the above limits&problems step by step.

as I'm a newbiew to rust&databend, feel free to give your advice to the design and code implement. cc @BohuTANG @dantengsky

BohuTANG · 2022-07-21T02:35:57Z

The basic framework to support the hive has been completed, this issue can be closed. We can track the hive tasks in another issue.

BohuTANG added the C-feature Category: feature label Apr 13, 2022

BohuTANG mentioned this issue Apr 13, 2022

Roadmap 2022 #3706

Closed

4 tasks

BohuTANG mentioned this issue Apr 15, 2022

Release proposal: Nightly v0.8 #4591

Closed

55 tasks

Xuanwo added this to the v0.8 milestone May 20, 2022

FANNG1 mentioned this issue Jun 10, 2022

feat(hive) add support to query simple hive table #5895

Merged

BohuTANG closed this as completed Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive as an External Data Source #4826

Hive as an External Data Source #4826

BohuTANG commented Apr 13, 2022 •

edited

Loading

zhang2014 commented Apr 13, 2022

BohuTANG commented Apr 13, 2022

FANNG1 commented Apr 13, 2022

Xuanwo commented May 7, 2022

BohuTANG commented May 7, 2022

FANNG1 commented May 21, 2022

BohuTANG commented May 21, 2022

FANNG1 commented May 21, 2022

dantengsky commented May 22, 2022 •

edited

Loading

FANNG1 commented Jun 10, 2022 •

edited

Loading

BohuTANG commented Jul 21, 2022

Hive as an External Data Source #4826

Hive as an External Data Source #4826

Comments

BohuTANG commented Apr 13, 2022 • edited Loading

zhang2014 commented Apr 13, 2022

BohuTANG commented Apr 13, 2022

FANNG1 commented Apr 13, 2022

Xuanwo commented May 7, 2022

BohuTANG commented May 7, 2022

FANNG1 commented May 21, 2022

BohuTANG commented May 21, 2022

FANNG1 commented May 21, 2022

dantengsky commented May 22, 2022 • edited Loading

FANNG1 commented Jun 10, 2022 • edited Loading

BohuTANG commented Jul 21, 2022

BohuTANG commented Apr 13, 2022 •

edited

Loading

dantengsky commented May 22, 2022 •

edited

Loading

FANNG1 commented Jun 10, 2022 •

edited

Loading