Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive as an External Data Source #4826

Closed
2 of 4 tasks
BohuTANG opened this issue Apr 13, 2022 · 11 comments
Closed
2 of 4 tasks

Hive as an External Data Source #4826

BohuTANG opened this issue Apr 13, 2022 · 11 comments
Labels
C-feature Category: feature
Milestone

Comments

@BohuTANG
Copy link
Member

BohuTANG commented Apr 13, 2022

Summary

To integration with Hive, Databend has two ways:

  1. Introduce a new Hive Database Engine, like github engine did:using github database engine, this way is not flexible, for example: the parameters settings and if we have many hive databases that need to be created multiple times
  2. Introduce a new Hive External Stage, this will be more flexible and scalable, treating hive as a object rather than a database

We prefer the second way.

Tasks:

@BohuTANG BohuTANG added the C-feature Category: feature label Apr 13, 2022
@BohuTANG BohuTANG mentioned this issue Apr 13, 2022
4 tasks
@zhang2014
Copy link
Member

Add tasks: implement hive supported file formats(https://cwiki.apache.org/confluence/display/Hive/FileFormats).

@BohuTANG
Copy link
Member Author

@zhang2014 ,
Will priority to supporting the parquet/json/csv format, which is already supported by databend

@FANNG1
Copy link
Collaborator

FANNG1 commented Apr 13, 2022

excited to see the process,it would be very useful to accelerate the query on exsiting hive data。
there could some work to compatible Hive SQL (syntax, datatypes,funcs,udf)

@Xuanwo
Copy link
Member

Xuanwo commented May 7, 2022

OpenDAL added hdfs support now, visit #5215 to get the status of this feature.

@BohuTANG
Copy link
Member Author

BohuTANG commented May 7, 2022

Let's bump OpenDAL 🚀

@Xuanwo Xuanwo added this to the v0.8 milestone May 20, 2022
@FANNG1
Copy link
Collaborator

FANNG1 commented May 21, 2022

could we do simple hive query now?

@BohuTANG
Copy link
Member Author

could we do simple hive query now?

Yes, there is already a simple PoC:
https://github.com/datafuselabs/databend/blob/main/tests/suites/2_stateful_hive/00_basics/00_0000_hms_basics.sql

@dantengsky knows more about it, welcome comments :)

@FANNG1
Copy link
Collaborator

FANNG1 commented May 21, 2022

great! @dantengsky , could you give some instruction to setup the basic env, I'd like to join your work to improve it

@dantengsky
Copy link
Member

dantengsky commented May 22, 2022

great! @dantengsky , could you give some instruction to setup the basic env

great :)

  1. build databend with feature hive enabled, e.g.

    cargo build --bin databend-query --features hive

  2. bring up the hive meta store
    there is a docker-compose yml, which used in the CI workflow, may be a simple starting point

    - name: Hive Cluster Setup
    shell: bash
    run: |
    docker-compose -f "./docker/it-hive/hive-docker-compose.yml" up -d

    but it may bring up too many services ... :D

  3. by default, databend will attach he default catalog of HSM, to the databend's hive catalog

    config of HSM,

    impl Default for HiveCatalogConfig {
    fn default() -> Self {
    Self {
    meta_store_address: "127.0.0.1:9083".to_string(),
    protocol: ThriftProtocol::Binary,
    }
    }
    }

    where the cat of HSM is attached
    #[cfg(feature = "hive")]
    fn register_external_catalogs(&mut self, conf: &Config) -> Result<()> {
    let hms_address = &conf.catalog.meta_store_address;
    if !hms_address.is_empty() {
    // register hive catalog
    let hive_catalog: Arc<dyn Catalog> = Arc::new(HiveCatalog::try_create(hms_address)?);
    self.catalogs.insert(CATALOG_HIVE.to_owned(), hive_catalog);
    }
    Ok(())
    }

  4. A simple test case

  5. hive table skeleton

    the read method

    async fn read(
    &self,
    _ctx: Arc<QueryContext>,
    _plan: &ReadDataSourcePlan,
    ) -> Result<SendableDataBlockStream> {
    let block = DataBlock::empty_with_schema(self.table_info.schema());
    Ok(Box::pin(DataBlockStream::create(
    self.table_info.schema(),
    None,
    vec![block],
    )))
    }

    • from the _plan, we can get the name of catalog
    • from the _ctx, we can get the catalog (QueryContext::get_catalog)
      which will give back a instance of hive catalog

the origin PR #4947

@FANNG1
Copy link
Collaborator

FANNG1 commented Jun 10, 2022

PR #5895 add the basic ability to query simple hive table, with following limititions:

  1. not support partition table, not support to read dirs in table location
  2. only suport query int and string field columns
  3. just parse the first rowgroup in parquet file
  4. just could parse little sized parquet file, for large parquet file may use delta encoding and not supported by arrow2 now
  5. not support predict pushdown

to run though the hive sql quickly, I had done some desicions:
1, use file not rowgroup as the hive partition unit
2. create a HiveParquetBLockReader not using BlockReader, for in current implement, in executor, hive should parse meta and read not just one rowgroup. but if we use rowgroup as partition unit, we could reuse BlockReader and remove HiveParquetBlockReader
3. add some hive table field (table location& partition keys) to table meta info, maybe we should create a HiveTableMetaInfo and DataBendTableMetaInfo? abstraction and design is needed.

but maybe we could optimize the above limits&problems step by step.

as I'm a newbiew to rust&databend, feel free to give your advice to the design and code implement. cc @BohuTANG @dantengsky

@BohuTANG
Copy link
Member Author

The basic framework to support the hive has been completed, this issue can be closed. We can track the hive tasks in another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature
Projects
None yet
Development

No branches or pull requests

5 participants