Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Partition Discovery for Broker Load #1582

Closed
yuanlihan opened this issue Aug 5, 2019 · 10 comments
Closed

Enable Partition Discovery for Broker Load #1582

yuanlihan opened this issue Aug 5, 2019 · 10 comments

Comments

@yuanlihan
Copy link
Contributor

yuanlihan commented Aug 5, 2019

Can not parse partitioned columns and can not recursively list files when using Broker Load

When users try to load data from hdfs files written by Spark jobs, they usually need to extract partitioned columns which are specified by write options of Spark API. And recursively listing files of a path(dir) is needed because there are usually multi partitioned columns.

Ex: We try to load data from source:

  • base path: hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/
  • partitioned columns need to be extracted: city and utc_date
  • input path(dir): hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26
  • detail files: [hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26/0000.csv, hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26/0001.csv, ...]

Expected Interface of Broker Load for the Feature of Partition Discovery

            DATA INFILE
            (
            "file_path1"[, file_path2, ...]
            )
            [NEGATIVE]
            INTO TABLE `table_name`
            [PARTITION (p1, p2)]
            [COLUMNS TERMINATED BY "column_separator"]
            [FORMAT AS "file_type"]
            [COLUMNS FROM PATH AS (columns_from_path)]
            [(column_list)]
            [SET (k1 = func(k2))]
  • file_path: path of file need to be loaded. eg: hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26/*
  • columns_from_path: the partitioned columns needed to be extracted from file path.

We will parse partitioned columns in file path from right to left. eg1: first extract column utc_date and then city from file path hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26/0001.csv
eg2: extract k1=value1 from file path hdfs://hdfs_host:hdfs_port/user/palo/data/input/k1=value/dir/k1=value1/001.csv
And report error if partitioned column specified in columns_from_path is not found.
Last but not least, columns from path are also compatible with columns mapping operations of SET statement, just like column_list

@imay
Copy link
Contributor

imay commented Aug 5, 2019

@yuanlihan

I prefer first option. However I think there should be some change in your version.

  1. we can still use INFILE. We support path has wildcard. for load directory, user can specify the path like "base_dir/*". This is also consistent with the previous usage.
  2. I think we should keep numbers of our keywords as small as possible. So PARTITIONED_COLUMNS and BASE_DIR should be reconsidered. For example PARTITION COLUMNS or BASE DIR?
  3. PARTITIONED_COLUMNS has no relation with PARTITION. This may confuse our users. Can we have better choice?

@yuanlihan
Copy link
Contributor Author

@yuanlihan

I prefer first option. However I think there should be some change in your version.

  1. we can still use INFILE. We support path has wildcard. for load directory, user can specify the path like "base_dir/*". This is also consistent with the previous usage.
  2. I think we should keep numbers of our keywords as small as possible. So PARTITIONED_COLUMNS and BASE_DIR should be reconsidered. For example PARTITION COLUMNS or BASE DIR?
  3. PARTITIONED_COLUMNS has no relation with PARTITION. This may confuse our users. Can we have better choice?

@imay

  1. Currently the path with wildcard only list files under the directory, but we need to list files recursively. Can we change the logic of wildcard-path(eg, "base_dir/*") to recursively listing files?
  2. What about change [BASE_PATH AS "base_path"] as [PATH START WITH "base_path"] and change [PARTITIONED_COLUMNS AS (partitioned_column_list)] as [COLUMNS FROM PATH AS (columns_from_path)]? Then we only need to add a key word PATH.

@imay
Copy link
Contributor

imay commented Aug 5, 2019

@yuanlihan
I prefer first option. However I think there should be some change in your version.

  1. we can still use INFILE. We support path has wildcard. for load directory, user can specify the path like "base_dir/*". This is also consistent with the previous usage.
  2. I think we should keep numbers of our keywords as small as possible. So PARTITIONED_COLUMNS and BASE_DIR should be reconsidered. For example PARTITION COLUMNS or BASE DIR?
  3. PARTITIONED_COLUMNS has no relation with PARTITION. This may confuse our users. Can we have better choice?

@imay

  1. Currently the path with wildcard only list files under the directory, but we need to list files recursively. Can we change the logic of wildcard-path(eg, "base_dir/*") to recursively listing files?

I think we can support listing path like "base_dir/*/*/*"

  1. What about change [BASE_PATH AS "base_path"] as [PATH START WITH "base_path"] and change [PARTITIONED_COLUMNS AS (partitioned_column_list)] as [COLUMNS FROM PATH AS (columns_from_path)]? Then we only need to add a key word PATH.

I think this syntax is good.

@yuanlihan
Copy link
Contributor Author

I think we can support listing path like "base_dir/*/*/*"

It seems a little weird about this syntax. What about support recursively listing files of path(eg, "base_dir/" or "base_dir/*") iff users specify columns_from_path by [COLUMNS FROM PATH AS (columns_from_path)], which rarely have conflicts with previous usage.

@imay
Copy link
Contributor

imay commented Aug 6, 2019

I think we can support listing path like "base_dir/*/*/*"

It seems a little weird about this syntax. What about support recursively listing files of path(eg, "base_dir/" or "base_dir/*") iff users specify columns_from_path by [COLUMNS FROM PATH AS (columns_from_path)], which rarely have conflicts with previous usage.

If you think wildcard is weird. I think we can keep DATA INDIR, and remove the [PATH START WITH "base_path"] clause. And if users specify the columns_from_path clause we can recursive directory according to it, if they don't we only traverse one depth.

For example if user specify DATA INDIR("/path/to/dir") and with COLUMNS FROM PATH AS (k1, k2). then we will traverse the two-level directory to get the corresponding import files. We will get file path like "/path/to/dir/k1=1/k2=2/file1". And if users don't specify COLUMNS FROM PATH clause, we will only traverse one level directory, and get files like "/path/to/dir/file1".

And for DATA INFILE with COLUMNS FROM PATH, we will try to parse partition columns from user specified file path. And if we can't get match information, we should return error to users.

@yuanlihan
Copy link
Contributor Author

@imay
I have update the previous design of interface.

@imay
Copy link
Contributor

imay commented Aug 6, 2019

@imay
I have update the previous design of interface.

Good job, I agree with this interface.

@EmmyMiao87
Copy link
Contributor

In the future, the columns from path also participate in columns mapping. So, I think the grammar of load stmt is columns terminated by 'xxx' (tmp_k1) columns from path (tmp_k2) set (k1=tmp_k1+tmp_k2).

@yuanlihan
Copy link
Contributor Author

In the future, the columns from path also participate in columns mapping. So, I think the grammar of load stmt is columns terminated by 'xxx' (tmp_k1) columns from path (tmp_k2) set (k1=tmp_k1+tmp_k2).

@EmmyMiao87
Sure. Columns from path are also compatible with columns mapping operations of SET statement. I have add this to the doc of interface design.

imay pushed a commit that referenced this issue Aug 19, 2019
Currently, we do not support parsing encoded/compressed columns in file path, eg: extract column k1 from file path /path/to/dir/k1=1/xxx.csv

This patch is able to parse columns from file path like in Spark(Partition Discovery).

This patch parse partition columns at BrokerScanNode.java and save parsing result of each file path as a property of TBrokerRangeDesc, then the broker reader of BE can read the value of specified partition column.
@yuanlihan
Copy link
Contributor Author

Added

morningman pushed a commit that referenced this issue Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants