Enable Partition Discovery for Broker Load #1582

yuanlihan · 2019-08-05T03:42:33Z

Can not parse partitioned columns and can not recursively list files when using Broker Load

When users try to load data from hdfs files written by Spark jobs, they usually need to extract partitioned columns which are specified by write options of Spark API. And recursively listing files of a path(dir) is needed because there are usually multi partitioned columns.

Ex: We try to load data from source:

base path: hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/
partitioned columns need to be extracted: city and utc_date
input path(dir): hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26
detail files: [hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26/0000.csv, hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26/0001.csv, ...]

Expected Interface of Broker Load for the Feature of Partition Discovery

            DATA INFILE
            (
            "file_path1"[, file_path2, ...]
            )
            [NEGATIVE]
            INTO TABLE `table_name`
            [PARTITION (p1, p2)]
            [COLUMNS TERMINATED BY "column_separator"]
            [FORMAT AS "file_type"]
            [COLUMNS FROM PATH AS (columns_from_path)]
            [(column_list)]
            [SET (k1 = func(k2))]

file_path: path of file need to be loaded. eg: hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26/*
columns_from_path: the partitioned columns needed to be extracted from file path.

We will parse partitioned columns in file path from right to left. eg1: first extract column utc_date and then city from file path hdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26/0001.csv
eg2: extract k1=value1 from file path hdfs://hdfs_host:hdfs_port/user/palo/data/input/k1=value/dir/k1=value1/001.csv
And report error if partitioned column specified in columns_from_path is not found.
Last but not least, columns from path are also compatible with columns mapping operations of SET statement, just like column_list

The text was updated successfully, but these errors were encountered:

imay · 2019-08-05T08:04:24Z

@yuanlihan

I prefer first option. However I think there should be some change in your version.

we can still use INFILE. We support path has wildcard. for load directory, user can specify the path like "base_dir/*". This is also consistent with the previous usage.
I think we should keep numbers of our keywords as small as possible. So PARTITIONED_COLUMNS and BASE_DIR should be reconsidered. For example PARTITION COLUMNS or BASE DIR?
PARTITIONED_COLUMNS has no relation with PARTITION. This may confuse our users. Can we have better choice?

yuanlihan · 2019-08-05T09:05:05Z

@yuanlihan

I prefer first option. However I think there should be some change in your version.

we can still use INFILE. We support path has wildcard. for load directory, user can specify the path like "base_dir/*". This is also consistent with the previous usage.

I think we should keep numbers of our keywords as small as possible. So PARTITIONED_COLUMNS and BASE_DIR should be reconsidered. For example PARTITION COLUMNS or BASE DIR?

PARTITIONED_COLUMNS has no relation with PARTITION. This may confuse our users. Can we have better choice?

@imay

Currently the path with wildcard only list files under the directory, but we need to list files recursively. Can we change the logic of wildcard-path(eg, "base_dir/*") to recursively listing files?
What about change [BASE_PATH AS "base_path"] as [PATH START WITH "base_path"] and change [PARTITIONED_COLUMNS AS (partitioned_column_list)] as [COLUMNS FROM PATH AS (columns_from_path)]? Then we only need to add a key word PATH.

imay · 2019-08-05T11:24:55Z

@yuanlihan
I prefer first option. However I think there should be some change in your version.

we can still use INFILE. We support path has wildcard. for load directory, user can specify the path like "base_dir/*". This is also consistent with the previous usage.

I think we should keep numbers of our keywords as small as possible. So PARTITIONED_COLUMNS and BASE_DIR should be reconsidered. For example PARTITION COLUMNS or BASE DIR?

PARTITIONED_COLUMNS has no relation with PARTITION. This may confuse our users. Can we have better choice?

@imay

Currently the path with wildcard only list files under the directory, but we need to list files recursively. Can we change the logic of wildcard-path(eg, "base_dir/*") to recursively listing files?

I think we can support listing path like "base_dir/*/*/*"

What about change [BASE_PATH AS "base_path"] as [PATH START WITH "base_path"] and change [PARTITIONED_COLUMNS AS (partitioned_column_list)] as [COLUMNS FROM PATH AS (columns_from_path)]? Then we only need to add a key word PATH.

I think this syntax is good.

yuanlihan · 2019-08-06T03:07:57Z

I think we can support listing path like "base_dir/*/*/*"

It seems a little weird about this syntax. What about support recursively listing files of path(eg, "base_dir/" or "base_dir/*") iff users specify columns_from_path by [COLUMNS FROM PATH AS (columns_from_path)], which rarely have conflicts with previous usage.

imay · 2019-08-06T04:26:54Z

I think we can support listing path like "base_dir/*/*/*"

It seems a little weird about this syntax. What about support recursively listing files of path(eg, "base_dir/" or "base_dir/*") iff users specify columns_from_path by [COLUMNS FROM PATH AS (columns_from_path)], which rarely have conflicts with previous usage.

If you think wildcard is weird. I think we can keep DATA INDIR, and remove the [PATH START WITH "base_path"] clause. And if users specify the columns_from_path clause we can recursive directory according to it, if they don't we only traverse one depth.

For example if user specify DATA INDIR("/path/to/dir") and with COLUMNS FROM PATH AS (k1, k2). then we will traverse the two-level directory to get the corresponding import files. We will get file path like "/path/to/dir/k1=1/k2=2/file1". And if users don't specify COLUMNS FROM PATH clause, we will only traverse one level directory, and get files like "/path/to/dir/file1".

And for DATA INFILE with COLUMNS FROM PATH, we will try to parse partition columns from user specified file path. And if we can't get match information, we should return error to users.

yuanlihan · 2019-08-06T07:21:58Z

@imay
I have update the previous design of interface.

imay · 2019-08-06T07:47:40Z

@imay
I have update the previous design of interface.

Good job, I agree with this interface.

EmmyMiao87 · 2019-08-06T08:36:09Z

In the future, the columns from path also participate in columns mapping. So, I think the grammar of load stmt is columns terminated by 'xxx' (tmp_k1) columns from path (tmp_k2) set (k1=tmp_k1+tmp_k2).

yuanlihan · 2019-08-07T03:41:22Z

In the future, the columns from path also participate in columns mapping. So, I think the grammar of load stmt is columns terminated by 'xxx' (tmp_k1) columns from path (tmp_k2) set (k1=tmp_k1+tmp_k2).

@EmmyMiao87
Sure. Columns from path are also compatible with columns mapping operations of SET statement. I have add this to the doc of interface design.

Currently, we do not support parsing encoded/compressed columns in file path, eg: extract column k1 from file path /path/to/dir/k1=1/xxx.csv This patch is able to parse columns from file path like in Spark(Partition Discovery). This patch parse partition columns at BrokerScanNode.java and save parsing result of each file path as a property of TBrokerRangeDesc, then the broker reader of BE can read the value of specified partition column.

yuanlihan · 2019-08-21T03:53:13Z

Added

…dbs to resolve class conflicts (#1582)

yuanlihan mentioned this issue Aug 5, 2019

Enable Partition Discovery for Broker Load #1569

Closed

yuanlihan closed this as completed Aug 21, 2019

imay mentioned this issue Sep 26, 2019

Release Notes 0.11.0 #1891

Closed

morningman pushed a commit that referenced this issue Apr 12, 2023

[zhongjin](dependency)Rename the Hive and Hadoop partial classes in T…

d37183c

…dbs to resolve class conflicts (#1582)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Partition Discovery for Broker Load #1582

Enable Partition Discovery for Broker Load #1582

yuanlihan commented Aug 5, 2019 •

edited

Loading

imay commented Aug 5, 2019

yuanlihan commented Aug 5, 2019

imay commented Aug 5, 2019 •

edited

Loading

yuanlihan commented Aug 6, 2019

imay commented Aug 6, 2019

yuanlihan commented Aug 6, 2019

imay commented Aug 6, 2019

EmmyMiao87 commented Aug 6, 2019

yuanlihan commented Aug 7, 2019

yuanlihan commented Aug 21, 2019

Enable Partition Discovery for Broker Load #1582

Enable Partition Discovery for Broker Load #1582

Comments

yuanlihan commented Aug 5, 2019 • edited Loading

imay commented Aug 5, 2019

yuanlihan commented Aug 5, 2019

imay commented Aug 5, 2019 • edited Loading

yuanlihan commented Aug 6, 2019

imay commented Aug 6, 2019

yuanlihan commented Aug 6, 2019

imay commented Aug 6, 2019

EmmyMiao87 commented Aug 6, 2019

yuanlihan commented Aug 7, 2019

yuanlihan commented Aug 21, 2019

yuanlihan commented Aug 5, 2019 •

edited

Loading

imay commented Aug 5, 2019 •

edited

Loading