-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Partition Discovery for Broker Load #1582
Comments
I prefer first option. However I think there should be some change in your version.
|
|
I think we can support listing path like
I think this syntax is good. |
It seems a little weird about this syntax. What about support recursively listing files of path(eg, "base_dir/" or "base_dir/*") iff users specify columns_from_path by [COLUMNS FROM PATH AS (columns_from_path)], which rarely have conflicts with previous usage. |
If you think wildcard is weird. I think we can keep For example if user specify And for |
@imay |
Good job, I agree with this interface. |
In the future, the columns from path also participate in columns mapping. So, I think the grammar of load stmt is |
@EmmyMiao87 |
Currently, we do not support parsing encoded/compressed columns in file path, eg: extract column k1 from file path /path/to/dir/k1=1/xxx.csv This patch is able to parse columns from file path like in Spark(Partition Discovery). This patch parse partition columns at BrokerScanNode.java and save parsing result of each file path as a property of TBrokerRangeDesc, then the broker reader of BE can read the value of specified partition column.
Added |
…dbs to resolve class conflicts (#1582)
Can not parse partitioned columns and can not recursively list files when using Broker Load
When users try to load data from hdfs files written by Spark jobs, they usually need to extract partitioned columns which are specified by write options of Spark API. And recursively listing files of a path(dir) is needed because there are usually multi partitioned columns.
Ex: We try to load data from source:
Expected Interface of Broker Load for the Feature of Partition Discovery
We will parse partitioned columns in file path from right to left. eg1: first extract column
utc_date
and thencity
from file pathhdfs://hdfs_host:hdfs_port/user/palo/data/input/dir/city=beijing/utc_date=2019-06-26/0001.csv
eg2: extract k1=value1 from file path
hdfs://hdfs_host:hdfs_port/user/palo/data/input/k1=value/dir/k1=value1/001.csv
And report error if partitioned column specified in
columns_from_path
is not found.Last but not least, columns from path are also compatible with columns mapping operations of SET statement, just like column_list
The text was updated successfully, but these errors were encountered: