-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge Into Performance to Iceberg Table #3607
Comments
The match expressions are not pushed down and are only applied to the result of the "ON" join clause. This means any pushdowns you want need to occur in the "ON" clause. So in your case you can just add "part_date_col" predicate into the "on" restriction and it should push down as you expect. |
@RussellSpitzer is it possible for you to elaborate a bit more what you mean with pushdown? Also, based on your suggestion, the merge query should look like this?
Btw, how does this work with hidden partitions? |
It works with hidden partitions as well. The MERGE INTO is performed by first doing a an join of Target vs Source where (ON CLAUSE) Matched clauses are then applied to specific rows in the result of this join. Because they are not applied universally the predicates inside the match clause cannot be pushed down to "source" or "target"This is why any pushdown clauses muse be in the "ON". This works in on hidden partitioning as well, just as it would in a normal query. If the predicate is on a column that has been partitioned we transform the predicate into the value that was used in partitioning. Certain predicates cannot be transformed though and require a full scan. For example if you say But if the partitioning was |
Thanks @RussellSpitzer this helped. Will close this ticket. |
Hi @RussellSpitzer , sorry to jump on this thread, but I have a question regarding your last message, if you could elaborate, as I think I'm missing a step:
Surely,
Thanks! |
I think I may have misled you by oversimplifying. The user here still only writes queries using their exact restrictions, Iceberg then uses this restriction to create restrictions which match the partitioning. For example, Iceberg knows a specific timestamp can only occur in a certain day and it can use that information to limit the files read. Iceberg doesn't disregard the original predicate, that stays with the execution engine for actually evaluating rows but Iceberg can still use this timestamp for partition pruning and file evaluation. For example say you are looking for First thing we do is look at our manifest_list file, see docs in the spec We load up the partition spec for the given spec ID and transform the original predicate into one that matches that spec. If our With this new transform we evaluate all the Once we have a list of all the possible manifest files that may have hits we play this game again. Now we check against Here we can do two steps of evaluation for each individual data file. First we can use the transformed predicate ( The scan then contains of all data files which we know may have our given row, this is transformed into a set of tasks for whatever execution engine is in use and evaluated. The execution engine then will use its own logic to filter individual rows with the original predicates (YMMV based on engine specific implementations). So what happens if we cannot transform a predicate into the partition spec? Or what if a data file was inserted into the table when it was unpartitioned? In both of these cases we default to "this file may contain the row we are looking for" and return it to the engine. For example, suppose you have a predicate My big TLDR here is: As a user you query on normal columns, Iceberg attempts to transform your predicates into ones that match the partitioning of the files within the table to prune out files. When Iceberg cannot transform the predicates it simply assumes there may be a match and returns those files to the execution engine which does the actual row level filtering. |
Hi, Sorry to jump in but this thread is gold for understanding how partition filter works using I have one question regarding to bucketed columns. It is clear in this thread that we should use We can't inject just the bucketing info into the ON clause with the already transformed "column"? For example, in this table we could inject if possible 8 values and not thousands ( I suppose that the SQL query has some size limits in Spark and we may hit those).
I know that I could built ´n´ |
Hi @RussellSpitzer , does it mean that in the following case
where Can I expect an almost same performance as for |
In the above example you would only be able to get dynamic pushdown so it depends on the partitioning. Column = Column means there is no static pushdown, only runtime filtering. |
Hi Team,
Can you suggest me some way to tune a slow running MERGE query. It is taking ~ 20 mins to upsert 1.5 million records.
Sample Merge query:
df.createOrReplaceTempView("source")
df.cache()
MERGE INTO iceberg_hive_cat.iceberg_poc_db.iceberg_tab target
USING (SELECT * FROM source)
ON target.col1 = source.col1 AND target.col2 = target.col2 AND target.col3 = source.col3
WHEN MATCHED AND part_date_col between '2021-01-01' and '2021-01-16' THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
The source dataset is a temporary view and it contains 1.5 million records and contains data between '2021-01-01' and '2021-01-16'.
The target iceberg table is a partitioned table partitioned by day and has 60 partitions. The source dataset will only upsert few trailing partitions of the target table. But from Spark UI it looks like it is touching all the partitions instead of looking just for partitions between '2021-01-01' and '2021-01-16'.
Let me know if you need any further details.
Regards
Joyan
The text was updated successfully, but these errors were encountered: