You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Important to note here that partitions can evolve over time with Iceberg. For example, when a partition is updated from daily to hourly partitions due to increased data, then new data will be written using the new partitioning strategy. The old data will not change and will remain partitioned on a daily basis until the data is rewritten.
Since there can be many different partition strategies over time, the evaluators are constructed lazily on PyIceberg. Based on the 502: partition_spec_id we construct an evaluator:
This requires the original schema, since the filter references the original fields. Consider the following schema:
schema {
1: id int
}
For example, id = 10, and there is a bucket(id) partitioning on the table. Then the spec is as:
[
1000: id: bucket[22](1)
]
The PartitionSpec has a single field, that starts from field-id 1000. It references field-id 1, which corresponds with id. This way we decouple the filtering from the naming.
The most important part where it is filtered:
classInclusiveProjection(ProjectionEvaluator):
defvisit_bound_predicate(self, predicate: BoundPredicate[Any]) ->BooleanExpression:
parts=self.spec.fields_by_source_id(predicate.term.ref().field.field_id)
result: BooleanExpression=AlwaysTrue()
forpartinparts:
# consider (d = 2019-01-01) with bucket(7, d) and bucket(5, d)# projections: b1 = bucket(7, '2019-01-01') = 5, b2 = bucket(5, '2019-01-01') = 0# any value where b1 != 5 or any value where b2 != 0 cannot be the '2019-01-01'## similarly, if partitioning by day(ts) and hour(ts), the more restrictive# projection should be used. ts = 2019-01-01T01:00:00 produces day=2019-01-01 and# hour=2019-01-01-01. the value will be in 2019-01-01-01 and not in 2019-01-01-02.incl_projection=part.transform.project(name=part.name, pred=predicate)
ifincl_projectionisnotNone:
result=And(result, incl_projection)
returnresult
Here we start with AlwaysTrue(), so until there is any evidence that the manifests are not relevant, it will be included. It will return a callable that can be bound to the row_filter:
defproject(self, expr: BooleanExpression) ->BooleanExpression:
# projections assume that there are no NOT nodes in the expression tree. to ensure that this# is the case, the expression is rewritten to push all NOT nodes down to the expression# leaf nodes.# this is necessary to ensure that the default expression returned when a predicate can't be# projected is correct.returnvisit(bind(self.schema, rewrite_not(expr), self.case_sensitive), self)
Pruning of the manifests
Manifest file:
When the manifest is considered relevant based on the the range:
We create a schema from the partition spec. We create a struct from the filter struct<int>, and we create an evaluator on top of it to see if there are rows that might match, based on the partition specification.
I hope this helps, I'm happy to provide more context, and when I get the time, I'll see if I can brush off my cpp skills.
The text was updated successfully, but these errors were encountered:
Iceberg has support for hidden partitioning. Data written to a partitioned table will be split up based on the column and the applied transform:
A partitioned table can tremendously speed up the table, since each
ManifestList
keeps a summary of the partition range. This way on a high level, the files can be pruned.Important to note here that partitions can evolve over time with Iceberg. For example, when a partition is updated from daily to hourly partitions due to increased data, then new data will be written using the new partitioning strategy. The old data will not change and will remain partitioned on a daily basis until the data is rewritten.
Pruning of manifest files
Manifest list:
Since there can be many different partition strategies over time, the evaluators are constructed lazily on PyIceberg. Based on the
502: partition_spec_id
we construct an evaluator:This looks up the partition spec, and creates the evaluator:
This requires the original schema, since the filter references the original fields. Consider the following schema:
For example,
id = 10
, and there is abucket(id)
partitioning on the table. Then the spec is as:The PartitionSpec has a single field, that starts from field-id 1000. It references field-id 1, which corresponds with id. This way we decouple the filtering from the naming.
The most important part where it is filtered:
Here we start with
AlwaysTrue()
, so until there is any evidence that the manifests are not relevant, it will be included. It will return a callable that can be bound to therow_filter
:Pruning of the manifests
Manifest file:
When the manifest is considered relevant based on the the range:
This re-uses the same partition projection:
And the meat is in this function:
We create a schema from the partition spec. We create a struct from the filter
struct<int>
, and we create an evaluator on top of it to see if there are rows that might match, based on the partition specification.I hope this helps, I'm happy to provide more context, and when I get the time, I'll see if I can brush off my cpp skills.
The text was updated successfully, but these errors were encountered: