Skip to content

Conversation

@FangYongs
Copy link
Contributor

@FangYongs FangYongs commented Nov 14, 2022

Currently, the table store uses the latest schema id to read the data file meta. When the schema evolves, it will cause errors, for example:

  1. the schema of underlying data is [1->a, 2->b, 3->c, 4->d] and schema id is 0, where 1/2/3/4 is field id and a/b/c/d is field name
  2. After schema evolution, schema id is 1, and the new schema is [1->a, 3->c, 5->f, 6->b, 7->g]

When table store reads the field stats from data file meta, it should mapping schema 1 to 0 according to their field ids.

This PR will read and parse the data according to the schema id in the meta file when reading the data file meta, and create index mapping from the table schema and the meta schema, so that the table store can read the correct file meta data through its latest schema.

The main codes are as follows:

  1. Added SchemaFieldTypeExtractor to extract key fields for ChangelogValueCountFileStoreTable and ChangelogWithKeyFileStoreTable
  2. Added SchemaEvolutionUtil to create index mapping from table schema to meta file schema
  3. Updated FieldStatsArraySerializer to read field stats with given index mapping

The main tests include:

  1. Added SchemaEvolutionUtilTest to create index mapping between two schemas.
  2. Added FieldStatsArraSerializerTest to read meta from table schema
  3. Added AppendOnlyTableFileMetaFilterTest, ChangelogValueCountFileMetaFilterTest and ChangelogWithKeyFileMetaFilterTest to filter old field, new field, partition field and primary key in data file meta in table scan.

@FangYongs
Copy link
Contributor Author

Hi @JingsongLi @tsreaper Can you help to review this PR when you're free? THX

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks very nice! I will take a deep look~

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

@FangYongs
Copy link
Contributor Author

Thanks @JingsongLi I have updated the codes

@JingsongLi
Copy link
Contributor

Thanks for the update. We can improve Extractor a little bit.

@FangYongs
Copy link
Contributor Author

Thanks for the update. We can improve Extractor a little bit.

Done

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@JingsongLi JingsongLi merged commit 157966f into apache:master Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants