Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Inconsistency around files_count #5338

Merged
merged 1 commit into from
Oct 5, 2023

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Jul 22, 2022

When building the Manifest mappers for Python, @rdblue noticed that the added_data_files_count should be added_files_count according to the spec.

However, this field is written in Java as added_data_files_count to Avro by the Java implementation:

Types.NestedField ADDED_FILES_COUNT = optional(504, "added_data_files_count", Types.IntegerType.get(),
"Added entry count");
Types.NestedField EXISTING_FILES_COUNT = optional(505, "existing_data_files_count", Types.IntegerType.get(),
"Existing entry count");
Types.NestedField DELETED_FILES_COUNT = optional(506, "deleted_data_files_count", Types.IntegerType.get(),
"Deleted entry count");

Luckily this doesn't affect the reading/writing because it is position based.

However, it is confusing. I think we should resolve this. We could either do this by changing the Java impl, which probably works, but we could also change the spec. I know that this isn't something lightweight, but we could take it into consideration.

I think we should also update the references in the code to {added,existing,deleted}_data_files_count to make everything consistent and avoid confusion in the future.

Closes #8684

.palantir/revapi.yml Outdated Show resolved Hide resolved
@Fokko Fokko force-pushed the fd-inconsitency-between-spec-and-impl branch 2 times, most recently from 978fa31 to 1667773 Compare August 1, 2022 07:13
@rdblue
Copy link
Contributor

rdblue commented Aug 7, 2022

@Fokko, sorry, but I think I was wrong in my initial review. I think instead of renaming these, we actually want to rename the fields that we write so that they match the spec. Because we use the same column for both DataFile and DeleteFile to count the number of ADDED rows, we should use a neutral name so added_files_count is correct.

Looks like the problem here is that the original added_data_files_count is used instead of added_files_count, not that the variable names don't match.

This is what I meant by this comment:

I think we can also update the field names. Those won't break anything because the only names that matter are the ones in the read schema.

I'm not sure why I thought the constant renames looked good though, since that would rename the constants to match the out-of-date manifest field names, which we should update instead.

@Fokko Fokko force-pushed the fd-inconsitency-between-spec-and-impl branch 3 times, most recently from 3dd4b02 to 95e8c27 Compare September 11, 2022 19:49
@Fokko Fokko closed this Nov 17, 2022
@Fokko Fokko force-pushed the fd-inconsitency-between-spec-and-impl branch from 95e8c27 to b3eaf0c Compare November 17, 2022 13:27
@github-actions github-actions bot removed the API label Nov 17, 2022
@Fokko Fokko reopened this Nov 17, 2022
@github-actions github-actions bot added the API label Nov 17, 2022
@Fokko Fokko force-pushed the fd-inconsitency-between-spec-and-impl branch from 922bb49 to 4c65ccb Compare November 17, 2022 16:03
@Fokko Fokko added the Specification Issues that may introduce spec changes. label Oct 5, 2023
@Fokko
Copy link
Contributor Author

Fokko commented Oct 5, 2023

Thanks for the review @rdblue 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API core Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spec disagrees with implementation on field names
4 participants