Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16651 : [Python] Casting Table to new schema ignores nullability of fields #14048

Merged

Conversation

kshitij12345
Copy link
Contributor

@kshitij12345 kshitij12345 commented Sep 5, 2022

table = pa.table({'a': [None, 1], 'b': [None, True]})
new_schema = pa.schema([pa.field("a", "int64", nullable=True), pa.field("b", "bool", nullable=False)])
casted = table.cast(new_schema)

Now leads to

RuntimeError: Casting field 'b' with null values to non-nullable

@github-actions
Copy link

github-actions bot commented Sep 5, 2022

@github-actions
Copy link

github-actions bot commented Sep 5, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@kshitij12345
Copy link
Contributor Author

cc: @AlenkaF @jorisvandenbossche

Copy link
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for solving this @kshitij12345, looks great to me!

@jorisvandenbossche
Copy link
Member

We should probably check if a field actually has nulls, instead of only the nullability flag of the field?
Meaning: I think it should still be possible to cast a nullable field to non-nullable, if that field has no nulls in it?

@kshitij12345
Copy link
Contributor Author

Wouldn't that be much slower if we have to check if the field has null in it's data or is there meta-data stored around that?

@jorisvandenbossche
Copy link
Member

Yeah, that can indeed be slower, so that is certainly a trade-off to make.

An array can store an optional null_count, which is cached once known, so getting the null count _can_be fast (depending on whether it is already available).

@kshitij12345
Copy link
Contributor Author

@jorisvandenbossche Have updated to check the null_count. PTAL :)

@kshitij12345 kshitij12345 marked this pull request as ready for review September 6, 2022 16:27
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, looking good! Just a small optimization comment

@@ -3401,6 +3401,9 @@ cdef class Table(_PandasConvertible):
.format(self.schema.names, target_schema.names))

for column, field in zip(self.itercolumns(), target_schema):
if column.null_count > 0 and not field.nullable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if column.null_count > 0 and not field.nullable:
if not field.nullable and column.null_count > 0:

Switching the order will avoid checking the null_count (potentially expensive) if the field is nullable (which will be the most common case, since this is the default)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Thanks!

@@ -3401,6 +3401,9 @@ cdef class Table(_PandasConvertible):
.format(self.schema.names, target_schema.names))

for column, field in zip(self.itercolumns(), target_schema):
if column.null_count > 0 and not field.nullable:
raise RuntimeError("Casting field {!r} with null values to non-nullable"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be a ValueError

@kshitij12345
Copy link
Contributor Author

@jorisvandenbossche Have addressed the review. Thanks! PTAL :)

@jorisvandenbossche jorisvandenbossche merged commit df121b7 into apache:master Sep 8, 2022
@kshitij12345 kshitij12345 deleted the dev/strict/table-casting branch September 8, 2022 12:45
@ursabot
Copy link

ursabot commented Sep 8, 2022

Benchmark runs are scheduled for baseline = 43670af and contender = df121b7. df121b7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.37% ⬆️0.1%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] df121b7f ec2-t3-xlarge-us-east-2
[Failed] df121b7f test-mac-arm
[Failed] df121b7f ursa-i9-9960x
[Finished] df121b7f ursa-thinkcentre-m75q
[Finished] 43670af0 ec2-t3-xlarge-us-east-2
[Failed] 43670af0 test-mac-arm
[Failed] 43670af0 ursa-i9-9960x
[Finished] 43670af0 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
…y of fields (apache#14048)

```python
table = pa.table({'a': [None, 1], 'b': [None, True]})
new_schema = pa.schema([pa.field("a", "int64", nullable=True), pa.field("b", "bool", nullable=False)])
casted = table.cast(new_schema)

```

Now leads to
```
RuntimeError: Casting field 'b' with null values to non-nullable
```

Authored-by: kshitij12345 <kshitijkalambarkar@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
…y of fields (apache#14048)

```python
table = pa.table({'a': [None, 1], 'b': [None, True]})
new_schema = pa.schema([pa.field("a", "int64", nullable=True), pa.field("b", "bool", nullable=False)])
casted = table.cast(new_schema)

```

Now leads to
```
RuntimeError: Casting field 'b' with null values to non-nullable
```

Authored-by: kshitij12345 <kshitijkalambarkar@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants