Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Basic conversion of RecordBatch to Arrow Tensor - add option to cast NULL to NaN #40061

Closed
Tracked by #40058
AlenkaF opened this issue Feb 13, 2024 · 1 comment

Comments

@AlenkaF
Copy link
Member

AlenkaF commented Feb 13, 2024

Describe the enhancement requested

This issue is a part of #40058 adds option to cast NULL to NaN mimicking the cast from ConvertIntegerWithNulls.

Component(s)

C++, Python

jorisvandenbossche added a commit that referenced this issue Mar 29, 2024
…or - add option to cast NULL to NaN (#40803)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class exists but it doesn't support record batches with validity bitmaps. This PR adds support for an option to convert null values to NaN.

### What changes are included in this PR?

This PR adds a `nul_to_nan` option in `RecordBatch::ToTensor` so that null values are converted to NaN in the resulting `Tensor`. This for example works:

```python
>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )

>>> batch
pyarrow.RecordBatch
a: int32
b: float
----
a: [1,2,3,4,null]
b: [10,20,30,40,null]

>>> batch.to_tensor(null_to_nan=True)
<pyarrow.Tensor>
type: double
shape: (5, 2)
strides: (8, 40)

>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
       [ 2., 20.],
       [ 3., 30.],
       [ 4., 40.],
       [nan, nan]])
```
but default would raise:

```python
>>> batch.to_tensor()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3421, in pyarrow.lib.RecordBatch.to_tensor
    a: int32
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Can only convert a RecordBatch with no nulls. Set null_to_nan to true to convert nulls to nan
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #40061

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche added this to the 16.0.0 milestone Mar 29, 2024
@jorisvandenbossche
Copy link
Member

Issue resolved by pull request 40803
#40803

tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
…w Tensor - add option to cast NULL to NaN (apache#40803)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class exists but it doesn't support record batches with validity bitmaps. This PR adds support for an option to convert null values to NaN.

### What changes are included in this PR?

This PR adds a `nul_to_nan` option in `RecordBatch::ToTensor` so that null values are converted to NaN in the resulting `Tensor`. This for example works:

```python
>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )

>>> batch
pyarrow.RecordBatch
a: int32
b: float
----
a: [1,2,3,4,null]
b: [10,20,30,40,null]

>>> batch.to_tensor(null_to_nan=True)
<pyarrow.Tensor>
type: double
shape: (5, 2)
strides: (8, 40)

>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
       [ 2., 20.],
       [ 3., 30.],
       [ 4., 40.],
       [nan, nan]])
```
but default would raise:

```python
>>> batch.to_tensor()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3421, in pyarrow.lib.RecordBatch.to_tensor
    a: int32
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Can only convert a RecordBatch with no nulls. Set null_to_nan to true to convert nulls to nan
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40061

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 4, 2024
…w Tensor - add option to cast NULL to NaN (apache#40803)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class exists but it doesn't support record batches with validity bitmaps. This PR adds support for an option to convert null values to NaN.

### What changes are included in this PR?

This PR adds a `nul_to_nan` option in `RecordBatch::ToTensor` so that null values are converted to NaN in the resulting `Tensor`. This for example works:

```python
>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )

>>> batch
pyarrow.RecordBatch
a: int32
b: float
----
a: [1,2,3,4,null]
b: [10,20,30,40,null]

>>> batch.to_tensor(null_to_nan=True)
<pyarrow.Tensor>
type: double
shape: (5, 2)
strides: (8, 40)

>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
       [ 2., 20.],
       [ 3., 30.],
       [ 4., 40.],
       [nan, nan]])
```
but default would raise:

```python
>>> batch.to_tensor()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3421, in pyarrow.lib.RecordBatch.to_tensor
    a: int32
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Can only convert a RecordBatch with no nulls. Set null_to_nan to true to convert nulls to nan
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40061

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…w Tensor - add option to cast NULL to NaN (apache#40803)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class exists but it doesn't support record batches with validity bitmaps. This PR adds support for an option to convert null values to NaN.

### What changes are included in this PR?

This PR adds a `nul_to_nan` option in `RecordBatch::ToTensor` so that null values are converted to NaN in the resulting `Tensor`. This for example works:

```python
>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )

>>> batch
pyarrow.RecordBatch
a: int32
b: float
----
a: [1,2,3,4,null]
b: [10,20,30,40,null]

>>> batch.to_tensor(null_to_nan=True)
<pyarrow.Tensor>
type: double
shape: (5, 2)
strides: (8, 40)

>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
       [ 2., 20.],
       [ 3., 30.],
       [ 4., 40.],
       [nan, nan]])
```
but default would raise:

```python
>>> batch.to_tensor()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3421, in pyarrow.lib.RecordBatch.to_tensor
    a: int32
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Can only convert a RecordBatch with no nulls. Set null_to_nan to true to convert nulls to nan
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40061

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…w Tensor - add option to cast NULL to NaN (apache#40803)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class exists but it doesn't support record batches with validity bitmaps. This PR adds support for an option to convert null values to NaN.

### What changes are included in this PR?

This PR adds a `nul_to_nan` option in `RecordBatch::ToTensor` so that null values are converted to NaN in the resulting `Tensor`. This for example works:

```python
>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )

>>> batch
pyarrow.RecordBatch
a: int32
b: float
----
a: [1,2,3,4,null]
b: [10,20,30,40,null]

>>> batch.to_tensor(null_to_nan=True)
<pyarrow.Tensor>
type: double
shape: (5, 2)
strides: (8, 40)

>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
       [ 2., 20.],
       [ 3., 30.],
       [ 4., 40.],
       [nan, nan]])
```
but default would raise:

```python
>>> batch.to_tensor()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3421, in pyarrow.lib.RecordBatch.to_tensor
    a: int32
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Can only convert a RecordBatch with no nulls. Set null_to_nan to true to convert nulls to nan
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40061

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…w Tensor - add option to cast NULL to NaN (apache#40803)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class exists but it doesn't support record batches with validity bitmaps. This PR adds support for an option to convert null values to NaN.

### What changes are included in this PR?

This PR adds a `nul_to_nan` option in `RecordBatch::ToTensor` so that null values are converted to NaN in the resulting `Tensor`. This for example works:

```python
>>> import pyarrow as pa
>>> batch = pa.record_batch(
...     [
...         pa.array([1, 2, 3, 4, None], type=pa.int32()),
...         pa.array([10, 20, 30, 40, None], type=pa.float32()),
...     ], names = ["a", "b"]
... )

>>> batch
pyarrow.RecordBatch
a: int32
b: float
----
a: [1,2,3,4,null]
b: [10,20,30,40,null]

>>> batch.to_tensor(null_to_nan=True)
<pyarrow.Tensor>
type: double
shape: (5, 2)
strides: (8, 40)

>>> batch.to_tensor(null_to_nan=True).to_numpy()
array([[ 1., 10.],
       [ 2., 20.],
       [ 3., 30.],
       [ 4., 40.],
       [nan, nan]])
```
but default would raise:

```python
>>> batch.to_tensor()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3421, in pyarrow.lib.RecordBatch.to_tensor
    a: int32
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Can only convert a RecordBatch with no nulls. Set null_to_nan to true to convert nulls to nan
```

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40061

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants