Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Basic conversion of RecordBatch to Arrow Tensor - add support for row-major #40866

Closed
Tracked by #40058
AlenkaF opened this issue Mar 28, 2024 · 2 comments

Comments

@AlenkaF
Copy link
Member

AlenkaF commented Mar 28, 2024

Describe the enhancement requested

This issue is a part of #40058 and adds an option to construct a row-major Tensor from a RecordBatch which is a layout used most often when working with tensors.

Component(s)

C++, Python

@jorisvandenbossche
Copy link
Member

Although this is a new feature and not a bug fix, it is changing the behaviour of a newly introduced feature for 16.0, and therefore I would propose to include it for 16.0 as well, to avoid that we directly make a breaking change in the new feature in the next release.

jorisvandenbossche added a commit that referenced this issue Apr 10, 2024
…or - add support for row-major (#40867)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`.

### What changes are included in this PR?

This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works:

```python
>>> import pyarrow as pa
>>> import numpy as np

>>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90]
>>> batch = pa.RecordBatch.from_arrays(
...     [
...         pa.array(arr1, type=pa.uint16()),
...         pa.array(arr2, type=pa.int16()),
... 
...     ], ["a", "b"]
... )

# Row-major

>>> batch.to_tensor()
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (8, 4)

>>> batch.to_tensor().to_numpy().flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

# Column-major

>>> batch.to_tensor(row_major=False)
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (4, 36)

>>> batch.to_tensor(row_major=False).to_numpy().flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
```

### Are these changes tested?

Yes, in C++ and Python.

### Are there any user-facing changes?

No.
* GitHub Issue: #40866

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member

Issue resolved by pull request 40867
#40867

raulcd pushed a commit that referenced this issue Apr 11, 2024
…or - add support for row-major (#40867)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`.

### What changes are included in this PR?

This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works:

```python
>>> import pyarrow as pa
>>> import numpy as np

>>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90]
>>> batch = pa.RecordBatch.from_arrays(
...     [
...         pa.array(arr1, type=pa.uint16()),
...         pa.array(arr2, type=pa.int16()),
... 
...     ], ["a", "b"]
... )

# Row-major

>>> batch.to_tensor()
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (8, 4)

>>> batch.to_tensor().to_numpy().flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

# Column-major

>>> batch.to_tensor(row_major=False)
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (4, 36)

>>> batch.to_tensor(row_major=False).to_numpy().flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
```

### Are these changes tested?

Yes, in C++ and Python.

### Are there any user-facing changes?

No.
* GitHub Issue: #40866

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue Apr 15, 2024
…w Tensor - add support for row-major (apache#40867)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`.

### What changes are included in this PR?

This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works:

```python
>>> import pyarrow as pa
>>> import numpy as np

>>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90]
>>> batch = pa.RecordBatch.from_arrays(
...     [
...         pa.array(arr1, type=pa.uint16()),
...         pa.array(arr2, type=pa.int16()),
... 
...     ], ["a", "b"]
... )

# Row-major

>>> batch.to_tensor()
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (8, 4)

>>> batch.to_tensor().to_numpy().flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

# Column-major

>>> batch.to_tensor(row_major=False)
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (4, 36)

>>> batch.to_tensor(row_major=False).to_numpy().flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
```

### Are these changes tested?

Yes, in C++ and Python.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40866

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
…w Tensor - add support for row-major (apache#40867)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`.

### What changes are included in this PR?

This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works:

```python
>>> import pyarrow as pa
>>> import numpy as np

>>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90]
>>> batch = pa.RecordBatch.from_arrays(
...     [
...         pa.array(arr1, type=pa.uint16()),
...         pa.array(arr2, type=pa.int16()),
... 
...     ], ["a", "b"]
... )

# Row-major

>>> batch.to_tensor()
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (8, 4)

>>> batch.to_tensor().to_numpy().flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

# Column-major

>>> batch.to_tensor(row_major=False)
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (4, 36)

>>> batch.to_tensor(row_major=False).to_numpy().flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
```

### Are these changes tested?

Yes, in C++ and Python.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40866

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 4, 2024
…w Tensor - add support for row-major (apache#40867)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`.

### What changes are included in this PR?

This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works:

```python
>>> import pyarrow as pa
>>> import numpy as np

>>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90]
>>> batch = pa.RecordBatch.from_arrays(
...     [
...         pa.array(arr1, type=pa.uint16()),
...         pa.array(arr2, type=pa.int16()),
... 
...     ], ["a", "b"]
... )

# Row-major

>>> batch.to_tensor()
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (8, 4)

>>> batch.to_tensor().to_numpy().flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

# Column-major

>>> batch.to_tensor(row_major=False)
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (4, 36)

>>> batch.to_tensor(row_major=False).to_numpy().flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
```

### Are these changes tested?

Yes, in C++ and Python.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40866

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…w Tensor - add support for row-major (apache#40867)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`.

### What changes are included in this PR?

This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works:

```python
>>> import pyarrow as pa
>>> import numpy as np

>>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90]
>>> batch = pa.RecordBatch.from_arrays(
...     [
...         pa.array(arr1, type=pa.uint16()),
...         pa.array(arr2, type=pa.int16()),
... 
...     ], ["a", "b"]
... )

# Row-major

>>> batch.to_tensor()
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (8, 4)

>>> batch.to_tensor().to_numpy().flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

# Column-major

>>> batch.to_tensor(row_major=False)
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (4, 36)

>>> batch.to_tensor(row_major=False).to_numpy().flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
```

### Are these changes tested?

Yes, in C++ and Python.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40866

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rok pushed a commit to tmct/arrow that referenced this issue May 8, 2024
…w Tensor - add support for row-major (apache#40867)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`.

### What changes are included in this PR?

This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works:

```python
>>> import pyarrow as pa
>>> import numpy as np

>>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90]
>>> batch = pa.RecordBatch.from_arrays(
...     [
...         pa.array(arr1, type=pa.uint16()),
...         pa.array(arr2, type=pa.int16()),
... 
...     ], ["a", "b"]
... )

# Row-major

>>> batch.to_tensor()
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (8, 4)

>>> batch.to_tensor().to_numpy().flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

# Column-major

>>> batch.to_tensor(row_major=False)
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (4, 36)

>>> batch.to_tensor(row_major=False).to_numpy().flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
```

### Are these changes tested?

Yes, in C++ and Python.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40866

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…w Tensor - add support for row-major (apache#40867)

### Rationale for this change

The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`.

### What changes are included in this PR?

This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works:

```python
>>> import pyarrow as pa
>>> import numpy as np

>>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90]
>>> batch = pa.RecordBatch.from_arrays(
...     [
...         pa.array(arr1, type=pa.uint16()),
...         pa.array(arr2, type=pa.int16()),
... 
...     ], ["a", "b"]
... )

# Row-major

>>> batch.to_tensor()
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (8, 4)

>>> batch.to_tensor().to_numpy().flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

# Column-major

>>> batch.to_tensor(row_major=False)
<pyarrow.Tensor>
type: int32
shape: (9, 2)
strides: (4, 36)

>>> batch.to_tensor(row_major=False).to_numpy().flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
```

### Are these changes tested?

Yes, in C++ and Python.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40866

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants