Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Add a mask argument to pyarrow.StructArray.from_arrays #28425

Closed
asfimport opened this issue May 7, 2021 · 6 comments
Closed

[Python] Add a mask argument to pyarrow.StructArray.from_arrays #28425

asfimport opened this issue May 7, 2021 · 6 comments

Comments

@asfimport
Copy link

The python API for creating StructArray from a list of array doesn't allow to pass a missing value mask.

At the moment the only way to create a StructArray with missing value is to use pyarrow.array and passing a vector of tuple.

>>> pyarrow.array(
    [
        None,
        (1, "foo"),
    ],
    type=pyarrow.struct(
        [pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
    )
)
-- is_valid:
  [
    false,
    true
  ]
-- child 0 type: int64
  [
    0,
    1
  ]
-- child 1 type: string
  [
    "",
    "foo"
  ]
>>> pyarrow.StructArray.from_arrays(
    [
        [None, 1],
        [None, "foo"]
    ],
    fields=[pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
)
-- is_valid: all not null
-- child 0 type: int64
  [
    null,
    1
  ]
-- child 1 type: string
  [
    null,
    "foo"
  ]

The C++ API allows it, so it should be easy to add.

see this so question

Reporter: &res / @0x26res
Assignee: Weston Pace / @westonpace

PRs and other links:

Note: This issue was originally created as ARROW-12677. Please see the migration documentation for further details.

@asfimport
Copy link
Author

&res / @0x26res:
@westonpace thanks for looking into this.

I'm not sure if it's the right place to mention that, but I now have the same issue with ListArray, and I'm wondering if it'd be worth doing the same changes there.

 

Here's an example where I'm have a list of struct, but some of the list are null:

 

  • Using pyarrow.array (works, but requires turning columns into rows)

    list_of_struct = pyarrow.list_(
        pyarrow.struct([pyarrow.field("foo", pyarrow.string())])
    )
    array = pyarrow.array(
        [[("hello",), ("World",)], [], None, [None, ("foo",), ("bar",)]],
        type=list_of_struct,
    )
    print(array) 

    {code:java}
    [

    • is_valid: all not null
    • child 0 type: string
      [
      "hello",
      "World"
      ],
    • is_valid: all not null
    • child 0 type: string
      [],
      null,
    • is_valid:
      [
      false,
      true,
      true
      ]
    • child 0 type: string
      [
      "",
      "foo",
      "bar"
      ]
      ] {code}
       
  • Using ListArray.from_array (it's not possible to mark a list a null (It falls back to empty)

    struct_type = pyarrow.struct([pyarrow.field("foo", pyarrow.string())])
    foo = pyarrow.array(["hello", "World", None, "foo", "bar"])
    validity_mask = pyarrow.array([True, True, False, True, True])
    validity_bitmask = validity_mask.buffers()[1]
    struct_array = pyarrow.StructArray.from_buffers(
        struct_type, len(foo), [validity_bitmask], children=[foo]
    )
    list_array = pyarrow.ListArray.from_arrays(
        offsets=[0, 2, 2, 2, 5], values=struct_array
    )

    {code:java}
    [

    • is_valid: all not null
    • child 0 type: string
      [
      "hello",
      "World"
      ],
    • is_valid: all not null
    • child 0 type: string
      [],
    • is_valid: all not null
    • child 0 type: string
      [],
    • is_valid:
      [
      false,
      true,
      true
      ]
    • child 0 type: string
      [
      null,
      "foo",
      "bar"
      ]
      ]
      {code}
       
  • Using the "from_buffers" work around (it works, but not a great API):

    struct_type = pyarrow.struct([pyarrow.field("foo", pyarrow.string())])
    foo_values = pyarrow.array(["hello", "World", None, "foo", "bar"])
    struct_validity_mask = pyarrow.array([True, True, False, True, True])
    struct_validity_bitmask = struct_validity_mask.buffers()[1]
    struct_array = pyarrow.StructArray.from_buffers(
        struct_type,
        len(foo_values),
        [struct_validity_bitmask],
        children=[foo_values],
    )
    
    list_validity_mask = pyarrow.array([True, True, False, True])
    list_validity_buffer = list_validity_mask.buffers()[1]
    list_offsets_buffer = pyarrow.array([0, 2, 2, 2, 5], pyarrow.int32()).buffers()[1]
    
    list_array = pyarrow.ListArray.from_buffers(
        type=pyarrow.list_(struct_type),
        length=4,
        buffers=[list_validity_buffer, list_offsets_buffer, ],
        children=[struct_array],
    )
    print(list_array)

    {code:java}

    • is_valid: all not null
    • child 0 type: string
      [
      "hello",
      "World"
      ],
    • is_valid: all not null
    • child 0 type: string
      [],
      null,
    • is_valid:
      [
      false,
      true,
      true
      ]
    • child 0 type: string
      [
      null,
      "foo",
      "bar"
      ]
      ]
      {code}

@asfimport
Copy link
Author

&res / @0x26res:

StructArray 100% cheaply from existing arrays with from_arrays

Are you saying this because a copy of the inverted mask array is required? Or is there another overhead.

I guess if we have to do a copy of the array, than memory_pool shoud be added to 'from_arrays' to be consistent, but it would make things confusing.

Personally I'm happy with using from_buffer. The API isn't great, but once you've figured it out it's fine.

 

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:

Are you saying this because a copy of the inverted mask array is required?

Indeed, inverting the mask is indeed the "overhead" I was pointing at.

@asfimport
Copy link
Author

Weston Pace / @westonpace:

Using ListArray.from_array (it's not possible to mark a list a null (It falls back to empty) 

It's odd, but you can do it by putting a null in the offsets array.  I added some examples to ListArray.from_arrays as part of the PR.

@asfimport
Copy link
Author

Weston Pace / @westonpace:
Although I'm not opposed to accepting a mask as well.  I could probably raise invalid if offsets.null_count > 0 and a mask is specified.  @jorisvandenbossche any opinion?

@asfimport
Copy link
Author

David Li / @lidavidm:
Issue resolved by pull request 10272
#10272

@asfimport asfimport added this to the 5.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants