Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Test failures on 32-bit x86 #40153

Closed
mgorny opened this issue Feb 20, 2024 · 16 comments · Fixed by #40158 or #40165
Closed

[Python] Test failures on 32-bit x86 #40153

mgorny opened this issue Feb 20, 2024 · 16 comments · Fixed by #40158 or #40165

Comments

@mgorny
Copy link
Contributor

mgorny commented Feb 20, 2024

Describe the bug, including details regarding any error messages, version, and platform.

When running the test suite on 32-bit x86, I'm getting the following test failures:

FAILED tests/test_array.py::test_dictionary_to_numpy - TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'
FAILED tests/test_io.py::test_python_file_large_seeks - assert 5 == ((2 ** 32) + 5)
FAILED tests/test_io.py::test_memory_map_large_seeks - OSError: Read out of bounds (offset = 4294967301, size = 5) in file of size 10
FAILED tests/test_pandas.py::TestConvertStructTypes::test_from_numpy_nested - AssertionError: assert 8 == 12
FAILED tests/test_schema.py::test_schema_sizeof - assert 28 > 30
FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_string - OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_large_string - OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_string_with_missing - OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_categorical - OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_empty_dataframe - OverflowError: Python int too large to convert to C ssize_t
Tracebacks
============================================================== FAILURES ===============================================================
______________________________________________________ test_dictionary_to_numpy _______________________________________________________

obj = array([13.7, 11. ]), method = 'take', args = (array([0, 1, 1, 0], dtype=int64),)
kwds = {'axis': None, 'mode': 'raise', 'out': None}, bound = <built-in method take of numpy.ndarray object at 0xeaca6ad0>

    def _wrapfunc(obj, method, *args, **kwds):
        bound = getattr(obj, method, None)
        if bound is None:
            return _wrapit(obj, method, *args, **kwds)
    
        try:
>           return bound(*args, **kwds)
E           TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

args       = (array([0, 1, 1, 0], dtype=int64),)
bound      = <built-in method take of numpy.ndarray object at 0xeaca6ad0>
kwds       = {'axis': None, 'mode': 'raise', 'out': None}
method     = 'take'
obj        = array([13.7, 11. ])

/usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: TypeError

During handling of the above exception, another exception occurred:

    def test_dictionary_to_numpy():
        expected = pa.array(
            ["foo", "bar", None, "foo"]
        ).to_numpy(zero_copy_only=False)
        a = pa.DictionaryArray.from_arrays(
            pa.array([0, 1, None, 0]),
            pa.array(['foo', 'bar'])
        )
        np.testing.assert_array_equal(a.to_numpy(zero_copy_only=False),
                                      expected)
    
        with pytest.raises(pa.ArrowInvalid):
            # If this would be changed to no longer raise in the future,
            # ensure to test the actual result because, currently, to_numpy takes
            # for granted that when zero_copy_only=True there will be no nulls
            # (it's the decoding of the DictionaryArray that handles the nulls and
            # this is only activated with zero_copy_only=False)
            a.to_numpy(zero_copy_only=True)
    
        anonulls = pa.DictionaryArray.from_arrays(
            pa.array([0, 1, 1, 0]),
            pa.array(['foo', 'bar'])
        )
        expected = pa.array(
            ["foo", "bar", "bar", "foo"]
        ).to_numpy(zero_copy_only=False)
        np.testing.assert_array_equal(anonulls.to_numpy(zero_copy_only=False),
                                      expected)
    
        with pytest.raises(pa.ArrowInvalid):
            anonulls.to_numpy(zero_copy_only=True)
    
        afloat = pa.DictionaryArray.from_arrays(
            pa.array([0, 1, 1, 0]),
            pa.array([13.7, 11.0])
        )
        expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
>       np.testing.assert_array_equal(afloat.to_numpy(zero_copy_only=True),
                                      expected)

a          = <pyarrow.lib.DictionaryArray object at 0xeafe6ed0>

-- dictionary:
  [
    "foo",
    "bar"
  ]
-- indices:
  [
    0,
    1,
    null,
    0
  ]
afloat     = <pyarrow.lib.DictionaryArray object at 0xeafe6fb0>

-- dictionary:
  [
    13.7,
    11
  ]
-- indices:
  [
    0,
    1,
    1,
    0
  ]
anonulls   = <pyarrow.lib.DictionaryArray object at 0xeafe6e60>

-- dictionary:
  [
    "foo",
    "bar"
  ]
-- indices:
  [
    0,
    1,
    1,
    0
  ]
expected   = array([13.7, 11. , 11. , 13.7])

../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_array.py:823: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyarrow/array.pxi:1590: in pyarrow.lib.Array.to_numpy
    ???
/usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:192: in take
    return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
        a          = array([13.7, 11. ])
        axis       = None
        indices    = array([0, 1, 1, 0], dtype=int64)
        mode       = 'raise'
        out        = None
/usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:68: in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
        args       = (array([0, 1, 1, 0], dtype=int64),)
        bound      = <built-in method take of numpy.ndarray object at 0xeaca6ad0>
        kwds       = {'axis': None, 'mode': 'raise', 'out': None}
        method     = 'take'
        obj        = array([13.7, 11. ])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

obj = array([13.7, 11. ]), method = 'take', args = (array([0, 1, 1, 0], dtype=int64),)
kwds = {'axis': None, 'mode': 'raise', 'out': None}, wrap = <built-in method __array_wrap__ of numpy.ndarray object at 0xeaca6ad0>

    def _wrapit(obj, method, *args, **kwds):
        try:
            wrap = obj.__array_wrap__
        except AttributeError:
            wrap = None
>       result = getattr(asarray(obj), method)(*args, **kwds)
E       TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

args       = (array([0, 1, 1, 0], dtype=int64),)
kwds       = {'axis': None, 'mode': 'raise', 'out': None}
method     = 'take'
obj        = array([13.7, 11. ])
wrap       = <built-in method __array_wrap__ of numpy.ndarray object at 0xeaca6ad0>

/usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:45: TypeError
____________________________________________________ test_python_file_large_seeks _____________________________________________________

    def test_python_file_large_seeks():
        def factory(filename):
            return pa.PythonFile(open(filename, 'rb'))
    
>       check_large_seeks(factory)

factory    = <function test_python_file_large_seeks.<locals>.factory at 0xe13b6de8>

../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:262: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

file_factory = <function test_python_file_large_seeks.<locals>.factory at 0xe13b6de8>

    def check_large_seeks(file_factory):
        if sys.platform in ('win32', 'darwin'):
            pytest.skip("need sparse file support")
        try:
            filename = tempfile.mktemp(prefix='test_io')
            with open(filename, 'wb') as f:
                f.truncate(2 ** 32 + 10)
                f.seek(2 ** 32 + 5)
                f.write(b'mark\n')
            with file_factory(filename) as f:
>               assert f.seek(2 ** 32 + 5) == 2 ** 32 + 5
E               assert 5 == ((2 ** 32) + 5)
E                +  where 5 = <bound method NativeFile.seek of <pyarrow.PythonFile closed=False own_file=False is_seekable=True is_writable=False is_readable=True>>(((2 ** 32) + 5))
E                +    where <bound method NativeFile.seek of <pyarrow.PythonFile closed=False own_file=False is_seekable=True is_writable=False is_readable=True>> = <pyarrow.PythonFile closed=False own_file=False is_seekable=True is_writable=False is_readable=True>.seek

f          = <pyarrow.PythonFile closed=True own_file=False is_seekable=True is_writable=False is_readable=True>
file_factory = <function test_python_file_large_seeks.<locals>.factory at 0xe13b6de8>
filename   = '/var/tmp/portage/dev-python/pyarrow-15.0.0/temp/test_ioj_p6zuld'

../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:49: AssertionError
_____________________________________________________ test_memory_map_large_seeks _____________________________________________________

    def test_memory_map_large_seeks():
>       check_large_seeks(pa.memory_map)


../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:1140: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:51: in check_large_seeks
    assert f.read(5) == b'mark\n'
        f          = <pyarrow.MemoryMappedFile closed=True own_file=False is_seekable=True is_writable=False is_readable=True>
        file_factory = <cyfunction memory_map at 0xf228e778>
        filename   = '/var/tmp/portage/dev-python/pyarrow-15.0.0/temp/test_iozl2wxbou'
pyarrow/io.pxi:409: in pyarrow.lib.NativeFile.read
    ???
pyarrow/error.pxi:154: in pyarrow.lib.pyarrow_internal_check_status
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   OSError: Read out of bounds (offset = 4294967301, size = 5) in file of size 10


pyarrow/error.pxi:91: OSError
____________________________________________ TestConvertStructTypes.test_from_numpy_nested ____________________________________________

self = <pyarrow.tests.test_pandas.TestConvertStructTypes object at 0xeb535d90>

    def test_from_numpy_nested(self):
        # Note: an object field inside a struct
        dt = np.dtype([('x', np.dtype([('xx', np.int8),
                                       ('yy', np.bool_)])),
                       ('y', np.int16),
                       ('z', np.object_)])
        # Note: itemsize is not a multiple of sizeof(object)
>       assert dt.itemsize == 12
E       AssertionError: assert 8 == 12
E        +  where 8 = dtype([('x', [('xx', 'i1'), ('yy', '?')]), ('y', '<i2'), ('z', 'O')]).itemsize

dt         = dtype([('x', [('xx', 'i1'), ('yy', '?')]), ('y', '<i2'), ('z', 'O')])
self       = <pyarrow.tests.test_pandas.TestConvertStructTypes object at 0xeb535d90>

../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_pandas.py:2604: AssertionError
_________________________________________________________ test_schema_sizeof __________________________________________________________

    def test_schema_sizeof():
        schema = pa.schema([
            pa.field('foo', pa.int32()),
            pa.field('bar', pa.string()),
        ])
    
>       assert sys.getsizeof(schema) > 30
E       assert 28 > 30
E        +  where 28 = <built-in function getsizeof>(foo: int32\nbar: string)
E        +    where <built-in function getsizeof> = sys.getsizeof

schema     = foo: int32
bar: string

../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_schema.py:684: AssertionError
____________________________________________________ test_pandas_roundtrip_string _____________________________________________________

    @pytest.mark.pandas
    def test_pandas_roundtrip_string():
        # See https://github.com/pandas-dev/pandas/issues/50554
        if Version(pd.__version__) < Version("1.6"):
            pytest.skip("Column.size() bug in pandas")
    
        arr = ["a", "", "c"]
        table = pa.table({"a": pa.array(arr)})
    
        from pandas.api.interchange import (
            from_dataframe as pandas_from_dataframe
        )
    
        pandas_df = pandas_from_dataframe(table)
>       result = pi.from_dataframe(pandas_df)

arr        = ['a', '', 'c']
pandas_df  =    a
0  a
1   
2  c
pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
table      = pyarrow.Table
a: string
----
a: [["a","","c"]]

../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:159: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113: in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
        allow_copy = True
        df         =    a
0  a
1   
2  c
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136: in _from_dataframe
    batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
        allow_copy = True
        batches    = []
        chunk      = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda1c61f0>
        df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda1c61f0>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182: in protocol_df_chunk_to_pyarrow
    columns[name] = column_to_array(col, allow_copy)
        allow_copy = True
        col        = <pandas.core.interchange.column.PandasColumn object at 0xda1c65b0>
        columns    = {}
        df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda1c61f0>
        dtype      = <DtypeKind.STRING: 21>
        name       = 'a'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214: in column_to_array
    data = buffers_to_array(buffers, data_type,
        allow_copy = True
        buffers    = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 3879523528, 'device': 'CPU'}),
          (<DtypeKind.STRING: 21>, 8, 'u', '=')),
 'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1530035680, 'device': 'CPU'}),
             (<DtypeKind.INT: 0>, 64, 'l', '=')),
 'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1529980112, 'device': 'CPU'}),
              (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
        col        = <pandas.core.interchange.column.PandasColumn object at 0xda1c65b0>
        data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396: in buffers_to_array
    data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
        _          = (<DtypeKind.STRING: 21>, 8, 'u', '=')
        allow_copy = True
        buffers    = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 3879523528, 'device': 'CPU'}),
          (<DtypeKind.STRING: 21>, 8, 'u', '=')),
 'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1530035680, 'device': 'CPU'}),
             (<DtypeKind.INT: 0>, 64, 'l', '=')),
 'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1529980112, 'device': 'CPU'}),
              (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
        data_buff  = PandasBuffer({'bufsize': 2, 'ptr': 3879523528, 'device': 'CPU'})
        data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
        describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
        length     = 3
        offset     = 0
        offset_buff = PandasBuffer({'bufsize': 32, 'ptr': 1530035680, 'device': 'CPU'})
        offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
        validity_buff = PandasBuffer({'bufsize': 3, 'ptr': 1529980112, 'device': 'CPU'})
        validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   OverflowError: Python int too large to convert to C ssize_t


pyarrow/io.pxi:1990: OverflowError
_________________________________________________ test_pandas_roundtrip_large_string __________________________________________________

    @pytest.mark.pandas
    def test_pandas_roundtrip_large_string():
        # See https://github.com/pandas-dev/pandas/issues/50554
        if Version(pd.__version__) < Version("1.6"):
            pytest.skip("Column.size() bug in pandas")
    
        arr = ["a", "", "c"]
        table = pa.table({"a_large": pa.array(arr, type=pa.large_string())})
    
        from pandas.api.interchange import (
            from_dataframe as pandas_from_dataframe
        )
    
        if Version(pd.__version__) >= Version("2.0.1"):
            pandas_df = pandas_from_dataframe(table)
>           result = pi.from_dataframe(pandas_df)

arr        = ['a', '', 'c']
pandas_df  =   a_large
0       a
1        
2       c
pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
table      = pyarrow.Table
a_large: large_string
----
a_large: [["a","","c"]]

../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:189: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113: in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
        allow_copy = True
        df         =   a_large
0       a
1        
2       c
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136: in _from_dataframe
    batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
        allow_copy = True
        batches    = []
        chunk      = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda103a10>
        df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda103a10>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182: in protocol_df_chunk_to_pyarrow
    columns[name] = column_to_array(col, allow_copy)
        allow_copy = True
        col        = <pandas.core.interchange.column.PandasColumn object at 0xda1033d0>
        columns    = {}
        df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda103a10>
        dtype      = <DtypeKind.STRING: 21>
        name       = 'a_large'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214: in column_to_array
    data = buffers_to_array(buffers, data_type,
        allow_copy = True
        buffers    = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 3879522800, 'device': 'CPU'}),
          (<DtypeKind.STRING: 21>, 8, 'u', '=')),
 'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1480303312, 'device': 'CPU'}),
             (<DtypeKind.INT: 0>, 64, 'l', '=')),
 'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1478277616, 'device': 'CPU'}),
              (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
        col        = <pandas.core.interchange.column.PandasColumn object at 0xda1033d0>
        data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396: in buffers_to_array
    data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
        _          = (<DtypeKind.STRING: 21>, 8, 'u', '=')
        allow_copy = True
        buffers    = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 3879522800, 'device': 'CPU'}),
          (<DtypeKind.STRING: 21>, 8, 'u', '=')),
 'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1480303312, 'device': 'CPU'}),
             (<DtypeKind.INT: 0>, 64, 'l', '=')),
 'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1478277616, 'device': 'CPU'}),
              (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
        data_buff  = PandasBuffer({'bufsize': 2, 'ptr': 3879522800, 'device': 'CPU'})
        data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
        describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
        length     = 3
        offset     = 0
        offset_buff = PandasBuffer({'bufsize': 32, 'ptr': 1480303312, 'device': 'CPU'})
        offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
        validity_buff = PandasBuffer({'bufsize': 3, 'ptr': 1478277616, 'device': 'CPU'})
        validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   OverflowError: Python int too large to convert to C ssize_t


pyarrow/io.pxi:1990: OverflowError
______________________________________________ test_pandas_roundtrip_string_with_missing ______________________________________________

    @pytest.mark.pandas
    def test_pandas_roundtrip_string_with_missing():
        # See https://github.com/pandas-dev/pandas/issues/50554
        if Version(pd.__version__) < Version("1.6"):
            pytest.skip("Column.size() bug in pandas")
    
        arr = ["a", "", "c", None]
        table = pa.table({"a": pa.array(arr),
                          "a_large": pa.array(arr, type=pa.large_string())})
    
        from pandas.api.interchange import (
            from_dataframe as pandas_from_dataframe
        )
    
        if Version(pd.__version__) >= Version("2.0.2"):
            pandas_df = pandas_from_dataframe(table)
>           result = pi.from_dataframe(pandas_df)

arr        = ['a', '', 'c', None]
pandas_df  =      a a_large
0    a       a
1             
2    c       c
3  NaN     NaN
pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
table      = pyarrow.Table
a: string
a_large: large_string
----
a: [["a","","c",null]]
a_large: [["a","","c",null]]

../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:227: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113: in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
        allow_copy = True
        df         =      a a_large
0    a       a
1             
2    c       c
3  NaN     NaN
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136: in _from_dataframe
    batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
        allow_copy = True
        batches    = []
        chunk      = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda15b850>
        df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda15b850>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182: in protocol_df_chunk_to_pyarrow
    columns[name] = column_to_array(col, allow_copy)
        allow_copy = True
        col        = <pandas.core.interchange.column.PandasColumn object at 0xda103210>
        columns    = {}
        df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xda15b850>
        dtype      = <DtypeKind.STRING: 21>
        name       = 'a'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214: in column_to_array
    data = buffers_to_array(buffers, data_type,
        allow_copy = True
        buffers    = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 3879523744, 'device': 'CPU'}),
          (<DtypeKind.STRING: 21>, 8, 'u', '=')),
 'offsets': (PandasBuffer({'bufsize': 40, 'ptr': 1469510752, 'device': 'CPU'}),
             (<DtypeKind.INT: 0>, 64, 'l', '=')),
 'validity': (PandasBuffer({'bufsize': 4, 'ptr': 1475420176, 'device': 'CPU'}),
              (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
        col        = <pandas.core.interchange.column.PandasColumn object at 0xda103210>
        data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396: in buffers_to_array
    data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
        _          = (<DtypeKind.STRING: 21>, 8, 'u', '=')
        allow_copy = True
        buffers    = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 3879523744, 'device': 'CPU'}),
          (<DtypeKind.STRING: 21>, 8, 'u', '=')),
 'offsets': (PandasBuffer({'bufsize': 40, 'ptr': 1469510752, 'device': 'CPU'}),
             (<DtypeKind.INT: 0>, 64, 'l', '=')),
 'validity': (PandasBuffer({'bufsize': 4, 'ptr': 1475420176, 'device': 'CPU'}),
              (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
        data_buff  = PandasBuffer({'bufsize': 2, 'ptr': 3879523744, 'device': 'CPU'})
        data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
        describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
        length     = 4
        offset     = 0
        offset_buff = PandasBuffer({'bufsize': 40, 'ptr': 1469510752, 'device': 'CPU'})
        offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
        validity_buff = PandasBuffer({'bufsize': 4, 'ptr': 1475420176, 'device': 'CPU'})
        validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   OverflowError: Python int too large to convert to C ssize_t


pyarrow/io.pxi:1990: OverflowError
__________________________________________________ test_pandas_roundtrip_categorical __________________________________________________

    @pytest.mark.pandas
    def test_pandas_roundtrip_categorical():
        if Version(pd.__version__) < Version("2.0.2"):
            pytest.skip("Bitmasks not supported in pandas interchange implementation")
    
        arr = ["Mon", "Tue", "Mon", "Wed", "Mon", "Thu", "Fri", "Sat", None]
        table = pa.table(
            {"weekday": pa.array(arr).dictionary_encode()}
        )
    
        from pandas.api.interchange import (
            from_dataframe as pandas_from_dataframe
        )
        pandas_df = pandas_from_dataframe(table)
>       result = pi.from_dataframe(pandas_df)

arr        = ['Mon', 'Tue', 'Mon', 'Wed', 'Mon', 'Thu', 'Fri', 'Sat', None]
pandas_df  =   weekday
0     Mon
1     Tue
2     Mon
3     Wed
4     Mon
5     Thu
6     Fri
7     Sat
8     NaN
pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
table      = pyarrow.Table
weekday: dictionary<values=string, indices=int32, ordered=0>
----
weekday: [  -- dictionary:
["Mon","Tue","Wed","Thu","Fri","Sat"]  -- indices:
[0,1,0,2,0,3,4,5,null]]

../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:257: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113: in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
        allow_copy = True
        df         =   weekday
0     Mon
1     Tue
2     Mon
3     Wed
4     Mon
5     Thu
6     Fri
7     Sat
8     NaN
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136: in _from_dataframe
    batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
        allow_copy = True
        batches    = []
        chunk      = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xd9e217f0>
        df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xd9e217f0>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:186: in protocol_df_chunk_to_pyarrow
    columns[name] = categorical_column_to_dictionary(col, allow_copy)
        allow_copy = True
        col        = <pandas.core.interchange.column.PandasColumn object at 0xda180550>
        columns    = {}
        df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xd9e217f0>
        dtype      = <DtypeKind.CATEGORICAL: 23>
        name       = 'weekday'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:293: in categorical_column_to_dictionary
    dictionary = column_to_array(cat_column)
        allow_copy = True
        cat_column = <pandas.core.interchange.column.PandasColumn object at 0xda1801d0>
        categorical = {'categories': <pandas.core.interchange.column.PandasColumn object at 0xda1801d0>,
 'is_dictionary': True,
 'is_ordered': False}
        col        = <pandas.core.interchange.column.PandasColumn object at 0xda180550>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214: in column_to_array
    data = buffers_to_array(buffers, data_type,
        allow_copy = True
        buffers    = {'data': (PandasBuffer({'bufsize': 18, 'ptr': 3659006432, 'device': 'CPU'}),
          (<DtypeKind.STRING: 21>, 8, 'u', '=')),
 'offsets': (PandasBuffer({'bufsize': 56, 'ptr': 1466456352, 'device': 'CPU'}),
             (<DtypeKind.INT: 0>, 64, 'l', '=')),
 'validity': (PandasBuffer({'bufsize': 6, 'ptr': 1477427216, 'device': 'CPU'}),
              (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
        col        = <pandas.core.interchange.column.PandasColumn object at 0xda1801d0>
        data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396: in buffers_to_array
    data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
        _          = (<DtypeKind.STRING: 21>, 8, 'u', '=')
        allow_copy = True
        buffers    = {'data': (PandasBuffer({'bufsize': 18, 'ptr': 3659006432, 'device': 'CPU'}),
          (<DtypeKind.STRING: 21>, 8, 'u', '=')),
 'offsets': (PandasBuffer({'bufsize': 56, 'ptr': 1466456352, 'device': 'CPU'}),
             (<DtypeKind.INT: 0>, 64, 'l', '=')),
 'validity': (PandasBuffer({'bufsize': 6, 'ptr': 1477427216, 'device': 'CPU'}),
              (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
        data_buff  = PandasBuffer({'bufsize': 18, 'ptr': 3659006432, 'device': 'CPU'})
        data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
        describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
        length     = 6
        offset     = 0
        offset_buff = PandasBuffer({'bufsize': 56, 'ptr': 1466456352, 'device': 'CPU'})
        offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
        validity_buff = PandasBuffer({'bufsize': 6, 'ptr': 1477427216, 'device': 'CPU'})
        validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   OverflowError: Python int too large to convert to C ssize_t


pyarrow/io.pxi:1990: OverflowError
________________________________________________________ test_empty_dataframe _________________________________________________________

    def test_empty_dataframe():
        schema = pa.schema([('col1', pa.int8())])
        df = pa.table([[]], schema=schema)
        dfi = df.__dataframe__()
>       assert pi.from_dataframe(dfi) == df

df         = pyarrow.Table
col1: int8
----
col1: [[]]
dfi        = <pyarrow.interchange.dataframe._PyArrowDataFrame object at 0xd98381d0>
schema     = col1: int8

../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:522: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113: in from_dataframe
    return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
        allow_copy = True
        df         = <pyarrow.interchange.dataframe._PyArrowDataFrame object at 0xd98381d0>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:140: in _from_dataframe
    batch = protocol_df_chunk_to_pyarrow(df)
        allow_copy = True
        batches    = []
        df         = <pyarrow.interchange.dataframe._PyArrowDataFrame object at 0xd96e41b0>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182: in protocol_df_chunk_to_pyarrow
    columns[name] = column_to_array(col, allow_copy)
        allow_copy = True
        col        = <pyarrow.interchange.column._PyArrowColumn object at 0xd96a6650>
        columns    = {}
        df         = <pyarrow.interchange.dataframe._PyArrowDataFrame object at 0xd96e41b0>
        dtype      = <DtypeKind.INT: 0>
        name       = 'col1'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214: in column_to_array
    data = buffers_to_array(buffers, data_type,
        allow_copy = True
        buffers    = {'data': (PyArrowBuffer({'bufsize': 0, 'ptr': 4122363392, 'device': 'CPU'}),
          (<DtypeKind.INT: 0>, 8, 'c', '=')),
 'offsets': None,
 'validity': None}
        col        = <pyarrow.interchange.column._PyArrowColumn object at 0xd96a6650>
        data_type  = (<DtypeKind.INT: 0>, 8, 'c', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396: in buffers_to_array
    data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
        _          = (<DtypeKind.INT: 0>, 8, 'c', '=')
        allow_copy = True
        buffers    = {'data': (PyArrowBuffer({'bufsize': 0, 'ptr': 4122363392, 'device': 'CPU'}),
          (<DtypeKind.INT: 0>, 8, 'c', '=')),
 'offsets': None,
 'validity': None}
        data_buff  = PyArrowBuffer({'bufsize': 0, 'ptr': 4122363392, 'device': 'CPU'})
        data_type  = (<DtypeKind.INT: 0>, 8, 'c', '=')
        describe_null = (<ColumnNullType.NON_NULLABLE: 0>, None)
        length     = 0
        offset     = 0
        offset_buff = None
        validity_buff = None
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   OverflowError: Python int too large to convert to C ssize_t


pyarrow/io.pxi:1990: OverflowError

Full build & test log (2.5M): pyarrow.txt

This is arrow 15.0.0 on Gentoo, with x86 systemd-nspawn container. I've used -O2 -march=pentium-m -mfpmath=sse -pipe as compiler flags, to rule out i387-specific issues.

>>> pyarrow.show_info()
pyarrow version info
--------------------
Package kind              : not indicated
Arrow C++ library version : 15.0.0  
Arrow C++ compiler        : GNU 13.2.1
Arrow C++ compiler flags  : -O2 -march=pentium-m -mfpmath=sse -pipe
Arrow C++ git revision    :         
Arrow C++ git description :         
Arrow C++ build type      : relwithdebinfo

Platform:
  OS / Arch           : Linux x86_64
  SIMD Level          : avx2    
  Detected SIMD Level : avx2    

Memory:
  Default backend     : system  
  Bytes allocated     : 0 bytes 
  Max memory          : 0 bytes 
  Supported Backends  : system  

Optional modules:
  csv                 : Enabled 
  cuda                : -       
  dataset             : Enabled 
  feather             : Enabled 
  flight              : -       
  fs                  : Enabled 
  gandiva             : -       
  json                : Enabled 
  orc                 : -       
  parquet             : Enabled 

Filesystems:
  GcsFileSystem       : -       
  HadoopFileSystem    : Enabled 
  S3FileSystem        : -       

Compression Codecs:
  brotli              : Enabled 
  bz2                 : Enabled 
  gzip                : Enabled 
  lz4_frame           : Enabled 
  lz4                 : Enabled 
  snappy              : Enabled 
  zstd                : Enabled 

Some of these might be problems inside pandas. I'm going to file a bug about the test failures there in a minute, and link it here afterwards.

Component(s)

Python

@mgorny
Copy link
Contributor Author

mgorny commented Feb 20, 2024

pandas counterpart: pandas-dev/pandas#57523

@pitrou
Copy link
Member

pitrou commented Feb 20, 2024

Perhaps you want to submit a PR for this? From what I can tell, this should be mostly a matter of fixing the tests.

@mgorny
Copy link
Contributor Author

mgorny commented Feb 20, 2024

I can try but I can't promise I'll come up with anything that makes sense. I have almost no knowledge of numpy, and none of arrow/pandas, so it'll be all guesswork.

mgorny added a commit to mgorny/apache-arrow that referenced this issue Feb 20, 2024
…t platforms

Use `uintptr_t` rather than `intptr_t` to fix `OverflowError`, visible
e.g. when running `tests/interchange/test_conversion.py` tests on 32-bit
platforms.
@pitrou
Copy link
Member

pitrou commented Feb 20, 2024

Also cc @jorisvandenbossche

@mgorny
Copy link
Contributor Author

mgorny commented Feb 20, 2024

Looking at the other failures:

FAILED tests/test_array.py::test_dictionary_to_numpy - TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

I haven't figured this out yet. It might be a bug in pandas. FWICS we're constructing two arrays, and they end up having different types inside.

FAILED tests/test_io.py::test_python_file_large_seeks - assert 5 == ((2 ** 32) + 5)
FAILED tests/test_io.py::test_memory_map_large_seeks - OSError: Read out of bounds (offset = 4294967301, size = 5) in file of size 10

This could be a bug in pyarrow (or arrow itself) — either it didn't build with Large File Support, or it doesn't use correct types somewhere in between.

FAILED tests/test_pandas.py::TestConvertStructTypes::test_from_numpy_nested - AssertionError: assert 8 == 12
FAILED tests/test_schema.py::test_schema_sizeof - assert 28 > 30

I think the first one is because np.object_ is 32-bit rather than 64-bit, but I'm not sure about the second one.

mgorny added a commit to mgorny/apache-arrow that referenced this issue Feb 20, 2024
Update the size assumptions in tests to account for size differences
on 32-bit platforms: `pd.object_` is 4 bytes rather than 8 bytes,
and `pa.schema` is twice smaller.
mgorny added a commit to mgorny/apache-arrow that referenced this issue Feb 20, 2024
Update the size assumptions in tests to account for size differences
on 32-bit platforms: `pd.object_` is 4 bytes rather than 8 bytes,
and `pa.schema` is twice smaller.
@kou kou changed the title Test failures on 32-bit x86 [Python] Test failures on 32-bit x86 Feb 20, 2024
pitrou pushed a commit that referenced this issue Feb 20, 2024
…forms (#40158)

Use `uintptr_t` rather than `intptr_t` to fix `OverflowError`, visible e.g. when running `tests/interchange/test_conversion.py` tests on 32-bit platforms.

### Rationale for this change

This fixes the `OverflowError`s from #40153, and makes `pyarrow/tests/interchange/` all pass on 32-bit x86.

### What changes are included in this PR?

- change the type used to store pointer from `intptr_t` to `uintptr_t` to provide coverage for pointers above `0x80000000`.

### Are these changes tested?

These changes are covered by the tests in `pyarrow/tests/interchange`.

### Are there any user-facing changes?

It fixes `OverflowError` that can be triggered by working with pandas data types, possibly more (though I'm not sure if this qualifies as a "crash").

* Closes: #40153

Authored-by: Michał Górny <mgorny@gentoo.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
pitrou added a commit that referenced this issue Feb 20, 2024
### What changes are included in this PR?

Add a Debian-based i386 test build for Python, similar to the existing one for C++.

### Are these changes tested?

Yes. The test suite step in the new build will fail until GH-40153 is entirely fixed.

### Are there any user-facing changes?

No.
* Closes: #40159

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@mgorny
Copy link
Contributor Author

mgorny commented Feb 21, 2024

Should we reopen the bug for the remaining issues or should I file a new one?

@pitrou
Copy link
Member

pitrou commented Feb 21, 2024

Oops, sorry, I didn't mean to close it. Let's reopen it.

@pitrou pitrou reopened this Feb 21, 2024
pitrou pushed a commit that referenced this issue Feb 21, 2024
### Rationale for this change

This fixes two tests on 32-bit platforms (tested on x86 specifically).

### What changes are included in this PR?

- update the `pd.object_` size assumption to 4 bytes on 32-bit platforms
- update the `pa.schema` size assumptions to be twice smaller on 32-bit platforms

### Are these changes tested?

The changes fix tests.

### Are there any user-facing changes?

Only test fixes.

* Closes: #40153

Authored-by: Michał Górny <mgorny@gentoo.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou reopened this Feb 21, 2024
@pitrou
Copy link
Member

pitrou commented Feb 21, 2024

I've opened #40176 for the large file-related failures.

@mgorny
Copy link
Contributor Author

mgorny commented Feb 21, 2024

By the way, I've gotten a reply that the two cases of ValueError: putmask: output array is read-only in pandas-dev/pandas#57523 "could potentially be related to bugs in Arrow".

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…t platforms (apache#40158)

Use `uintptr_t` rather than `intptr_t` to fix `OverflowError`, visible e.g. when running `tests/interchange/test_conversion.py` tests on 32-bit platforms.

### Rationale for this change

This fixes the `OverflowError`s from apache#40153, and makes `pyarrow/tests/interchange/` all pass on 32-bit x86.

### What changes are included in this PR?

- change the type used to store pointer from `intptr_t` to `uintptr_t` to provide coverage for pointers above `0x80000000`.

### Are these changes tested?

These changes are covered by the tests in `pyarrow/tests/interchange`.

### Are there any user-facing changes?

It fixes `OverflowError` that can be triggered by working with pandas data types, possibly more (though I'm not sure if this qualifies as a "crash").

* Closes: apache#40153

Authored-by: Michał Górny <mgorny@gentoo.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…che#40164)

### What changes are included in this PR?

Add a Debian-based i386 test build for Python, similar to the existing one for C++.

### Are these changes tested?

Yes. The test suite step in the new build will fail until apacheGH-40153 is entirely fixed.

### Are there any user-facing changes?

No.
* Closes: apache#40159

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…apache#40165)

### Rationale for this change

This fixes two tests on 32-bit platforms (tested on x86 specifically).

### What changes are included in this PR?

- update the `pd.object_` size assumption to 4 bytes on 32-bit platforms
- update the `pa.schema` size assumptions to be twice smaller on 32-bit platforms

### Are these changes tested?

The changes fix tests.

### Are there any user-facing changes?

Only test fixes.

* Closes: apache#40153

Authored-by: Michał Górny <mgorny@gentoo.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…n build (apache#40176)

### Rationale for this change

Python large file tests fail on 32-bit platforms.

### What changes are included in this PR?

1. Fix passing `int64_t` position to the Python file methods when a Python file object is wrapped in an Arrow `RandomAccessFile`
2. Disallow creating a `MemoryMappedFile` spanning more than the `size_t` maximum, instead of silently truncating its length

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: apache#40153

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
pitrou added a commit to pitrou/arrow that referenced this issue Feb 29, 2024
pitrou added a commit to pitrou/arrow that referenced this issue Feb 29, 2024
@pitrou
Copy link
Member

pitrou commented Feb 29, 2024

Taken together, PRs #40293, #40294 and #40295 should fix the remaining failures.

@mgorny
Copy link
Contributor Author

mgorny commented Feb 29, 2024

Thanks a lot! I can confirm that they fix the remaining issues.

@thesamesam
Copy link

Thank you for putting the time into this.

pitrou added a commit that referenced this issue Feb 29, 2024
### Rationale for this change

`Array.to_numpy` calls `np.take` to linearize dictionary arrays. This fails on 32-bit Numpy builds because we give Numpy 64-bit indices and Numpy would like to downcast them.

### What changes are included in this PR?

Avoid calling `np.take`, instead using our own dictionary decoding routine.

### Are these changes tested?

Yes. A test failure is fixed on 32-bit.

### Are there any user-facing changes?

No.
* GitHub Issue: #40153

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
jorisvandenbossche pushed a commit that referenced this issue Mar 5, 2024
…ms (#40294)

### Rationale for this change

`Tensor.__getbuffer__` would silently assume that `Py_ssize_t` is the same width as `int64_t`, which is true only on 64-bit platforms.

### What changes are included in this PR?

Create an internal buffer of `Py_ssize_t` values mirroring a Tensor's shape and strides, to avoid relying on the aforementioned assumption.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: #40153

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
pitrou added a commit to pitrou/arrow that referenced this issue Mar 5, 2024
pitrou added a commit that referenced this issue Mar 5, 2024
### Rationale for this change

`test_gdb.py` would fail on 32-bit platforms because the gdb extension errors out when a timestamp value is larger than the platform's time_t.

### What changes are included in this PR?

1. Catch `OverflowError` from the Python datetime module when trying to format a timestamp
2. Tweak the expected test results on 32-bit

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #40153

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 16.0.0 milestone Mar 5, 2024
@pitrou
Copy link
Member

pitrou commented Mar 5, 2024

Issue resolved by pull request 40293
#40293

@pitrou pitrou closed this as completed Mar 5, 2024
mapleFU pushed a commit to mapleFU/arrow that referenced this issue Mar 7, 2024
…latforms (apache#40294)

### Rationale for this change

`Tensor.__getbuffer__` would silently assume that `Py_ssize_t` is the same width as `int64_t`, which is true only on 64-bit platforms.

### What changes are included in this PR?

Create an internal buffer of `Py_ssize_t` values mirroring a Tensor's shape and strides, to avoid relying on the aforementioned assumption.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: apache#40153

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
mapleFU pushed a commit to mapleFU/arrow that referenced this issue Mar 7, 2024
…#40293)

### Rationale for this change

`test_gdb.py` would fail on 32-bit platforms because the gdb extension errors out when a timestamp value is larger than the platform's time_t.

### What changes are included in this PR?

1. Catch `OverflowError` from the Python datetime module when trying to format a timestamp
2. Tweak the expected test results on 32-bit

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40153

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…t platforms (apache#40158)

Use `uintptr_t` rather than `intptr_t` to fix `OverflowError`, visible e.g. when running `tests/interchange/test_conversion.py` tests on 32-bit platforms.

### Rationale for this change

This fixes the `OverflowError`s from apache#40153, and makes `pyarrow/tests/interchange/` all pass on 32-bit x86.

### What changes are included in this PR?

- change the type used to store pointer from `intptr_t` to `uintptr_t` to provide coverage for pointers above `0x80000000`.

### Are these changes tested?

These changes are covered by the tests in `pyarrow/tests/interchange`.

### Are there any user-facing changes?

It fixes `OverflowError` that can be triggered by working with pandas data types, possibly more (though I'm not sure if this qualifies as a "crash").

* Closes: apache#40153

Authored-by: Michał Górny <mgorny@gentoo.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…che#40164)

### What changes are included in this PR?

Add a Debian-based i386 test build for Python, similar to the existing one for C++.

### Are these changes tested?

Yes. The test suite step in the new build will fail until apacheGH-40153 is entirely fixed.

### Are there any user-facing changes?

No.
* Closes: apache#40159

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…apache#40165)

### Rationale for this change

This fixes two tests on 32-bit platforms (tested on x86 specifically).

### What changes are included in this PR?

- update the `pd.object_` size assumption to 4 bytes on 32-bit platforms
- update the `pa.schema` size assumptions to be twice smaller on 32-bit platforms

### Are these changes tested?

The changes fix tests.

### Are there any user-facing changes?

Only test fixes.

* Closes: apache#40153

Authored-by: Michał Górny <mgorny@gentoo.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…n build (apache#40176)

### Rationale for this change

Python large file tests fail on 32-bit platforms.

### What changes are included in this PR?

1. Fix passing `int64_t` position to the Python file methods when a Python file object is wrapped in an Arrow `RandomAccessFile`
2. Disallow creating a `MemoryMappedFile` spanning more than the `size_t` maximum, instead of silently truncating its length

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: apache#40153

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…che#40295)

### Rationale for this change

`Array.to_numpy` calls `np.take` to linearize dictionary arrays. This fails on 32-bit Numpy builds because we give Numpy 64-bit indices and Numpy would like to downcast them.

### What changes are included in this PR?

Avoid calling `np.take`, instead using our own dictionary decoding routine.

### Are these changes tested?

Yes. A test failure is fixed on 32-bit.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40153

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…latforms (apache#40294)

### Rationale for this change

`Tensor.__getbuffer__` would silently assume that `Py_ssize_t` is the same width as `int64_t`, which is true only on 64-bit platforms.

### What changes are included in this PR?

Create an internal buffer of `Py_ssize_t` values mirroring a Tensor's shape and strides, to avoid relying on the aforementioned assumption.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

* GitHub Issue: apache#40153

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…#40293)

### Rationale for this change

`test_gdb.py` would fail on 32-bit platforms because the gdb extension errors out when a timestamp value is larger than the platform's time_t.

### What changes are included in this PR?

1. Catch `OverflowError` from the Python datetime module when trying to format a timestamp
2. Tweak the expected test results on 32-bit

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40153

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment