[Python] Round-trip type-conversion bug list object in pd.DataFrame #34574

MMCMA · 2023-03-15T14:39:04Z

Describe the bug, including details regarding any error messages, version, and platform.

I am trying to convert a pandas DataFrame into a table and back. However the type property is lost, the list is returned as an numpy.ndarray

import pandas as pd
import pyarrow as pa

df = pd.DataFrame(dict(a=[[1,2,3],]))
type(df['a'].iloc[0])
# Out[1]: list

round_trip_df = pa.Table.from_pandas(df).to_pandas()
type(round_trip_df['a'].iloc[0])
# Out[2]: numpy.ndarray

Here is the output of pd.show_versions();

INSTALLED VERSIONS

commit : 2e218d10984e9919f0296931d92ea851c6a6faf5
python : 3.10.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.5.3
numpy : 1.23.5
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 64.0.2
pip : 23.0
Cython : 0.29.33
pytest : 7.2.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 8.10.0
pandas_datareader: 0.10.0
bs4 : 4.11.2
bottleneck : None
brotli :
fastparquet : None
fsspec : 2023.1.0
gcsfs : None
matplotlib : 3.7.0
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : 3.1.1
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
snappy : None
sqlalchemy : 1.4.46
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

Component(s)

Python

The text was updated successfully, but these errors were encountered:

AlenkaF · 2023-03-21T09:53:41Z

If I am reading the C++ code correctly, this is the current design in the conversion of the lists

arrow/python/pyarrow/src/arrow/python/arrow_to_pandas.cc

Lines 735 to 770 in 69118b2

    
           Status ConvertListsLike(PandasOptions options, const ChunkedArray& data, 
        
                                   PyObject** out_values) { 
        
             // Get column of underlying value arrays 
        
             ArrayVector value_arrays; 
        
             for (int c = 0; c < data.num_chunks(); c++) { 
        
               const auto& arr = checked_cast<const ListArrayT&>(*data.chunk(c)); 
        
               // values() does not account for offsets, so we need to slice into it. 
        
               // We can't use Flatten(), because it removes the values behind a null list 
        
               // value, and that makes the offsets into original list values and our 
        
               // flattened_values array different. 
        
               std::shared_ptr<Array> flattened_values = arr.values()->Slice( 
        
                   arr.value_offset(0), arr.value_offset(arr.length()) - arr.value_offset(0)); 
        
               if (arr.value_type()->id() == Type::EXTENSION) { 
        
                 const auto& arr_ext = checked_cast<const ExtensionArray&>(*flattened_values); 
        
                 value_arrays.emplace_back(arr_ext.storage()); 
        
               } else { 
        
                 value_arrays.emplace_back(flattened_values); 
        
               } 
        
             } 
        
             using ListArrayType = typename ListArrayT::TypeClass; 
        
             const auto& list_type = checked_cast<const ListArrayType&>(*data.type()); 
        
             auto value_type = list_type.value_type(); 
        
             if (value_type->id() == Type::EXTENSION) { 
        
               value_type = checked_cast<const ExtensionType&>(*value_type).storage_type(); 
        
             } 
        
             auto flat_column = std::make_shared<ChunkedArray>(value_arrays, value_type); 
        
             options = MakeInnerOptions(std::move(options)); 
        
             OwnedRefNoGIL owned_numpy_array; 
        
             RETURN_NOT_OK(ConvertChunkedArrayToPandas(options, flat_column, nullptr, 
        
                                                       owned_numpy_array.ref())); 
        
             PyObject* numpy_array = owned_numpy_array.obj(); 
        
             DCHECK(PyArray_Check(numpy_array));

We do, however, store the pandas_dtype in schema metadata:

>>> table = pa.Table.from_pandas(df)
>>> table.schema.metadata
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 1, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "list[int64]", ...

MMCMA added the Type: bug label Mar 15, 2023

github-actions bot added the Component: Python label Mar 15, 2023

kou changed the title ~~Round-trip type-conversion bug list object in pd.DataFrame~~ [Python] Round-trip type-conversion bug list object in pd.DataFrame Mar 15, 2023

AlenkaF mentioned this issue Mar 23, 2023

[Python] Pandas roundtrip doesn't preserve list of datetime objects #19770

Closed

galipremsagar mentioned this issue Feb 28, 2024

Fix ListColumn.to_pandas() to retain list type rapidsai/cudf#15155

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Round-trip type-conversion bug list object in pd.DataFrame #34574

[Python] Round-trip type-conversion bug list object in pd.DataFrame #34574

MMCMA commented Mar 15, 2023 •

edited

AlenkaF commented Mar 21, 2023

[Python] Round-trip type-conversion bug list object in pd.DataFrame #34574

[Python] Round-trip type-conversion bug list object in pd.DataFrame #34574

Comments

MMCMA commented Mar 15, 2023 • edited

Describe the bug, including details regarding any error messages, version, and platform.

INSTALLED VERSIONS

Component(s)

AlenkaF commented Mar 21, 2023

MMCMA commented Mar 15, 2023 •

edited