[Python] Massive performance deterioration with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() #38260

MMCMA · 2023-10-13T10:40:55Z

Describe the bug, including details regarding any error messages, version, and platform.

We experierence a massiv drop in performance when using pandas 2.1.1 vs. pandas 1.5.3 when invoking pa.Table.from_pandas().
In this example, the conversion time increased from roughly 2.9 seconds to 16.2 seconds. In our data application the problem is evern more dramatic since the size of the dataframe is larger - it seems very sensitive to the number of columns. 2x number of columns yields roughly 4x compute time (num_cols=20000 vs. num_cols=40000). With pandas 1.5.3 the compute time is more linear with the number of columns. Not sure if this should be raised also with pandas.

import pyarrow as pa
import pandas as pd
import numpy as np
import timeit

num_cols = 20000
num_dates = 8800
dates = pd.date_range(start='19900101', freq='b', periods=num_dates)
data = numpy.random.randint(low=0, high=10, size=(num_dates, num_cols))
df = pd.DataFrame(data, index=dates)

tic = timeit.default_timer()
pa.Table.from_pandas(df, preserve_index=True)
total_time = timeit.default_timer() - tic
print(f'Conversion from pandas to pyarrow took {total_time} seconds')

Component(s)

Python

The text was updated successfully, but these errors were encountered:

raulcd · 2023-10-13T16:10:32Z

Hi, could you share the version of pyarrow used and the Python version?

danepitkin · 2023-10-13T16:32:51Z

I am able to reproduce it locally. The behavior changes only when updating python/pandas. I'm not sure what the root cause is though.

$ python arrow-38260.py 20000
python version:  3.9.18
pyarrow version: 13.0.0
pandas version:  1.5.3
numpy version:   1.26.0
Conversion from pandas to pyarrow took 1.114816792 seconds for 20000 columns
$ python arrow-38260.py 40000
python version:  3.9.18
pyarrow version: 13.0.0
pandas version:  1.5.3
numpy version:   1.26.0
Conversion from pandas to pyarrow took 2.4374076250000005 seconds for 40000 columns

$ python arrow-38260.py 20000
python version:  3.12.0
pyarrow version: 13.0.0
pandas version:  2.1.1
numpy version:   1.26.0
Conversion from pandas to pyarrow took 5.036314583034255 seconds for 20000 columns
$ python arrow-38260.py 40000
python version:  3.12.0
pyarrow version: 13.0.0
pandas version:  2.1.1
numpy version:   1.26.0
Conversion from pandas to pyarrow took 19.435286541993264 seconds for 40000 columns

Here's the modified version of the above script I used:

import argparse
import platform
import timeit

import numpy as np
import pandas as pd
import pyarrow as pa

parser = argparse.ArgumentParser()
parser.add_argument("num_cols", type=int)
args = parser.parse_args()

num_cols = args.num_cols
num_dates = 8800
dates = pd.date_range(start='19900101', freq='b', periods=num_dates)
data = np.random.randint(low=0, high=10, size=(num_dates, num_cols))
df = pd.DataFrame(data, index=dates)

tic = timeit.default_timer()
pa.Table.from_pandas(df, preserve_index=True)
total_time = timeit.default_timer() - tic
print(f'python version:  {platform.python_version()}')
print(f'pyarrow version: {pa.__version__}')
print(f'pandas version:  {pd.__version__}')
print(f'numpy version:   {np.__version__}')
print(f'Conversion from pandas to pyarrow took {total_time} seconds for {num_cols} columns')

amoeba · 2023-10-13T17:52:13Z

Here are two flamegraphs produced from py-spy. It looks like the major difference between the two is in the proportion of time spent in dataframe_to_arrays.

python version:  3.11.6
pyarrow version: 11.0.0
pandas version:  1.5.3
numpy version:   1.26.0
Conversion from pandas to pyarrow took 1.1433022079290822 seconds for 20000 columns

python version:  3.11.6
pyarrow version: 13.0.0
pandas version:  2.1.1
numpy version:   1.26.0
Conversion from pandas to pyarrow took 3.7711586660007015 seconds for 20000 columns

anjakefala · 2023-10-13T18:19:09Z

One thing noticed: _get_columns_to_convert (pandas_compat.py(573)) is significantly faster in 1.5.3. Doing more digging!

anjakefala · 2023-10-13T18:29:25Z

This is the massively slower part of the code, based on pandas version:

 373     for name in columns:                                                                                                                                                                                      
 374         tic = timeit.default_timer()                                                                                                                                                                          
 375         col = df[name]                                                                                                                                                                                        
 376         name = _column_name_to_strings(name)                                                                                                                                                                  
 377                                                                                                                                                                                                               
 378         if _pandas_api.is_sparse(col):                                                                                                                                                                        
 379             raise TypeError(                                                                                                                                                                                  
 380                 "Sparse pandas data (column {}) not supported.".format(name))                                                                                                                                 
 381                                                                                                                                                                                                               
 382         columns_to_convert.append(col)                                                                                                                                                                        
 383         convert_fields.append(None)                                                                                                                                                                           
 384         column_names.append(name)

anjakefala · 2023-10-13T18:37:57Z

The column lookup (col = df[name]) is the performance drop. It is roughly 3 orders of magnitude slower. We are talking about a difference of maybe 100 microseconds, that adds up to seconds when you have a lot of columns.

From this investigation, it seems like the bug is on pandas' end. I will research if there is an existing open issue.

jorisvandenbossche · 2023-10-14T13:06:41Z

I was going to comment yesterday that this is quite likely an issue on the pandas side, which is in the meantime confirmed by the comments above. And coincidentally, I was now looking at a perf regression report in pandas (pandas-dev/pandas#55245) that shows the same culprit as in the py-spy image from @amoeba above: Manager.iget, which is what is used under the hood to access a column.

So it's indeed repeated column lookup in wide dataframes that has become significantly slower. It's a regression in 2.1.0 -> 2.1.1, and caused by pandas-dev/pandas#55008 (comment)

assignUser · 2023-10-16T14:33:37Z

Thanks for the thorough investigation everyone, great work! If this is a confirmed pandas issue, should we close this?

MMCMA added the Type: bug label Oct 13, 2023

github-actions bot added the Component: Python label Oct 13, 2023

assignUser added the Priority: Blocker Marks a blocker for the release label Oct 13, 2023

raulcd changed the title ~~Massive performance deterioation with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas()~~ [Python] Massive performance deterioation with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() Oct 13, 2023

raulcd changed the title ~~[Python] Massive performance deterioation with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas()~~ [Python] Massive performance deterioration with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() Oct 13, 2023

jorisvandenbossche removed the Priority: Blocker Marks a blocker for the release label Oct 14, 2023

jorisvandenbossche mentioned this issue Oct 14, 2023

CoW: Clear dead references every time we add a new one pandas-dev/pandas#55008

Merged

5 tasks

jorisvandenbossche closed this as completed Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Massive performance deterioration with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() #38260

[Python] Massive performance deterioration with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() #38260

MMCMA commented Oct 13, 2023 •

edited

raulcd commented Oct 13, 2023

danepitkin commented Oct 13, 2023

amoeba commented Oct 13, 2023

anjakefala commented Oct 13, 2023

anjakefala commented Oct 13, 2023

anjakefala commented Oct 13, 2023 •

edited

jorisvandenbossche commented Oct 14, 2023

assignUser commented Oct 16, 2023

[Python] Massive performance deterioration with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() #38260

[Python] Massive performance deterioration with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() #38260

Comments

MMCMA commented Oct 13, 2023 • edited

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

raulcd commented Oct 13, 2023

danepitkin commented Oct 13, 2023

amoeba commented Oct 13, 2023

anjakefala commented Oct 13, 2023

anjakefala commented Oct 13, 2023

anjakefala commented Oct 13, 2023 • edited

jorisvandenbossche commented Oct 14, 2023

assignUser commented Oct 16, 2023

MMCMA commented Oct 13, 2023 •

edited

anjakefala commented Oct 13, 2023 •

edited