Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Possible performance regression in Feather read/write path #18037

Closed
asfimport opened this issue Jan 30, 2018 · 16 comments
Closed

[Python] Possible performance regression in Feather read/write path #18037

asfimport opened this issue Jan 30, 2018 · 16 comments

Comments

@asfimport
Copy link

asfimport commented Jan 30, 2018

See discussion in wesm/feather#329. Needs to be investigated

Reporter: Wes McKinney / @wesm
Assignee: Antoine Pitrou / @pitrou

Related issues:

Original Issue Attachments:

Note: This issue was originally created as ARROW-2059. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Jingyuan Wang / @alphalfalfa:
Here is what I've done. I simple repeated the 1M rows and created a 30M and 100M testing csv files and try to repeat the process of reading from csv, writing as feather and reading from feather and time each part. I also repeated the measurement 10 times for the four combination of (python-2.7, python-3.6) x (feather-format-0.3.1, feather-format-0.4.0).

Processing 100M rows files all failed on my laptop (16GB memory) except for the version of python2.7 and feather-format-0.3.1.

The measurement of 1M rows is as following: 

|python version|feather version|1. rows|write feather|read feather|
|
|-|-|-|-|-|-|
|2.7|0.3.1|1M|0.06216781139|0.05903599262|
|
|2.7|0.4.0|1M|0.1335380793|0.04576666355|
|
|3.6|0.3.1|1M|0.07768514156|0.09041910172|
|
|3.6|0.4.0|1M|0.08690385818|0.05801310539|



The measuremnt of 30M rows is as following:
|
|python version|feather version|1. rows|write feather|read feather|
|
|2.7|0.3.1|30M|1.747310066|2.35606482|
|
|2.7|0.4.0|30M|3.5653723|1.934461188|
|
|3.6|0.3.1|30M|2.407458949|2.811572456|
|
|3.6|0.4.0|30M|2.925034189|1.852504301|



From both tables, performance of writing to feather did degrade from 0.3.1 to 0.4.0 with python2 being more dramatically. Reading feather files were actually faster with the newer feather version.



One other thing, I noticed that feather-format-0.3.1 does not even depend on Arrow. So the performance difference is more than the Arrow's version upgrade. And I do think we need some thorough benchmarks for Arrow or do we already have them?|

@asfimport
Copy link
Author

Jingyuan Wang / @alphalfalfa:
Stack trace of the failure when processing 100M-row data (feather-0.4.0):


Traceback (most recent call last):
  File "test.py", line 44, in <module>
    elapsed_ts.append(test(df, i))
  File "test.py", line 18, in test
    feather.write_dataframe(df, fthfilename)
  File "/home/jingyuan/anaconda3/envs/arrow-2059-feather-0.4.0/lib/python3.6/site-packages/pyarrow/feather.py", line 100, in write_feather
    writer.write(df)
  File "/home/jingyuan/anaconda3/envs/arrow-2059-feather-0.4.0/lib/python3.6/site-packages/pyarrow/feather.py", line 80, in write
    batch = RecordBatch.from_pandas(df, preserve_index=False)
  File "table.pxi", line 705, in pyarrow.lib.RecordBatch.from_pandas
  File "table.pxi", line 739, in pyarrow.lib.RecordBatch.from_arrays
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

 

@asfimport
Copy link
Author

Jingyuan Wang / @alphalfalfa:
Stack trace of the failure when processing 100M-row data (feather-0.3.1):


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 9: invalid continuation byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test.py", line 44, in <module>
    elapsed_ts.append(test(df, i))
  File "test.py", line 20, in test
    df = feather.read_dataframe(fthfilename)
  File "/home/jingyuan/anaconda3/envs/arrow-2059-feather-0.3.1/lib/python3.6/site-packages/feather/api.py", line 93, in read_dataframe
    arr = col.read()
SystemError: <built-in method read of feather.ext.Column object at 0x7f5fdc958a40> returned a result with an error set

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Thanks @alphalfalfa – this suggests that we need to optimize the string handling path (since strings are different in py2 vs py3 this explains why the performance is a lot different). The write failure with the 100M table can't be fixed at the moment (see ARROW-1907) – as soon as we have R bindings for Arrow (and thus the current iteration of the feather format code) we can fix.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Here is a profile tree got by exercising the write path on Python 3.6:

   - arrow::py::NumPyConverter::ConvertObjectStrings()
      - 80,27% arrow::py::AppendObjectStrings(tagPyArrayObject*, tagPyArrayObject*, long, bool, arrow::StringBuilder*, long*, bool*)
         - 50,74% arrow::py::internal::BuilderAppend(arrow::StringBuilder*, _object*, bool, bool*)
            - 24,95% arrow::BinaryBuilder::Append(unsigned char const*, int)
                 7,43% arrow::BinaryBuilder::AppendNextOffset()
               + 6,28% arrow::BufferBuilder::Resize(long, bool)
                 2,30% __memcpy_avx_unaligned
                 0,71% arrow::ArrayBuilder::Reserve(long)
              6,16% PyUnicode_AsUTF8AndSize
            + 4,37% PyErr_Occurred
         - 16,70% arrow::py::internal::PandasObjectIsNull(_object*)
            - 8,29% arrow::py::internal::PyDecimal_Check(_object*)
                 PyType_IsSubtype
            - 4,59% arrow::py::internal::PyFloat_IsNaN(_object*)
                 PyType_IsSubtype
           2,51% PyArray_MultiplyList
           2,41% PyType_IsSubtype
      + 1,57% arrow::ArrayBuilder::Finish(std::shared_ptr<arrow::Array>*)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Thanks @pitrou, do you have some tricks for creating profiles like that (I have made FlameGraphs but not much more)?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I use Linux "perf" on a benchmark script than ran "perf report" to get call trees and then manually fed some paste of that into "c++filt".

The most accessible resource I've found about the "perf" utility is http://www.brendangregg.com/perf.html

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
There doesn't seem to be anything obvious left to do here. Most of the time taken here is converting strings around (in Python 3, that means bytes to unicode, or the reverse). There's probably some overhead elsewhere, but nothing seems to stand out.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Thanks @pitrou for investigating. I agree there doesn't seem much more to do. I'll review the patch you opened and then resolve this once that's merged (I will quickly check the write perf to see what things look like on my end after that).

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
The patch was already merged actually (see #2258 ).

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Oh great, thanks =) Well I'll run some perf tests locally, post the results here if anything interesting, and then resolve

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I ran some benchmarks locally (quad-core Xeon E3-1535M, Ubuntu 14.04). There is still a slight ~5-10% write performance regression, but reading is faster.

Feather 0.3.1:

# WRITE
$ python bench.py
Elapsed: 15.497231721878052 seconds
Average: 1.549723172187805

# READ
$ python bench.py 
Elapsed: 9.88158106803894 seconds
Average: 0.988158106803894

Feather 0.4.0

# WRITE
$ python bench.py
Elapsed: 16.36524486541748 seconds
Average: 1.636524486541748

# READ
$ python bench.py 
Elapsed: 7.4859395027160645 seconds
Average: 0.7485939502716065

Here's the benchmarking script so people can run their own experiments. It would be useful to look at the perf output more closely and see what else we can do to make things faster:

import io
import pickle
import time

import feather
import pandas as pd


def generate_example():
    buf = io.StringIO("""07300003030539,42198997,-1,2016-10-03T13:14:22.326Z
    41130003053286,42224636,-1,2016-09-20T19:31:51.196Z
    """)

    table = pd.read_csv(buf, header=None)
    table = pd.concat([table] * 5000, axis=0, ignore_index=True)
    table = pd.concat([table] * 1000, axis=0, ignore_index=True)

    with open('example.pkl', 'wb') as f:
        pickle.dump(table, f)


def _get_time():
    return time.clock_gettime(time.CLOCK_REALTIME)


class Timer:

    def __init__(self, iterations):
        self.start_time = _get_time()
        self.iterations = iterations

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, tb):
        elapsed = _get_time() - self.start_time
        print("Elapsed: {0} seconds\nAverage: {1}"
              .format(elapsed, elapsed / self.iterations))


def feather_write_bench(iterations=10):
    with open('example.pkl', 'rb') as f:
        data = pickle.load(f)

    with Timer(iterations):
        for i in range(iterations):
            feather.write_dataframe(data, 'example.fth')


def feather_read_bench(iterations=10):
    import gc
    gc.disable()
    with Timer(iterations):
        for i in range(iterations):
            feather.read_dataframe('example.fth')
    gc.enable()


# generate_example()
# feather_write_bench()
# feather_read_bench()

To use

  • Run with generate_example() to make the data file
  • Run write benchmarks with feather_write_bench
  • Run read benchmarks with feather_read_bench

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I just attached a FlameGraph (perf.svg) created after the work in ARROW-2814.

Some notes:

  • Buffer trimming in BinaryBuilder::FinishInternal is taking up a good amount of time
  • perf isn't recording what symbols are under the libpthread calls; that would be interesting to know.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
If you get a lot of unknown symbols in the call stacks, my experience is that you may try to change the way call stacks are recorded by "perf record". On my machine, for example, only "--call-graph=dwarf" gives good results.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Ah great, I will try that out

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
Also adding -fno-omit-frame-pointer makes the stacks much more clearer. But this sadly comes at a bit of overhead, so one needs to be careful to make final measurements without this flag again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants