New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Possible performance regression in Feather read/write path #18037
Comments
Jingyuan Wang / @alphalfalfa: Processing 100M rows files all failed on my laptop (16GB memory) except for the version of python2.7 and feather-format-0.3.1. The measurement of 1M rows is as following: |python version|feather version|1. rows|write feather|read feather| |
Jingyuan Wang / @alphalfalfa:
|
Jingyuan Wang / @alphalfalfa:
|
Wes McKinney / @wesm: |
Antoine Pitrou / @pitrou: - arrow::py::NumPyConverter::ConvertObjectStrings()
- 80,27% arrow::py::AppendObjectStrings(tagPyArrayObject*, tagPyArrayObject*, long, bool, arrow::StringBuilder*, long*, bool*)
- 50,74% arrow::py::internal::BuilderAppend(arrow::StringBuilder*, _object*, bool, bool*)
- 24,95% arrow::BinaryBuilder::Append(unsigned char const*, int)
7,43% arrow::BinaryBuilder::AppendNextOffset()
+ 6,28% arrow::BufferBuilder::Resize(long, bool)
2,30% __memcpy_avx_unaligned
0,71% arrow::ArrayBuilder::Reserve(long)
6,16% PyUnicode_AsUTF8AndSize
+ 4,37% PyErr_Occurred
- 16,70% arrow::py::internal::PandasObjectIsNull(_object*)
- 8,29% arrow::py::internal::PyDecimal_Check(_object*)
PyType_IsSubtype
- 4,59% arrow::py::internal::PyFloat_IsNaN(_object*)
PyType_IsSubtype
2,51% PyArray_MultiplyList
2,41% PyType_IsSubtype
+ 1,57% arrow::ArrayBuilder::Finish(std::shared_ptr<arrow::Array>*) |
Wes McKinney / @wesm: |
Antoine Pitrou / @pitrou: The most accessible resource I've found about the "perf" utility is http://www.brendangregg.com/perf.html |
Antoine Pitrou / @pitrou: |
Wes McKinney / @wesm: |
Antoine Pitrou / @pitrou: |
Wes McKinney / @wesm: |
Wes McKinney / @wesm: Feather 0.3.1: # WRITE
$ python bench.py
Elapsed: 15.497231721878052 seconds
Average: 1.549723172187805
# READ
$ python bench.py
Elapsed: 9.88158106803894 seconds
Average: 0.988158106803894 Feather 0.4.0 # WRITE
$ python bench.py
Elapsed: 16.36524486541748 seconds
Average: 1.636524486541748
# READ
$ python bench.py
Elapsed: 7.4859395027160645 seconds
Average: 0.7485939502716065 Here's the benchmarking script so people can run their own experiments. It would be useful to look at the perf output more closely and see what else we can do to make things faster: import io
import pickle
import time
import feather
import pandas as pd
def generate_example():
buf = io.StringIO("""07300003030539,42198997,-1,2016-10-03T13:14:22.326Z
41130003053286,42224636,-1,2016-09-20T19:31:51.196Z
""")
table = pd.read_csv(buf, header=None)
table = pd.concat([table] * 5000, axis=0, ignore_index=True)
table = pd.concat([table] * 1000, axis=0, ignore_index=True)
with open('example.pkl', 'wb') as f:
pickle.dump(table, f)
def _get_time():
return time.clock_gettime(time.CLOCK_REALTIME)
class Timer:
def __init__(self, iterations):
self.start_time = _get_time()
self.iterations = iterations
def __enter__(self):
return self
def __exit__(self, exc_type, exc_value, tb):
elapsed = _get_time() - self.start_time
print("Elapsed: {0} seconds\nAverage: {1}"
.format(elapsed, elapsed / self.iterations))
def feather_write_bench(iterations=10):
with open('example.pkl', 'rb') as f:
data = pickle.load(f)
with Timer(iterations):
for i in range(iterations):
feather.write_dataframe(data, 'example.fth')
def feather_read_bench(iterations=10):
import gc
gc.disable()
with Timer(iterations):
for i in range(iterations):
feather.read_dataframe('example.fth')
gc.enable()
# generate_example()
# feather_write_bench()
# feather_read_bench() To use
|
Wes McKinney / @wesm: Some notes:
|
Antoine Pitrou / @pitrou: |
Wes McKinney / @wesm: |
See discussion in wesm/feather#329. Needs to be investigated
Reporter: Wes McKinney / @wesm
Assignee: Antoine Pitrou / @pitrou
Related issues:
Original Issue Attachments:
Note: This issue was originally created as ARROW-2059. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: