Status and roadmap #1

martindurant · 2016-10-20T17:38:37Z

Features to be implemented.
An asterisk shows the next item(s) on the list.
A question mark shows something that might (almost) work, but isn't tested.

python2 compatibility

Reading

Writing

Admin

Features not to be attempted

nested schemas (maybe can find a way to flatten or encode as dicts)
choice of encoding on write? (keep it simple)
schema evolution

mrocklin · 2016-10-20T18:04:35Z

Can I make a request for an additional section for Administrative topics like packaging, documentation, etc..

What do we need for Dask.dataframe integration? Presumably we're depending on dask.bytes.open_files?

martindurant · 2016-10-20T18:22:18Z

Yes, passing a file-like object that can be resolved in each worker would do: core.read_col currently takes an open file-object or a string that can be opened within the function. It probably should take a function to create a file object given a path (a parquet metadata file will reference other files with relative paths).
The only places that reading actually happens is core.read_thrift (where the size of the thrift structure is not known) and core._read_page (where the size in bytes is known). The former is small and would fit within a read-ahead buffer, the latter can be formed in terms of dask's read_bytes.

lomereiter · 2016-12-03T08:24:44Z

Hi there,

Are you guys aware of ongoing PyArrow development? It is also already on conda-forge and also has pandas <-> parquet read/write (through Arrow), although I don't think it supports multi-file yet.

mrocklin · 2016-12-03T11:44:49Z

@lomereiter Yes, we're very aware. We've been waiting for comprehensive Parquet read-write functionality from Arrow for a long while. Hopefully fastparquet is just a stopgap measure until PyArrow matures as a comprehensive solution.

teh · 2016-12-07T19:48:38Z

Hi, amazing work. Two things I noticed:

pytest required at runtime (imported in utils.py) which is a bit unusual
if column names are not string types then saving fails (e.g. AttributeError: 'int' object has no attribute 'encode')

frol · 2017-03-01T19:56:41Z

Since @lomereiter mentioned PyArrow, I will just leave this link here: Extreme IO performance with parallel Apache Parquet in Python

martindurant · 2017-03-01T20:26:00Z

Thanks @frol . That there are multiple projects pushing on parquet for python is a good thing. You should also have linked to the previous posting python-parquet-update (Wes's work, not mine) which shows that fastparquet and arrow have very similar performance in many cases.

Note also that fastparquet is designed to run in parallel using dask, allowing distributed data access, and reading from remote stores such as s3.

frol · 2017-03-01T20:29:54Z

@martindurant Thank you! I was actually looking out there for some sorts of benchmarks for fastparquet as I am going to use it with Dask. It would be very helpful to have some info about benchmarks in the documentation as "fast" suffix in the project name implies the focus on speed, but I failed to find any info on this until you pointed me to this article.

martindurant · 2017-03-01T20:37:02Z

There are some raw benchmarks in https://github.com/dask/fastparquet/blob/master/fastparquet/benchmarks/columns.py

My colleagues at datashader did some benchmarking on census data at the time when we were focusing on performance. Their numbers include both loading and performing aggregations on the data.

martindurant mentioned this issue Oct 28, 2016

Fast writer #3

Merged

martindurant mentioned this issue Nov 21, 2016

Code review #32

Closed

martindurant mentioned this issue Jan 11, 2017

Taking dependency on FastParquet in production #64

Closed

martindurant closed this as completed Feb 9, 2017

MayankKr8 mentioned this issue Jan 18, 2021

Reduced number of categories while loading categorical data #548

Closed

michael-sanders mentioned this issue Apr 6, 2022

fastparquet 0.8.1 Linux aarch64 builds are missing #770

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Status and roadmap #1

Status and roadmap #1

martindurant commented Oct 20, 2016 •

edited

Loading

mrocklin commented Oct 20, 2016

martindurant commented Oct 20, 2016 •

edited

Loading

lomereiter commented Dec 3, 2016

mrocklin commented Dec 3, 2016

teh commented Dec 7, 2016 •

edited

Loading

frol commented Mar 1, 2017

martindurant commented Mar 1, 2017 •

edited

Loading

frol commented Mar 1, 2017

martindurant commented Mar 1, 2017

Status and roadmap #1

Status and roadmap #1

Comments

martindurant commented Oct 20, 2016 • edited Loading

Reading

Writing

Admin

Features not to be attempted

mrocklin commented Oct 20, 2016

martindurant commented Oct 20, 2016 • edited Loading

lomereiter commented Dec 3, 2016

mrocklin commented Dec 3, 2016

teh commented Dec 7, 2016 • edited Loading

frol commented Mar 1, 2017

martindurant commented Mar 1, 2017 • edited Loading

frol commented Mar 1, 2017

martindurant commented Mar 1, 2017

martindurant commented Oct 20, 2016 •

edited

Loading

martindurant commented Oct 20, 2016 •

edited

Loading

teh commented Dec 7, 2016 •

edited

Loading

martindurant commented Mar 1, 2017 •

edited

Loading