Skip to content
python implementation of the parquet columnar file format.
Branch: master
Clone or download
Pull request Compare This branch is 562 commits ahead, 58 commits behind jcrobak:master.
thomasbkahn and martindurant Add schema == (__eq__) and != (__ne__) methods and tests. (#425)
* add __eq__ and __ne__ methods to SchemaHelper so that == and != return meaningful results

* add test_schema.py containing tests for schema __eq__ and __ne__

* add one more test, for schema order mismatch absent any other changes
Latest commit ef4b06a Apr 23, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs Remove python 3 requirement from install docs (#407) Feb 26, 2019
fastparquet Add schema == (__eq__) and != (__ne__) methods and tests. (#425) Apr 23, 2019
test-data Fix item iteration for decimals (#418) Apr 9, 2019
.gitignore Get setup.py pytest to work. (#287) Jan 27, 2018
.pylintrc Fast reading; writing Oct 17, 2016
.travis.yml Add support for zstandard compression (#296) Feb 1, 2018
LICENSE add license Sep 2, 2013
MANIFEST.in
Makefile COMPAT: Update thrift (#264) Dec 11, 2017
README.rst change `parquet-format` link to apache repo (#328) Apr 27, 2018
readthedocs.yml add readthedocs.yml Nov 10, 2016
requirements.txt Update for forthcoming thrift release (#281) Jan 25, 2018
setup.cfg Add handling of keyword arguments for compressor (#294) Jan 27, 2018
setup.py Update version for release Mar 30, 2019

README.rst

fastparquet

https://travis-ci.org/jcrobak/parquet-python.svg?branch=master

fastparquet is a python implementation of the parquet format, aiming integrate into python-based big data work-flows.

Not all parts of the parquet-format have been implemented yet or tested e.g. see the Todos linked below. With that said, fastparquet is capable of reading all the data files from the parquet-compatability project.

Introduction

Details of this project can be found in the documentation.

The original plan listing expected features can be found in this issue. Please feel free to comment on that list as to missing items and priorities, or raise new issues with bugs or requests.

Requirements

(all development is against recent versions in the default anaconda channels)

Required:

  • numba (requires LLVM 4.0.x)
  • numpy
  • pandas
  • cython
  • six

Optional (compression algorithms; gzip is always available):

  • snappy (aka python-snappy)
  • lzo
  • brotli
  • lz4
  • zstandard

Installation

Install using conda:

conda install -c conda-forge fastparquet

install from pypi:

pip install fastparquet

or install latest version from github:

pip install git+https://github.com/dask/fastparquet

For the pip methods, numba must have been previously installed (using conda).

Usage

Reading

from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
df = pf.to_pandas()
df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])

You may specify which columns to load, which of those to keep as categoricals (if the data uses dictionary encoding). The file-path can be a single file, a metadata file pointing to other data files, or a directory (tree) containing data files. The latter is what is typically output by hive/spark.

Writing

from fastparquet import write
write('outfile.parq', df)
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
      compression='GZIP', file_scheme='hive')

The default is to produce a single output file with a single row-group (i.e., logical segment) and no compression. At the moment, only simple data-types and plain encoding are supported, so expect performance to be similar to numpy.savez.

History

Since early October 2016, this fork of parquet-python has been undergoing considerable redevelopment. The aim is to have a small and simple and performant library for reading and writing the parquet format from python.

You can’t perform that action at this time.