Add support to Write nullable pandas types (Int|UInt|boolean|string) #525

haleemur · 2020-11-08T16:38:06Z

Following the discussion on PR-483

As per @TomAugspurger 's comment, perhaps we should split this into two PRs: one that copes with those extension types at write time, if we come across them, and another one (to be merged only later) that creates new extension types on reading for all relevant "optional" columns.

I have added support to write pandas nullable types. The behaviour of reading columns having null values is unchanged, i.e. boolean & integer columns with null values get upcast to the appropriate sized float, string columns with null values will get read as object. All existing code will continue to work as-is

import pandas as pd
import fastparquet as fp

df = pd.DataFrame({
  'a': pd.Series([1, 2, pd.NA], dtype='Int32'),
  'b': pd.Series([1, 2, 3], dtype='UInt32'),
  'c': pd.Series([True, False, pd.NA], dtype='boolean'),
  'd': pd.Series(['hello', 'world', pd.NA], dtype='string')
})
fp.write('test.pq', df, compression='snappy')
out = fp.ParquetFile('test.pq')
outdf = out.to_pandas()
outdf
# prints:
         a  b    c      d
    0  1.0  1  1.0  hello
    1  2.0  2  0.0  world
    2  NaN  3  NaN   None

print(outdf.dtypes)
# prints:
    a    float64
    b     uint32
    c    float16
    d     object
    dtype: object

print(out.schema)
# prints:
    - schema: 
    | - a: INT32, OPTIONAL
    | - b: INT32, UINT_32, OPTIONAL
    | - c: BOOLEAN, OPTIONAL
      - d: BYTE_ARRAY, UTF8, OPTIONAL

Inferring of optional types on read was omitted on purpose. But if needed, the numpy types can be converted to their fancy optional counterparts like below:

outdf.convert_dtypes({
  'a': 'Int32',
  'b': 'UInt32',
  'c': 'boolean',
  'd': 'string'
})

This PR will solve the issue ValueError: Don't know how to convert data type: Int64

It will also allow using fastparquet to generate Loading Files in ETL processes where the files' target is a database. Postgres & Redshift will refuse to load from a parquet file if a column is declared as an integer in the database but stored as a float in the parquet file.

While working on this PR, I also encountered & resolved a bug where unsigned integer columns in a parquet file with some null values would not be converted properly when using out.to_pandas(). To explore this bug, please generate a test output (i.e. run the code snipped above) using the branch haleemur:feat/write-optional-types and then attempt reading the test output using the master branch.

add support for `string` type bump requirements fix bug on matching type

martindurant · 2020-11-09T15:14:38Z

Thanks for the PR!
I will try to look at this soon. However, one test is failing (fastparquet/test/test_output.py::test_read_partitioned_and_write_with_empty_partions) in one of the builds - perhaps depending on the pandas version. Would you mind checking?

haleemur · 2020-11-09T17:02:44Z

Thank you. I'm looking into it. What is strange is that the test also fails on my machine even I'm on the master branch. Maybe the failure is dependent on another library.

martindurant · 2020-11-09T17:04:56Z

Right, probably dependent on other package versions

haleemur · 2020-11-11T06:04:14Z

The test failed as a result of behaviour change between pandas versions 1.1.13 & 1.1.14.

The dataframe created as a result of df_filtered = ParquetFile(tempdir).to_pandas(filters=[('a', '==', 'b')]) renders column data.a as a categorical column, with 3 possible categories a, b, c, and only 1 category b present.

In the function partition_on_columns, the data is written to disk according to the partitioning columns.

first the data is grouped by the partitioning columns. gb = data.groupby(columns)

In 1.1.13, gb.indices returns a dictionary of 3 items

{ 'a': [], 'b': array([...]), 'c': []}

In 1.1.14, gb.indices returns a dictionary of 1 item

{'b': array([...])}

This then changes the behaviour of the zip(sorted(data.indices), sorted(gb)), resulting in the test failure. The specific scenario for failure is:

data is partitioned and written to disk with `file_schema='hive'
the partitioning columns are categorical
some partitions do not have data.

The latest version of this PR resolves this issue, by replacing the zip(sorted(data.indices), sorted(gb)) with sorted(gb). This works because sorted(gb) returns a list of tuples. The first element of each tuple is the column value or (tuple of column values) on which the data is split. So, data.indices is actually redundant.

Currently all tests are passing.

haleemur · 2020-11-17T07:06:13Z

@martindurant please let me know if you can get a chance to review this PR. All tests are passing, and it solves Interop issues with the latest pandas version

martindurant · 2020-11-17T14:22:55Z

This is excellent, thank you, and sorry to keep you waiting.
I wonder if the int8/16 options are actually faster to write...

We can discuss separately about possible timing for implementing the reading side, perhaps as an option at first.

haleemur force-pushed the feat/write-optional-types branch from 3263486 to 0fb5e7e Compare November 8, 2020 17:03

add support for writing optional boolean and integer types

ee2ddf9

add support for `string` type bump requirements fix bug on matching type

haleemur force-pushed the feat/write-optional-types branch from 0fb5e7e to ee2ddf9 Compare November 9, 2020 01:36

fix compatibility issue with pandas==1.1.14

148a89f

add test for writing categorical with missing categories

28c73e4

haleemur force-pushed the feat/write-optional-types branch from b724734 to 28c73e4 Compare November 12, 2020 03:28

martindurant merged commit e7abbb9 into dask:master Nov 17, 2020

haleemur mentioned this pull request Nov 27, 2020

Partitioning on multiple columns where one is a categorical-datetime #440

Closed

martindurant mentioned this pull request Dec 10, 2020

Add support for pandas1.0.0 nullable string/bool/int #483

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to Write nullable pandas types (Int|UInt|boolean|string) #525

Add support to Write nullable pandas types (Int|UInt|boolean|string) #525

haleemur commented Nov 8, 2020 •

edited

Loading

martindurant commented Nov 9, 2020

haleemur commented Nov 9, 2020

martindurant commented Nov 9, 2020

haleemur commented Nov 11, 2020

haleemur commented Nov 17, 2020

martindurant commented Nov 17, 2020

Add support to Write nullable pandas types (Int*|UInt*|boolean|string) #525

Add support to Write nullable pandas types (Int*|UInt*|boolean|string) #525

Conversation

haleemur commented Nov 8, 2020 • edited Loading

martindurant commented Nov 9, 2020

haleemur commented Nov 9, 2020

martindurant commented Nov 9, 2020

haleemur commented Nov 11, 2020

haleemur commented Nov 17, 2020

martindurant commented Nov 17, 2020

Add support to Write nullable pandas types (Int|UInt|boolean|string) #525

Add support to Write nullable pandas types (Int|UInt|boolean|string) #525

haleemur commented Nov 8, 2020 •

edited

Loading