Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to Write nullable pandas types (Int*|UInt*|boolean|string) #525

Merged
merged 3 commits into from
Nov 17, 2020

Conversation

haleemur
Copy link

@haleemur haleemur commented Nov 8, 2020

Following the discussion on PR-483

As per @TomAugspurger 's comment, perhaps we should split this into two PRs: one that copes with those extension types at write time, if we come across them, and another one (to be merged only later) that creates new extension types on reading for all relevant "optional" columns.

I have added support to write pandas nullable types. The behaviour of reading columns having null values is unchanged, i.e. boolean & integer columns with null values get upcast to the appropriate sized float, string columns with null values will get read as object. All existing code will continue to work as-is

import pandas as pd
import fastparquet as fp

df = pd.DataFrame({
  'a': pd.Series([1, 2, pd.NA], dtype='Int32'),
  'b': pd.Series([1, 2, 3], dtype='UInt32'),
  'c': pd.Series([True, False, pd.NA], dtype='boolean'),
  'd': pd.Series(['hello', 'world', pd.NA], dtype='string')
})
fp.write('test.pq', df, compression='snappy')
out = fp.ParquetFile('test.pq')
outdf = out.to_pandas()
outdf
# prints:
         a  b    c      d
    0  1.0  1  1.0  hello
    1  2.0  2  0.0  world
    2  NaN  3  NaN   None

print(outdf.dtypes)
# prints:
    a    float64
    b     uint32
    c    float16
    d     object
    dtype: object

print(out.schema)
# prints:
    - schema: 
    | - a: INT32, OPTIONAL
    | - b: INT32, UINT_32, OPTIONAL
    | - c: BOOLEAN, OPTIONAL
      - d: BYTE_ARRAY, UTF8, OPTIONAL

Inferring of optional types on read was omitted on purpose. But if needed, the numpy types can be converted to their fancy optional counterparts like below:

outdf.convert_dtypes({
  'a': 'Int32',
  'b': 'UInt32',
  'c': 'boolean',
  'd': 'string'
})

This PR will solve the issue ValueError: Don't know how to convert data type: Int64

It will also allow using fastparquet to generate Loading Files in ETL processes where the files' target is a database. Postgres & Redshift will refuse to load from a parquet file if a column is declared as an integer in the database but stored as a float in the parquet file.

While working on this PR, I also encountered & resolved a bug where unsigned integer columns in a parquet file with some null values would not be converted properly when using out.to_pandas(). To explore this bug, please generate a test output (i.e. run the code snipped above) using the branch haleemur:feat/write-optional-types and then attempt reading the test output using the master branch.

add support for `string` type
bump requirements
fix bug on matching type
@martindurant
Copy link
Member

Thanks for the PR!
I will try to look at this soon. However, one test is failing (fastparquet/test/test_output.py::test_read_partitioned_and_write_with_empty_partions) in one of the builds - perhaps depending on the pandas version. Would you mind checking?

@haleemur
Copy link
Author

haleemur commented Nov 9, 2020

Thank you. I'm looking into it. What is strange is that the test also fails on my machine even I'm on the master branch. Maybe the failure is dependent on another library.

@martindurant
Copy link
Member

Right, probably dependent on other package versions

@haleemur
Copy link
Author

The test failed as a result of behaviour change between pandas versions 1.1.13 & 1.1.14.

The dataframe created as a result of df_filtered = ParquetFile(tempdir).to_pandas(filters=[('a', '==', 'b')]) renders column data.a as a categorical column, with 3 possible categories a, b, c, and only 1 category b present.

In the function partition_on_columns, the data is written to disk according to the partitioning columns.

first the data is grouped by the partitioning columns. gb = data.groupby(columns)

In 1.1.13, gb.indices returns a dictionary of 3 items

{ 'a': [], 'b': array([...]), 'c': []}

In 1.1.14, gb.indices returns a dictionary of 1 item

{'b': array([...])}

This then changes the behaviour of the zip(sorted(data.indices), sorted(gb)), resulting in the test failure. The specific scenario for failure is:

  • data is partitioned and written to disk with `file_schema='hive'
  • the partitioning columns are categorical
  • some partitions do not have data.

The latest version of this PR resolves this issue, by replacing the zip(sorted(data.indices), sorted(gb)) with sorted(gb). This works because sorted(gb) returns a list of tuples. The first element of each tuple is the column value or (tuple of column values) on which the data is split. So, data.indices is actually redundant.

Currently all tests are passing.

@haleemur
Copy link
Author

@martindurant please let me know if you can get a chance to review this PR. All tests are passing, and it solves Interop issues with the latest pandas version

@martindurant
Copy link
Member

This is excellent, thank you, and sorry to keep you waiting.
I wonder if the int8/16 options are actually faster to write...

We can discuss separately about possible timing for implementing the reading side, perhaps as an option at first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants