Removing the "has_nulls" option from to_parquet, or set default to True? #2177

Open
wesm opened this Issue Apr 5, 2017 · 23 comments

Comments

Projects
None yet
5 participants
Contributor

wesm commented Apr 5, 2017 edited

In dask.dataframe.io.to_parquet:

    has_nulls : bool, list or None
        Specifies whether to write NULLs information for columns. If bools,
        apply to all columns, if list, use for only the named columns, if None,
        use only for columns which don't have a sentinel NULL marker (currently
        object columns only).

I am looking in the logic in:

https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L626

My understand of this is that the default behavior is to not write nulls in the Parquet definition levels (i.e. RepetitionType:REQUIRED) for float32/64 types. I'm concerned this is a dangerous default setting because the Parquet files written will not be readable in general by other systems that use Parquet files. Would you be open to removing this option altogether (respecting pandas's NULL semantics) or changing its default value to True?

cc @martindurant

Contributor

wesm commented Apr 7, 2017 edited

Bumping this issue. If files are being written with nonzero probability of being read by Spark, Hive, Drill Impala, or any other SQL-on-Hadoop system then the default options are producing files whose results in those systems may be incorrect.

If using has_nulls=False yields better performance in a dask-to-dask roundtrip, that seems OK, but that should be something that users are explicitly opting in to. I think you also should clearly document that this will produce incompatible Parquet files.

Contributor

martindurant commented Apr 7, 2017

I don’t see how it can be incorrect to save the value NaN as opposed to NULL in a parquet file, when the input is a pandas float column in which, as far as I know, there is no difference between the two.

Contributor

wesm commented Apr 7, 2017

The NaN values in pandas are semantically NULL; when you write them as NaN they are no longer considered to be NULL by other systems. So

pd.isnull(value)

will yield different results from ISNULL(value) in Spark/Hive/Impala/Presto/Drill/etc.

Contributor

martindurant commented Apr 7, 2017

"The NaN values in pandas are semantically NULL" - so how does one achieve a semantic NaN?

Contributor

wesm commented Apr 7, 2017

You cannot -- NaN is null:

In [1]: import pandas as pd

In [2]: s = pd.Series([-1, -1, -1])

In [3]: import numpy as np

In [4]: nans = np.sqrt(s)
/home/wesm/anaconda3/envs/arrow-test/bin/ipython:1: RuntimeWarning: invalid value encountered in sqrt
  #!/home/wesm/anaconda3/envs/arrow-test/bin/python

In [5]: nans
Out[5]: 
0   NaN
1   NaN
2   NaN
dtype: float64

In [6]: nans.isnull()
Out[6]: 
0    True
1    True
2    True
dtype: bool
Owner

mrocklin commented Apr 7, 2017

I think that the following is a simple and motivational use case in favor of treating nans/nulls dilligently (what @wesm describes). I haven't thought about this topic enough to have a strong opinion, but I think that this example might be useful.

Imagine you read the following CSV file

a,b
1,1
2,2
3,NA
4,4

Pandas treats the missing values here properly as missing values

>>> df = pd.read_csv('myfile.csv')
>>> df.b.sum()
7

But now we move the data to parquet and then read with Spark and try the same thing:

>>> df.agg({'b': 'sum'})
NaN

So either Spark would need to interpret NaN's as Nulls (which I suspect is unlikely and probably incorrect from their perspective) or we are transforming the data so that it produces different results on different tools.

Contributor

wesm commented Apr 7, 2017

Overmore, I believe that RepetitionType::REQUIRED should be avoided altogether, since if you append data to a dataset written initially by Dask, you may have a schema conflict (many systems use OPTIONAL exclusively).

wesm referenced this issue in pandas-dev/pandas Apr 9, 2017

Open

ENH: add to/from_parquet with pyarrow & fastparquet #15838

2 of 2 tasks complete
Contributor

wesm commented Apr 13, 2017

Here's an example of writing a DataFrame with nulls with the default options and then querying it with Impala:

Write the file:

import dask.dataframe as dd, numpy as np, pandas as pd
df = pd.DataFrame({'key': [u'a', u'b', u'c'] * 100,
                   'value': np.random.randn(300)})
df['value'][::4] = np.nan
ddf = dd.from_pandas(df, npartitions=1)
​
ddf.to_parquet('proper_foo.parquet', write_index=False)

Do a groupby with pandas

df.groupby('key').value.sum()
Out[11]:
key
a    13.728379
b    -0.028474
c    -2.610579
Name: value, dtype: float64

Create an EXTERNAL TABLE in Impala referencing the new Parquet file:

import ibis
import pandas as pd
hdfs = ibis.hdfs_connect('localhost', port=5070)
con = ibis.impala.connect('localhost', port=21050, hdfs_client=hdfs)
ibis.options.verbose = True

path = '/tmp/fastparquet-test'
hdfs.mkdir(path)

hdfs.put('/tmp/fastparquet-test/0.parq', 'proper_foo.parquet/part.0.parquet')

Create a table referencing the file:

>>> pf = con.parquet_file(path)
SHOW DATABASES LIKE '__ibis_tmp'
CREATE EXTERNAL TABLE __ibis_tmp.`__ibis_tmp_524b11f116a04d29add65f1bd9bfee8c`
LIKE PARQUET '/tmp/fastparquet-test/0.parq'
STORED AS PARQUET
LOCATION '/tmp/fastparquet-test'

Now do the same groupby, which yields NaN results:

cur = con.con.execute('select key, sum(value) from __ibis_tmp.`__ibis_tmp_524b11f116a04d29add65f1bd9bfee8c` group by key')
cur.fetchall()
select key, sum(value) from __ibis_tmp.`__ibis_tmp_524b11f116a04d29add65f1bd9bfee8c` group by key
Out[18]:
[('b', nan), ('a', nan), ('c', nan)]

For the sake of compatibility, in Dask I strongly recommend:

  • Always use OPTIONAL repetition type
  • Write with has_nulls=True

I think it's OK if you want to have a special option for fastparquet users to disable compatibility but give better performance, but it should not be the default.

Contributor

wesm commented Apr 13, 2017

@cpcloud your example has a mistake because the type of a is bigint (int64) not int

Contributor

cpcloud commented Apr 13, 2017

Removed, the example works with proper types.

Owner

mrocklin commented Apr 13, 2017

Checking in on the status here. @martindurant did you have any thoughts?

Contributor

martindurant commented Apr 17, 2017

Sorry for my relative silence on this issue. Here are a couple of further thoughts:

  • from a more scientist point of view, there is an argument that nan should mean nan as the result of an invalid calculation and not the same as null or missing data. Of course, this is not what pandas normally does, and I'm sure the reasons for that were thorough (had to choose one or the other).
  • I note that spark does have a concept of nan, which is not the same as null, and if we have has_nulls=True by default, the user will never make them
  • the performance cost on reading a float column with nulls in is about 30% in my benchmarks. There is a tradeoff between compatibility and performance, which depends on the user's purpose. There are other similar compatibility options in a similar category and it seems that the implementation of the parquet options is spotty in the various reader frameworks. (btw: can anyone send me parquet data containing delta-encoded ints??)

I am OK to say that performance comes second to compatibility in general, as long as the loss is not too big, if we are conscious that that is the decision being made.

On the point that all schema fields should be OPTIONAL: this sounds pretty odd to me, no designer of a database table would agree that the primary key should be optional, even if in practice it never is.

Contributor

wesm commented Apr 17, 2017

from a more scientist point of view, there is an argument that nan should mean nan as the result of an invalid calculation and not the same as null or missing data. Of course, this is not what pandas normally does, and I'm sure the reasons for that were thorough (had to choose one or the other).

Mathematically, you are right, but pandas is not a library for scientific computing; it's for business analytics and data analysis (where people don't seem to care about NaNs). In 2008 there was substantial precedent in the financial analytics world for using NaN for performance and convenience (null propagation in hardware, also in operations involving floats and ints together). In pandas 2.0, like in databases and Spark / other big datasystems, there will be a semantic distinction between NaN and null, but that is still a good ways off.

If you have an application where NaN is a meaningful value, I don't recommend using pandas at all -- better to use NumPy directly.

On the point that all schema fields should be OPTIONAL: this sounds pretty odd to me, no designer of a database table would agree that the primary key should be optional, even if in practice it never is.

This is a practical argument based on empirical Parquet use, where the columns are always nullable. Hive and Impala do not have NOT NULL, for example (see https://github.com/apache/incubator-impala/blob/768fc0ea2773289b88256ea16090c0cfcf2d0a97/be/src/exec/hdfs-parquet-table-writer.cc#L845). So if

  • you append data with REQUIRED to an existing Hive or Impala table or
  • you append data from Hive or Impala to a dataset written initially by fastparquet

then some downstream readers of those datasets may raise errors due to some of the files having "different" schemas from the others.

Contributor

martindurant commented Apr 17, 2017

Ah, I misunderstood your earlier statement about appends, I had honestly not thought of people using different tools to append to the same dataset. Does that happen?

Contributor

wesm commented Apr 18, 2017

In the case of Hive or Impala, the Hive metastore allows you to create partitions that can point to any directory in HDFS -- it's not unusual for different ETL processes to dump data into a given logical dataset. See e.g. https://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/

Member

pitrou commented Apr 19, 2017

@wesm' arguments sound reasonable to me. If it's important that people be able to choose performance presets easily, perhaps this should be handled by a dialect or compatibility setting in fastparquet, so that users don't have to bother about individual configuration options such as has_nulls?

Contributor

martindurant commented Apr 19, 2017

OK, I'm convinced, I'll make the change. This will be for has_nulls=True for both fastparquet and dask's to_parquet. What would be a reasonable name for the current behaviour (which is no longer default, and so should not be None): "fast_but_incompatible" ?

Owner

mrocklin commented Apr 20, 2017

Contributor

wesm commented Apr 21, 2017 edited

It seems like this is a candidate for a pass-through parameter to the engine. I'd like to provide the option to write files with pyarrow also

@mrocklin mrocklin added a commit that referenced this issue Apr 28, 2017

@wesm @mrocklin wesm + mrocklin ENH: Refactoring to support multiple Parquet readers, add PyArrow rea…
…der (#2223)

* Refactor Parquet read path to support multiple implementations. Add
PyArrow-based implementation. Disable tests that don't yet work

* Remove superfluous global

* Comment typo

* Use functional style for Parquet read functions. Add failing test for #2177

* Fix renamed module

* Code review comments

* Add pyarrow to conda dependencies

* Code reviews

* Add documentation about Parquet engines in dataframe-performance.rst
68f9e41
Contributor

wesm commented May 15, 2017

Can we change this default option in Dask? I tried in 68f9e41 but it broke the test suite, so more work is required

Contributor

martindurant commented May 15, 2017

Yes, agree that this can be True (but please do not remove).
I thought this was done, but I cannot find it now.
More changes will come soon to the fastparquet side following the pandas schema spec, but feel free to change this immediately.

Owner

mrocklin commented Jun 1, 2017

Has this been resolved?

Contributor

martindurant commented Jun 1, 2017

This is also in #2365 , which I now see has a py2 issue yet to be resolved; I'll get on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment