Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Data #9

Open
TomAugspurger opened this issue May 26, 2020 · 18 comments
Open

Missing Data #9

TomAugspurger opened this issue May 26, 2020 · 18 comments

Comments

@TomAugspurger
Copy link

This issues is dedicated to discussing the large topic of "missing" data.

First, a bit on names. I think we can reasonably choose between NA, null, or missing as a general name for "missing" values. We'd use that to inform decisions on method names like DataFrame.isna() vs. DataFrame.isnull() vs. ...
Pandas favors NA, databases might favor null, Julia uses missing. I don't have a strong opinion here.

Some topics of discussion:

  1. data types should be nullable

I think we'd like that the introduction of missing data should not fundamentally change the dtype of a column.
This is not the case with pandas:

In [5]: df1 = pd.DataFrame({"A": ['a', 'b'], "B": [1, 2]})

In [6]: df2 = pd.DataFrame({"A": ['a', 'c'], "C": [3, 4]})

In [7]: df1.dtypes
Out[7]:
A    object
B     int64
dtype: object

In [8]: pd.merge(df1, df2, on="A", how="outer")
Out[8]:
   A    B    C
0  a  1.0  3.0
1  b  2.0  NaN
2  c  NaN  4.0

In [9]: _.dtypes
Out[9]:
A     object
B    float64
C    float64

In pandas, for int-dtype data NaN is used as the missing value indicator. NaN is a float, and so the column is cast to float64 dtype.

Ideally Out[9] would preserve the int dtype for B and C. At this moment, I don't have a strong opinion on whether the dtype for B should be a plain int64, or something like a Union[int64, NA].

  1. Semantics in arithmetic and comparison operations

In general, missing values should propagate in arithmetic and comparison operations (using <NA> as a marker for a missing value)`.

>>> df1 = DataFrame({"A": [1, None, 3]})
>>> df1 + 1
      A
0     2
1  <NA>
2     4

>>> df1 == 1
       A
0   True
1   <NA>
2  False

There might be a few exceptions. For example 0 ** NA might be 1 rather than NA, since it doesn't matter exactly what value NA takes on.

  1. Semantics in logical operations

For boolean logical operations (and, or, xor), libraries should implement three-value or Kleene Logic. The pandas docs has a table
The short-version is that the result should be NA if it depends on whether the NA operand being True or False. For example, True | NA is True, since it doesn't matter whether that NA is "really" True or False.

  1. The need for a scalar NA?

Libraries might need to implement a scalar NA value, but I'm not sure. As a user, you would get this from indexing to get a scalar, or in an operation that produces an NA result.

>>> df = pd.DataFrame({"A": [None]})
>>> df.iloc[0, 0]  # no comment on the indexing API
<NA>

What semantics should this scalar NA have? In particular, should it be typed? This is something we've struggled with in recent versions of pandas. There's a desire to preserve a property along the lines of the following

(arr1 + arr2)[0].dtype == (arr1 + arr2[0]).dtype

Where the first value in the second array is NA. If you have a single NA without any dtype, you can't implement that property.
There's a long thread on this at pandas-dev/pandas#28095.

@markusweimer
Copy link

Do others see value in different kinds of missing? E.g. not_recorded which indicates the data was not present at the source vs. NaN which would indicate a computation returned an invalid result.

@rgommers
Copy link
Member

@markusweimer that's what this is about indeed; it would be good to be more explicit in the final description. "not-a-number" (nan) is pretty universally available, "missing data" here means "not recorded" and support for it is much more recent and more patchy.

@TomAugspurger
Copy link
Author

Yes, we should make the distinction between NA and NaN clear.

There might be reason to support both within a single colum. For example,

>>> a = DataFrame({"A": [0, 1, NA, np.nan]})
>>> b = DataFrame({"A": [0, 0, 0, 0]})
>>> a / b
DataFrame({"A": [nan, inf, NA, nan]})  # float dtype

0/0 is defined to be nan. We would be saying that NA / 0 would be NA, by the principle that the result depends on the NA value.

This has implications for other parts of the API: does DataFrame.dropna() drop just NA values? Or does it drop NaN values as well?

Discussion on the NA vs. NaN distinction at pandas-dev/pandas#32265. In particular, cudf users have reported that some users appreciate being able to store both NaN and NA values within a single column: pandas-dev/pandas#32265 (comment).

@amueller
Copy link

What's NA * NaN?

I agree that having both might be useful, though I think I'm not entirely decided on whether it's necessary. They do have different semantics, but the cases where the different semantics change the outcome are pretty rare, right?

Having both in a single column will certainly make life for downstream packages harder because now we might need to deal with two special cases everywhere. Unless they are both mapped to the same at the numeric level and only are different on the dataframe level?

@TomAugspurger
Copy link
Author

TomAugspurger commented May 27, 2020

What's NA * NaN?

NA. edit: I'm not actually sure about this. Pandas current implementation (returning NA) may not be valid.

They do have different semantics, but the cases where the different semantics change the outcome are pretty rare, right?

What do you mean by "change the outcome", or rather, how does that differ from "different semantics"? To me those sound the same :) (e.g. np.nan > 0 is False, while NA > 0 is NA sounds like both different semantics and different outcomes).

Having both in a single column will certainly make life for downstream packages harder because now we might need to deal with two special cases everywhere. Unless they are both mapped to the same at the numeric level and only are different on the dataframe level?

Indeed, handling both might be difficult, or at least requires some thought. Scikit-Learn is choosing to treat pandas.NA as np.nan at the boundary in check_array: scikit-learn/scikit-learn#16508. This results in a loss of precision for large integers, but that might be the right choice for that library.

@amueller
Copy link

The NA * NaN was a bit tongue in cheek, though if we want to have both types inside a column then this actually needs an answer, and there's probably other situations that are at least as unclear.
Ok, good example of different behavior. With semantics I meant more semantics in user code, i.e. even if both had exactly the same behavior within a library, it might still be useful to have both so that users could distinguish them in their code.

@maartenbreddels
Copy link

I think we can reasonably choose between NA, null, or missing as a general name for "missing" values.

I feel like 'null' is a bit strange in the context of numbers, since it reminds me of pointers. I think missing is more fitting, though less neutral. But that is maybe what we want (give it an explicit meaning).

In Vaex we defined isna(x) as isnan(x) | ismissing(x), where ismissing(x) means missing values (implemented as masked arrays or Arrow Arrays which naturally have null bitmasks). isnan(x) just follows IEEE standards. So isna is short for 'get rid of anything messy'.

I strongly dislike using sentinels/special values for missing values in a library since for integers there is basically no solution. This means you need to support a byte or bitmask anway to keep track of them. Mixing sentinels and missing values just makes life more complex.

I see NaN (just a float number) as orthogonal to missing values, the only connection they have in Vaex is through the convenience methods isna/countna/fillna which follows the definition above. I also think having both NaN and missing values in a column can indicate different things, and a user should be able to distinguish between them.

@TomAugspurger
Copy link
Author

TomAugspurger commented May 28, 2020 via email

@maartenbreddels
Copy link

Is there an agreement on what 'NA' means, does it means 'Not available'?

I would say the meaning of 'missing' is the least ambiguous (which has its cons and pros), also NaN has a very explicit meaning, the meaning of null and NA are less clear to me.

@TomAugspurger
Copy link
Author

Yes, I think "Not available".

@saulshanabrook
Copy link

FWIW when I hear NA I think Not Applicable, but maybe I am just not used to the domain specific usage here.

@amueller
Copy link

I strongly dislike using sentinels/special values for missing values in a library since for integers there is basically no solution. This means you need to support a byte or bitmask anway to keep track of them. Mixing sentinels and missing values just makes life more complex.

I'm not sure I follow this part, could you please elaborte?

@datapythonista
Copy link
Member

I see NaN as a particular value of float. My understanding is that the advantage is that CPU's understand it, and can operate with it. So, I assume the performance should be much faster than having a boolean mask, when operating with it.

I guess for projects like numpy that can be important, but I don't think it is worth the trouble for dataframes. My opinion is that it'd make live easier for the users if NaN values are automatically convert to NA in the boolean mask, and users of dataframes forget about them.

@maartenbreddels
Copy link

I'm not sure I follow this part, could you please elaborte?

Since integers don't have a special value like NaN, you cannot 'abuse' NaN as a missing value. You could use a special value, but that would cause trouble since if you happen to have data which includes that special value, you suddenly have an accidental missing value.

A user might be able to get away with that I think, but having that solution as a building block for an ecosystem to build on does not sound like a good plan.

I think that means you have to keep track of a mask (bit or bytemask). And I guess that's also what database do. They will not limit you to not use a special value for integers, because it's reserved for a 'missing value sentinel' (correct me if I'm wrong, but I'd be surprised).

On top of that, NaN and missing values can have different meaning. A missing value can be a missing value, as it indicates, a NaN could mean a measurement gone wrong, a math 'error' etc. NaN and missing values are fundamentally different things, although one could group them (say call them NA).

I think I fully agree with Apache Arrow's idea. Each array can have missing values, and in that case it has a bitmask attached to the array, but it's optional. If you compute on this, I think the plan is to just brute force compute over all the data (ignoring the array has missing values, since it's all victorized/SIMD down in the loops.)
Apart from that, the optional bitmasks can be combined, in whatever way the algorithm things is required. I think performance-wise, that should be quite efficient.

Note that using a bitmask is not that memory consuming. Say a 1 billion column of float64 would use 1e9B*8=~8GB, if it had a full mask (1 bit per element), it would require 1e9B/8=~125MB extra, 1/64=1.5%.

My opinion is that it'd make live easier for the users if NaN values are automatically convert to NA in the boolean mask, and users of dataframes forget about them.

I disagree and agree here because I think you should be able to distinguish between them, but also have the "I don't care , so throw away any data that's NA/null/missing/NaN whatever".
This is the reason why I chose in vaex to have isna/isnan/ismissing and countna/countnan/countmissing etc. I usually use isna, but sometimes I need isnan or ismissing.

@TomAugspurger
Copy link
Author

Following up on a question on a question on the call by @apaszke. This demonstrates why a typed NA might be desirable.

We have that datetime - datetime is a timedelta. But datetime - timedelta
is a datetime.

If we have an untyped, scalar NA then you you have to arbitrarily choose that
datetime - NA interprets the NA as, say, a datetime, so the result is a timedelta.

In [17]: a  # datetime
Out[17]:
           A
0 2000-01-01
1 2000-01-01

In [18]: b  # datetime
Out[18]:
           A
0        NA
1 2000-01-01

In [19]: (a - b.iloc[0, 0]).dtypes
Out[19]:
A    timedelta64[ns]
dtype: object

But that loses the (I think desirable) property of knowing the result dtype of
the operation (a - any_scalar_from_b). It would now depend on whether the particular
scalar from b was NA or not.

Having a typed NA scalar like NA<datetime> would resolve this.

@maartenbreddels
Copy link

Thanks @TomAugspurger, that's really useful to keep in mind.

@teoliphant said it should be possible for ndarrays to have extra data added to it, like a mask (bit or byte). If normal ndarrays were more like numpy masked arrays and keep track of their masks, and numpy scalars would also hold this information (it would keep a single bit or byte), we could have masked scalar values. You wouldn't need sentinel value.

I think masked arrays in numpy has to happen someday (not added on), instead of solving it in the DataFrame layer (recognizing that's probably an order of magnitude more difficult to get off the ground).

@teoliphant
Copy link

teoliphant commented May 29, 2020

There were many multiple long discussions about missing data that pre-dated Pandas from 1998 to 2011 while NumPy was being written and emerging in popularity. There were several NEPS and mailing list discussions that didn't result in large agreement. No-one was funded to work on this. I remember these debates well but could not facilitate them effectively because I was either running a consulting company or leaving that company to start Anaconda. I do remember getting two particularly active participants in the discussion together to further the conversation. The output of these efforts were the two participants, Mark and Nathaniel, writing a document that is published here: https://numpy.org/neps/nep-0026-missing-data-summary.html that goes into a lot of detail about the opportunities and challenges from the NumPy perspective.

I think it's very important that we understand that much of the challenge they faced in coming to agreement about NumPy is that changing an existing library and working out all the details of what must be changed in the code is much harder than proposing an API based on existing work.

Of course, for any reference to be relevant, it has to be used, and so it's not completely orthogonal. However, now there are many, many more array libraries and dataframe libraries. Our efforts here are to do our best to express the best API we can confidently describe and then work with projects to consume or produce these.

My personal conclusion about the missing data APIs: the problem actually rests in the fact that NumPy only created an approximate type system (dtypes) and did not build well on Python's type system.

A type system is what connects the bytes contained in a data-structure to how those types should be interpreted by code. Certainly the sentinel concept is clearly a new kind of type (in ndtypes we called it an optional type). Even the masked concept could be considered a kind of type (if you consider the mask-bits part of the element data -- even though the mask bits are stored elsewhere). It is probably better, though to consider a mask array as a separate container type that could be used for a data-frame with native support for missing data.

NumPy has a nascent type system, but it is not easily extended (though you can do it in C with some effort). The type extension system is very different from the builtin types which means NumPy's types are somewhat related to Python 1.0 classes. If NumPy had a more easily extended type-system, then we could have had many more experiments with missing data and would be farther along.

So, in my mind, the missing data problem is actually deeply connected to the "type" problem which does not have a great solution currently today in Python. I have ideas and designs about how to fix this fundamentally (anyone want to fund me to fix it?). There is even quite a bit of code in the xnd, ndtypes, and mtypes repositories (some of which may be useful).

For the purposes of this consortium, however, I think we will have to effectively follow what Vaex is doing here (and it sounds like Pandas is heading to) and have both (NAN) and (NA) and leave it to libraries to comply with the standard.

@gatorsmile
Copy link

FYI, PySpark is following the NULL semantics that was defined in ANSI SQL. We documented our behaviors in http://spark.apache.org/docs/latest/sql-ref-null-semantics.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants