New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IsNull and DropNA expressions #733

Merged
merged 37 commits into from Aug 20, 2015

Conversation

Projects
None yet
3 participants
@cpcloud
Member

cpcloud commented Oct 11, 2014

closes #697

@cpcloud cpcloud self-assigned this Oct 11, 2014

@cpcloud cpcloud added this to the Release 0.7 milestone Oct 11, 2014

@cpcloud cpcloud force-pushed the cpcloud:is-and-non-null branch from b8bc475 to 40e0631 Oct 11, 2014

# i.e., only compute_up over the columns in the child of an IsNull
# expression
names = t.cols.names
return bcolz.carray([compute_up(n, t.cols[c], **kwargs) for c in names])

This comment has been minimized.

@mrocklin

mrocklin Oct 11, 2014

Member

In some cases I wonder if these compute_up calls should be flattened down to just the direct bcolz code.

@dispatch(IsNull, bcolz.carray)
def compute_up(n, t, **kwargs):
return bcolz.eval('t != t')

This comment has been minimized.

@mrocklin

mrocklin Oct 11, 2014

Member

Nice solution. I wonder if you should mention it here pydata/numexpr#23

@@ -9,7 +9,7 @@
from datashape.predicates import isscalar, iscollection, isboolean, isrecord
from ..compatibility import _strtypes, builtins
from .core import *
from .core import Node, common_subexpression, path

This comment has been minimized.

@mrocklin

mrocklin Oct 11, 2014

Member

My slight preference is to keep it as before. I don't think we gain anything by being explicit about imports here, particularly when we define __all__. I do however find it annoying when adding expressions to simultaneously have to manage this extra bookkeeping.

Or more particularly, what is the value of being explicit here? I understand that I have a minority opinion here.

This comment has been minimized.

@cpcloud

cpcloud Oct 11, 2014

Member

Star imports hide unused variables and hide use of undefined variables (these aren't always caught in tests, however they are generally rare and so far I've only found them in code from old blaze). They make it annoying to find where something is being imported and make editor plugins that detect undefined or unused variables basically useless. They also make it hard to realize that many times we will import all, max, min or other core Python names that overwrite the interpreter builtins. Overall they make developing on blaze less convenient.

This comment has been minimized.

@mrocklin

mrocklin Oct 11, 2014

Member

OK, I'll try to limit my use of them. It's interesting the extent to which ones tooling or development practice can affect ones aesthetics around code.

I appreciate the efforts you've made in accommodating my needs (fast tests, limited pytest magic) and so please let me know if there are other things that I do that conflict with your ideal development workflow.

@property
def dshape(self):
return self._child.dshape

This comment has been minimized.

@mrocklin

mrocklin Oct 11, 2014

Member

The first dimension of the dshape should be variable. Also the measure should be non-optional

>>> s = Symbol('s', '5 * ?int32')
>>> s.dropna().dshape
var * int32
schema_method_list.extend([
(lambda ds: isinstance(ds, (Option, Record)) or ds == real, set([isnull]))
])

This comment has been minimized.

@mrocklin

mrocklin Oct 11, 2014

Member

It's unclear to me how isnull should work on non-optional record types.

This comment has been minimized.

@mrocklin

mrocklin Oct 11, 2014

Member

Hrm, I wonder if we shouldn't have or ds == real here, and instead leave that case to isnan. This feels a bit cleaner.

])
def isdropna(ds):

This comment has been minimized.

@mrocklin

mrocklin Oct 11, 2014

Member

A better name could be found here.

This comment has been minimized.

@cpcloud

cpcloud Oct 11, 2014

Member

I meant to just use the lambda directly here. I was debugging it and needs to see where it was being executed and with what input

@mrocklin

This comment has been minimized.

Member

mrocklin commented Oct 11, 2014

Lookin' good!

__all__ = ['IsNull', 'isnull', 'DropNA', 'dropna']
class IsNull(ElemWise):

This comment has been minimized.

@mrocklin

mrocklin Oct 11, 2014

Member

Short docstrings would be helpful here.

def isnull(expr):
return IsNull(expr)

This comment has been minimized.

@mrocklin

mrocklin Oct 11, 2014

Member

I usually then do

isnull.__doc__ = IsNull.__doc__
return IsNull(expr)
class DropNA(Expr):

This comment has been minimized.

@mrocklin

mrocklin Oct 11, 2014

Member

Is this the most intuitive name for this operation? What do other systems use. I know that Pandas uses this but what about others? A little research here might be appropriate.

This comment has been minimized.

@cpcloud

cpcloud Dec 28, 2014

Member

@mrocklin

  • pandas: <column>|<table>.dropna()
  • sql: select * from <table> where <column> NOT NULL
  • mongo: db.<collection>.find( { '$and': [ { 'cancelDate': None }, { 'cancelDate': { '$type': 10 } } ] } )
  • scidb: SELECT expression1 FROM array WHERE not is_null(expression2);, see here: http://scidb.org/HTMLmanual/14.3/scidb_ug/re11.html
  • python: filter(None, [...])
  • numpy: x[~np.isnan(x)] if you're using regular arrays, x.compressed() if you're using MaskedArrays

i don't really see a pattern here or anything that sticks out as the best choice.

i like notnull() if we're going to be declarative about it, or dropna() if we're staying imperative.

This comment has been minimized.

@cpcloud

cpcloud Dec 28, 2014

Member

Though I don't like having isnull() return bools and notnull() return a thing with the dshape of the child (sans Option), I think a single isnull() covers all cases. if someone wants to select from a table where all the values of a particular column aren't null they say

t[~t.column.isnull()]

This comment has been minimized.

@mrocklin

mrocklin Dec 28, 2014

Member

I think that this mostly comes down to how easy it is to implement these
things on various backends. If backends have nonnull operations then
backing out this intent from a selection of an isnull will be annoying. If
backends treat nonnull in the same way that they treat selections then we
don't have a problem with the single isnull operation.

On Sun, Dec 28, 2014 at 9:19 AM, Phillip Cloud notifications@github.com
wrote:

In blaze/expr/null.py
#733 (diff):

+all = ['IsNull', 'isnull', 'DropNA', 'dropna']
+
+
+class IsNull(ElemWise):

  • slots = '_child',
  • @Property
  • def schema(self):
  •    return dshape('bool')
    
    +def isnull(expr):
  • return IsNull(expr)

+class DropNA(Expr):

Though I don't like having isnull() return bools and notnull() return the
dshape of the child (sans Option), I think a single isnull() covers all
cases. if someone wants all the values that aren't null they say

t[~t.column.isnull()]


Reply to this email directly or view it on GitHub
https://github.com/ContinuumIO/blaze/pull/733/files#r22298990.

@mrocklin

This comment has been minimized.

Member

mrocklin commented Oct 11, 2014

Python solutions might leverage is None and filter(None, seq)

For SQLAlchemy this might be helpful. http://stackoverflow.com/questions/21784851/sqlalchemy-is-not-null-select

@@ -51,6 +53,33 @@ def compute_up(expr, data, **kwargs):
return data[expr.fields]
@dispatch(DropNA, (bcolz.ctable, bcolz.carray))
def compute_up(n, t, **kwargs):

This comment has been minimized.

@mrocklin

mrocklin Oct 11, 2014

Member

In bcolz compute code I've started uniformly using the arguments expr and data as in

def compute_up(expr, data, **kwargs)

I found that this consistent naming made it a bit easier to quickly inspect functions. Maybe we should think of a standard to writing compute functions (not necessarily this one, but some standard)

This comment has been minimized.

@cpcloud

cpcloud Oct 11, 2014

Member

Yep makes sense. I usually go with the first letter of each expression and data but that's not a very good solution.

r = bres
else:
r = {'all': np.logical_and,
'any': np.logical_or}[n.how].reduce(bres, axis=1)

This comment has been minimized.

@cpcloud

cpcloud Oct 11, 2014

Member

These should just use any or all

@cpcloud cpcloud force-pushed the cpcloud:is-and-non-null branch from a5f09dd to f40d6be Oct 13, 2014

@cpcloud cpcloud modified the milestones: 0.7.1, Release 0.7, Release 0.8 Dec 21, 2014

@cpcloud cpcloud force-pushed the cpcloud:is-and-non-null branch from 809e06e to 2ccbe27 Dec 28, 2014

@cpcloud cpcloud force-pushed the cpcloud:is-and-non-null branch from 3af7fad to 0a98d36 Jan 10, 2015

@cpcloud cpcloud reopened this Jun 17, 2015

@cpcloud cpcloud modified the milestones: 0.9.0, 0.8.2 Jun 27, 2015

@cpcloud cpcloud modified the milestones: 0.8.2, 0.8.3 Jul 9, 2015

@cpcloud

This comment has been minimized.

Member

cpcloud commented Aug 19, 2015

@llllllllll I'd like to revisit this soon. What do you think about calling DropNA Valid and removing the IsNull expression?

@llllllllll

This comment has been minimized.

Member

llllllllll commented Aug 19, 2015

Why the name change to Valid? I think having the predicate is still really useful. For exploring a new data set, it is often nice to see how many missing values you have or something. Also, if we have the predicate, we don't really need a dropna because it just becomes expr[~expr.isnull()]

@cpcloud

This comment has been minimized.

Member

cpcloud commented Aug 19, 2015

True. I feel the same way about exploring a new dataset. So let's ... drop ... dropna :) and keep isnull

@llllllllll

This comment has been minimized.

Member

llllllllll commented Aug 19, 2015

Do we have code in place that lets us do an optimzation like Selection(IsNull) on DataFrame turns into df.dropna() or no?

@cpcloud

This comment has been minimized.

Member

cpcloud commented Aug 20, 2015

I think changing isnull to notnull is the way to go here. As a user of blaze I think that I'd want to be able to say "give me the data that are notnull" rather than say "give me the data that are not isnull"

@llllllllll thoughts?

@cpcloud

This comment has been minimized.

Member

cpcloud commented Aug 20, 2015

@llllllllll i don't know if that optimization can be done with existing code

it might be possible to steal some ideas from here to do this: https://github.com/blaze/dask/blob/master/dask/rewrite.py

@llllllllll

This comment has been minimized.

Member

llllllllll commented Aug 20, 2015

I agree that notnull is more frequently what you want. You can still express the inverse with a unary not so I am +1.

About the rewrite stuff, this looks really cool. This matches the use case I had a while ago about calling compute_down on each subexpr. I think we are sort of missing this ast-level transformation step.

cpcloud added some commits Aug 20, 2015

cpcloud added a commit that referenced this pull request Aug 20, 2015

Merge pull request #733 from cpcloud/is-and-non-null
IsNull and DropNA expressions

@cpcloud cpcloud merged commit f0b0255 into blaze:master Aug 20, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@cpcloud cpcloud deleted the cpcloud:is-and-non-null branch Aug 20, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment