New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str arithmetic #1058

Merged
merged 15 commits into from Jun 2, 2015

Conversation

Projects
None yet
4 participants
@llllllllll
Member

llllllllll commented Apr 23, 2015

Closes #1045

I wanted to try to learn how some of the symbolic stuff worked so I figured this was an easy one to tackle.

In [1]: import blaze as bz

In [2]: ds = bz.Data([('%s', 'a'), ('%s', 'b')], fields=['a', 'b'])

In [3]: ds.a + ds.b
Out[3]: 

0  %sa
1  %sb

In [4]: ds.a * 2
Out[4]: 
      a
0  %s%s
1  %s%s

In [5]: ds.a % ds.b
Out[5]: 

0  a
1  b

Let me know if I this needs more tests or the tests are in the wrong place.

def test_str_arith():
assert isinstance(cs + cs, Add)
assert isinstance(cs * 1, Mult)
assert isinstance(cs % cs, Mod)

This comment has been minimized.

@mrocklin

mrocklin Apr 23, 2015

Member

I wonder if we should have separate string operations rather than overload Add. The fact that these are the same in our minds is mostly a Python thing but may not carry over well to other backends. It may be more clear to have a string concatenation operator. Being explicit in this way might help when building backends because we won't have to check the dtype in Add operations for this case.

This comment has been minimized.

@llllllllll

llllllllll Apr 23, 2015

Member

I could do Cat, Format, and Repeat to represent these new actions, how does that sound?

This comment has been minimized.

@cpcloud

cpcloud Apr 23, 2015

Member

i'd call Format Interp because format is a method that does string formatting in a very different way

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 23, 2015

Usually when we add a new expression we add a couple of implementations using a two or three backends. Depending on what you're comfortable with, pandas and sql might be a nice pair to implement. after that, we can implement others piecemeal

@llllllllll

This comment has been minimized.

Member

llllllllll commented Apr 23, 2015

How should I go about adding the implement these; with pandas and sql, all three operators are already defined to act element-wise on the values so they would not be using these new nodes.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 23, 2015

if we go the methods rather than operators route then you can do something like this for pandas (in compute/pandas.py):

@dispatch(Concat, pd.Series)
def compute_up(expr, data, **kwargs):
    return data.str.concat(expr.rhs)

there might be some hoop jumping required, take a look at the BinOp implementations in the same file

@llllllllll llllllllll force-pushed the quantopian:str-catenation branch from 9fe77e1 to 206e492 Apr 23, 2015

@llllllllll

This comment has been minimized.

Member

llllllllll commented Apr 23, 2015

Should the method names be: cat, interp and repeat to match the node names?

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 23, 2015

cat -> concat, others look good

@llllllllll

This comment has been minimized.

Member

llllllllll commented Apr 23, 2015

Should strings still be able to use the arithmetic operators or should you just use these new methods. I think I found a good way to allow them to use the operators.

@llllllllll llllllllll force-pushed the quantopian:str-catenation branch from 206e492 to 0a91009 Apr 23, 2015

@llllllllll

This comment has been minimized.

Member

llllllllll commented Apr 23, 2015

Should I move the Concat and Repeat into the expr.collections and Interp into string?

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 23, 2015

if there's a way that other backends such as sql and others that don't implement this kind of syntax can use it, then i'd be okay with allowing the operators

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 23, 2015

Should I move the Concat and Repeat into the expr.collections and Interp into string?

sounds good.

@@ -54,7 +55,14 @@ def isstring(ds):
return isinstance(getattr(measure, 'ty', measure), String)
_add, _radd = _mkbin('add', Concat)
_mod, _rmod = _mkbin('mod', Interp)
_mul, _rmul = _mkbin('mul', Repeat)

This comment has been minimized.

@cpcloud

cpcloud Apr 23, 2015

Member

nice. this should allow backends to use this syntax even if they don't know anything about it

@llllllllll

This comment has been minimized.

Member

llllllllll commented Apr 23, 2015

@cpcloud in your example, you have Concat doing elementwise string concat for dataframes; should it not call pd.concat?

EDIT: nvm, I think it is best to be consistent.

@llllllllll llllllllll force-pushed the quantopian:str-catenation branch from 28c6688 to 8088182 Apr 24, 2015

@llllllllll

This comment has been minimized.

Member

llllllllll commented Apr 24, 2015

I am having some trouble figuring out how to get a series of strings to qualify for the string ops. I started with numpy though. Also, I am not super experienced with sql so I am not sure where to start for the implementation for that back end.

@llllllllll

This comment has been minimized.

Member

llllllllll commented Apr 27, 2015

@cpcloud, wondering if this looks okay with the numpy stuff and if you have any ideas for getting pandas to work

@dispatch(Interp, base, np.ndarray)
def compute_up(t, lhs, rhs, **kwargs):
return np.char.mod(lhs, rhs)

This comment has been minimized.

@cpcloud

cpcloud Apr 27, 2015

Member

You can consolidate this function and the one above by rewriting it like this:

@compute_up.register(Interp, base, np.ndarray)
@compute_up.register(Interp, np.ndarray, (np.ndarray, base))
def compute_up_np_char_mod(t, lhs, rhs, **kwargs):
    return np.char.mod(lhs, rhs)

Same for the above Concat and Repeat implementations as well.

This comment has been minimized.

@llllllllll

llllllllll Apr 27, 2015

Member

ah, I wasn't sure if you could do this; I was using the other models as an example. Would it make sense to add a style commit to refactor the neighbouring compute functions to do this?

This comment has been minimized.

@cpcloud

cpcloud Apr 27, 2015

Member

that's fine. you can also just lump that change into another commit if that's more convenient

This comment has been minimized.

@llllllllll

llllllllll Apr 27, 2015

Member

AttributeError: 'Dispatcher' object has no attribute 'name'

this doesn't seem to work

This comment has been minimized.

@mrocklin

mrocklin Apr 27, 2015

Member

It won't work with the @dispatch decorator. It should work if you use the register method on the Dispatcher itself.

https://github.com/mrocklin/multipledispatch/blob/master/multipledispatch/dispatcher.py#L74-L103

This comment has been minimized.

@cpcloud

cpcloud Apr 27, 2015

Member

You can't do this

@compute_up.register(...)
def compute_up(expr, data, **kwargs):
    # do stuff

You have to call it something besides compute_up if you use register

This comment has been minimized.

@cpcloud

cpcloud Apr 27, 2015

Member

Oh, also try updating your version of multipledispatch by either pip install -U git+git://github.com/mrocklin/multipledispatch or conda update multipledispatch

This comment has been minimized.

@cpcloud

cpcloud Apr 27, 2015

Member

@mrocklin Why wouldn't this work with @dispatch?

This comment has been minimized.

@mrocklin

mrocklin Apr 27, 2015

Member

The first time we call dispatch on a new function we need the Dispatcher to be in the module's namespace, not the original function. That way when someone says from foo import myfunc they get the Dispatcher.

This comment has been minimized.

@mrocklin

mrocklin Apr 27, 2015

Member

This naming bit isn't an issue if you create the dispatcher explicitly:

myfunc = Dispatcher('myfunc')

@myfunc.register(...)
def some_other_name(...):
    ...
@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 27, 2015

@llllllllll Can you add some tests in test_numpy_compute.py for the 3 operations that you've implemented?

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 27, 2015

As for Pandas, you can do these operations with the following operations:

  • Concat: pd.Series.str.concat
  • Repeat: pd.Series.str.repeat
  • Interp: use the modulo operator
@llllllllll

This comment has been minimized.

Member

llllllllll commented Apr 27, 2015

Sorry; I had no issue writing to the compute functions; however, the methods were not appearing because I am not sure how to tag the pandas structures with these expression types.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Apr 27, 2015

Oh, do it like this:

@dispatch(Concat, pd.Series)
def compute_up(expr, data, **kwargs):
    # do stuff
@llllllllll

This comment has been minimized.

Member

llllllllll commented Apr 27, 2015

In [2]: ds = bz.Data(pd.Series(['a', 'b', 'c']))

In [4]: 'repeat' in dir(ds)
Out[4]: False

I am not seeing the methods to create the expressions that the compute_up will dispatch on. I think this is because the dhape of a series with strings is just object and it fails the isstring check

@cowlicks

This comment has been minimized.

Contributor

cowlicks commented Apr 27, 2015

@llllllllll llllllllll force-pushed the quantopian:str-catenation branch from 0361a02 to 144a1d4 Apr 29, 2015

@llllllllll llllllllll closed this May 3, 2015

@llllllllll llllllllll reopened this May 3, 2015

@cpcloud

This comment has been minimized.

Member

cpcloud commented May 3, 2015

I wonder if we shouldn't just shove numpy into pandas. numpy's char module algos seem to be 10x slower than pandas

In [23]: v
Out[23]:
array(['%d', '%s', '%d', ..., '%s', '%d', '%s'],
      dtype='<U2')

In [24]: s.values
Out[24]: array(['%d', '%s', '%d', ..., '%s', '%d', '%s'], dtype=object)

In [25]: %timeit np.char.mod(v, 1)
1 loops, best of 3: 3.31 s per loop

In [26]: %timeit s % 1
1 loops, best of 3: 328 ms per loop

In [27]: 3310/328
Out[27]: 10.091463414634147

In [31]: len(s)
Out[31]: 2000000

@llllllllll llllllllll closed this May 19, 2015

@llllllllll llllllllll reopened this May 19, 2015

@cpcloud

This comment has been minimized.

Member

cpcloud commented May 19, 2015

@llllllllll what do you think about pushing numpy string algos to pandas?

@llllllllll

This comment has been minimized.

Member

llllllllll commented May 19, 2015

I wouldn't mind investigating; however, I think that the root of the issue with the speed is that pandas treats a series of strings as dtype object instead of strings so it must dispatch with PyNumber_Mod or something.

@cpcloud

This comment has been minimized.

Member

cpcloud commented May 19, 2015

I would intuitively think that that would be slower than what numpy is doing. My numbers above suggest otherwise

@llllllllll

This comment has been minimized.

Member

llllllllll commented May 19, 2015

Ah, sorry, I misread. I will convert the numpy version to use pandas.

@llllllllll

This comment has been minimized.

Member

llllllllll commented May 23, 2015

The cost of the numpy function grows faster than the pandas version, however, it is better for smaller arrays. Some basic testing has shown that they cross at around n=135. To factor in the cost of creating the series from the array, I think that a nice win could be to have the numpy compute function delegate to pandas for arrays of n>=145. I will work on adding that now.

PERF: Delegate to faster string interp function based on array len.
cache global lookups in function.__defaults__
@dispatch(Interp, Series)
def compute_up(t, data, **kwargs):
if isinstance(t.lhs, Expr):
return pd.Series(data.values % t.rhs)

This comment has been minimized.

@cpcloud

cpcloud May 26, 2015

Member

I think this should be

if ...:
    return data % t.rhs
else:
    return t.lhs % data
@compute_up.register(Interp, Series, (Series, base))
@compute_up.register(Interp, base, Series)
def compute_up_pd_interp(t, lhs, rhs, **kwargs):
return pd.Series(lhs.values % rhs.values)

This comment has been minimized.

@cpcloud

cpcloud May 26, 2015

Member

you should be able to do just this:

return lhs % rhs

This comment has been minimized.

@cpcloud

cpcloud May 26, 2015

Member

If this isn't raising then I suspect you don't have coverage for all the dispatch cases listed here.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 1, 2015

@llllllllll one last comment, then i think we're good to merge

@llllllllll

This comment has been minimized.

Member

llllllllll commented Jun 2, 2015

updated

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 2, 2015

merging

cpcloud added a commit that referenced this pull request Jun 2, 2015

@cpcloud cpcloud merged commit 42d1d5c into blaze:master Jun 2, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@cpcloud

This comment has been minimized.

Member

cpcloud commented Jun 2, 2015

@llllllllll thanks!

@llllllllll

This comment has been minimized.

Member

llllllllll commented Jun 2, 2015

yay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment