Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str arithmetic #1058

Merged
merged 15 commits into from
Jun 2, 2015
Merged

str arithmetic #1058

merged 15 commits into from
Jun 2, 2015

Conversation

llllllllll
Copy link
Member

Closes #1045

I wanted to try to learn how some of the symbolic stuff worked so I figured this was an easy one to tackle.

In [1]: import blaze as bz

In [2]: ds = bz.Data([('%s', 'a'), ('%s', 'b')], fields=['a', 'b'])

In [3]: ds.a + ds.b
Out[3]: 

0  %sa
1  %sb

In [4]: ds.a * 2
Out[4]: 
      a
0  %s%s
1  %s%s

In [5]: ds.a % ds.b
Out[5]: 

0  a
1  b

Let me know if I this needs more tests or the tests are in the wrong place.

def test_str_arith():
assert isinstance(cs + cs, Add)
assert isinstance(cs * 1, Mult)
assert isinstance(cs % cs, Mod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have separate string operations rather than overload Add. The fact that these are the same in our minds is mostly a Python thing but may not carry over well to other backends. It may be more clear to have a string concatenation operator. Being explicit in this way might help when building backends because we won't have to check the dtype in Add operations for this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do Cat, Format, and Repeat to represent these new actions, how does that sound?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd call Format Interp because format is a method that does string formatting in a very different way

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2015

Usually when we add a new expression we add a couple of implementations using a two or three backends. Depending on what you're comfortable with, pandas and sql might be a nice pair to implement. after that, we can implement others piecemeal

@llllllllll
Copy link
Member Author

How should I go about adding the implement these; with pandas and sql, all three operators are already defined to act element-wise on the values so they would not be using these new nodes.

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2015

if we go the methods rather than operators route then you can do something like this for pandas (in compute/pandas.py):

@dispatch(Concat, pd.Series)
def compute_up(expr, data, **kwargs):
    return data.str.concat(expr.rhs)

there might be some hoop jumping required, take a look at the BinOp implementations in the same file

@llllllllll
Copy link
Member Author

Should the method names be: cat, interp and repeat to match the node names?

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2015

cat -> concat, others look good

@llllllllll
Copy link
Member Author

Should strings still be able to use the arithmetic operators or should you just use these new methods. I think I found a good way to allow them to use the operators.

@llllllllll
Copy link
Member Author

Should I move the Concat and Repeat into the expr.collections and Interp into string?

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2015

if there's a way that other backends such as sql and others that don't implement this kind of syntax can use it, then i'd be okay with allowing the operators

@cpcloud
Copy link
Member

cpcloud commented Apr 23, 2015

Should I move the Concat and Repeat into the expr.collections and Interp into string?

sounds good.

@@ -54,7 +55,14 @@ def isstring(ds):
return isinstance(getattr(measure, 'ty', measure), String)


_add, _radd = _mkbin('add', Concat)
_mod, _rmod = _mkbin('mod', Interp)
_mul, _rmul = _mkbin('mul', Repeat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. this should allow backends to use this syntax even if they don't know anything about it

@llllllllll
Copy link
Member Author

@cpcloud in your example, you have Concat doing elementwise string concat for dataframes; should it not call pd.concat?

EDIT: nvm, I think it is best to be consistent.

@llllllllll
Copy link
Member Author

I am having some trouble figuring out how to get a series of strings to qualify for the string ops. I started with numpy though. Also, I am not super experienced with sql so I am not sure where to start for the implementation for that back end.

@llllllllll
Copy link
Member Author

@cpcloud, wondering if this looks okay with the numpy stuff and if you have any ideas for getting pandas to work


@dispatch(Interp, base, np.ndarray)
def compute_up(t, lhs, rhs, **kwargs):
return np.char.mod(lhs, rhs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can consolidate this function and the one above by rewriting it like this:

@compute_up.register(Interp, base, np.ndarray)
@compute_up.register(Interp, np.ndarray, (np.ndarray, base))
def compute_up_np_char_mod(t, lhs, rhs, **kwargs):
    return np.char.mod(lhs, rhs)

Same for the above Concat and Repeat implementations as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I wasn't sure if you could do this; I was using the other models as an example. Would it make sense to add a style commit to refactor the neighbouring compute functions to do this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine. you can also just lump that change into another commit if that's more convenient

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AttributeError: 'Dispatcher' object has no attribute 'name'

this doesn't seem to work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't work with the @dispatch decorator. It should work if you use the register method on the Dispatcher itself.

https://github.com/mrocklin/multipledispatch/blob/master/multipledispatch/dispatcher.py#L74-L103

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't do this

@compute_up.register(...)
def compute_up(expr, data, **kwargs):
    # do stuff

You have to call it something besides compute_up if you use register

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, also try updating your version of multipledispatch by either pip install -U git+git://github.com/mrocklin/multipledispatch or conda update multipledispatch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrocklin Why wouldn't this work with @dispatch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first time we call dispatch on a new function we need the Dispatcher to be in the module's namespace, not the original function. That way when someone says from foo import myfunc they get the Dispatcher.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming bit isn't an issue if you create the dispatcher explicitly:

myfunc = Dispatcher('myfunc')

@myfunc.register(...)
def some_other_name(...):
    ...

@cpcloud
Copy link
Member

cpcloud commented Apr 27, 2015

@llllllllll Can you add some tests in test_numpy_compute.py for the 3 operations that you've implemented?

@cpcloud
Copy link
Member

cpcloud commented Apr 27, 2015

As for Pandas, you can do these operations with the following operations:

  • Concat: pd.Series.str.concat
  • Repeat: pd.Series.str.repeat
  • Interp: use the modulo operator

@llllllllll
Copy link
Member Author

Sorry; I had no issue writing to the compute functions; however, the methods were not appearing because I am not sure how to tag the pandas structures with these expression types.

@cpcloud
Copy link
Member

cpcloud commented Apr 27, 2015

Oh, do it like this:

@dispatch(Concat, pd.Series)
def compute_up(expr, data, **kwargs):
    # do stuff

@llllllllll
Copy link
Member Author

In [2]: ds = bz.Data(pd.Series(['a', 'b', 'c']))

In [4]: 'repeat' in dir(ds)
Out[4]: False

I am not seeing the methods to create the expressions that the compute_up will dispatch on. I think this is because the dhape of a series with strings is just object and it fails the isstring check

@cowlicks
Copy link
Member

@llllllllll I think you need to add them to schema_method_list here: https://github.com/quantopian/blaze/blob/str-catenation/blaze/expr/arithmetic.py#L413

@cpcloud
Copy link
Member

cpcloud commented May 3, 2015

I wonder if we shouldn't just shove numpy into pandas. numpy's char module algos seem to be 10x slower than pandas

In [23]: v
Out[23]:
array(['%d', '%s', '%d', ..., '%s', '%d', '%s'],
      dtype='<U2')

In [24]: s.values
Out[24]: array(['%d', '%s', '%d', ..., '%s', '%d', '%s'], dtype=object)

In [25]: %timeit np.char.mod(v, 1)
1 loops, best of 3: 3.31 s per loop

In [26]: %timeit s % 1
1 loops, best of 3: 328 ms per loop

In [27]: 3310/328
Out[27]: 10.091463414634147

In [31]: len(s)
Out[31]: 2000000

@llllllllll llllllllll closed this May 19, 2015
@llllllllll llllllllll reopened this May 19, 2015
@cpcloud
Copy link
Member

cpcloud commented May 19, 2015

@llllllllll what do you think about pushing numpy string algos to pandas?

@llllllllll
Copy link
Member Author

I wouldn't mind investigating; however, I think that the root of the issue with the speed is that pandas treats a series of strings as dtype object instead of strings so it must dispatch with PyNumber_Mod or something.

@cpcloud
Copy link
Member

cpcloud commented May 19, 2015

I would intuitively think that that would be slower than what numpy is doing. My numbers above suggest otherwise

@llllllllll
Copy link
Member Author

Ah, sorry, I misread. I will convert the numpy version to use pandas.

@llllllllll
Copy link
Member Author

The cost of the numpy function grows faster than the pandas version, however, it is better for smaller arrays. Some basic testing has shown that they cross at around n=135. To factor in the cost of creating the series from the array, I think that a nice win could be to have the numpy compute function delegate to pandas for arrays of n>=145. I will work on adding that now.

cache global lookups in function.__defaults__
@dispatch(Interp, Series)
def compute_up(t, data, **kwargs):
if isinstance(t.lhs, Expr):
return pd.Series(data.values % t.rhs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be

if ...:
    return data % t.rhs
else:
    return t.lhs % data

@cpcloud
Copy link
Member

cpcloud commented Jun 1, 2015

@llllllllll one last comment, then i think we're good to merge

@llllllllll
Copy link
Member Author

updated

@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2015

merging

cpcloud added a commit that referenced this pull request Jun 2, 2015
@cpcloud cpcloud merged commit 42d1d5c into blaze:master Jun 2, 2015
@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2015

@llllllllll thanks!

@llllllllll
Copy link
Member Author

yay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: concat on add for string types
4 participants