str arithmetic #1058

llllllllll · 2015-04-23T15:35:03Z

I wanted to try to learn how some of the symbolic stuff worked so I figured this was an easy one to tackle.

In [1]: import blaze as bz

In [2]: ds = bz.Data([('%s', 'a'), ('%s', 'b')], fields=['a', 'b'])

In [3]: ds.a + ds.b
Out[3]: 

0  %sa
1  %sb

In [4]: ds.a * 2
Out[4]: 
      a
0  %s%s
1  %s%s

In [5]: ds.a % ds.b
Out[5]: 

0  a
1  b

Let me know if I this needs more tests or the tests are in the wrong place.

mrocklin · 2015-04-23T17:40:45Z

blaze/expr/tests/test_arithmetic.py

+def test_str_arith():
+    assert isinstance(cs + cs, Add)
+    assert isinstance(cs * 1, Mult)
+    assert isinstance(cs % cs, Mod)


I wonder if we should have separate string operations rather than overload Add. The fact that these are the same in our minds is mostly a Python thing but may not carry over well to other backends. It may be more clear to have a string concatenation operator. Being explicit in this way might help when building backends because we won't have to check the dtype in Add operations for this case.

I could do Cat, Format, and Repeat to represent these new actions, how does that sound?

i'd call Format Interp because format is a method that does string formatting in a very different way

cpcloud · 2015-04-23T19:18:27Z

Usually when we add a new expression we add a couple of implementations using a two or three backends. Depending on what you're comfortable with, pandas and sql might be a nice pair to implement. after that, we can implement others piecemeal

llllllllll · 2015-04-23T19:26:38Z

How should I go about adding the implement these; with pandas and sql, all three operators are already defined to act element-wise on the values so they would not be using these new nodes.

cpcloud · 2015-04-23T19:30:57Z

if we go the methods rather than operators route then you can do something like this for pandas (in compute/pandas.py):

@dispatch(Concat, pd.Series)
def compute_up(expr, data, **kwargs):
    return data.str.concat(expr.rhs)

there might be some hoop jumping required, take a look at the BinOp implementations in the same file

llllllllll · 2015-04-23T19:32:36Z

Should the method names be: cat, interp and repeat to match the node names?

cpcloud · 2015-04-23T19:33:18Z

cat -> concat, others look good

llllllllll · 2015-04-23T19:54:25Z

Should strings still be able to use the arithmetic operators or should you just use these new methods. I think I found a good way to allow them to use the operators.

llllllllll · 2015-04-23T20:05:57Z

Should I move the Concat and Repeat into the expr.collections and Interp into string?

cpcloud · 2015-04-23T20:12:32Z

if there's a way that other backends such as sql and others that don't implement this kind of syntax can use it, then i'd be okay with allowing the operators

cpcloud · 2015-04-23T20:13:30Z

Should I move the Concat and Repeat into the expr.collections and Interp into string?

sounds good.

cpcloud · 2015-04-23T20:30:14Z

blaze/expr/strings.py

@@ -54,7 +55,14 @@ def isstring(ds):
    return isinstance(getattr(measure, 'ty', measure), String)


+_add, _radd = _mkbin('add', Concat)
+_mod, _rmod = _mkbin('mod', Interp)
+_mul, _rmul = _mkbin('mul', Repeat)


nice. this should allow backends to use this syntax even if they don't know anything about it

llllllllll · 2015-04-23T21:05:18Z

@cpcloud in your example, you have Concat doing elementwise string concat for dataframes; should it not call pd.concat?

EDIT: nvm, I think it is best to be consistent.

llllllllll · 2015-04-24T20:19:37Z

I am having some trouble figuring out how to get a series of strings to qualify for the string ops. I started with numpy though. Also, I am not super experienced with sql so I am not sure where to start for the implementation for that back end.

llllllllll · 2015-04-27T17:35:33Z

@cpcloud, wondering if this looks okay with the numpy stuff and if you have any ideas for getting pandas to work

cpcloud · 2015-04-27T17:37:56Z

blaze/compute/numpy.py

+
+@dispatch(Interp, base, np.ndarray)
+def compute_up(t, lhs, rhs, **kwargs):
+    return np.char.mod(lhs, rhs)


You can consolidate this function and the one above by rewriting it like this:

@compute_up.register(Interp, base, np.ndarray) @compute_up.register(Interp, np.ndarray, (np.ndarray, base)) def compute_up_np_char_mod(t, lhs, rhs, **kwargs): return np.char.mod(lhs, rhs)

Same for the above Concat and Repeat implementations as well.

ah, I wasn't sure if you could do this; I was using the other models as an example. Would it make sense to add a style commit to refactor the neighbouring compute functions to do this?

that's fine. you can also just lump that change into another commit if that's more convenient

AttributeError: 'Dispatcher' object has no attribute 'name'

this doesn't seem to work

It won't work with the @dispatch decorator. It should work if you use the register method on the Dispatcher itself.

https://github.com/mrocklin/multipledispatch/blob/master/multipledispatch/dispatcher.py#L74-L103

You can't do this

@compute_up.register(...) def compute_up(expr, data, **kwargs): # do stuff

You have to call it something besides compute_up if you use register

Oh, also try updating your version of multipledispatch by either pip install -U git+git://github.com/mrocklin/multipledispatch or conda update multipledispatch

@mrocklin Why wouldn't this work with @dispatch?

The first time we call dispatch on a new function we need the Dispatcher to be in the module's namespace, not the original function. That way when someone says from foo import myfunc they get the Dispatcher.

This naming bit isn't an issue if you create the dispatcher explicitly:

myfunc = Dispatcher('myfunc') @myfunc.register(...) def some_other_name(...): ...

cpcloud · 2015-04-27T17:39:59Z

@llllllllll Can you add some tests in test_numpy_compute.py for the 3 operations that you've implemented?

cpcloud · 2015-04-27T17:44:42Z

As for Pandas, you can do these operations with the following operations:

Concat: pd.Series.str.concat
Repeat: pd.Series.str.repeat
Interp: use the modulo operator

llllllllll · 2015-04-27T17:47:01Z

Sorry; I had no issue writing to the compute functions; however, the methods were not appearing because I am not sure how to tag the pandas structures with these expression types.

cpcloud · 2015-04-27T17:53:56Z

Oh, do it like this:

@dispatch(Concat, pd.Series)
def compute_up(expr, data, **kwargs):
    # do stuff

llllllllll · 2015-04-27T17:59:30Z

In [2]: ds = bz.Data(pd.Series(['a', 'b', 'c']))

In [4]: 'repeat' in dir(ds)
Out[4]: False

I am not seeing the methods to create the expressions that the compute_up will dispatch on. I think this is because the dhape of a series with strings is just object and it fails the isstring check

cowlicks · 2015-04-27T18:05:24Z

@llllllllll I think you need to add them to schema_method_list here: https://github.com/quantopian/blaze/blob/str-catenation/blaze/expr/arithmetic.py#L413

cpcloud · 2015-05-03T01:07:15Z

I wonder if we shouldn't just shove numpy into pandas. numpy's char module algos seem to be 10x slower than pandas

In [23]: v
Out[23]:
array(['%d', '%s', '%d', ..., '%s', '%d', '%s'],
      dtype='<U2')

In [24]: s.values
Out[24]: array(['%d', '%s', '%d', ..., '%s', '%d', '%s'], dtype=object)

In [25]: %timeit np.char.mod(v, 1)
1 loops, best of 3: 3.31 s per loop

In [26]: %timeit s % 1
1 loops, best of 3: 328 ms per loop

In [27]: 3310/328
Out[27]: 10.091463414634147

In [31]: len(s)
Out[31]: 2000000

cpcloud · 2015-05-19T17:02:45Z

@llllllllll what do you think about pushing numpy string algos to pandas?

llllllllll · 2015-05-19T17:13:45Z

I wouldn't mind investigating; however, I think that the root of the issue with the speed is that pandas treats a series of strings as dtype object instead of strings so it must dispatch with PyNumber_Mod or something.

cpcloud · 2015-05-19T17:16:22Z

I would intuitively think that that would be slower than what numpy is doing. My numbers above suggest otherwise

llllllllll · 2015-05-19T17:20:14Z

Ah, sorry, I misread. I will convert the numpy version to use pandas.

llllllllll · 2015-05-23T15:50:21Z

The cost of the numpy function grows faster than the pandas version, however, it is better for smaller arrays. Some basic testing has shown that they cross at around n=135. To factor in the cost of creating the series from the array, I think that a nice win could be to have the numpy compute function delegate to pandas for arrays of n>=145. I will work on adding that now.

cache global lookups in function.__defaults__

cpcloud · 2015-05-26T18:48:16Z

blaze/compute/pandas.py

+@dispatch(Interp, Series)
+def compute_up(t, data, **kwargs):
+    if isinstance(t.lhs, Expr):
+        return pd.Series(data.values % t.rhs)


I think this should be

if ...: return data % t.rhs else: return t.lhs % data

cpcloud · 2015-06-01T18:39:36Z

@llllllllll one last comment, then i think we're good to merge

llllllllll · 2015-06-02T00:39:23Z

updated

cpcloud · 2015-06-02T18:03:50Z

merging

str arithmetic

cpcloud · 2015-06-02T18:03:59Z

@llllllllll thanks!

llllllllll · 2015-06-02T18:32:50Z

yay

mrocklin reviewed Apr 23, 2015
View reviewed changes

cpcloud self-assigned this Apr 23, 2015

cpcloud added this to the 0.8.1 milestone Apr 23, 2015

cpcloud added enhancement expression core new expression strings labels Apr 23, 2015

llllllllll force-pushed the str-catenation branch from 9fe77e1 to 206e492 Compare April 23, 2015 19:31

llllllllll force-pushed the str-catenation branch from 206e492 to 0a91009 Compare April 23, 2015 20:02

cpcloud reviewed Apr 23, 2015
View reviewed changes

llllllllll force-pushed the str-catenation branch from 28c6688 to 8088182 Compare April 24, 2015 20:18

cpcloud reviewed Apr 27, 2015
View reviewed changes

Joe Jevnik added 7 commits April 29, 2015 15:39

MAINT: Move string handling to string module

0bdef66

ENH: Adds numpy compute backend for string ops

1612af5

MAINT: Reduce code duplication in numpy compute

37806b9

TST: Adds numpy tests for str expressions

af8aedb

ENH: Compute backend for pandas

82bb5f8

ENH: Work on pandas compute for string ops

9aa9f6e

TST: Adds pandas string tests

144a1d4

llllllllll force-pushed the str-catenation branch from 0361a02 to 144a1d4 Compare April 29, 2015 19:39

llllllllll closed this May 3, 2015

llllllllll reopened this May 3, 2015

llllllllll closed this May 19, 2015

llllllllll reopened this May 19, 2015

PERF: Delegate to faster string interp function based on array len.

e3f509e

cache global lookups in function.__defaults__

cpcloud reviewed May 26, 2015
View reviewed changes

BUG: Just use the __mod__ operator for pandas str interp

3a65d6c

TST: Adds string tests with scalar values

0be7087

cpcloud added a commit that referenced this pull request Jun 2, 2015

Merge pull request #1058 from quantopian/str-catenation

42d1d5c

str arithmetic

cpcloud merged commit 42d1d5c into blaze:master Jun 2, 2015

str arithmetic #1058

str arithmetic #1058

Conversation

llllllllll commented Apr 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Apr 23, 2015

llllllllll commented Apr 23, 2015

cpcloud commented Apr 23, 2015

llllllllll commented Apr 23, 2015

cpcloud commented Apr 23, 2015

llllllllll commented Apr 23, 2015

llllllllll commented Apr 23, 2015

cpcloud commented Apr 23, 2015

cpcloud commented Apr 23, 2015

Choose a reason for hiding this comment

llllllllll commented Apr 23, 2015

llllllllll commented Apr 24, 2015

llllllllll commented Apr 27, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpcloud commented Apr 27, 2015

cpcloud commented Apr 27, 2015

llllllllll commented Apr 27, 2015

cpcloud commented Apr 27, 2015

llllllllll commented Apr 27, 2015

cowlicks commented Apr 27, 2015

cpcloud commented May 3, 2015

cpcloud commented May 19, 2015

llllllllll commented May 19, 2015

cpcloud commented May 19, 2015

llllllllll commented May 19, 2015

llllllllll commented May 23, 2015

Choose a reason for hiding this comment

cpcloud commented Jun 1, 2015

llllllllll commented Jun 2, 2015

cpcloud commented Jun 2, 2015

cpcloud commented Jun 2, 2015

llllllllll commented Jun 2, 2015