-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
str arithmetic #1058
str arithmetic #1058
Conversation
def test_str_arith(): | ||
assert isinstance(cs + cs, Add) | ||
assert isinstance(cs * 1, Mult) | ||
assert isinstance(cs % cs, Mod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have separate string operations rather than overload Add
. The fact that these are the same in our minds is mostly a Python thing but may not carry over well to other backends. It may be more clear to have a string concatenation operator. Being explicit in this way might help when building backends because we won't have to check the dtype in Add
operations for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do Cat
, Format
, and Repeat
to represent these new actions, how does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd call Format
Interp
because format
is a method that does string formatting in a very different way
Usually when we add a new expression we add a couple of implementations using a two or three backends. Depending on what you're comfortable with, pandas and sql might be a nice pair to implement. after that, we can implement others piecemeal |
How should I go about adding the implement these; with pandas and sql, all three operators are already defined to act element-wise on the values so they would not be using these new nodes. |
if we go the methods rather than operators route then you can do something like this for pandas (in @dispatch(Concat, pd.Series)
def compute_up(expr, data, **kwargs):
return data.str.concat(expr.rhs) there might be some hoop jumping required, take a look at the |
9fe77e1
to
206e492
Compare
Should the method names be: |
|
Should strings still be able to use the arithmetic operators or should you just use these new methods. I think I found a good way to allow them to use the operators. |
206e492
to
0a91009
Compare
Should I move the Concat and Repeat into the expr.collections and Interp into string? |
if there's a way that other backends such as sql and others that don't implement this kind of syntax can use it, then i'd be okay with allowing the operators |
sounds good. |
@@ -54,7 +55,14 @@ def isstring(ds): | |||
return isinstance(getattr(measure, 'ty', measure), String) | |||
|
|||
|
|||
_add, _radd = _mkbin('add', Concat) | |||
_mod, _rmod = _mkbin('mod', Interp) | |||
_mul, _rmul = _mkbin('mul', Repeat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice. this should allow backends to use this syntax even if they don't know anything about it
@cpcloud in your example, you have EDIT: nvm, I think it is best to be consistent. |
28c6688
to
8088182
Compare
I am having some trouble figuring out how to get a series of strings to qualify for the string ops. I started with numpy though. Also, I am not super experienced with sql so I am not sure where to start for the implementation for that back end. |
@cpcloud, wondering if this looks okay with the numpy stuff and if you have any ideas for getting pandas to work |
|
||
@dispatch(Interp, base, np.ndarray) | ||
def compute_up(t, lhs, rhs, **kwargs): | ||
return np.char.mod(lhs, rhs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can consolidate this function and the one above by rewriting it like this:
@compute_up.register(Interp, base, np.ndarray)
@compute_up.register(Interp, np.ndarray, (np.ndarray, base))
def compute_up_np_char_mod(t, lhs, rhs, **kwargs):
return np.char.mod(lhs, rhs)
Same for the above Concat
and Repeat
implementations as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I wasn't sure if you could do this; I was using the other models as an example. Would it make sense to add a style commit to refactor the neighbouring compute functions to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine. you can also just lump that change into another commit if that's more convenient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AttributeError: 'Dispatcher' object has no attribute 'name'
this doesn't seem to work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't work with the @dispatch
decorator. It should work if you use the register
method on the Dispatcher itself.
https://github.com/mrocklin/multipledispatch/blob/master/multipledispatch/dispatcher.py#L74-L103
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't do this
@compute_up.register(...)
def compute_up(expr, data, **kwargs):
# do stuff
You have to call it something besides compute_up
if you use register
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, also try updating your version of multipledispatch
by either pip install -U git+git://github.com/mrocklin/multipledispatch
or conda update multipledispatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrocklin Why wouldn't this work with @dispatch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first time we call dispatch
on a new function we need the Dispatcher to be in the module's namespace, not the original function. That way when someone says from foo import myfunc
they get the Dispatcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This naming bit isn't an issue if you create the dispatcher explicitly:
myfunc = Dispatcher('myfunc')
@myfunc.register(...)
def some_other_name(...):
...
@llllllllll Can you add some tests in |
As for Pandas, you can do these operations with the following operations:
|
Sorry; I had no issue writing to the compute functions; however, the methods were not appearing because I am not sure how to tag the pandas structures with these expression types. |
Oh, do it like this: @dispatch(Concat, pd.Series)
def compute_up(expr, data, **kwargs):
# do stuff |
In [2]: ds = bz.Data(pd.Series(['a', 'b', 'c']))
In [4]: 'repeat' in dir(ds)
Out[4]: False I am not seeing the methods to create the expressions that the compute_up will dispatch on. I think this is because the dhape of a series with strings is just object and it fails the isstring check |
@llllllllll I think you need to add them to |
0361a02
to
144a1d4
Compare
I wonder if we shouldn't just shove numpy into pandas. numpy's char module algos seem to be 10x slower than pandas In [23]: v
Out[23]:
array(['%d', '%s', '%d', ..., '%s', '%d', '%s'],
dtype='<U2')
In [24]: s.values
Out[24]: array(['%d', '%s', '%d', ..., '%s', '%d', '%s'], dtype=object)
In [25]: %timeit np.char.mod(v, 1)
1 loops, best of 3: 3.31 s per loop
In [26]: %timeit s % 1
1 loops, best of 3: 328 ms per loop
In [27]: 3310/328
Out[27]: 10.091463414634147
In [31]: len(s)
Out[31]: 2000000 |
@llllllllll what do you think about pushing numpy string algos to pandas? |
I wouldn't mind investigating; however, I think that the root of the issue with the speed is that pandas treats a series of strings as dtype |
I would intuitively think that that would be slower than what numpy is doing. My numbers above suggest otherwise |
Ah, sorry, I misread. I will convert the numpy version to use pandas. |
The cost of the numpy function grows faster than the pandas version, however, it is better for smaller arrays. Some basic testing has shown that they cross at around n=135. To factor in the cost of creating the series from the array, I think that a nice win could be to have the numpy compute function delegate to pandas for arrays of n>=145. I will work on adding that now. |
cache global lookups in function.__defaults__
@dispatch(Interp, Series) | ||
def compute_up(t, data, **kwargs): | ||
if isinstance(t.lhs, Expr): | ||
return pd.Series(data.values % t.rhs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be
if ...:
return data % t.rhs
else:
return t.lhs % data
@llllllllll one last comment, then i think we're good to merge |
updated |
merging |
@llllllllll thanks! |
yay |
Closes #1045
I wanted to try to learn how some of the symbolic stuff worked so I figured this was an easy one to tackle.
Let me know if I this needs more tests or the tests are in the wrong place.