New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str_cat with Pandas `.str.cat()` interface #1496

Merged
merged 42 commits into from May 2, 2016

Conversation

Projects
None yet
3 participants
@kwmsmith
Member

kwmsmith commented Apr 28, 2016

  • Ensured that chaining .str_cat().str_cat() rolls up concatenation ops in SQL query.
  • Will be refactored as .str.cat() when .str accessor namespace is implemented.
  • Tested against Pandas / SQL / postgres backends.

ping @sandhujasmine

sandhujasmine added some commits Apr 11, 2016

ENH: Implement str_cat() for SQL backend.
Ad-hoc testing with postgres. Still need automated tests.
Separate str_cat() function from StrCat() class
- input error checking in str_cat
- clean up compute_up() functions
ENH: '_child' not needed - change it to 'lhs'
GH 1476
_child attribute had special meaning in blaze. For this function, we
only need a 'lhs' and 'rhs' argument so replace '_child' and 'col' with
'lhs' and 'rhs' where the data in 'rhs' concatenates to the data in
'lhs' elementwise.
TST: Move str_cat() exception tests to test_strings
GH 1476
also raise an exception if 'sep' kwarg is not a string
TST: Add test_str_cat() to check SQL
GH 1476
Table needs at least two string columns for good test so added a new
table 'accounts2' for this test.
Fix: [doctest] failure in docstring. Reformat and try again.
Reformat the example in docstring to see if that fixes the error thrown
by doctest.
TST: Add str_cat() test for null values in DataFrame
- use existing DataFrame (dfbig) for previous tests
- add a fixture to append row will null values and test str_cat()
Make str_cat() for NULL values consistent with Pandas str_cat
GH 1476
If a row in either column has a NULL, str_cat() returns None which is
consistent with pandas for na_rep kwarg set to None; current default
behavior.

todo: add na_rep as a kwarg.
BUG: Fix schema resulting from str_cat()
GH 1476 - if lhs or rhs arguments contain Nulls, then output of
str_cat() will also contain Null values.
ENH: Implement chaining str_cat() operation
GH 1476
Since str_cat() is a binary operation, we should be able to chain it to
concat more than one String columns - similar to pandas.
TST: Add test for chaining str_cat() feature
The table/DataFrame needs three string columns so updated test variables
as well.

sandhujasmine and others added some commits Apr 22, 2016

WIP: StrCat compiles when both inputs are Select clauses
- Handles the cases where str_cat operates on expression with WHERE
  clause: eg:
    s = symbol('s', ds2)
    t = s[s.amount <= 200]
    c1 = t.comment.str_cat(t.sex, sep=' -- ')

- pulled out manipuating data arrays into separate function
- added a test
- chaining str_cat() with WHERE clause is not consistent with pandas yet
- also unable to add (Select, Select) in existing compute_up decorator - it fails
WIP: Update decorate to remove duplicate code; mark test as xfail
The exception raised when trying to concat string columns from different
tables is not yet implemented fully for all use cases in the refactored
code. For now, marked a test as xfail.
@@ -1185,7 +1188,7 @@ def compute_up(t, s, **kwargs):
string_func_names = {
# <blaze function name>: <SQL function name>
'str_upper': 'upper',
'str_lower': 'lower',
'str_lower': 'lower'

This comment has been minimized.

@llllllllll

llllllllll Apr 28, 2016

Member

If we are making a style edit, should this just be:

string_func_names = {'str_upper': upper,
                     'str_lower': lower}
schema = self.rhs.schema
else:
# convert fixed length string to variable length string
schema = DataShape(String())

This comment has been minimized.

@llllllllll

llllllllll Apr 28, 2016

Member

should we preserve encoding?

else:
sel = select((lhs_col + rhs_col).label(expr.lhs._name))
wheres = unify_wheres([lhs, rhs])
if wheres is not None:

This comment has been minimized.

@llllllllll

llllllllll Apr 28, 2016

Member

I think we need to use reconstruct_select here because if the comes from a select with something like a groupby or sort this information will be lost.

@kwmsmith

This comment has been minimized.

Member

kwmsmith commented Apr 28, 2016

Thanks for the pointers @llllllllll -- reconstruct_select() was exactly what I was looking for; it's been a while since I've touched the sql stuff.

@pytest.mark.parametrize("expr",
[t.name.str_cat(t_str_cat.comment),
t.name.str_cat(t_str_cat.comment.str_cat(t_str_cat.name))])
def test_str_cat_runtime_exception(expr):

This comment has been minimized.

@sandhujasmine

sandhujasmine Apr 28, 2016

Contributor

Are we checking that the str_cat operates on the same table? If not, then we should take this test out - I saw a comment that we need to do this check more consistently throughout other blaze operations so perhaps remove the test?

@kwmsmith kwmsmith merged commit 0094aa4 into blaze:master May 2, 2016

@kwmsmith kwmsmith deleted the kwmsmith:feature-cat branch May 2, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment