Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String Arithmetics: __add__ ops #68

Merged
merged 5 commits into from
Nov 27, 2019

Conversation

blaklaybul
Copy link
Contributor

Addresses #65 by adding support for __add__ operation between:

  1. a str and an eland.Series
  2. an eland.Series and a str
  3. 2 eland.Series

__mult__ was intentionally not implemented, as this could result in some unwanted behavior when the numeric operand is a large value. E.g. df['customer_first_name'] * 10000. We could include a value check here, but any number I included seemed arbitrary. Happy to discuss further.

@blaklaybul blaklaybul added enhancement New feature or request topic:series Issue or PR about eland.Series labels Nov 26, 2019
@blaklaybul blaklaybul self-assigned this Nov 26, 2019
@stevedodson
Copy link
Contributor

👀

Copy link
Contributor

@stevedodson stevedodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good!

However, it's a bit messy to do this at the moment, and I'd advise addressing some of the comments, then we can refactor to cleanup.

"script": {
"source": "doc[left_field].value / doc[right_field].value"
}
if not field_name.endswith("||str") and not field_name.startswith("str||"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the check on field_name, it would be neater to add an 'op_type' to the task. i.e.
# task = ('arithmetic_op_fields', (field_name, (op_name, (left_field, right_field)))) is
# task = ('arithmetic_op_fields', (field_name, (op_name, op_type)(left_field, right_field)))
or similar. Then op_type could be compared rather than field_name string match (magic match..)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soon we should also refactor operations.py so tasks are subclass of a task class (using Bridge pattern or other). For now, if op_type is added to the arithmetic_op_fields task object, it will move this forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree that this could use a refactor. The task items will quickly become unmanageable as we add more ops. We should consider using dicts or named tuples as items.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented in 5307850 for string types

eland/series.py Outdated
new_field_name, op_method_name, left, self.name))

# name of Series remains original name
series.name = self.name.replace('.keyword', '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing '.keyword' from the name may not always give expected behaviour. For instance in the pathological case where 'keyword' is embedded in the field name:

    "C" : {
          "properties" : {
            "keyword" : {
              "properties" : {
                "str" : {
                  "type" : "keyword"
                }
              }
            }
          }
        },```

A better way to do this is to call `Mappings.aggregatable_field_names` to map the requested field name to the aggregatable field name (remember to catch the possible ValueError). 

In the case of non-aggregatable fields we can still do this using `_source` in the script. This is inefficient and I'd leave it to a separate issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5307850 right now, removing the last occurrence of .keyword in aggregatable field using for scripting. This should ensure that no embedded occurrences get removed for fields ending in .keyword. Will update to check mappings for aggregatable fields

edadd = self.ed_ecommerce()['customer_first_name'] + self.ed_ecommerce()['customer_last_name']
pdadd = self.pd_ecommerce()['customer_first_name'] + self.pd_ecommerce()['customer_last_name']

assert_pandas_eland_series_equal(pdadd, edadd, check_less_precise=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove check_less_precise=True in all this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 5307850

@@ -503,6 +503,20 @@ def __add__(self, right):
3 176.979996
4 82.980003
dtype: float64
>>> df.customer_first_name + df.customer_last_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we capture the behaviour on df.customer_first_name + " " + df.customer_last_name and raise a separate issue if this fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raised in #69

eland/series.py Outdated
return series

elif self._dtype == 'object' and right._dtype == 'object':
new_field_name = "str||{0}_{1}_{2}||str".format(self.name, method_name, right.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don't rely on the field name of op_type then we don't need to name it like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer using variants of str|| , implemented op_type checking 5307850

@stevedodson
Copy link
Contributor

Looks good. We need some refactoring as the complexity increases, but best to merge this first.

Copy link
Contributor

@stevedodson stevedodson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request topic:series Issue or PR about eland.Series
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants