-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String Arithmetics: __add__ ops #68
Conversation
👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good!
However, it's a bit messy to do this at the moment, and I'd advise addressing some of the comments, then we can refactor to cleanup.
eland/operations.py
Outdated
"script": { | ||
"source": "doc[left_field].value / doc[right_field].value" | ||
} | ||
if not field_name.endswith("||str") and not field_name.startswith("str||"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the check on field_name, it would be neater to add an 'op_type' to the task. i.e.
# task = ('arithmetic_op_fields', (field_name, (op_name, (left_field, right_field))))
is
# task = ('arithmetic_op_fields', (field_name, (op_name, op_type)(left_field, right_field)))
or similar. Then op_type
could be compared rather than field_name string match (magic match..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Soon we should also refactor operations.py
so tasks are subclass of a task class (using Bridge pattern or other). For now, if op_type
is added to the arithmetic_op_fields
task object, it will move this forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree that this could use a refactor. The task items will quickly become unmanageable as we add more ops. We should consider using dicts or named tuples as items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented in 5307850 for string
types
eland/series.py
Outdated
new_field_name, op_method_name, left, self.name)) | ||
|
||
# name of Series remains original name | ||
series.name = self.name.replace('.keyword', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing '.keyword' from the name may not always give expected behaviour. For instance in the pathological case where 'keyword' is embedded in the field name:
"C" : {
"properties" : {
"keyword" : {
"properties" : {
"str" : {
"type" : "keyword"
}
}
}
}
},```
A better way to do this is to call `Mappings.aggregatable_field_names` to map the requested field name to the aggregatable field name (remember to catch the possible ValueError).
In the case of non-aggregatable fields we can still do this using `_source` in the script. This is inefficient and I'd leave it to a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5307850 right now, removing the last occurrence of .keyword
in aggregatable field using for scripting. This should ensure that no embedded occurrences get removed for fields ending in .keyword
. Will update to check mappings for aggregatable fields
edadd = self.ed_ecommerce()['customer_first_name'] + self.ed_ecommerce()['customer_last_name'] | ||
pdadd = self.pd_ecommerce()['customer_first_name'] + self.pd_ecommerce()['customer_last_name'] | ||
|
||
assert_pandas_eland_series_equal(pdadd, edadd, check_less_precise=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove check_less_precise=True
in all this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 5307850
@@ -503,6 +503,20 @@ def __add__(self, right): | |||
3 176.979996 | |||
4 82.980003 | |||
dtype: float64 | |||
>>> df.customer_first_name + df.customer_last_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we capture the behaviour on df.customer_first_name + " " + df.customer_last_name
and raise a separate issue if this fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raised in #69
eland/series.py
Outdated
return series | ||
|
||
elif self._dtype == 'object' and right._dtype == 'object': | ||
new_field_name = "str||{0}_{1}_{2}||str".format(self.name, method_name, right.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't rely on the field name of op_type
then we don't need to name it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer using variants of str||
, implemented op_type
checking 5307850
Looks good. We need some refactoring as the complexity increases, but best to merge this first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Addresses #65 by adding support for
__add__
operation between:str
and aneland.Series
eland.Series
and astr
eland.Series
__mult__
was intentionally not implemented, as this could result in some unwanted behavior when the numeric operand is a large value. E.g.df['customer_first_name'] * 10000
. We could include a value check here, but any number I included seemed arbitrary. Happy to discuss further.