Conversation
Codecov Report
@@ Coverage Diff @@
## master #51 +/- ##
==========================================
+ Coverage 87.14% 87.19% +0.04%
==========================================
Files 74 74
Lines 6946 6973 +27
==========================================
+ Hits 6053 6080 +27
Misses 893 893
Continue to review full report at Codecov.
|
kmax12
left a comment
There was a problem hiding this comment.
Good to merge after these changes
| name = "weekday" | ||
|
|
||
|
|
||
| class NumCharacter(TransformPrimitive): |
There was a problem hiding this comment.
Should we name this NumCharacters to be consistent with NumWords below?
| return_type = Numeric | ||
|
|
||
| def get_function(self): | ||
| return lambda array: pd.Series([len(x) for x in array]) |
There was a problem hiding this comment.
I think you can use more pandas built syntax for this since the array variable is going to be a pandas series (please double check this though)
In [1]: array = pd.Series(["1","12 2","1212 3"])
In [2]: array.str.len()
Out[2]:
0 1
1 4
2 6
dtype: int64
so, it would just be
def get_function(self):
return lambda array: array.str.len()There was a problem hiding this comment.
Array is a numpy.ndarray
There was a problem hiding this comment.
thanks for double checking. I don't have strong preference, but perhaps do this to avoid the list comprehension
pd.Series(array).str.len()| return_type = Numeric | ||
|
|
||
| def get_function(self): | ||
| return lambda array: pd.Series([len(x.split(" ")) for x in array]) |
There was a problem hiding this comment.
similar to above, you can do
def get_function(self):
return lambda array: array.str.split(" ").str.len()There was a problem hiding this comment.
actually, this is probably better.
def get_function(self):
return lambda array: array.str.count(" ") + 1easier to read and might be up to 25% faster
There was a problem hiding this comment.
Those aren't quite equivalent - what if there's some leading or trailing whitespace, or if there's more than one space between characters?
There was a problem hiding this comment.
I think both pieces of code handle white space and more than one space the same way because they look for just " ". To do the multiple spaces, the only way I can think of is a regex and I rather keep it simple for now. However, for the trailing or leading white space we could do
def get_function(self):
return lambda array: array.str.strip().str.count(" ") + 1There was a problem hiding this comment.
Multiple spaces could be taken care of in the str.split(" ") case since they'll show up as empty strings. Could just remove with
new_list = [x for x in split_list if x != '']
For now, it has been changed to str.count(" ") + 1
Add two text primitives,
NumWordsandNumCharacterwhich count the number of words and the number of characters when the variable type isText.