Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Words and Chars primitives #51

Merged
merged 6 commits into from Jan 2, 2018
Merged

Add Words and Chars primitives #51

merged 6 commits into from Jan 2, 2018

Conversation

@Seth-Rothschild
Copy link
Contributor

@Seth-Rothschild Seth-Rothschild commented Dec 22, 2017

Add two text primitives, NumWords and NumCharacter which count the number of words and the number of characters when the variable type is Text.

Seth-Rothschild added 4 commits Dec 22, 2017
Seth-Rothschild
Seth-Rothschild
Seth-Rothschild
Seth-Rothschild
@codecov-io
Copy link

@codecov-io codecov-io commented Dec 22, 2017

Codecov Report

Merging #51 into master will increase coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #51      +/-   ##
==========================================
+ Coverage   87.14%   87.19%   +0.04%     
==========================================
  Files          74       74              
  Lines        6946     6973      +27     
==========================================
+ Hits         6053     6080      +27     
  Misses        893      893
Impacted Files Coverage Δ
primitives/transform_primitive.py 97.46% <0%> (+0.13%) ⬆️
.../feature_function_tests/test_transform_features.py 85.45% <0%> (+0.37%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d30e76...687e1b0. Read the comment docs.

Copy link
Member

@kmax12 kmax12 left a comment

Good to merge after these changes

@@ -336,6 +337,30 @@ class Weekday(DatetimeUnitBasePrimitive):
name = "weekday"


class NumCharacter(TransformPrimitive):

This comment has been minimized.

@kmax12

kmax12 Dec 26, 2017
Member

Should we name this NumCharacters to be consistent with NumWords below?

This comment has been minimized.

@Seth-Rothschild

Seth-Rothschild Jan 1, 2018
Author Contributor

Changed

return_type = Numeric

def get_function(self):
return lambda array: pd.Series([len(x) for x in array])

This comment has been minimized.

@kmax12

kmax12 Dec 26, 2017
Member

I think you can use more pandas built syntax for this since the array variable is going to be a pandas series (please double check this though)

In [1]: array = pd.Series(["1","12 2","1212 3"])

In [2]: array.str.len()
Out[2]: 
0    1
1    4
2    6
dtype: int64

so, it would just be

def get_function(self):
    return lambda array: array.str.len()

This comment has been minimized.

@Seth-Rothschild

Seth-Rothschild Jan 1, 2018
Author Contributor

Array is a numpy.ndarray

This comment has been minimized.

@kmax12

kmax12 Jan 1, 2018
Member

thanks for double checking. I don't have strong preference, but perhaps do this to avoid the list comprehension

pd.Series(array).str.len()
return_type = Numeric

def get_function(self):
return lambda array: pd.Series([len(x.split(" ")) for x in array])

This comment has been minimized.

@kmax12

kmax12 Dec 26, 2017
Member

similar to above, you can do

def get_function(self):
    return lambda array: array.str.split(" ").str.len()

This comment has been minimized.

@kmax12

kmax12 Dec 26, 2017
Member

actually, this is probably better.

def get_function(self):
    return lambda array: array.str.count(" ") + 1

easier to read and might be up to 25% faster

This comment has been minimized.

@PaulHobbs

PaulHobbs Dec 29, 2017

Those aren't quite equivalent - what if there's some leading or trailing whitespace, or if there's more than one space between characters?

This comment has been minimized.

@kmax12

kmax12 Dec 29, 2017
Member

I think both pieces of code handle white space and more than one space the same way because they look for just " ". To do the multiple spaces, the only way I can think of is a regex and I rather keep it simple for now. However, for the trailing or leading white space we could do

def get_function(self):
    return lambda array: array.str.strip().str.count(" ") + 1

This comment has been minimized.

@Seth-Rothschild

Seth-Rothschild Jan 1, 2018
Author Contributor

Multiple spaces could be taken care of in the str.split(" ") case since they'll show up as empty strings. Could just remove with

new_list = [x for x in split_list if x != '']

For now, it has been changed to str.count(" ") + 1

Seth-Rothschild added 2 commits Jan 1, 2018
@kmax12 kmax12 merged commit 1656c82 into master Jan 2, 2018
2 checks passed
2 checks passed
ci/circleci Your tests passed on CircleCI!
Details
license/cla Contributor License Agreement is signed.
Details
@Seth-Rothschild Seth-Rothschild deleted the words-and-chars branch Jan 2, 2018
@rwedge rwedge mentioned this pull request Jan 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants