Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Words and Chars primitives #51

Merged
merged 6 commits into from Jan 2, 2018
Merged

Add Words and Chars primitives #51

merged 6 commits into from Jan 2, 2018

Conversation

Seth-Rothschild
Copy link
Contributor

Add two text primitives, NumWords and NumCharacter which count the number of words and the number of characters when the variable type is Text.

@codecov-io
Copy link

codecov-io commented Dec 22, 2017

Codecov Report

Merging #51 into master will increase coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #51      +/-   ##
==========================================
+ Coverage   87.14%   87.19%   +0.04%     
==========================================
  Files          74       74              
  Lines        6946     6973      +27     
==========================================
+ Hits         6053     6080      +27     
  Misses        893      893
Impacted Files Coverage Δ
primitives/transform_primitive.py 97.46% <0%> (+0.13%) ⬆️
.../feature_function_tests/test_transform_features.py 85.45% <0%> (+0.37%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d30e76...687e1b0. Read the comment docs.

Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge after these changes

@@ -336,6 +337,30 @@ class Weekday(DatetimeUnitBasePrimitive):
name = "weekday"


class NumCharacter(TransformPrimitive):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we name this NumCharacters to be consistent with NumWords below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

return_type = Numeric

def get_function(self):
return lambda array: pd.Series([len(x) for x in array])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use more pandas built syntax for this since the array variable is going to be a pandas series (please double check this though)

In [1]: array = pd.Series(["1","12 2","1212 3"])

In [2]: array.str.len()
Out[2]: 
0    1
1    4
2    6
dtype: int64

so, it would just be

def get_function(self):
    return lambda array: array.str.len()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array is a numpy.ndarray

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for double checking. I don't have strong preference, but perhaps do this to avoid the list comprehension

pd.Series(array).str.len()

return_type = Numeric

def get_function(self):
return lambda array: pd.Series([len(x.split(" ")) for x in array])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to above, you can do

def get_function(self):
    return lambda array: array.str.split(" ").str.len()

Copy link
Contributor

@kmax12 kmax12 Dec 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this is probably better.

def get_function(self):
    return lambda array: array.str.count(" ") + 1

easier to read and might be up to 25% faster

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those aren't quite equivalent - what if there's some leading or trailing whitespace, or if there's more than one space between characters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both pieces of code handle white space and more than one space the same way because they look for just " ". To do the multiple spaces, the only way I can think of is a regex and I rather keep it simple for now. However, for the trailing or leading white space we could do

def get_function(self):
    return lambda array: array.str.strip().str.count(" ") + 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple spaces could be taken care of in the str.split(" ") case since they'll show up as empty strings. Could just remove with

new_list = [x for x in split_list if x != '']

For now, it has been changed to str.count(" ") + 1

@kmax12 kmax12 merged commit 1656c82 into master Jan 2, 2018
@Seth-Rothschild Seth-Rothschild deleted the words-and-chars branch January 2, 2018 21:03
@rwedge rwedge mentioned this pull request Jan 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants