Skip to content

Conversation

@ankitlade12
Copy link

  • Add TextFeatures class to extract features from text columns
  • Support for features: char_count, word_count, digit_count, uppercase_count, etc.
  • Add comprehensive tests with pytest parametrize
  • Add user guide documentation

- Add TextFeatures class to extract features from text columns
- Support for features: char_count, word_count, digit_count, uppercase_count, etc.
- Add comprehensive tests with pytest parametrize
- Add user guide documentation
Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ankitlade12

Thanks a lot!

This transformer, function-wise, I'd say it's ready. I made a few suggestions regarding how to optimize the feature creation functions. Let me know if they make sense.

Other than that, we need the various docs file and we'll be good to go :)

Thanks again!

TEXT_FEATURES = {
"char_count": lambda x: x.str.len(),
"word_count": lambda x: x.str.split().str.len(),
"sentence_count": lambda x: x.str.count(r"[.!?]+"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is counting punctuation as a proxy for sentence count? did I get it right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct! It counts sentence-ending punctuation (., !, ?) as a proxy for sentence count. This is a simple heuristic that works well for most common text. It won't handle edge cases like abbreviations (e.g., 'Dr.', 'U.S.') or text without punctuation, but it's a reasonable approximation for basic text analysis.

word counts, sentence counts, and various ratios and indicators.

A list of variables can be passed as an argument. Alternatively, the transformer
will automatically select and transform all variables of type object (string).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense for compatibility with our other classes, however, it might be a disaster for less experienced users that will pass the transformer to the entire dataset without second thoughts.

Not sure what's best here: we could enforce the user to pass one or more text columns by not defaulting this to a value. Or we could select variables that actually have text, by maybe choosing those variables with texts lenghts greater than a certain value (we'll need a separate function).

Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I kept it consistent with other transformers in the library (like encoders) which also default to auto-selecting object columns.

I agree there's a risk for less experienced users. Would you prefer one of these approaches?

  1. Keep current behavior for consistency
  2. Make variables a required parameter
  3. Emit a UserWarning when auto-selecting multiple columns

Let me know which you'd prefer and I'll implement it!

X = check_X(X)

# Find or validate text variables
if self.variables is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to stick to selecting all object variables, we have a function for this already. Check how it is done in the encoders. I still think that extracting features from all categorical variables is a massive overkill. We need to think what's best.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the current behavior (variables=None auto-selects all object columns) for consistency with other Feature-engine transformers like the encoders. However, I'm happy to make variables a required parameter if you prefer a more explicit API for this transformer. What do you think is best for the library?

- Optimize avg_word_length using vectorized char_count / word_count
- Simplify unique_word_count using x.str.lower().str.split().apply(set).str.len()
- Rename unique_word_ratio to lexical_diversity (word_count / unique_word_count)
- Use _check_variables_input_value for variable validation
- Use find_categorical_variables for automatic variable selection
- Remove redundant docstring text
- Add comprehensive test assertions with expected values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants