This repository has been archived by the owner on Jan 9, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add OneHotEncoder Ability to collapse infrequent values issue 26 (#31)
* * Add UncommonRemover internal transformer to remove infrequent values in a categorical column * Re-format smart transformers to have constructors and document parameters * Add UncommonRemover as a preprocessing step to OneHotEncoder * Add relevant tests * Address CR requests and fix pandas wrapper * Change the internal implementation of UncommonRemover to pandas * Add field to check_df to validate single column DataFrame and add relevant tests * Update smart encoder to simplify categorical column if it can using UncommonRemover
- Loading branch information
1 parent
3b3b017
commit 6579deb
Showing
7 changed files
with
190 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
import pandas as pd | ||
import numpy as np | ||
from sklearn.utils import check_array | ||
from sklearn.utils.validation import check_is_fitted | ||
from sklearn.base import TransformerMixin, BaseEstimator | ||
|
||
from foreshadow.utils import check_df | ||
|
||
|
||
class UncommonRemover(BaseEstimator, TransformerMixin): | ||
"""Merges uncommon values in a categorical column to an other value | ||
Note: Unseen values from fitting will alse be merged. | ||
Args: | ||
threshold (float): data that is less frequant than this percentage will | ||
be merged into a singular unique value | ||
replacement (Optional): value with which to replace uncommon values | ||
""" | ||
|
||
def __init__(self, threshold=0.01, replacement="UncommonRemover_Other"): | ||
self.threshold = threshold | ||
self.replacement = replacement | ||
|
||
def fit(self, X, y=None): | ||
""" | ||
Finds the uncommon values and sets the replacement value | ||
Args: | ||
X (:obj:`pandas.DataFrame`): input dataframe | ||
returns: | ||
(self) object instance | ||
""" | ||
X = check_df(X, single_column=True).iloc[:, 0] | ||
|
||
vc_series = X.value_counts() | ||
self.values_ = vc_series.index.values.tolist() | ||
self.merge_values_ = vc_series[ | ||
vc_series <= (self.threshold * X.size) | ||
].index.values.tolist() | ||
|
||
return self | ||
|
||
def transform(self, X, y=None): | ||
X = check_df(X, single_column=True).iloc[:, 0] | ||
check_is_fitted(self, ["values_", "merge_values_"]) | ||
X[X.isin(self.merge_values_) | ~X.isin(self.values_)] = self.replacement | ||
X = X.to_frame() | ||
|
||
return X |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters