Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how smote_variants work with incremental classifier with large amount of data #17

Open
arjunpuri7 opened this issue Dec 26, 2019 · 3 comments
Assignees

Comments

@arjunpuri7
Copy link

dear,
presently I am working with large datasets with high dimensional (1459 features and 20 billion instances and using partial_fit method to execute my code. how could I use smote_variant library work properly with these classifier (known as online classifier like class sklearn.linear_model.SGDClassifier).

@gykovacs gykovacs self-assigned this Dec 28, 2019
@gykovacs
Copy link
Member

Hi @arjunpuri7,

in my impression, 20 billions of instances of 1500 features (altogether 30 trillions of numbers120 terabytes) is far beyond the capabilities of sklearn-related techniques. partial_fit could be used, but as a matter of fact, smote_variants is not prepared for this load of data. Imbalanced datasets are usually much smaller, and SMOTE techniques are developed for these relatively small datasets.

What is the imbalance rate (#negative/#positive) in your dataset? I would guess, many of your records are redundant, do not add much information to the classification process. Subsampling would make it more easy to handle without a significant loss of information.

@arjunpuri7
Copy link
Author

sir,
I am trying to work with dask library and want to use smote_variants. Data is about some drugs and try to work with imbalance ratio. whole datasets is not load into memory at once, so, I am trying to load data with dask dataframe and want to use smote_variants library to work with datasets with small chuncks of main datasets. If I try to reduce the instances of my datasets then it will refect my study. please help me out.

@gykovacs
Copy link
Member

Hi @arjunpuri7 , I hope you managed to overcome the problem. Personally I do not think that oversampling is meaningful to be applied to your huge amount of data, I think some reliable downsampling is what you need. Can we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants