You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dear,
presently I am working with large datasets with high dimensional (1459 features and 20 billion instances and using partial_fit method to execute my code. how could I use smote_variant library work properly with these classifier (known as online classifier like class sklearn.linear_model.SGDClassifier).
The text was updated successfully, but these errors were encountered:
in my impression, 20 billions of instances of 1500 features (altogether 30 trillions of numbers120 terabytes) is far beyond the capabilities of sklearn-related techniques. partial_fit could be used, but as a matter of fact, smote_variants is not prepared for this load of data. Imbalanced datasets are usually much smaller, and SMOTE techniques are developed for these relatively small datasets.
What is the imbalance rate (#negative/#positive) in your dataset? I would guess, many of your records are redundant, do not add much information to the classification process. Subsampling would make it more easy to handle without a significant loss of information.
sir,
I am trying to work with dask library and want to use smote_variants. Data is about some drugs and try to work with imbalance ratio. whole datasets is not load into memory at once, so, I am trying to load data with dask dataframe and want to use smote_variants library to work with datasets with small chuncks of main datasets. If I try to reduce the instances of my datasets then it will refect my study. please help me out.
Hi @arjunpuri7 , I hope you managed to overcome the problem. Personally I do not think that oversampling is meaningful to be applied to your huge amount of data, I think some reliable downsampling is what you need. Can we close this issue?
dear,
presently I am working with large datasets with high dimensional (1459 features and 20 billion instances and using partial_fit method to execute my code. how could I use smote_variant library work properly with these classifier (known as online classifier like class sklearn.linear_model.SGDClassifier).
The text was updated successfully, but these errors were encountered: