Implementation of SMOTE - Synthetic Minority Over-sampling Technique in SparkML / MLLib
This is a very basic implementation of SMOTE Algorithm in SparkML. This is the only available implementation which plugs in to Spark Pipelines.
- Spark 2.3.0 +
sbt clean package
Linux
--conf "spark.driver.extraClassPath=/path/to/smork-0.0.1.jar"
import com.iresium.ml.SMOTE
val smote = new SMOTE()
smote.setfeatureCol("myFeatures").setlabelCol("myLabel").setbucketLength(100)
val smoteModel = smote.fit(df)
val newDF = smoteModel.transform(df)
You can also see and run an example in src/main/scala/SMORKApp.scala
- PySMORK - Python Wrapper for SMORK - allows you to use SMOTE in PySpark
- Support for categorical attributes
Looking for contributors ! You are welcome to raise issues / send a pull-request.
- Abhinandan Dubey - @alivcor
This project is licensed under the MIT License - see the LICENSE.md file for details