Skip to content
/ SMORK Public

Implementation of SMOTE - Synthetic Minority Over-sampling Technique in SparkML / MLLib

Notifications You must be signed in to change notification settings

alivcor/SMORK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SMORK

SMOTE in Spark

Open Source Love

Build Status

Implementation of SMOTE - Synthetic Minority Over-sampling Technique in SparkML / MLLib

:octocat: Link to GitHub Repo

Getting Started

This is a very basic implementation of SMOTE Algorithm in SparkML. This is the only available implementation which plugs in to Spark Pipelines.

Prerequisites

  • Spark 2.3.0 +

Installation

1. Build The Jar

      sbt clean package

2. Add The Jar to your Spark Application

Linux

      --conf "spark.driver.extraClassPath=/path/to/smork-0.0.1.jar"

3. Use it normally as you would use any Estimator in Spark.

- Import
      import com.iresium.ml.SMOTE
- Initialize & Fit
    val smote = new SMOTE()
    smote.setfeatureCol("myFeatures").setlabelCol("myLabel").setbucketLength(100)

    val smoteModel = smote.fit(df)
- Transform
    val newDF = smoteModel.transform(df)

You can also see and run an example in src/main/scala/SMORKApp.scala

Coming Soon

  • PySMORK - Python Wrapper for SMORK - allows you to use SMOTE in PySpark
  • Support for categorical attributes

Contributing

Looking for contributors ! You are welcome to raise issues / send a pull-request.

Authors

  • Abhinandan Dubey - @alivcor

License

This project is licensed under the MIT License - see the LICENSE.md file for details

forthebadge

Buy Me A Coffee

About

Implementation of SMOTE - Synthetic Minority Over-sampling Technique in SparkML / MLLib

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages