Skip to content
Binary classification of products passage or failure of quality control
Jupyter Notebook Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Binary Classification with Apache Spark / HDFS

↖data source

The goal of the competition is to predict which parts will fail quality control

My goal is to utilize the hadoop ecosystem to handle a large dataset and establish a pipeline for machine learning

munge :

  • Aggregate columns using RDD transformations
  • Create a column that indicates which of those column aggregations are outliers.

fit_predict :

  • Model data with Spark Machine Learning package
  • Predict on test data

munge_fit_predict :

  • Run this as is to use the toy data set example
You can’t perform that action at this time.