Skip to content
Reducing Manufacturing Failures - A Kaggle Challenge
Jupyter Notebook Python R
Branch: master
Clone or download
Latest commit ee6fd6d Dec 22, 2016
Type Name Latest commit message Commit time
Failed to load latest commit information.
data final submission Dec 23, 2016
plots final submission Dec 22, 2016
scripts final submission Dec 23, 2016
LICENSE Initial commit Dec 21, 2016 final submission Dec 23, 2016


Reducing Manufacturing Failures - A Kaggle Challenge

Link to Kaggle Competition:

Video Description

Directory Structure:

  • data/ : Holds the training and testing dataset, please download the datasets unzipp them to store in the data folder as csv files

  • plots/ : Holds the different visualization plots

  • scripts/ : Holds runnable .ipynb, .r and .py files, which are explained below.

  • idMovements.ipynb: Visualization of the first 9999 jobs as they progress through the production lines.

    • idMovements.ipynb: Python version of idMovements.ipynb
  • visualizer.ipynb: Visualization of categorical features and frequency of features with respect to stations, and lines

    • Python version of visualizer.ipynb
  • graph-viz.ipynb: Visualizing the first 1000 Defective Ids and first 1000 Non-Defective Ids using IBM System G and their movements across the production lines to develop useful insights about the data.

    • gshell.txt: Text file containing the commands used in graph-viz.ipynb
    • Python version of graph-viz.ipynb
  • nextProb.ipynb: Finding out the probabilities of seeing a defect after a defect

    • Python version of nextProb.ipynb
  • FeatureSelection.ipynb: Open using the jupyter notebook. Holds the code for figuring out the top most important features, on which the classification algorithms depends upon.

    • Python version of FeatureSelection.ipynb
  • spark.ipynb: Open using Databricks Platform/Py-spark. It holds the code for developing the RandomForest Classifier on the chosen subset of important features. At the "REPLACE_YOUR_FILE" location, please provide the path to you test_numeric.csv file on databricks/local machine

    • Python version of spark.ipynb
  • train.ipynb: Open using Jupyter Notebook. It holds the code and visualizations for developing the different classification algorithms (LibSVM, RBF SVM, Naive Bayes, Random Forest, Gradient Boosting) on the chosen subset of important features

    • Python version of train.ipynb

Instructions for Installation

  • Download Dataset from Link and unzip the files in the data/ folder
  • Run any of the scripts from the scripts/ folder according to the task desired


  • Python: 2.7.6
  • Pandas: 0.19.1
  • Python Sklearn: 0.18.1
  • Numpy: 1.8.2
  • R: 3.3.2
  • py-spark: 2.0.1 or Account over Databricks
  • System-G: (systemg-tools-1.4.0) [For Visualizations]
  • jupyter notebook : 4.2.1 with support for R and Python kernels

The code has been tested on Ubuntu 14.04 LTS system. It should work well on other distributions but has not yet been tested.

In case of any issue with installation or otherwise, please contact: Aayush Mudgal


  • Aayush Mudgal
  • Sheallika Singh
  • Vibhuti Mahajan
You can’t perform that action at this time.