Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Reducing Manufacturing Failures - A Kaggle Challenge

Link to Kaggle Competition:

Video Description

Directory Structure:

  • data/ : Holds the training and testing dataset, please download the datasets unzipp them to store in the data folder as csv files

  • plots/ : Holds the different visualization plots

  • scripts/ : Holds runnable .ipynb, .r and .py files, which are explained below.

  • idMovements.ipynb: Visualization of the first 9999 jobs as they progress through the production lines.

    • idMovements.ipynb: Python version of idMovements.ipynb
  • visualizer.ipynb: Visualization of categorical features and frequency of features with respect to stations, and lines

    • Python version of visualizer.ipynb
  • graph-viz.ipynb: Visualizing the first 1000 Defective Ids and first 1000 Non-Defective Ids using IBM System G and their movements across the production lines to develop useful insights about the data.

    • gshell.txt: Text file containing the commands used in graph-viz.ipynb
    • Python version of graph-viz.ipynb
  • nextProb.ipynb: Finding out the probabilities of seeing a defect after a defect

    • Python version of nextProb.ipynb
  • FeatureSelection.ipynb: Open using the jupyter notebook. Holds the code for figuring out the top most important features, on which the classification algorithms depends upon.

    • Python version of FeatureSelection.ipynb
  • spark.ipynb: Open using Databricks Platform/Py-spark. It holds the code for developing the RandomForest Classifier on the chosen subset of important features. At the "REPLACE_YOUR_FILE" location, please provide the path to you test_numeric.csv file on databricks/local machine

    • Python version of spark.ipynb
  • train.ipynb: Open using Jupyter Notebook. It holds the code and visualizations for developing the different classification algorithms (LibSVM, RBF SVM, Naive Bayes, Random Forest, Gradient Boosting) on the chosen subset of important features

    • Python version of train.ipynb

Instructions for Installation

  • Download Dataset from Link and unzip the files in the data/ folder
  • Run any of the scripts from the scripts/ folder according to the task desired


  • Python: 2.7.6
  • Pandas: 0.19.1
  • Python Sklearn: 0.18.1
  • Numpy: 1.8.2
  • R: 3.3.2
  • py-spark: 2.0.1 or Account over Databricks
  • System-G: (systemg-tools-1.4.0) [For Visualizations]
  • jupyter notebook : 4.2.1 with support for R and Python kernels

The code has been tested on Ubuntu 14.04 LTS system. It should work well on other distributions but has not yet been tested.

In case of any issue with installation or otherwise, please contact: Aayush Mudgal


  • Aayush Mudgal
  • Sheallika Singh
  • Vibhuti Mahajan


Reducing Manufacturing Failures - A Kaggle Challenge




No releases published


No packages published