# Kaggle Bosch Production Line Challenge Model
## Summary
This is a machine learning model, built on a [Spark](https://spark.apache.org) framework, to attempt to solve the [Bosch Production Line Performance Challenge](https://www.kaggle.com/c/bosch-production-line-performance) on [Kaggle](https://www.kaggle.com).  This project was begun by Thomas Hughes on November 24, 2016, after the competition was completed.  It should be considered a test of effectiveness of technology platforms.

## Notes on Execution
Since this Notebook is designed to run with Spark, it must be running with the PySpark interpreter.  This can be done mostly automatically if you launch the notebook using the script 'pyspark-notebook' that is available in the github repository along with the notebook.  PySpark will need to be installed and properly configured, and you may need to update the script to your local copy of PySpark.

In [1]:
# Load File Locations, using Kaggle specifications
import json

with open('SETTINGS.json') as settings_file:
    settings = json.load(settings_file)

## Import Bosch Data

In [2]:
from pyspark.sql import SQLContext

# sc is the SparkContext provided by the pyspark interpreter.  That's why you don't see it initialized here.
sqlContext = SQLContext(sc)

# Source directory for your data
source_dir = settings['source_dir']

# Import Bosch training numeric data
source_numeric = source_dir + settings['train_numeric_file']
train_numeric = sqlContext.read.csv(source_numeric, header = "true", inferSchema = "true")

# Fill missing values with 0.
train_numeric = train_numeric.na.fill(0)

# Now the categorical data
#source_categorical = source_dir + 'train_categorical.csv'
#train_categorical = sqlContext.read.csv(source_categorical, header="true", inferSchema="true")

## Data Wrangling
### Vectorize Feature Space for Spark

In [2]:
# We need to vectorize our features for MLLib
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

# Only vectorize the non-ID and non-Response columns
ignore = ['Id', 'Response']
numeric_columns = [x for x in train_numeric.columns if x not in ignore]

assembler = VectorAssembler(
    inputCols = numeric_columns,
    outputCol = 'features')

### Reduce Dimensionality

#### Standard Scaling

## Model Generation

In [3]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier

# Train a GBT model.
gbt = GBTClassifier(labelCol = "Response", featuresCol = "features", maxIter = 10, maxDepth = 10, 
                    maxMemoryInMB = 1024, maxBins = 64)

# Chain indexers and GBT in a Pipeline
pipeline = Pipeline(stages = [assembler, gbt])

# Train model.  This also runs the indexers.
model = pipeline.fit(train_numeric)

## Model Performance
### Load Test Data

In [4]:
# Load just like before
source_test = source_dir + settings['test_numeric_file']
data_test = sqlContext.read.csv(source_test, header = "true", inferSchema = "true")

# And set null data to zero
data_test = data_test.na.fill(0)

### Generate Test Predictions

In [5]:
# Make predictions.
preds = model.transform(data_test)

### Format and Export Kaggle Submission

In [6]:
import pandas as pd
import numpy as np

# Collect the prediction from Spark
predsGBT = preds.select("prediction").rdd.map(lambda r: r[0]).collect()

# Format to Kaggle Format
sub = pd.read_csv(source_dir + settings['sample_submission.csv'])
sub['Response'] = np.asarray(predsGBT).astype(int)
sub.to_csv(source_dir + settings['final_submission_file'], index = False)

## Submission History

* Submission 1: -		Thomas M Hughes	0.13591	-	Sun, 27 Nov 2016 23:03:33 (GBT)
* Submission 2: -		Thomas M Hughes	0.13591	-	Mon, 28 Nov 2016 00:06:04 (GBT w/ Standard Scaler)
* Submission 3: -		Thomas M Hughes	0.15070	-	Mon, 28 Nov 2016 01:27:14 (GBT w/ maxDepth=10, maxBins=64)