# Kaggle Bosch Production Line Challenge Model
## Summary
This is a [TensorFlow](https://www.tensorflow.org) model, built on a [Spark](https://spark.apache.org) framework, to attempt to solve the [Bosch Production Line Performance Challenge](https://www.kaggle.com/c/bosch-production-line-performance) on [Kaggle](https://www.kaggle.com).  This project was begun by Thomas Hughes on November 24, 2016, after the competition was completed.  It should be considered a test of effectiveness of technology platforms.

## Notes on Execution
Since this Notebook is designed to run with Spark, it must be running with the PySpark interpreter.  This can be done mostly automatically if you launch the notebook using the script 'pyspark-notebook' that is available in the github repository along with the notebook.  PySpark will need to be installed and properly configured, and you may need to update the script to your local copy of PySpark.

## Import Bosch Data

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext

spark = SparkSession\
    .builder\
    .appName("example-spark")\
    .config("spark.sql.crossJoin.enabled","true")\
    .getOrCreate()

# sc is the SparkContext provided by the pyspark interpreter.  That's why you don't see it initialized here.
sqlContext = SQLContext(sc)

# Source directory for your data
source_dir = '/Users/thughes/tmp/data/'

# Import Bosch training numeric data
source_numeric = source_dir + 'train_numeric.csv'
train_numeric = sqlContext.read.csv(source_numeric, header="true", inferSchema="true")

# Fill missing values with 0.
train_numeric = train_numeric.na.fill(0)

# Now the categorical data
#source_categorical = source_dir + 'train_categorical.csv'
#train_categorical = sqlContext.read.csv(source_categorical, header="true", inferSchema="true")

In [2]:
# We need to vectorize our features for MLLib
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

# Only vectorize the non-ID and non-Response columns
ignore = ['Id', 'Response']
numeric_columns = [x for x in train_numeric.columns if x not in ignore]

assembler = VectorAssembler(
    inputCols=numeric_columns,
    outputCol='features')

## Data Wrangling

### Scale Data

In [3]:
# TODO: Scale Data

### Normalize Data

In [4]:
# TODO: Normalize Data on Logarithmic Scale

### Drop Outliers

In [5]:
# TODO: Drop Outliers in Training Set

### Impute Missing Values

In [6]:
# TODO: Impute missing values

### Reduce Dimensionality

#### Feature Selection

In [7]:
# TODO: Feature Selection; use Spark libraries if possible

#### Independent Component Analysis (?)

In [8]:
# TODO: Engineer some independent components using ICA

#### Principal Component Analysis

In [9]:
# TODO: Engineer some principal components using PCA

## Model Generation

In [3]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier

# Train a GBT model.
gbt = GBTClassifier(labelCol="Response", featuresCol="features", maxIter=10)

# Chain indexers and GBT in a Pipeline
pipeline = Pipeline(stages=[assembler, gbt])

# Train model.  This also runs the indexers.
model = pipeline.fit(train_numeric) # Make predictions.

## Model Performance
### Load Test Data

In [12]:
# Load just like before
source_test = source_dir + 'test_numeric.csv'
data_test = sqlContext.read.csv(source_test, header="true", inferSchema="true")

# And set null data to zero
data_test = data_test.na.fill(0)



### Generate Test Predictions

In [None]:
# Make predictions.
preds = model.transform(data_test)

### Format and Export Kaggle Submission

In [None]:
import pandas as pd
import numpy as np

# Collect the prediction from Spark
predsGBT = preds.select("prediction").rdd.map(lambda r: r[0]).collect()

# Format to Kaggle Format
sub = pd.read_csv((source_dir + '/sample_submission.csv'))
sub['Response'] = np.asarray(predsGBT).astype(int)
sub.to_csv(source_dir + 'bosch-spark.csv', index=False)