<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_08_5_kaggle_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 8: Kaggle Data Sets**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 8 Material

* Part 8.1: Introduction to Kaggle [[Video]](https://www.youtube.com/watch?v=v4lJBhdCuCU&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_1_kaggle_intro.ipynb)
* Part 8.2: Building Ensembles with Scikit-Learn and Keras [[Video]](https://www.youtube.com/watch?v=LQ-9ZRBLasw&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_2_keras_ensembles.ipynb)
* Part 8.3: How Should you Architect Your Keras Neural Network: Hyperparameters [[Video]](https://www.youtube.com/watch?v=1q9klwSoUQw&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_3_keras_hyperparameters.ipynb)
* Part 8.4: Bayesian Hyperparameter Optimization for Keras [[Video]](https://www.youtube.com/watch?v=sXdxyUCCm8s&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_4_bayesian_hyperparameter_opt.ipynb)
* **Part 8.5: Current Semester's Kaggle** [[Video]](https://www.youtube.com/watch?v=PHQt0aUasRg&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_08_5_kaggle_project.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [1]:
# Start CoLab
try:
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

Note: not using Google CoLab


# Part 8.5: Current Semester's Kaggle

Kaggke competition site for current semester (Fall 2020):

* [Fall 2020 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learning-wustl-fall-2020/data)

Previous Kaggle competition sites for this class (NOT this semester's assignment, feel free to use code):
* [Spring 2020 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learningwustl-spring-2020)
* [Fall 2019 Kaggle Assignment](https://kaggle.com/c/applications-of-deep-learningwustl-fall-2019)
* [Spring 2019 Kaggle Assignment](https://www.kaggle.com/c/applications-of-deep-learningwustl-spring-2019)
* [Fall 2018 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2018)
* [Spring 2018 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-spring-2018)
* [Fall 2017 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2017)
* [Spring 2017 Kaggle Assignment](https://inclass.kaggle.com/c/applications-of-deep-learning-wustl-spring-2017)
* [Fall 2016 Kaggle Assignment](https://inclass.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2016)


# Iris as a Kaggle Competition

If the Iris data were used as a Kaggle, you would be given the following three files:

* [kaggle_iris_test.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_test.csv) - The data that Kaggle will evaluate you on.  Contains only input, you must provide answers.  (contains x)
* [kaggle_iris_train.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_train.csv) - The data that you will use to train. (contains x and y)
* [kaggle_iris_sample.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_iris_sample.csv) - A sample submission for Kaggle. (contains x and y)

Important features of the Kaggle iris files (that differ from how we've previously seen files):

* The iris species is already index encoded.
* Your training data is in a separate file.
* You will load the test data to generate a submission file.

The following program generates a submission file for "Iris Kaggle".  You can use it as a starting point for assignment 3.

In [2]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping

df_train = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/datasets/"+\
    "kaggle_iris_train.csv", na_values=['NA','?'])

# Encode feature vector
df_train.drop('id', axis=1, inplace=True)

num_classes = len(df_train.groupby('species').species.nunique())

print("Number of classes: {}".format(num_classes))

# Convert to numpy - Classification
x = df_train[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']].values
dummies = pd.get_dummies(df_train['species']) # Classification
species = dummies.columns
y = dummies.values
    
# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=45)

# Train, with early stopping
model = Sequential()
model.add(Dense(50, input_dim=x.shape[1], activation='relu'))
model.add(Dense(25))
model.add(Dense(y.shape[1],activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, 
                        patience=5, verbose=1, mode='auto',
                       restore_best_weights=True)

model.fit(x_train,y_train,validation_data=(x_test,y_test),
          callbacks=[monitor],verbose=0,epochs=1000)

Number of classes: 3
Restoring model weights from the end of the best epoch.
Epoch 00055: early stopping


<tensorflow.python.keras.callbacks.History at 0x178e5493fc8>

Now that we've trained the neural network, we can check its log loss.

In [3]:
from sklearn import metrics

# Calculate multi log loss error
pred = model.predict(x_test)
score = metrics.log_loss(y_test, pred)
print("Log loss score: {}".format(score))


Log loss score: 0.3136451941728592


Now we are ready to generate the Kaggle submission file.  We will use the iris test data that does not contain a $y$ target value.  It is our job to predict this value and submit to Kaggle.

In [4]:
# Generate Kaggle submit file

# Encode feature vector
df_test = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/datasets/"+\
    "kaggle_iris_test.csv", na_values=['NA','?'])

# Convert to numpy - Classification
ids = df_test['id']
df_test.drop('id', axis=1, inplace=True)
x = df_test[['sepal_l', 'sepal_w', 'petal_l', 'petal_w']].values
y = dummies.values

# Generate predictions
pred = model.predict(x)
#pred

# Create submission data set

df_submit = pd.DataFrame(pred)
df_submit.insert(0,'id',ids)
df_submit.columns = ['id','species-0','species-1','species-2']

# Write submit file locally
df_submit.to_csv("iris_submit.csv", index=False) 

print(df_submit)

     id  species-0  species-1  species-2
0   100   0.022236   0.533230   0.444534
1   101   0.003699   0.394908   0.601393
2   102   0.004600   0.420394   0.575007
3   103   0.956168   0.040161   0.003672
4   104   0.975333   0.022761   0.001906
5   105   0.966681   0.030938   0.002381
6   106   0.992637   0.007049   0.000314
7   107   0.002810   0.358485   0.638705
8   108   0.026152   0.557480   0.416368
9   109   0.001194   0.350682   0.648124
10  110   0.000649   0.268023   0.731328
11  111   0.994907   0.004923   0.000170
12  112   0.072954   0.587299   0.339747
13  113   0.000571   0.258208   0.741221
14  114   0.977138   0.021400   0.001463
15  115   0.004665   0.449740   0.545596
16  116   0.073553   0.567955   0.358493
17  117   0.968778   0.029240   0.001982
18  118   0.983742   0.015341   0.000918
19  119   0.986016   0.013193   0.000792
20  120   0.023752   0.583601   0.392647
21  121   0.032858   0.584882   0.382260
22  122   0.004007   0.395656   0.600338
23  123   0.0008

### MPG as a Kaggle Competition (Regression)

If the Auto MPG data were used as a Kaggle, you would be given the following three files:

* [kaggle_mpg_test.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv) - The data that Kaggle will evaluate you on.  Contains only input, you must provide answers.  (contains x)
* [kaggle_mpg_train.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_test.csv) - The data that you will use to train. (contains x and y)
* [kaggle_mpg_sample.csv](https://data.heatonresearch.com/data/t81-558/datasets/kaggle_auto_sample.csv) - A sample submission for Kaggle. (contains x and y)

Important features of the Kaggle iris files (that differ from how we've previously seen files):

The following program generates a submission file for "MPG Kaggle".  

In [5]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
import pandas as pd
import io
import os
import requests
import numpy as np
from sklearn import metrics

save_path = "."

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/datasets/"+\
    "kaggle_auto_train.csv", 
    na_values=['NA', '?'])

cars = df['name']

# Handle missing value
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())

# Pandas to Numpy
x = df[['cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin']].values
y = df['mpg'].values # regression

# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=42)

# Build the neural network
model = Sequential()
model.add(Dense(25, input_dim=x.shape[1], activation='relu')) # Hidden 1
model.add(Dense(10, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, 
                        verbose=1, mode='auto', restore_best_weights=True)
model.fit(x_train,y_train,validation_data=(x_test,y_test),
          verbose=2,callbacks=[monitor],epochs=1000)

# Predict
pred = model.predict(x_test)

Train on 261 samples, validate on 88 samples
Epoch 1/1000
261/261 - 0s - loss: 382597.1196 - val_loss: 246687.4858
Epoch 2/1000
261/261 - 0s - loss: 192257.0072 - val_loss: 98804.3558
Epoch 3/1000
261/261 - 0s - loss: 67605.7908 - val_loss: 28617.0703
Epoch 4/1000
261/261 - 0s - loss: 15922.8367 - val_loss: 3325.1682
Epoch 5/1000
261/261 - 0s - loss: 1270.3832 - val_loss: 512.5387
Epoch 6/1000
261/261 - 0s - loss: 1118.9636 - val_loss: 1651.5679
Epoch 7/1000
261/261 - 0s - loss: 1703.0441 - val_loss: 1161.2368
Epoch 8/1000
261/261 - 0s - loss: 900.1420 - val_loss: 452.0660
Epoch 9/1000
261/261 - 0s - loss: 355.7248 - val_loss: 304.3305
Epoch 10/1000
261/261 - 0s - loss: 336.1776 - val_loss: 353.2767
Epoch 11/1000
261/261 - 0s - loss: 364.7770 - val_loss: 337.0882
Epoch 12/1000
261/261 - 0s - loss: 334.1086 - val_loss: 301.5655
Epoch 13/1000
261/261 - 0s - loss: 318.2330 - val_loss: 295.2506
Epoch 14/1000
261/261 - 0s - loss: 315.3628 - val_loss: 294.1454
Epoch 15/1000
261/261 - 0s - lo

Epoch 126/1000
261/261 - 0s - loss: 96.6995 - val_loss: 80.4465
Epoch 127/1000
261/261 - 0s - loss: 95.5034 - val_loss: 79.5468
Epoch 128/1000
261/261 - 0s - loss: 93.9933 - val_loss: 78.7416
Epoch 129/1000
261/261 - 0s - loss: 93.2547 - val_loss: 77.2559
Epoch 130/1000
261/261 - 0s - loss: 92.0739 - val_loss: 76.4692
Epoch 131/1000
261/261 - 0s - loss: 91.3897 - val_loss: 75.0902
Epoch 132/1000
261/261 - 0s - loss: 89.5802 - val_loss: 74.2796
Epoch 133/1000
261/261 - 0s - loss: 89.2358 - val_loss: 73.7019
Epoch 134/1000
261/261 - 0s - loss: 89.2894 - val_loss: 71.7912
Epoch 135/1000
261/261 - 0s - loss: 86.9927 - val_loss: 70.9630
Epoch 136/1000
261/261 - 0s - loss: 84.9979 - val_loss: 71.5301
Epoch 137/1000
261/261 - 0s - loss: 85.4751 - val_loss: 69.3716
Epoch 138/1000
261/261 - 0s - loss: 84.5646 - val_loss: 69.2690
Epoch 139/1000
261/261 - 0s - loss: 83.6890 - val_loss: 67.7983
Epoch 140/1000
261/261 - 0s - loss: 80.8676 - val_loss: 66.0073
Epoch 141/1000
261/261 - 0s - loss: 79.7

Epoch 255/1000
261/261 - 0s - loss: 28.5132 - val_loss: 23.8399
Epoch 256/1000
261/261 - 0s - loss: 28.9835 - val_loss: 23.3674
Epoch 257/1000
261/261 - 0s - loss: 28.2271 - val_loss: 23.4548
Epoch 258/1000
261/261 - 0s - loss: 27.8565 - val_loss: 23.1535
Epoch 259/1000
261/261 - 0s - loss: 27.8770 - val_loss: 23.1761
Epoch 260/1000
261/261 - 0s - loss: 27.5445 - val_loss: 22.9507
Epoch 261/1000
261/261 - 0s - loss: 27.6223 - val_loss: 22.8882
Epoch 262/1000
261/261 - 0s - loss: 27.3854 - val_loss: 22.9048
Epoch 263/1000
261/261 - 0s - loss: 27.3946 - val_loss: 22.6476
Epoch 264/1000
261/261 - 0s - loss: 27.0089 - val_loss: 22.5546
Epoch 265/1000
261/261 - 0s - loss: 26.9027 - val_loss: 22.4856
Epoch 266/1000
261/261 - 0s - loss: 26.7630 - val_loss: 22.4675
Epoch 267/1000
261/261 - 0s - loss: 27.0150 - val_loss: 22.3077
Epoch 268/1000
261/261 - 0s - loss: 26.3339 - val_loss: 22.1958
Epoch 269/1000
261/261 - 0s - loss: 26.5861 - val_loss: 22.3650
Epoch 270/1000
261/261 - 0s - loss: 26.3

Now that we've trained the neural network, we can check its RMSE error.

In [6]:
import numpy as np

# Measure RMSE error.  RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred,y_test))
print("Final score (RMSE): {}".format(score))

Final score (RMSE): 4.5134384517538795


Now we are ready to generate the Kaggle submission file.  We will use the MPG test data that does not contain a $y$ target value.  It is our job to predict this value and submit to Kaggle.

In [7]:
import pandas as pd

# Generate Kaggle submit file

# Encode feature vector
df_test = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/datasets/"+\
    "kaggle_auto_test.csv", na_values=['NA','?'])

# Convert to numpy - regression
ids = df_test['id']
df_test.drop('id', axis=1, inplace=True)

# Handle missing value
df_test['horsepower'] = df_test['horsepower'].\
    fillna(df['horsepower'].median())

x = df_test[['cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin']].values




# Generate predictions
pred = model.predict(x)
#pred

# Create submission data set

df_submit = pd.DataFrame(pred)
df_submit.insert(0,'id',ids)
df_submit.columns = ['id','mpg']

# Write submit file locally
df_submit.to_csv("auto_submit.csv", index=False) 

print(df_submit)

     id        mpg
0   350  29.112602
1   351  27.803200
2   352  27.981804
3   353  30.487831
4   354  27.227440
5   355  26.438324
6   356  27.886986
7   357  29.103935
8   358  26.447609
9   359  30.027260
10  360  30.312553
11  361  30.712151
12  362  23.952263
13  363  24.858467
14  364  23.459129
15  365  22.638985
16  366  26.032127
17  367  26.197884
18  368  28.448906
19  369  28.138954
20  370  27.352821
21  371  27.313377
22  372  26.464119
23  373  26.689583
24  374  26.546562
25  375  27.829781
26  376  27.466354
27  377  30.343369
28  378  29.985909
29  379  27.807251
30  380  28.450882
31  381  26.574844
32  382  28.199501
33  383  29.615051
34  384  29.048317
35  385  29.320534
36  386  29.582710
37  387  24.533165
38  388  24.426888
39  389  24.658607
40  390  21.805504
41  391  26.026482
42  392  24.947670
43  393  26.902489
44  394  26.575218
45  395  33.546684
46  396  24.233910
47  397  28.609993
48  398  28.913261


# Module 8 Assignment

You can find the first assignment here: [assignment 8](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class8.ipynb)