# T81-558: Applications of Deep Neural Networks
**Module 8: Kaggle Data Sets.**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module Video Material

Main video lecture:

* [Part 8.1: Introduction to Kaggle](https://www.youtube.com/watch?v=XpGI4engRjQ&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN&index=24)
* [Part 8.2: Simple Kaggle Solution for Keras](https://www.youtube.com/watch?v=AA3KFxjPxCo&index=25&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN)
* [Part 8.3: Overview of this Semester's Kaggle Assignment](https://www.youtube.com/watch?v=suwZz0qLmww)


# Helpful Functions

You will see these at the top of every module.  These are simply a set of reusable functions that we will make use of.  Each of them will be explained as the semester progresses.  They are explained in greater detail as the course progresses.  Class 4 contains a complete overview of these functions.

In [3]:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shutil
import os
import requests
import base64


# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name, x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)


# Encode text values to a single dummy variable.  The new columns (which do not replace the old) will have a 1
# at every location where the original column (name) matches each of the target_values.  One column is added for
# each target value.
def encode_text_single_dummy(df, name, target_values):
    for tv in target_values:
        l = list(df[name].astype(str))
        l = [1 if str(x) == str(tv) else 0 for x in l]
        name2 = "{}-{}".format(name, tv)
        df[name2] = l


# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df, name):
    le = preprocessing.LabelEncoder()
    df[name] = le.fit_transform(df[name])
    return le.classes_


# Encode a numeric column as zscores
def encode_numeric_zscore(df, name, mean=None, sd=None):
    if mean is None:
        mean = df[name].mean()

    if sd is None:
        sd = df[name].std()

    df[name] = (df[name] - mean) / sd


# Convert all missing values in the specified column to the median
def missing_median(df, name):
    med = df[name].median()
    df[name] = df[name].fillna(med)


# Convert all missing values in the specified column to the default
def missing_default(df, name, default_value):
    df[name] = df[name].fillna(default_value)


# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df, target):
    result = []
    for x in df.columns:
        if x != target:
            result.append(x)
    # find out the type of the target column.  Is it really this hard? :(
    target_type = df[target].dtypes
    target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
    # Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
    if target_type in (np.int64, np.int32):
        # Classification
        dummies = pd.get_dummies(df[target])
        return df.as_matrix(result).astype(np.float32), dummies.as_matrix().astype(np.float32)
    else:
        # Regression
        return df.as_matrix(result).astype(np.float32), df.as_matrix([target]).astype(np.float32)

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)


# Regression chart.
def chart_regression(pred,y,sort=True):
    t = pd.DataFrame({'pred' : pred, 'y' : y.flatten()})
    if sort:
        t.sort_values(by=['y'],inplace=True)
    a = plt.plot(t['y'].tolist(),label='expected')
    b = plt.plot(t['pred'].tolist(),label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()

# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
    drop_rows = df.index[(np.abs(df[name] - df[name].mean()) >= (sd * df[name].std()))]
    df.drop(drop_rows, axis=0, inplace=True)


# Encode a column to a range between normalized_low and normalized_high.
def encode_numeric_range(df, name, normalized_low=-1, normalized_high=1,
                         data_low=None, data_high=None):
    if data_low is None:
        data_low = min(df[name])
        data_high = max(df[name])

    df[name] = ((df[name] - data_low) / (data_high - data_low)) \
               * (normalized_high - normalized_low) + normalized_low
        
# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - Pandas dataframe output.
# key - Your student key that was emailed to you.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.  
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    r = requests.post("https://api.heatonresearch.com/assignment-submit",
        headers={'x-api-key':key}, json={'csv':base64.b64encode(data.to_csv(index=False).encode('ascii')).decode("ascii"),
        'assignment': no, 'ext':ext, 'py':encoded_python})
    if r.status_code == 200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

# What is Kaggle?

[Kaggle](http://www.kaggle.com) runs competitions in which data scientists compete in order to provide the best model to fit the data. The capstone project of this chapter features Kaggle’s [Titanic data set](https://www.kaggle.com/c/titanic-gettingStarted). Before we get started with the Titanic example, it’s important to be aware of some Kaggle guidelines. First, most competitions end on a specific date. Website organizers have currently scheduled the Titanic competition to end on December 31, 2016. However, they have already extended the deadline several times, and an extension beyond 2014 is also possible. Second, the Titanic data set is considered a tutorial data set. In other words, there is no prize, and your score in the competition does not count towards becoming a Kaggle Master. 

# Kaggle Ranks

Kaggle ranks are achieved by earning gold, silver and bronze medals.

* [Kaggle Top Users](https://www.kaggle.com/rankings)
* [Current Top Kaggle User's Profile Page](https://www.kaggle.com/stasg7)
* [Jeff Heaton's (your instructor) Kaggle Profile](https://www.kaggle.com/jeffheaton)
* [Current Kaggle Ranking System](https://www.kaggle.com/progression)

# Typical Kaggle Competition

A typical Kaggle competition will have several components.  Consider the Titanic tutorial:

* [Competition Summary Page](https://www.kaggle.com/c/titanic)
* [Data Page](https://www.kaggle.com/c/titanic/data)
* [Evaluation Description Page](https://www.kaggle.com/c/titanic/details/evaluation)
* [Leaderboard](https://www.kaggle.com/c/titanic/leaderboard)

## How Kaggle Competitions are Scored

Kaggle is provided with a data set by the competition sponsor.  This data set is divided up as follows:

* **Complete Data Set** - This is the complete data set.
    * **Training Data Set** - You are provided both the inputs and the outcomes for the training portion of the data set.
    * **Test Data Set** - You are provided the complete test data set; however, you are not given the outcomes.  Your submission is  your predicted outcomes for this data set.
        * **Public Leaderboard** - You are not told what part of the test data set contributes to the public leaderboard.  Your public score is calculated based on this part of the data set.
        * **Private Leaderboard** - You are not told what part of the test data set contributes to the public leaderboard.  Your final score/rank is calculated based on this part.  You do not see your private leaderboard score until the end.

![How Kaggle Competitions are Scored](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_3_kaggle.png "How Kaggle Competitions are Scored")

## Preparing a Kaggle Submission

Code need not be submitted to Kaggle.  For competitions, you are scored entirely on the accuracy of your sbmission file.  A Kaggle submission file is always a CSV file that contains the **Id** of the row you are predicting and the answer.  For the titanic competition, a submission file looks something like this:

```
PassengerId,Survived
892,0
893,1
894,1
895,0
896,0
897,1
...
```

The above file states the prediction for each of various passengers.  You should only predict on ID's that are in the test file.  Likewise, you should render a prediction for every row in the test file.  Some competitions will have different formats for their answers.  For example, a multi-classification will usually have a column for each class and your predictions for each class.

# Select Kaggle Competitions

There have been many interesting competitions on Kaggle, these are some of my favorites.

## Predictive Modeling

* [Otto Group Product Classification Challenge](https://www.kaggle.com/c/otto-group-product-classification-challenge)
* [Galaxy Zoo - The Galaxy Challenge](https://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge)
* [Practice Fusion Diabetes Classification](https://www.kaggle.com/c/pf2012-diabetes)
* [Predicting a Biological Response](https://www.kaggle.com/c/bioresponse)

## Computer Vision

* [Diabetic Retinopathy Detection](https://www.kaggle.com/c/diabetic-retinopathy-detection)
* [Cats vs Dogs](https://www.kaggle.com/c/dogs-vs-cats)
* [State Farm Distracted Driver Detection](https://www.kaggle.com/c/state-farm-distracted-driver-detection)

## Time Series

* [The Marinexplore and Cornell University Whale Detection Challenge](https://www.kaggle.com/c/whale-detection-challenge)

## Other

* [Helping Santa's Helpers](https://www.kaggle.com/c/helping-santas-helpers)


# Iris as a Kaggle Competition

If the Iris data were used as a Kaggle, you would be given the following three files:

* [kaggle_iris_test.csv](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/data/kaggle_iris_test.csv) - The data that Kaggle will evaluate you on.  Contains only input, you must provide answers.  (contains x)
* [kaggle_iris_train.csv](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/data/kaggle_iris_train.csv) - The data that you will use to train. (contains x and y)
* [kaggle_iris_sample.csv](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/data/kaggle_iris_sample.csv) - A sample submission for Kaggle. (contains x and y)

Important features of the Kaggle iris files (that differ from how we've previously seen files):

* The iris species is already index encoded.
* Your training data is in a separate file.
* You will load the test data to generate a submission file.

The following program generates a submission file for "Iris Kaggle".  You can use it as a starting point for assignment 3.

In [4]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
from tensorflow.contrib.learn.python.learn.metric_spec import MetricSpec
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.callbacks import EarlyStopping

path = "./data/"
    
filename_train = os.path.join(path,"iris_weight.csv")
#filename_test = os.path.join(path,"kaggle_iris_test.csv")
#filename_submit = os.path.join(path,"kaggle_iris_submit.csv")

train_df = pd.read_csv(filename_train,na_values=['NA','?'])

# Encode feature vector
encode_numeric_zscore(train_df,'led')
encode_numeric_zscore(train_df,'gears')
encode_numeric_zscore(train_df,'motors')
encode_numeric_zscore(train_df,'cost')
encode_numeric_zscore(train_df,'volume')

num_classes = len(train_df.groupby('weight').weight.nunique())

print("Number of classes: {}".format(num_classes))

# Create x & y for training

# Create the x-side (feature vectors) of the training




Using TensorFlow backend.


Number of classes: 13978


In [5]:
train_df[5:10]
#train_df.info()

Unnamed: 0,weight,led,gears,motors,cost,volume,shape_box,shape_cylinder,shape_sphere,metal_bronze,metal_gold,metal_platinum,metal_silver,metal_tin
5,269,-1.157443,1.350832,-1.159807,-0.432268,-1.631281,0,0,1,0,1,0,0,0
6,4575,-0.810875,-0.865815,-1.159807,-0.533871,0.600636,1,0,0,0,0,0,1,0
7,2164,-1.365384,1.697183,0.101383,-0.557984,0.934464,1,0,0,0,0,0,0,1
8,122,0.298142,1.697183,1.110336,-0.559224,-0.667908,1,0,0,1,0,0,0,0
9,1238,1.199219,1.004481,0.227502,-0.104929,-0.403856,0,0,1,0,0,1,0,0


In [6]:
train_np=train_df[0:659253].as_matrix()
test_np=train_df[659253:879004].as_matrix()

  """Entry point for launching an IPython kernel.
  


In [7]:
#x, y = to_xy(train_df,'weight')

y_train=train_np[:,0]
x_train=train_np[:,1:]    
y_test=test_np[:,0]
x_test=test_np[:,1:]   
#y_test=train_np[703204::,0]
# Split into train/test
#x_train, x_test, y_train, y_test = train_test_split(    
    #x, y, test_size=0.25, random_state=45)

In [8]:
model = Sequential()
model.add(Dense(800, input_dim=x_train.shape[1], activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1,activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-5, patience=5, verbose=1, mode='auto')
model.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor],verbose=2,epochs=100)


Train on 659253 samples, validate on 219751 samples
Epoch 1/100
 - 25s - loss: 146178.2754 - val_loss: 74177.4131
Epoch 2/100
 - 25s - loss: 74813.0829 - val_loss: 73136.3983
Epoch 3/100
 - 25s - loss: 74534.9413 - val_loss: 73230.6260
Epoch 4/100
 - 25s - loss: 74318.0553 - val_loss: 72767.1037
Epoch 5/100
 - 25s - loss: 74115.9839 - val_loss: 75504.6357
Epoch 6/100
 - 25s - loss: 74072.5923 - val_loss: 72727.5923
Epoch 7/100
 - 25s - loss: 73964.9825 - val_loss: 72808.3738
Epoch 8/100
 - 25s - loss: 73898.0443 - val_loss: 72219.3165
Epoch 9/100
 - 25s - loss: 73793.8078 - val_loss: 73382.3388
Epoch 10/100
 - 25s - loss: 73701.0989 - val_loss: 72814.2807
Epoch 11/100
 - 25s - loss: 73659.7952 - val_loss: 74409.0279
Epoch 12/100
 - 25s - loss: 73605.8045 - val_loss: 72911.6052
Epoch 13/100
 - 25s - loss: 73503.5095 - val_loss: 72321.6006
Epoch 00013: early stopping


<keras.callbacks.History at 0x180952a1d0>

In [6]:
train_np=pd.concat([train_df[0:439502],train_df[659253:879004]],axis=0).as_matrix()
test_np=train_df[439502:659253].as_matrix()

  """Entry point for launching an IPython kernel.
  


In [7]:
y_train=train_np[:,0]
x_train=train_np[:,1:]    
y_test=test_np[:,0]
x_test=test_np[:,1:]   

In [8]:
model2 = Sequential()
model2.add(Dense(800, input_dim=x_train.shape[1], activation='relu'))
model2.add(Dense(50, activation='relu'))
model2.add(Dense(10, activation='relu'))
model2.add(Dense(1,activation='linear'))
model2.compile(loss='mean_squared_error', optimizer='adam')
monitor2 = EarlyStopping(monitor='val_loss', min_delta=1e-5, patience=5, verbose=1, mode='auto')
model2.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor2],verbose=2,epochs=100)

Train on 659253 samples, validate on 219751 samples
Epoch 1/100
 - 26s - loss: 137396.4015 - val_loss: 75468.6173
Epoch 2/100
 - 26s - loss: 74988.0603 - val_loss: 74095.0008
Epoch 3/100
 - 26s - loss: 74485.0625 - val_loss: 78906.5785
Epoch 4/100
 - 26s - loss: 74290.6961 - val_loss: 74182.5548
Epoch 5/100
 - 26s - loss: 74208.3082 - val_loss: 74497.9172
Epoch 6/100
 - 26s - loss: 74053.9282 - val_loss: 74324.1282
Epoch 7/100
 - 26s - loss: 73999.5827 - val_loss: 73545.0046
Epoch 8/100
 - 27s - loss: 73851.0451 - val_loss: 74967.8897
Epoch 9/100
 - 26s - loss: 73766.7808 - val_loss: 73593.1802
Epoch 10/100
 - 27s - loss: 73745.7871 - val_loss: 73185.6248
Epoch 11/100
 - 26s - loss: 73486.6705 - val_loss: 74393.9355
Epoch 12/100
 - 26s - loss: 73344.8690 - val_loss: 74021.2771
Epoch 13/100
 - 26s - loss: 73266.5550 - val_loss: 73160.6138
Epoch 14/100
 - 26s - loss: 73145.3360 - val_loss: 72525.9760
Epoch 15/100
 - 26s - loss: 73004.6222 - val_loss: 72246.9648
Epoch 16/100
 - 26s - loss

<keras.callbacks.History at 0x1c2732feb8>

In [6]:
train_np=train_df[219751:879004].as_matrix()
test_np=train_df[0:219751].as_matrix()

  """Entry point for launching an IPython kernel.
  


In [7]:
y_train=train_np[:,0]
x_train=train_np[:,1:]    
y_test=test_np[:,0]
x_test=test_np[:,1:]   

In [8]:
model3 = Sequential()
model3.add(Dense(800, input_dim=x_train.shape[1], activation='relu'))
model3.add(Dense(50, activation='relu'))
model3.add(Dense(10, activation='relu'))
model3.add(Dense(1,activation='linear'))
model3.compile(loss='mean_squared_error', optimizer='adam')
monitor3 = EarlyStopping(monitor='val_loss', min_delta=1e-5, patience=5, verbose=1, mode='auto')
model3.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor3],verbose=2,epochs=100)

Train on 659253 samples, validate on 219751 samples
Epoch 1/100
 - 28s - loss: 144605.4825 - val_loss: 75132.2576
Epoch 2/100
 - 28s - loss: 74760.2840 - val_loss: 74235.4007
Epoch 3/100
 - 28s - loss: 74421.7130 - val_loss: 80929.9295
Epoch 4/100
 - 27s - loss: 74202.3992 - val_loss: 74035.6749
Epoch 5/100
 - 27s - loss: 74089.7412 - val_loss: 74705.6925
Epoch 6/100
 - 27s - loss: 73959.0965 - val_loss: 72899.5992
Epoch 7/100
 - 28s - loss: 73901.4663 - val_loss: 73306.5542
Epoch 8/100
 - 27s - loss: 73739.1826 - val_loss: 72667.2701
Epoch 9/100
 - 28s - loss: 73707.6700 - val_loss: 72457.1600
Epoch 10/100
 - 28s - loss: 73686.0561 - val_loss: 72852.2821
Epoch 11/100
 - 27s - loss: 73603.3759 - val_loss: 72598.9915
Epoch 12/100
 - 28s - loss: 73557.7105 - val_loss: 76510.3442
Epoch 13/100
 - 27s - loss: 73480.0415 - val_loss: 73746.0857
Epoch 14/100
 - 27s - loss: 73429.4965 - val_loss: 72658.1943
Epoch 00014: early stopping


<keras.callbacks.History at 0x1c29fadac8>

In [12]:
train_np=pd.concat([train_df[0:219751],train_df[439502:879004]],axis=0).as_matrix()
test_np=train_df[219751:439502].as_matrix()
y_train=train_np[:,0]
x_train=train_np[:,1:]    
y_test=test_np[:,0]
x_test=test_np[:,1:]   
model4 = Sequential()
model4.add(Dense(800, input_dim=x_train.shape[1], activation='relu'))
model4.add(Dense(50, activation='relu'))
model4.add(Dense(10, activation='relu'))
model4.add(Dense(1,activation='linear'))
model4.compile(loss='mean_squared_error', optimizer='adam')
monitor4 = EarlyStopping(monitor='val_loss', min_delta=1e-5, patience=5, verbose=1, mode='auto')
model4.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor4],verbose=2,epochs=100)

  """Entry point for launching an IPython kernel.
  


Train on 659253 samples, validate on 219751 samples
Epoch 1/100
 - 29s - loss: 146551.9167 - val_loss: 75355.4112
Epoch 2/100
 - 28s - loss: 75060.8614 - val_loss: 74247.4260
Epoch 3/100
 - 28s - loss: 74443.5184 - val_loss: 74790.4621
Epoch 4/100
 - 28s - loss: 74243.6783 - val_loss: 73508.1375
Epoch 5/100
 - 28s - loss: 74085.1206 - val_loss: 73987.4532
Epoch 6/100
 - 28s - loss: 73962.0061 - val_loss: 72980.6766
Epoch 7/100
 - 28s - loss: 73891.1985 - val_loss: 73431.0160
Epoch 8/100
 - 28s - loss: 73844.9873 - val_loss: 73252.5093
Epoch 9/100
 - 28s - loss: 73754.0529 - val_loss: 73529.1921
Epoch 10/100
 - 28s - loss: 73622.5266 - val_loss: 74430.4262
Epoch 11/100
 - 29s - loss: 73574.8397 - val_loss: 72788.5041
Epoch 12/100
 - 28s - loss: 73517.1306 - val_loss: 73164.7154
Epoch 13/100
 - 28s - loss: 73383.5301 - val_loss: 74306.0749
Epoch 14/100
 - 27s - loss: 73272.4570 - val_loss: 74259.4334
Epoch 15/100
 - 28s - loss: 73266.3516 - val_loss: 72810.2003
Epoch 16/100
 - 27s - loss

<keras.callbacks.History at 0x1c2a103b38>

In [5]:
train_np=pd.concat([train_df[0:219751],train_df[439502:879004]],axis=0).as_matrix()
test_np=train_df[219751:439502].as_matrix()
y_train=train_np[:,0]
x_train=train_np[:,1:]    
y_test=test_np[:,0]
x_test=test_np[:,1:]   
model5 = Sequential()
model5.add(Dense(50, input_dim=x_train.shape[1], activation='relu'))
model5.add(Dense(20, activation='relu'))
model5.add(Dense(1,activation='linear'))
model5.compile(loss='mean_squared_error', optimizer='adam')
monitor5 = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto')
model5.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor5],verbose=2,epochs=100)

  """Entry point for launching an IPython kernel.
  


Train on 659253 samples, validate on 219751 samples
Epoch 1/100
 - 19s - loss: 352625.8589 - val_loss: 87000.1877
Epoch 2/100
 - 18s - loss: 83586.3388 - val_loss: 80800.0351
Epoch 3/100
 - 18s - loss: 78310.7028 - val_loss: 76734.4984
Epoch 4/100
 - 18s - loss: 75425.2568 - val_loss: 75180.7518
Epoch 5/100
 - 18s - loss: 73988.2624 - val_loss: 73466.2428
Epoch 6/100
 - 18s - loss: 73311.7344 - val_loss: 74177.6096
Epoch 7/100
 - 18s - loss: 73142.4997 - val_loss: 73429.1096
Epoch 8/100
 - 18s - loss: 73045.5986 - val_loss: 72951.9707
Epoch 9/100
 - 18s - loss: 73009.0633 - val_loss: 73425.3764
Epoch 10/100
 - 18s - loss: 72954.8379 - val_loss: 73321.5803
Epoch 11/100
 - 18s - loss: 72914.8509 - val_loss: 72948.4518
Epoch 12/100
 - 18s - loss: 72872.4502 - val_loss: 72796.8650
Epoch 13/100
 - 18s - loss: 72862.3848 - val_loss: 74469.0372
Epoch 14/100
 - 18s - loss: 72838.7390 - val_loss: 72631.6555
Epoch 15/100
 - 18s - loss: 72832.1140 - val_loss: 72726.5456
Epoch 16/100
 - 18s - loss

<keras.callbacks.History at 0x107c114a8>

In [14]:
train_np=pd.concat([train_df[0:219751],train_df[439502:879004]],axis=0).as_matrix()
test_np=train_df[219751:439502].as_matrix()
y_train=train_np[:,0]
x_train=train_np[:,1:]    
y_test=test_np[:,0]
x_test=test_np[:,1:]   
model6 = Sequential()
model6.add(Dense(800, input_dim=x_train.shape[1], activation='relu'))
model6.add(Dense(30, activation='relu'))
model6.add(Dense(1,activation='linear'))
model6.compile(loss='mean_squared_error', optimizer='adam')
monitor6 = EarlyStopping(monitor='val_loss', min_delta=1e-5, patience=5, verbose=1, mode='auto')
model6.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor6],verbose=2,epochs=100)

  """Entry point for launching an IPython kernel.
  


Train on 659253 samples, validate on 219751 samples
Epoch 1/100
 - 25s - loss: 163897.7217 - val_loss: 76059.0476
Epoch 2/100
 - 25s - loss: 74577.8122 - val_loss: 73500.0497
Epoch 3/100
 - 25s - loss: 74035.6897 - val_loss: 73646.7569
Epoch 4/100
 - 25s - loss: 73825.9735 - val_loss: 74164.8122
Epoch 5/100
 - 25s - loss: 73610.9527 - val_loss: 73642.9545
Epoch 6/100
 - 25s - loss: 73465.4780 - val_loss: 73715.0059
Epoch 7/100
 - 25s - loss: 73382.4838 - val_loss: 73048.8671
Epoch 8/100
 - 25s - loss: 73325.5231 - val_loss: 73047.9469
Epoch 9/100
 - 25s - loss: 73326.5232 - val_loss: 72976.4285
Epoch 10/100
 - 25s - loss: 73216.3396 - val_loss: 73431.9975
Epoch 11/100
 - 25s - loss: 73166.7587 - val_loss: 73026.0353
Epoch 12/100
 - 25s - loss: 73140.6770 - val_loss: 72891.3120
Epoch 13/100
 - 26s - loss: 73077.0666 - val_loss: 74701.8249
Epoch 14/100
 - 26s - loss: 73097.3978 - val_loss: 73646.1406
Epoch 15/100
 - 26s - loss: 73007.8654 - val_loss: 73588.8105
Epoch 16/100
 - 25s - loss

<keras.callbacks.History at 0x1c311a6668>

In [9]:
train_np=train_df.as_matrix()
y_train=train_np[:,0]
x_train=train_np[:,1:]      

  """Entry point for launching an IPython kernel.


In [10]:
from sklearn.ensemble import RandomForestRegressor
train_np=train_df.as_matrix()
y_train=train_np[:,0]
x_train=train_np[:,1:] 
rfr_pred = RandomForestRegressor(random_state=0, n_estimators=1000, n_jobs=-1)
rfr_pred.fit(x_train, y_train)

  


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=-1,
           oob_score=False, random_state=0, verbose=0, warm_start=False)

In [None]:
data_train=pd.read_csv("/Users/apple/study/DNN/kaggle/in-class competition/all/train.csv")
data_train.drop('id', axis=1, inplace=True)
df_sphere=data_train[data_train['shape']=="sphere"]
df_sphere.loc[:,'volume']=((df_sphere.length/2)**3)*3.14159*4/3
data_train.loc[(data_train['shape']=="sphere"),'volume']=df_sphere.volume
df_cylinder=data_train[data_train['shape']=="cylinder"]
df_cylinder.loc[:,'volume']=3.14159*df_cylinder.height*((df_cylinder.width/2)**2)
data_train.loc[(data_train['shape']=="cylinder"),'volume']=df_cylinder.volume
df_box=data_train[data_train['shape']=="box"]
df_box.loc[:,'volume']=df_box.height*df_box.width*df_box.length
data_train.loc[(data_train['shape']=="box"),'volume']=df_box.volume
dummies_shape=pd.get_dummies(data_train['shape'],prefix='shape')
dummies_metal=pd.get_dummies(data_train['metal'],prefix='metal')
df=pd.concat([data_train,dummies_shape,dummies_metal],axis=1)
df.drop(['shape','metal','motor_vol','gear_vol','volume_parts'],axis=1,inplace=True)

In [38]:
from sklearn.ensemble import RandomForestRegressor
def set_missing_cost(df):
    cost_df=df[['cost','metal_cost','volume','shape_box','shape_cylinder','shape_sphere','led','gears','motors']]
    
    known_cost=cost_df[cost_df.cost!="?"].as_matrix()
    unknown_cost=cost_df[cost_df.cost=="?"].as_matrix()
    
    y=known_cost[:,0]
    X=known_cost[:,1:]
    
    rfr=RandomForestRegressor(random_state=0,n_estimators=1000,n_jobs=-1)
    rfr.fit(X,y)
    predictedCost=rfr.predict(unknown_cost[:,1::])
    df.loc[(df.cost=="?"),'cost']=predictedCost
    return df,rfr

In [39]:
df,rfr=set_missing_cost(df)

  """
  


In [22]:
data_test=pd.read_csv("/Users/apple/study/DNN/kaggle/in-class competition/all/test.csv")
ids = data_test['id']
data_test.drop('id', axis=1, inplace=True)
data_test['volume']=0
data_test[1:5]

Unnamed: 0,shape,metal,metal_cost,height,width,length,led,gears,motors,led_vol,motor_vol,gear_vol,volume_parts,cost,volume
1,box,platinum,29.44,4,5,8,42,5,4,?,?,?,?,96386.0,0
2,sphere,platinum,29.44,0,0,9,55,30,11,?,?,?,?,?,0
3,cylinder,platinum,29.44,9,7,0,67,22,8,?,?,?,?,?,0
4,box,bronze,0.05,3,9,3,34,46,0,?,?,?,?,91.0,0


In [41]:
df_sphere=data_test[data_test['shape']=="sphere"]
df_sphere.loc[:,'volume']=((df_sphere.length/2)**3)*3.14159*4/3
data_test.loc[(data_test['shape']=="sphere"),'volume']=df_sphere.volume
df_cylinder=data_test[data_test['shape']=="cylinder"]
df_cylinder.loc[:,'volume']=3.14159*df_cylinder.height*((df_cylinder.width/2)**2)
data_test.loc[(data_test['shape']=="cylinder"),'volume']=df_cylinder.volume
df_box=data_test[data_test['shape']=="box"]
df_box.loc[:,'volume']=df_box.height*df_box.width*df_box.length
data_test.loc[(data_test['shape']=="box"),'volume']=df_box.volume
data_test[1:30]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,shape,metal,metal_cost,height,width,length,led,gears,motors,led_vol,motor_vol,gear_vol,volume_parts,cost,volume
1,box,platinum,29.44,4,5,8,42,5,4,?,?,?,?,96386.0,160.0
2,sphere,platinum,29.44,0,0,9,55,30,11,?,?,?,?,?,381.703185
3,cylinder,platinum,29.44,9,7,0,67,22,8,?,?,?,?,?,346.360298
4,box,bronze,0.05,3,9,3,34,46,0,?,?,?,?,91.0,81.0
5,sphere,tin,0.06,0,0,7,18,26,1,?,?,?,?,?,179.594228
6,box,gold,39.1,3,5,7,70,25,2,?,?,?,?,74159.0,105.0
7,sphere,silver,0.47,0,0,8,8,46,12,?,?,?,?,680.0,268.082347
8,box,tin,0.06,4,6,3,90,49,3,?,?,?,?,76.0,72.0
9,box,gold,39.1,5,3,5,38,8,6,?,?,?,?,?,75.0
10,cylinder,tin,0.06,7,6,0,35,19,8,?,?,?,?,104.0,197.92017


In [42]:
dummies_shape=pd.get_dummies(data_test['shape'],prefix='shape')
dummies_metal=pd.get_dummies(data_test['metal'],prefix='metal')
df=pd.concat([data_test,dummies_shape,dummies_metal],axis=1)
df.drop(['shape','metal','motor_vol','gear_vol','volume_parts'],axis=1,inplace=True)

In [None]:
#from sklearn.ensemble import RandomForestRegressor
#def set_missing_cost(df):
    #cost_df=df[['cost','metal_cost','volume','shape_box','shape_cylinder','shape_sphere','led','gears','motors']]
    
    #known_cost=cost_df[cost_df.cost!="?"].as_matrix()
    #unknown_cost=cost_df[cost_df.cost=="?"].as_matrix()
    
    #y=known_cost[:,0]
    #X=known_cost[:,1:]
    
   # rfr=RandomForestRegressor(random_state=0,n_estimators=1000,n_jobs=-1)
    #rfr.fit(X,y)
    #predictedCost=rfr.predict(unknown_cost[:,1::])
    #df.loc[(df.cost=="?"),'cost']=predictedCost
    #return df,rfr

In [45]:
cost_df=df[['cost','metal_cost','volume','shape_box','shape_cylinder','shape_sphere','led','gears','motors']]
unknown_cost=cost_df[cost_df.cost=="?"].as_matrix()

X=unknown_cost[:,1:]

predictedCost=rfr.predict(X)


  


In [46]:
df.loc[(df.cost=="?"),'cost']=predictedCost
df[1:30]

Unnamed: 0,metal_cost,height,width,length,led,gears,motors,led_vol,cost,volume,shape_box,shape_cylinder,shape_sphere,metal_bronze,metal_gold,metal_platinum,metal_silver,metal_tin
1,29.44,4,5,8,42,5,4,?,96386.0,160.0,1,0,0,0,0,1,0,0
2,29.44,0,0,9,55,30,11,?,200163.0,381.703185,0,0,1,0,0,1,0,0
3,29.44,9,7,0,67,22,8,?,201427.0,346.360298,0,1,0,0,0,1,0,0
4,0.05,3,9,3,34,46,0,?,91.0,81.0,1,0,0,1,0,0,0,0
5,0.06,0,0,7,18,26,1,?,113.735,179.594228,0,0,1,0,0,0,0,1
6,39.1,3,5,7,70,25,2,?,74159.0,105.0,1,0,0,0,1,0,0,0
7,0.47,0,0,8,8,46,12,?,680.0,268.082347,0,0,1,0,0,0,1,0
8,0.06,4,6,3,90,49,3,?,76.0,72.0,1,0,0,0,0,0,0,1
9,39.1,5,3,5,38,8,6,?,22006.5,75.0,1,0,0,0,1,0,0,0
10,0.06,7,6,0,35,19,8,?,104.0,197.92017,0,1,0,0,0,0,0,1


In [47]:
df.to_csv("/Users/apple/study/DNN/kaggle/in-class competition/set_missing_cost_test.csv", index=False)

In [11]:
df=pd.read_csv("/Users/apple/study/DNN/kaggle/in-class competition/set_missing_cost_test.csv")

In [12]:
df['cost'] = df['cost'].astype(float)

In [13]:
test_df=df[['led','gears','motors','cost','volume','shape_box','shape_cylinder','shape_sphere','metal_bronze','metal_gold','metal_platinum','metal_silver','metal_tin']]

In [14]:
encode_numeric_zscore(test_df,'led')
encode_numeric_zscore(test_df,'gears')
encode_numeric_zscore(test_df,'motors')
encode_numeric_zscore(test_df,'cost')
encode_numeric_zscore(test_df,'volume')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [15]:
x=test_df.as_matrix().astype(np.float32)

  """Entry point for launching an IPython kernel.


In [54]:

pred1 = model.predict(x)
pred1

  """Entry point for launching an IPython kernel.


array([[1557.5359 ],
       [3345.3916 ],
       [6857.5796 ],
       ...,
       [8833.764  ],
       [ 375.49713],
       [1657.0417 ]], dtype=float32)

In [17]:
pred2 = model2.predict(x)
pred2

array([[1543.4271 ],
       [3324.1274 ],
       [6809.9346 ],
       ...,
       [8731.941  ],
       [ 429.00458],
       [1697.662  ]], dtype=float32)

In [56]:
pred3 = model3.predict(x)
pred3

array([[1516.3594 ],
       [3284.746  ],
       [6780.087  ],
       ...,
       [8772.513  ],
       [ 393.28226],
       [1709.4618 ]], dtype=float32)

In [57]:
pred4 = model4.predict(x)
pred4

array([[1584.978  ],
       [3290.1748 ],
       [6738.4346 ],
       ...,
       [8630.934  ],
       [ 400.36923],
       [1657.0559 ]], dtype=float32)

In [2]:
pred5 = model5.predict(x)
pred5

NameError: name 'model5' is not defined

In [60]:
pred6 = model6.predict(x)
pred6

array([[1505.4438 ],
       [3260.595  ],
       [6750.543  ],
       ...,
       [8683.65   ],
       [ 414.14935],
       [1715.6493 ]], dtype=float32)

In [16]:

pred7=rfr_pred.predict(x)

In [1]:
pred7

NameError: name 'pred7' is not defined

In [None]:
pred=(pred2+pred5+pred7)/3
pred

In [62]:
pred=(pred1+pred2+pred3+pred4+pred5+pred6)/6
pred

array([[1535.2692],
       [3290.152 ],
       [6779.8247],
       ...,
       [8722.007 ],
       [ 406.2837],
       [1681.0347]], dtype=float32)

In [19]:
mean=pd.read_csv("/Users/apple/study/DNN/kaggle/in-class competition/mean_predictions.csv")

In [18]:
df_save_rf=pd.DataFrame(pred7)

df_save_rf.columns=['weight']
df_save_rf.to_csv("/Users/apple/study/DNN/kaggle/in-class competition/random_for_pred.csv", index=False)

In [None]:
random_for=pd.read_csv("/Users/apple/study/DNN/kaggle/in-class competition/random_for_pred.csv")

In [21]:
pred_mean=mean['weight']
pred=(2*pred_mean+pred7)/3

In [23]:
# Generate Kaggle submit file


df_submit = pd.DataFrame(pred)
df_submit.insert(0,'id',ids)
df_submit.columns = ['id','weight']

df_submit.to_csv("/Users/apple/study/DNN/kaggle/in-class competition/mean_predictions_rf.csv", index=False)

print(df_submit)


          id        weight
0          0   1540.346133
1          1   3272.757667
2          2   6798.428800
3          3   6852.795067
4          4    814.735467
5          5   1125.406711
6          6   1929.730267
7          7   1663.081600
8          8    332.717602
9          9    577.721267
10        10    758.739467
11        11    915.041867
12        12   1589.835733
13        13    988.282267
14        14   1719.519800
15        15   1228.405733
16        16   2810.892400
17        17   4552.597833
18        18   2425.073333
19        19   1027.337800
20        20   4840.480333
21        21    477.908606
22        22    566.582847
23        23    715.029000
24        24   1436.792333
25        25     70.087807
26        26    430.566020
27        27   2195.512333
28        28   1842.113667
29        29    480.333492
...      ...           ...
99970  99970   2423.537867
99971  99971   1454.717867
99972  99972   6075.916200
99973  99973    888.498600
99974  99974   1877.584333
9

# Kaggle Project

Kaggke competition site for current semester (Fall 2017):
* [Spring 2018 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-spring-2018)

Previous Kaggle competition sites for this class (for your reference, do not use):
* [Fall 2017 Kaggle Assignment](https://www.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2017)
* [Spring 2017 Kaggle Assignment](https://inclass.kaggle.com/c/applications-of-deep-learning-wustl-spring-2017)
* [Fall 2016 Kaggle Assignment](https://inclass.kaggle.com/c/wustl-t81-558-washu-deep-learning-fall-2016)


# Module 8 Assignment

You can find the first assignmeht here: [assignment 8](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class1.ipynb)