## Getting Started

In this exercise I will use the features I previously engineered using R and Kaggle Mercari Price challenge data set. 

We will start with loading the libraries and functions we will need during the modeling.

In [1]:
import pandas as pd
import numpy as np
import os
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.callbacks import EarlyStopping
from sklearn.metrics import mean_squared_error
np.random.seed(12345)

Using TensorFlow backend.
  return f(*args, **kwds)


In [3]:
os.listdir()

['.git',
 '.ipynb_checkpoints',
 'Deep+Learning+using+Mercari+Data+set.ipynb',
 'mini_subtrain.csv',
 'subtrain.csv',
 'validation.csv']

In [13]:
#First load the mini subtraining set we prepared previously
mini_subtrain = pd.read_csv("mini_subtrain.csv", index_col = 0)
mini_subtrain.shape
mini_subtrain.head()


Unnamed: 0,item_condition_id,price,shipping,no.brand_name,log.excl.description,excl.name,dollar.description,fancy.categories,cheap.categories,fancy.brands,...,now,cheap,buy,excellent,great,michael.brand,jordan.name,iphon.name,bundl.name,cap.letter.brand
1,1,8.0,0,1,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,2,39.0,1,1,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,1,30.0,1,0,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,2,470.0,1,0,0.0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6
5,2,22.0,0,0,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2


In [15]:
mini_subtrain.describe()

Unnamed: 0,item_condition_id,price,shipping,no.brand_name,log.excl.description,excl.name,dollar.description,fancy.categories,cheap.categories,fancy.brands,...,now,cheap,buy,excellent,great,michael.brand,jordan.name,iphon.name,bundl.name,cap.letter.brand
count,37066.0,37066.0,37066.0,37066.0,37066.0,37066.0,37066.0,37066.0,37066.0,37066.0,...,37066.0,37066.0,37066.0,37066.0,37066.0,37066.0,37066.0,37066.0,37066.0,37066.0
mean,1.902876,26.581045,0.44963,0.428776,0.46134,0.078266,0.018319,0.00348,0.005126,0.00982,...,0.032752,0.005315,0.031835,0.031134,0.102412,0.008229,0.008013,0.020774,0.051395,1.589732
std,0.901044,36.188265,0.497463,0.494908,0.985842,0.559334,0.231276,0.058892,0.071413,0.098611,...,0.198214,0.076683,0.211528,0.175689,0.330199,0.090339,0.089156,0.142628,0.220805,1.198543
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,2.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,3.0,29.0,1.0,1.0,0.693147,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
max,5.0,1106.0,1.0,1.0,11.332853,21.0,14.0,1.0,1.0,1.0,...,4.0,3.0,6.0,2.0,4.0,1.0,1.0,1.0,1.0,17.0


In [16]:
mini_subtrain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37066 entries, 1 to 37066
Data columns (total 27 columns):
item_condition_id       37066 non-null int64
price                   37066 non-null float64
shipping                37066 non-null int64
no.brand_name           37066 non-null int64
log.excl.description    37066 non-null float64
excl.name               37066 non-null int64
dollar.description      37066 non-null int64
fancy.categories        37066 non-null int64
cheap.categories        37066 non-null int64
fancy.brands            37066 non-null int64
cheap.brands            37066 non-null int64
sale                    37066 non-null int64
free                    37066 non-null int64
save                    37066 non-null int64
deal                    37066 non-null int64
good                    37066 non-null int64
steal                   37066 non-null int64
now                     37066 non-null int64
cheap                   37066 non-null int64
buy                     37066 no

After sucessfully reading the verifying the training data set we have previously constructed using R, we can start building a small neural network and training it by using our data.

First we start with seperating predictors and response arrays:

In [24]:
predictors = np.array(mini_subtrain.drop(["price"], axis=1))
# We will log transform the target variable as we have performed in R
target = np.array(np.log(mini_subtrain.price + 1))

(37066,)

Next we can start building our network:

In [31]:
estop_monitor = EarlyStopping(patience= 2)
pre_shape = (predictors.shape[1],)
model = Sequential()
model.add(Dense(10,activation= "relu", input_shape = pre_shape))
model.add(Dense(10,activation= "relu"))
model.add(Dense(1,activation = "relu"))
model.compile(optimizer= "adam", loss= "mean_squared_error")
model_1 = model.fit(predictors,target, epochs= 30, callbacks= [estop_monitor], validation_split= 0.3)

Train on 25946 samples, validate on 11120 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30


Note that our loss function is "mean squared error". We will take the square root of this to follow "root mean squared error" (RMSE). We also keep in mind that the target is log transformed.

In [34]:
RMSE_model_1 = np.sqrt(min(model_1.history["val_loss"]))
RMSE_model_1

0.66764026261785125

Note that this RMSE is close to what we have obtained other machine learning algorithms previously. Therefore, we will continue our experiment by increasing model complexity:

In [41]:
estop_monitor = EarlyStopping(patience= 2)
pre_shape = (predictors.shape[1],)
model = Sequential()
model.add(Dense(1000,activation= "relu", input_shape = pre_shape))
model.add(Dense(1000,activation= "relu"))
model.add(Dense(1,activation = "relu"))
model.compile(optimizer= "adam", loss= "mean_squared_error")
model_2 = model.fit(predictors,target, epochs= 30, callbacks= [estop_monitor], validation_split= 0.3)
RMSE_model_2 = np.sqrt(min(model_2.history["val_loss"]))
RMSE_model_2

Train on 25946 samples, validate on 11120 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30


0.67429454720879667

It looks like our first model is already at its capacity. 

In [43]:
estop_monitor = EarlyStopping(patience= 2)
pre_shape = (predictors.shape[1],)
model = Sequential()
model.add(Dense(10,activation= "relu", input_shape = pre_shape))
model.add(Dense(10,activation= "relu"))
model.add(Dense(10,activation= "relu"))
model.add(Dense(1,activation = "relu"))
model.compile(optimizer= "adam", loss= "mean_squared_error")
model_3 = model.fit(predictors,target, epochs= 30, callbacks= [estop_monitor], validation_split= 0.3)
RMSE_model_3 = np.sqrt(min(model_3.history["val_loss"]))
RMSE_model_3

Train on 25946 samples, validate on 11120 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30


0.67136944170640311

Looks like we reached to model capacity even with a relatively simple network. Next, we will try if we can reduce the bias by training a larger data set, which we have available.

In [2]:
os.listdir()

['.git',
 '.ipynb_checkpoints',
 'Deep+Learning+using+Mercari+Data+set.ipynb',
 'mini_subtrain.csv',
 'subtrain.csv',
 'validation.csv']

In [3]:
subtrain = pd.read_csv("subtrain.csv",index_col=0)

In [4]:
subtrain.shape

(741269, 27)

In [12]:
#We need to check for missing values in the data set
subtrain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 741269 entries, 1 to 741269
Data columns (total 27 columns):
item_condition_id       741269 non-null int64
price                   741269 non-null float64
shipping                741269 non-null int64
no.brand_name           741269 non-null int64
log.excl.description    741268 non-null float64
excl.name               741269 non-null int64
dollar.description      741268 non-null float64
fancy.categories        741269 non-null int64
cheap.categories        741269 non-null int64
fancy.brands            741269 non-null int64
cheap.brands            741269 non-null int64
sale                    741268 non-null float64
free                    741268 non-null float64
save                    741268 non-null float64
deal                    741268 non-null float64
good                    741268 non-null float64
steal                   741268 non-null float64
now                     741268 non-null float64
cheap                   741268 non-null f

In [7]:
y = np.log(subtrain.price + 1)
X = subtrain.drop(["price"],axis = 1).as_matrix()


13

In [8]:
estop_monitor = EarlyStopping(patience= 2)
pre_shape = (X.shape[1],)
model = Sequential()
model.add(Dense(10,activation= "relu", input_shape = pre_shape))
model.add(Dense(10,activation= "relu"))
model.add(Dense(10,activation= "relu"))
model.add(Dense(1,activation = "relu"))
model.compile(optimizer= "adam", loss= "mean_squared_error")
model_4 = model.fit(X,y, epochs= 30, callbacks= [estop_monitor], validation_split= 0.3)
RMSE_model_4 = np.sqrt(min(model_4.history["val_loss"]))
RMSE_model_4

Train on 518888 samples, validate on 222381 samples
Epoch 1/30
Epoch 2/30


NameError: name 'model_3' is not defined

nan