### Using *scaled* data
First I am importing necessary libraries, then importing data, removing the arbitrary index "Unnamed: 0" column, and viewing the first few observations for each.

In [60]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler

In [48]:
train = pd.read_csv('data/red_wine_train.csv')
train.drop('Unnamed: 0',axis=1, inplace=True)
train.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,chlorides,total sulfur dioxide,density,sulphates,alcohol,quality,alcohol_higher,va_high
0,7.0,0.685,0.0,0.067,63.0,0.9979,0.81,9.9,5,0,1
1,8.6,0.685,0.1,0.092,12.0,0.99745,0.65,9.55,6,0,1
2,5.6,0.66,0.0,0.087,11.0,0.99378,0.63,12.8,7,1,1
3,7.7,0.51,0.28,0.087,54.0,0.998,0.74,9.2,5,0,0
4,8.7,0.31,0.46,0.059,25.0,0.9966,0.76,10.1,6,0,0


In [49]:
test = pd.read_csv('data/red_wine_test.csv')
test.drop('Unnamed: 0',axis=1, inplace=True)
test.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,chlorides,total sulfur dioxide,density,sulphates,alcohol,quality,alcohol_higher,va_high
0,12.6,0.31,0.72,0.072,29.0,0.9987,0.82,9.8,8,0,0
1,11.8,0.33,0.49,0.093,80.0,1.0002,0.76,10.7,7,0,0
2,7.1,0.875,0.05,0.082,14.0,0.99808,0.52,10.2,3,0,1
3,9.0,0.8,0.12,0.083,28.0,0.99836,0.65,10.4,6,0,1
4,7.9,0.69,0.21,0.08,141.0,0.9962,0.51,9.9,5,0,1


To decide how to scale each attribute, I am viewing the statistical summary of each column with describe(). Some columns already range from 0 to 1 while others have maximums above 10 or for total sulfur dioxide, almost 300. To ensure the neural network performs ideally, I am going to scale each column to [0,1] using the Min-Max Normalization method through scikitlearn's MinMaxScaler. Since we already split the data into test and train sets, I am first going to recombine these two disjoint dataframes back into one dataframe so that my values for mins and maxes needed for the formula are constant across scaling each of the two dataframes--it is likely that test and/or train contains a unique min or max value, and if we performed the min and max fit on these dfs separately, the values would not be the same (if one had a lower max for example) and our scaling would be flawed. This way, two equal values will certainly be transformed to the same value whether or not they begin in test or in train.

As mentioned, first we concatenate test and train back into one dataframe. Then we use .describe() to view summary of each column.

In [63]:
#recombining train & test to get overall max and min values so test and train are scaled w same values in the scaler
whole_set = pd.concat([train,test])

train.describe() #summary to understand all columns have varying scales

Unnamed: 0,fixed acidity,volatile acidity,citric acid,chlorides,total sulfur dioxide,density,sulphates,alcohol,quality,alcohol_higher,va_high
count,1279.0,1279.0,1279.0,1279.0,1279.0,1279.0,1279.0,1279.0,1279.0,1279.0,1279.0
mean,8.270837,0.530328,0.263628,0.087603,46.213839,0.996718,0.655395,10.418647,5.637217,0.151681,0.351056
std,1.730139,0.176907,0.194559,0.047631,32.942827,0.001908,0.171623,1.069182,0.808633,0.358852,0.477487
min,4.6,0.12,0.0,0.012,6.0,0.99007,0.33,8.4,3.0,0.0,0.0
25%,7.1,0.4,0.09,0.07,22.0,0.99559,0.55,9.5,5.0,0.0,0.0
50%,7.9,0.53,0.25,0.079,38.0,0.9967,0.62,10.2,6.0,0.0,0.0
75%,9.1,0.64,0.42,0.09,62.0,0.9978,0.72,11.1,6.0,0.0,1.0
max,15.9,1.33,1.0,0.611,289.0,1.00369,2.0,14.9,8.0,1.0,1.0


Here, we fit the scaler before using it to transform both train and test dfs.

In [64]:
scaler = MinMaxScaler() #build scaler
scaler.fit(whole_set) #fit scaler to entire df

#transform train and test separately, both using same scaler fit from whole df
train_scaled=scaler.transform(train)
test_scaled=scaler.transform(test)

#make transformed data in a dataframe (.transform returns arrays, we want df) using old col names
train_scaled = pd.DataFrame(train_scaled, columns=train.columns)
test_scaled = pd.DataFrame(test_scaled, columns=test.columns)

From our scaled data frames, we will separate each into dataframes of attributes only and class label 'quality' only. Additionally we put the number of attributes into a variable n_inputs to be used as a parameter in the model.

In [53]:
trainScaled = train_scaled.drop('quality',axis=1) #training df without class column

testScaled = test_scaled.drop('quality',axis=1) #training df without class column

trainScaledClass = train_scaled['quality'] #training df only class column

testScaledClass = test_scaled['quality'] #testing df only class column

n_inputs = [trainScaled.shape[1]] #10 attributes = 10 input nodes

Next we build, summarize, and compile the model.

In [54]:
modelScaled = tf.keras.Sequential([tf.keras.layers.Dense(units=1,input_shape=n_inputs)]) #build model with one layer
 
modelScaled.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 1)                 11        
                                                                 
Total params: 11
Trainable params: 11
Non-trainable params: 0
_________________________________________________________________


In [55]:
modelScaled.compile(optimizer='adam',
              loss='mae',
              metrics=['accuracy'])

Finally we fit the model and view the accuracy outputs at each epoch.

In [56]:
modelScaled.fit(trainScaled, trainScaledClass, epochs=25)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x2187d2558b0>