# Goal

We are going to train a ML model for predicting the rating of a chocolate bar given the data in our falvors_of_cacao.csv file.

## Let's Take a Look At Our Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [118]:
data = pd.read_csv('flavors_of_cacao.csv')

In [119]:
data.head()

Unnamed: 0,Company Maker,Specific Bean Origin\nOr Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin
0,A. Morin,Agua Grande,1876,2016,63%,France,3.75,,Sao Tome
1,A. Morin,Kpime,1676,2015,70%,France,2.75,,Togo
2,A. Morin,Atsane,1676,2015,70%,France,3.0,,Togo
3,A. Morin,Akata,1680,2015,70%,France,3.5,,Togo
4,A. Morin,Quilla,1704,2015,70%,France,3.5,,Peru


In [120]:
for column in data.columns:
    print(column + '\n')

Company Maker

Specific Bean Origin
Or Bar Name

REF

Review
Date

Cocoa
Percent

Company
Location

Rating

Bean
Type

Broad Bean
Origin



In [121]:
print(f'Number of records total in csv file: {len(data)}')

Number of records total in csv file: 1795


### Target Vector

Looking at the data, we will use the "Rating" column to be our target vector.

However, before we do anything, we must clean our data and then transform it.

## Data Cleaning

### Column Name

First thing is first, notice that the names of most of our columns contain a "\n" in them. We are going to need to change this, as it will later become a parsing issue.

Generally we want to avoid any form of white space in the name of our columns.

E.g. ThisIsAGoodFormatForColumnNames


In [122]:
data.columns = data.columns.str.replace('\\n', '')
data.columns = data.columns.str.replace(' ', '')

print(data.columns)

Index(['CompanyMaker', 'SpecificBeanOriginOrBarName', 'REF', 'ReviewDate',
       'CocoaPercent', 'CompanyLocation', 'Rating', 'BeanType',
       'BroadBeanOrigin'],
      dtype='object')


### Missing Values

First thing's first, let's take a look at each column of data that we have and get the number of missing values for that  column

In [123]:
print(data.isnull().sum())

CompanyMaker                   0
SpecificBeanOriginOrBarName    0
REF                            0
ReviewDate                     0
CocoaPercent                   0
CompanyLocation                0
Rating                         0
BeanType                       1
BroadBeanOrigin                1
dtype: int64


It appears that barely any of the data is missing. I mean at the maximum amount, there is 1 record that is Nan, so we will just go ahead and remove every row that contains a nan.

In [124]:

print(f'Total number of NA values: {data.isnull().sum().sum()}')
print(f'Number of values before removing NA: {len(data)}')
data = data.dropna(axis=0)
data.reset_index(drop=True, inplace=True)
print(f'Number of values after removing NA: {len(data)}')

Total number of NA values: 2
Number of values before removing NA: 1795
Number of values after removing NA: 1793


### Let's take a look at each Columns min and max values (especially the integer fields)

In [125]:
print(data.nsmallest(10, 'Rating')['Rating'])

326     1.0
437     1.0
465     1.0
1174    1.0
245     1.5
249     1.5
324     1.5
449     1.5
554     1.5
988     1.5
Name: Rating, dtype: float64


This also looks good. If any reviews were less than 1, I would have removed them.

### Overall data set

In [126]:
print("Smallest values:\n")
print(data.min())


Smallest values:

CompanyMaker                                      A. Morin
SpecificBeanOriginOrBarName    "heirloom", Arriba Nacional
REF                                                      5
ReviewDate                                            2006
CocoaPercent                                          100%
CompanyLocation                                  Amsterdam
Rating                                                   1
BeanType                                            Amazon
BroadBeanOrigin                  Africa, Carribean, C. Am.
dtype: object


In [127]:
print("Largest values:\n")
print(data.max())

Largest values:

CompanyMaker                                     twenty-four blackbirds
SpecificBeanOriginOrBarName    the lost city, gracias a dias, batch 362
REF                                                                1952
ReviewDate                                                         2017
CocoaPercent                                                        99%
CompanyLocation                                                   Wales
Rating                                                                5
BeanType                                                               
BroadBeanOrigin                                                        
dtype: object


### Remove REF

The REF field will not give us any information about the likeability of a chocolate bar, so we will remove it.

In [128]:
data = data.drop(['REF'], axis=1)

In [129]:
print(data.columns)

Index(['CompanyMaker', 'SpecificBeanOriginOrBarName', 'ReviewDate',
       'CocoaPercent', 'CompanyLocation', 'Rating', 'BeanType',
       'BroadBeanOrigin'],
      dtype='object')


## Data Transformation

We need to take ever column of string data and convert each entry into integer form.

This is required for running any machine learning algorithms on our data.


In [130]:
print(data.head(1))

  CompanyMaker SpecificBeanOriginOrBarName  ReviewDate CocoaPercent  \
0     A. Morin                 Agua Grande        2016          63%   

  CompanyLocation  Rating BeanType BroadBeanOrigin  
0          France    3.75                 Sao Tome  


The fields that need to be converted to integer are everything except for "Rating", and "Review Date".

We will also have to deal with CocoaPercent, but we will deal with this one differently.

### Let's deal with converting Cocoa Percent into a integer column

First we will convert this field and store it off as its own data frame.

Later we will combine this together with another dataframe to put everything back together but in integer form.


In [131]:
# This data is almost in correct integer form as it is.
# All we need to do it cut off the % sign from each entry and convert that str to type float.
cocoa_percent_dict = {'CocoaPercentage': [float(value.replace('%', '')) for value in data['CocoaPercent']]}

# And boom, now we have a valid data frame.
cocoa_percent_df = pd.DataFrame(cocoa_percent_dict)
print(cocoa_percent_df.head())

   CocoaPercentage
0             63.0
1             70.0
2             70.0
3             70.0
4             70.0


In [132]:
# Just for curiosity, I want to look at the max and min.
print(f"Min: {cocoa_percent_df.min()}")
print(f"Max: {cocoa_percent_df.max()}")

Min: CocoaPercentage    42.0
dtype: float64
Max: CocoaPercentage    100.0
dtype: float64


#### Now it's time to deal with the all of the other string fields.

Note: Unlike the "Cocoa Percent" field,  this these fields don't have a level of ordinal valuing. As in there is no order to this data, high values do not mean that the data is better or worse, it is simply a label. These fields are simply nominal.

Since we are dealing with nomial data (and NOT ordinal), we are going to run One-Hot-Encoding on all nominal fields.

Pro: Each category of data will not be taintedby calculations where one field has a large value then the other (like ordinal data).

Con: Because of the way One-Hot-Encoding works, we will have an additional vector (column) for each version of the encoded data.

In [133]:
print("Reminder of what columns we have:\n")
for col in data.columns:
    print(col)

Reminder of what columns we have:

CompanyMaker
SpecificBeanOriginOrBarName
ReviewDate
CocoaPercent
CompanyLocation
Rating
BeanType
BroadBeanOrigin


### We will create a new DataFrame that will hold everything in integer form

This will be the new dataframe that we are going to insert all of our integer columns into, now that everything is going to be converted into integer form.

We will take the columns from data that are already in the correct form. Then we will add the fields that are going to be properly encoded in later.

In [134]:
new_data = data[['Rating', 'ReviewDate']]

# Let's add in the data frame that we had created for converting CocoaPercent into integer form.
new_data = pd.concat([new_data, cocoa_percent_df], axis=1)

print(new_data.columns)
print(new_data.head())


Index(['Rating', 'ReviewDate', 'CocoaPercentage'], dtype='object')
   Rating  ReviewDate  CocoaPercentage
0    3.75        2016             63.0
1    2.75        2015             70.0
2    3.00        2015             70.0
3    3.50        2015             70.0
4    3.50        2015             70.0


### Created One-Hot-Encoded DataFrame

We will take all of the nomial fields now and run one hot encoding on them to get them into integer form.

In [135]:
from sklearn.preprocessing import OneHotEncoder

print(f'Nomial fields: {data.columns}\n')

# Are there any featires tjat we wannt to test out dropping
# list_to_drop = ['BroadBeanOrigin', 'CompanyMaker', 'CompanyLocation', 'SpecificBeanOriginOrBarName']
# data = data.drop(list_to_drop, axis=1)

nomial_fields = [col for col in data.columns if col not in ['Rating', 'ReviewDate', 'CocoaPercent']]
nomial_fields_df = data[nomial_fields]
one_hot_encoded_df = pd.get_dummies(nomial_fields_df, drop_first=True)

Nomial fields: Index(['CompanyMaker', 'SpecificBeanOriginOrBarName', 'ReviewDate',
       'CocoaPercent', 'CompanyLocation', 'Rating', 'BeanType',
       'BroadBeanOrigin'],
      dtype='object')



In [136]:
print(f"Expected shape of OHE matrix: {data.shape}")
print(f"Actual shape of OHE matrix: {one_hot_encoded_df.shape}")

Expected shape of OHE matrix: (1793, 8)
Actual shape of OHE matrix: (1793, 1649)


Let's finish the construction of the "new_data" DataFrame that we were creating. Recall that this DataFrame will contain all of our one hot encoded fields and the other interger fields that we already have.

In [137]:
new_data = pd.concat([new_data, one_hot_encoded_df], axis=1)
print(new_data.head())

   Rating  ReviewDate  CocoaPercentage  CompanyMaker_AMMA  \
0    3.75        2016             63.0                  0   
1    2.75        2015             70.0                  0   
2    3.00        2015             70.0                  0   
3    3.50        2015             70.0                  0   
4    3.50        2015             70.0                  0   

   CompanyMaker_Acalli  CompanyMaker_Adi  CompanyMaker_Aequare (Gianduja)  \
0                    0                 0                                0   
1                    0                 0                                0   
2                    0                 0                                0   
3                    0                 0                                0   
4                    0                 0                                0   

   CompanyMaker_Ah Cacao  CompanyMaker_Akesson's (Pralus)  \
0                      0                                0   
1                      0                        

In [138]:
print(new_data.shape)

(1793, 1652)


## Feature Reduction

Taking a look at the number of columns that we have, it is really easy to see that we have...a LOT more features then we started with.

In fact, after running One Hot Encoding, we now have 1649 different columns that are just associated with the nomial fields that we had.

This will definately impact the speed it takes to create our model, among other things.

To combat this we will run feature reduction using the PCA algorthim. We will determine what features are useful and what features are not.

### Standardize our data

Before we can run PCA on our data, we need to standardize it. All of our X data should be within 0 - 1 of itself. This will help to avoid (issues with gradient descent later) and is required for PCA. Data must be in the correct range, or the varience will not make any sense.

In [139]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
finalized_feature_matrix = pd.DataFrame(scaler.fit_transform(new_data))
print(finalized_feature_matrix)

          0         1         2         3         4         5         6     \
0     1.181356  1.254754 -1.375407 -0.052881 -0.033417 -0.047285 -0.033417   
1    -0.912734  0.913207 -0.268644 -0.052881 -0.033417 -0.047285 -0.033417   
2    -0.389211  0.913207 -0.268644 -0.052881 -0.033417 -0.047285 -0.033417   
3     0.657834  0.913207 -0.268644 -0.052881 -0.033417 -0.047285 -0.033417   
4     0.657834  0.913207 -0.268644 -0.052881 -0.033417 -0.047285 -0.033417   
...        ...       ...       ...       ...       ...       ...       ...   
1788  1.181356 -0.452984 -0.268644 -0.052881 -0.033417 -0.047285 -0.033417   
1789 -0.389211 -0.452984 -1.059189 -0.052881 -0.033417 -0.047285 -0.033417   
1790  0.657834 -0.452984 -1.059189 -0.052881 -0.033417 -0.047285 -0.033417   
1791  0.134311 -0.452984 -1.533516 -0.052881 -0.033417 -0.047285 -0.033417   
1792 -0.389211 -0.794532 -1.059189 -0.052881 -0.033417 -0.047285 -0.033417   

          7         8         9     ...      1642      1643    

By stanardizing our data using Sklearn, we have lost our data column names.

We are going to need to add those back so we know what feature is what and such.

Since we are stanardazing our data, the mean should be close to 0.0 and the standard deviation should be about 1.0

In [140]:
finalized_feature_matrix.columns = new_data.columns
print(finalized_feature_matrix)

        Rating  ReviewDate  CocoaPercentage  CompanyMaker_AMMA  \
0     1.181356    1.254754        -1.375407          -0.052881   
1    -0.912734    0.913207        -0.268644          -0.052881   
2    -0.389211    0.913207        -0.268644          -0.052881   
3     0.657834    0.913207        -0.268644          -0.052881   
4     0.657834    0.913207        -0.268644          -0.052881   
...        ...         ...              ...                ...   
1788  1.181356   -0.452984        -0.268644          -0.052881   
1789 -0.389211   -0.452984        -1.059189          -0.052881   
1790  0.657834   -0.452984        -1.059189          -0.052881   
1791  0.134311   -0.452984        -1.533516          -0.052881   
1792 -0.389211   -0.794532        -1.059189          -0.052881   

      CompanyMaker_Acalli  CompanyMaker_Adi  CompanyMaker_Aequare (Gianduja)  \
0               -0.033417         -0.047285                        -0.033417   
1               -0.033417         -0.047285    

Now let's take a look at our vairence and make sure that is actually makes sense.

In [141]:
print(new_data.var())

Rating                                  0.228166
ReviewDate                              8.577083
CocoaPercentage                        40.024787
CompanyMaker_AMMA                       0.002782
CompanyMaker_Acalli                     0.001115
                                         ...    
BroadBeanOrigin_Venezuela, Trinidad     0.000558
BroadBeanOrigin_Venezuela/ Ghana        0.000558
BroadBeanOrigin_Vietnam                 0.020756
BroadBeanOrigin_West Africa             0.003337
BroadBeanOrigin_                        0.039078
Length: 1652, dtype: float64


In [142]:
# This is the part that we really care about...
print(f"Mean:\n\n{finalized_feature_matrix.mean()}\n")

Mean:

Rating                                 2.018897e-16
ReviewDate                            -2.683328e-14
CocoaPercentage                       -1.313939e-16
CompanyMaker_AMMA                     -1.286130e-15
CompanyMaker_Acalli                    5.108311e-16
                                           ...     
BroadBeanOrigin_Venezuela, Trinidad   -8.072414e-17
BroadBeanOrigin_Venezuela/ Ghana      -8.072801e-17
BroadBeanOrigin_Vietnam               -3.973088e-16
BroadBeanOrigin_West Africa           -4.139342e-17
BroadBeanOrigin_                      -1.039944e-16
Length: 1652, dtype: float64



In [143]:
print(f"Standard Deviation:\n\n{finalized_feature_matrix.std()}")

Standard Deviation:

Rating                                 1.000279
ReviewDate                             1.000279
CocoaPercentage                        1.000279
CompanyMaker_AMMA                      1.000279
CompanyMaker_Acalli                    1.000279
                                         ...   
BroadBeanOrigin_Venezuela, Trinidad    1.000279
BroadBeanOrigin_Venezuela/ Ghana       1.000279
BroadBeanOrigin_Vietnam                1.000279
BroadBeanOrigin_West Africa            1.000279
BroadBeanOrigin_                       1.000279
Length: 1652, dtype: float64


Excellent,out standard dev and mean look perfect, time to run PCA. PCA will tell us the weight of how important each individual feature is. We will receive a pca level of varience for each feature in order of most important to least imporant.

We will remove the "Rating" field from our feature matrix, as that is our target value and we do not want that being apart of our pca analysis. We will store our target vector off to the side while we run PCA. And create our X meature matrix officially.

In [144]:
# Recall that target vector is y.
y = finalized_feature_matrix[['Rating']]
print(y.head())

     Rating
0  1.181356
1 -0.912734
2 -0.389211
3  0.657834
4  0.657834


In [145]:
X = finalized_feature_matrix.drop(['Rating'], axis=1)
print(X.head())

   ReviewDate  CocoaPercentage  CompanyMaker_AMMA  CompanyMaker_Acalli  \
0    1.254754        -1.375407          -0.052881            -0.033417   
1    0.913207        -0.268644          -0.052881            -0.033417   
2    0.913207        -0.268644          -0.052881            -0.033417   
3    0.913207        -0.268644          -0.052881            -0.033417   
4    0.913207        -0.268644          -0.052881            -0.033417   

   CompanyMaker_Adi  CompanyMaker_Aequare (Gianduja)  CompanyMaker_Ah Cacao  \
0         -0.047285                        -0.033417              -0.023623   
1         -0.047285                        -0.033417              -0.023623   
2         -0.047285                        -0.033417              -0.023623   
3         -0.047285                        -0.033417              -0.023623   
4         -0.047285                        -0.033417              -0.023623   

   CompanyMaker_Akesson's (Pralus)  CompanyMaker_Alain Ducasse  \
0             

In [146]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X)
print(pca.explained_variance_ratio_)
# print(pca.singular_values_)

[2.57840525e-03 2.47942508e-03 2.43544133e-03 ... 4.24498629e-35
 3.92819187e-35 3.74440466e-35]


## Model 1

I am going to attempt to train this data using a 75/25 train/test split with Linear Regression.

However, based on the number of features we have and the small number of columns that we have, I expect this model to overfit a lot and give as really poor score. The number of features is actually extremely close to the number of rows that we have in our feature matrix, which is a bad sign right off the bat.

In [147]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.25)

print(f"X_train size: {len(X_train)}")
print(f"X_test size: {len(X_test)}\n")

print(f"y_train size: {len(y_train)}")
print(f"y_test size: {len(y_test)}")


X_train size: 1344
X_test size: 449

y_train size: 1344
y_test size: 449


### Linear Regression on training data

In [148]:

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg = reg.fit(X_train, y_train)
print(f"Linear Regression Training Score: {reg.score(X_train, y_train) * 100} percent")

Linear Regression Training Score: 39.137388531437104 percent


### Linear Regression on testing data

In [149]:
from sklearn.linear_model import LinearRegression
reg.fit(X_test, y_test)
print(f"Linear Regression Testing Score: {reg.score(X_train, y_train) * 100} percent")

Linear Regression Testing Score: -5.264922401675303e+29 percent


## Model 2

As predicted, the linear regression model preformed horribly.

Here is what we are going to do:

1. Use cross validation for training/testing our model, as we do not have a lot of data and this will help to use everything that we have got  for training and for testing.

2. We are going try out using Lasso Regression (regularization) because it appears that we are STRONGLY overfitting and it has to do with the fact that we have a huge amount of features. Lasso is actually really good at zeroing out features that are not of use to us. In this case that zeroing out of features will strongly increase our performance, so we will use this instead of Ridge.

3. In order to find the best possible learning rate for our Lasso model, we will run GridSearch with a bunch of different possible learning rates. This will give us the best model with the best learning rate and show us the best score for that combonation.

I am going to skip ahead and display the best model and tunning paramaters right now simply because GridSearch takes a while to run.
The code for GridSearch and how I found the best model is at the bottom of the notebook.

In [150]:
from sklearn.linear_model import LassoCV, Lasso
from numpy import absolute, arange

model = LassoCV(alphas=[0.02], cv=5, random_state=42, n_jobs=3).fit(X, y)
print(f'Lasso Score: {absolute(model.score(X,  y) * 100)}')

Lasso Score: 53.87798809200481


In [151]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import RepeatedKFold, GridSearchCV
from numpy import absolute, arange

# Lasso Approach.
# Note I am using lasso over ridge because I have a LOT of features and I actually want many of them to be zeroed out.
# In addition to this, we will be running Cross Validation because we do not have a lot of training data to offer this model.
# lasso = Lasso()
# cv = RepeatedKFold(n_splits=10)
# grid = {'alpha': arange(0, 1, 0.01)}
# search = GridSearchCV(lasso, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=3)
# results = search.fit(X, y)

In [152]:
# print(f'Best score: {absolute(results.best_score_) * 100}')
# print(f'Params: {results.best_params_}')

### Things to do

1. Look more into the results we got from PCA and see if we can cut a bunch of features off. Because if we can, then linear regression or ridge might be useful.

2. Look into GridSearchCV v.s. LassoCV

## Model 3

Let's take a moment to train and run our data on a Neural Network and see what our results look like. Neural Networks (in general) are better at fitting non-linear data. In other words, we can use a NN when running with a lot of features.

This will actually be a good test here because we can run the NN with a logistic regression based activation function (Relu) and see if we are overfitting because we do not have enough data, or we are overfitting because there is too many features.

If we are overfitting because of too little data, then in theory, our NN should also have issues getting a good overall score on our data.

If we are not overfitting, then the number of training records we are feeding out network should be fine.

### Network Structure

First let's create our the structure for our Neural Network.

Since this data does not seem all that complex (to my knowledge), we will start with a simple structure. One hidden layer.

Since we are simply making a single prediction, we will use one output node in our layer.

We will use m nodes in our hidden layer, where m = (number of rows of data we have in our data frame).


In [153]:
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from keras.metrics import MeanAbsolutePercentageError
from sklearn.model_selection import cross_val_score, KFold


# Function for computing accuracy of our model.
def soft_acc(y_true, y_pred):
    return K.mean(K.equal(K.round(y_true), K.round(y_pred)))


# Function that creates and returns our NN structure.
def build_model():
    # Train, test approach. (This should work worse than with cross validation, but I want to test it).
    # We will make the input layer node count = number of features we have.
    # We will make each layer have m number of nodes, where m is the number of rows in our dataframe.
    # We will only have one output, since we are doing Regression prediction.
    model = Sequential()
    model.add(Dense(X.shape[0], input_dim=X.shape[1], activation='relu'))
    model.add(Dense(1, activation='relu'))

    # We will use mean squared error as our loss function, as we are basically running linear regression.
    # We will use the optimizer function "adam". This is rather popular and is a good optimized verison of standard gradient descent.
    model.compile(loss='mean_squared_error', optimizer='adam', metrics=[soft_acc])
    return model

In [154]:
model = build_model()

### This is what our model looks like

In [None]:
model.summary()

In [156]:
# Train model on the training data...
model.fit(X_train, y_train, epochs=100, batch_size=20)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x1594641f400>

In [168]:
# Let's see our accuracy on the training data...
accuracy = model.evaluate(X_train, y_train)
print(f'Accuracy Percentage: {accuracy}')

Accuracy Percentage: [1.0147128105163574, 0.3504464328289032]


In [169]:
# Let's see how well we do on testing data...
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Accuracy Percentage: {accuracy}')

Accuracy Percentage: 0.3604166805744171


### Results

It appears that our training accuracy is 81 percent and our testing accuracy is 79 percent. That is not at all bad. When we compare these scores to Linear regression or Lasso, they knock it right out of the park.

This really shows us that a Neural Network can handle lots tons of features much better than that of Linear Regression or even tweaked Lasso Regression. We also get the added benefit of not even worrying about feature reduction.

## Model 4

We are going to use the exact same Neural Network, but this time instead of using a 80/20 split, we are going to use cross validation. This should improve our NN's scores even more.

In [159]:
estimator = KerasRegressor(build_fn=build_model, epochs=150, batch_size=25, verbose=0)
kfold = RepeatedKFold(n_splits=5, n_repeats=100)
results = cross_val_score(estimator, X, y, cv=kfold, n_jobs=3)

KeyboardInterrupt: 

In [None]:
print(f'MSE: {results.mean()}')