# Strategies to prevent overfitting in neural networks

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

from tensorflow.keras import regularizers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, Dense, GlobalMaxPooling1D, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping

try:
    tf.set_random_seed(1337)                    # set the random seed for reproducibility
except:
    tf.random.set_seed(1337)                     # NOTE: Newer version of tensorflow uses tf.random.set_seed
np.random.seed(1337)                         #       instead of tf.set_random_seed

## Introduction

**Business Context.** You are a data scientist working for a machine learning consultancy. One of your clients wants to be able to classify text reviews automatically by the likely rating (on a 1 - 5 scale) that that person would give. However, they do not have sufficient data they generated on their own to do this, so you need to use an external, rich dataset as a basis on which to build your model and then translate it over.

**Business Problem.** Your task is to **build a neural networks-based model for classifying text reviews into likely ratings (on a 1 - 5 scale)**.

**Analytical Context.** We'll use the Amazon review dataset again and try to classify reviews into star ratings automatically. Instead of just positive and negative, we'll take on the harder challenge of predicting the *exact* star rating. The lowest score is 1 and the highest is 5.

Instead of trying to optimize by pre-processing the text, we'll do very basic tokenization and experiment with different neural network models, architectures, and hyperparameters to optimize the results. You'll start by building a simple dense neural network and try to get it to perform better using various techniques. Then you'll evaluate the results and diagnose where it tends to perform more poorly.

## Setting up and preparing the data

We'll mainly be using the `keras` module from TensorFlow, but we'll also use `pandas` to read the CSV file and `sklearn` for some helper functions. We'll be using only the "Text" and "Score" columns in the `Reviews.csv` file:

In [2]:
amazon_reviews = pd.read_csv('Reviews.csv', nrows=262084)
amazon_reviews.head(5)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


### Exercise 1:


Combine the first 1,000 of each of the 1-, 2-, 3-, 4-, and 5-star reviews in `amazon_reviews` into a single DataFrame (so you should have 5,000 observations in total). Split this DataFrame into training and test sets, with 80% of the data for the training set.

**Hint:** `keras` will expected your labels to start with 0, and not 1, so make sure to adjust the labels accordingly.

**Answer.**

In [3]:
appended_data = []
for i in range(1,6):
    temp = []
    temp = amazon_reviews[amazon_reviews.Score ==i ].sample(1000)
    appended_data.append(temp)
    
df = pd.concat(appended_data,  ignore_index=True)

In [4]:
df['Score'].replace({1: 0, 
                     2: 1, 
                     3:2, 
                     4:3, 
                     5:4}, inplace=True)

df.sort_values(by='Score', ascending=True).reset_index().head(3)

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,0,193397,B001PLIGB8,A3PZVEJ94ZG43,Mother005,6,6,0,1287100800,Broken Dreams,I am not impressed with MYOFFICEPRODUCTS.COM. ...
1,658,202599,B004XXXK5W,A1QDQD4HJKCFI9,Geoffrey J Graham,13,16,0,1313712000,It's only 75% Juice...it has added sugar and s...,"<span class=""tiny""> Length:: 0:51 Mins<br /><b..."
2,659,172718,B0002DHOWW,A5NQFXER5QYMD,"jay sellers ""jay bird""",0,0,0,1336176000,cats hate it,I couldn't get my cats to eat this if it were ...


In [5]:
# Partition

train, test = train_test_split(df,
                               test_size=0.2,
                               random_state=42,
                               stratify = df['Score']
                              )

-------

## Tokenizing our texts

Keras comes with its own functions to preprocess text, including a [tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) (a mapping from each word in our corpus to a unique integer). Unlike the `CountVectorizer` from `sklearn`, which produces sparse matrices, `keras` often expects to work with sequences representing only the words that occur in a text. To prepare text before feeding it into a neural network, we usually:

1. Create a [tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer).
2. [Create sequences](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#texts_to_sequences) from our text (each text becomes a list of integers, based on the tokenizer mapping, instead of words)
3. [Pad or truncate](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences) each sequence to a fixed length (very short texts get `0`s added to them, while very long ones are truncated).

The tokenizer has a configurable word cap, so it will only consider the $n$ most common words in the corpus, ignoring very rare words.

### Exercise 2:

In this exercise, you will learn how to use the `tf.keras.preprocessing.text.Tokenizer` tool to carry out the preprocessing steps described above.



#### 2.1

Perform some exploratory analysis of the dataset to calculate the number of unique words in our corpus and the distribution of the number of words in each review of the training set. What is the 80th percentile of this distribution?

**Answer.**

In [6]:
# Let's create a function to count the words

def count_words(corpus):
    x = len(corpus.lower().split())
    return x

In [7]:
train['Text'].apply(count_words),  test['Text'].apply(count_words)

(1174    140
 4786     23
 4915     93
 3234    145
 852      99
        ... 
 744     100
 165     122
 1964     66
 3479     44
 6        60
 Name: Text, Length: 4000, dtype: int64,
 1553     89
 3503     23
 413      24
 2881     51
 2618    234
        ... 
 1615     84
 561      26
 612      81
 2714    192
 3914     23
 Name: Text, Length: 1000, dtype: int64)

In [9]:
print('The percentile 80th in train dataset is:', train['Text'].apply(count_words).quantile(.8))

The percentile 80th in train dataset is: 129.0


Now we are going to explore the most frequent words for each score in the dataset

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

def common_words_ngrams(corpus, n=5,k=1):
    # Train countvectorizer with input corpus and personalizate n grams 
    # Here the stopwords are remove
    vec = CountVectorizer(ngram_range=(k,k),                          
                          stop_words = 'english').fit(corpus)
    
    # Creation of bag of words from all corpus
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    
    # Count how many times the ngram appears
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    
    # order from most to least occurrences
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    
    # Return the number n (personalizate in input) of most frequents ngrams
    return words_freq[:n]

In [11]:
common_words_ngrams(df[df['Score']==0]['Text'])

[('br', 1360), ('like', 485), ('product', 463), ('just', 354), ('taste', 354)]

In [12]:
common_words_ngrams(df[df['Score']==1]['Text'])

[('br', 1368), ('like', 645), ('taste', 485), ('coffee', 376), ('just', 367)]

In [13]:
common_words_ngrams(df[df['Score']==2]['Text'])

[('br', 1498), ('like', 654), ('coffee', 472), ('good', 468), ('taste', 466)]

In [14]:
common_words_ngrams(df[df['Score']==3]['Text'])

[('br', 1357), ('like', 564), ('coffee', 541), ('good', 538), ('flavor', 366)]

In [15]:
common_words_ngrams(df[df['Score']==4]['Text'])

[('br', 1171), ('like', 380), ('good', 373), ('great', 359), ('just', 294)]

-------

#### 2.2

Given the results above, we create a tokenizer using only the top 20,000 most frequent words in our corpus (which corresponds to roughly 80% of the words): 

In [16]:
tokenizer = Tokenizer(num_words=20000) #We create the tokenizer using only top 20000 words

In [17]:
tokenizer.fit_on_texts(train['Text'])  #Then, we create the text->indices mapping. 

The above line has given several features and methods to our tokenizer. For instance, print the line `tokenizer.word_index` in a new cell - what do you see? Apply the `tokenizer.texts_to_sequences()` method on the list `['I just feel very very good']`. Apply the `tokenizer.sequences_to_texts()` method on the list `[[109, 19, 824, 76, 114, 6315, 1137, 8070]]`. What were your results?

**Answer.**

In [18]:
tokenizer.texts_to_sequences(['I just feel very very good'])

[[2, 35, 271, 39, 39, 30]]

In [19]:
tokenizer.sequences_to_texts([[109, 19, 824, 76, 114, 6315, 1137, 8070]])

['box have fair your drink whites prices unsuitable']

-------

#### 2.3

Use the tokenizer to transform the texts in our test and train data to sequences. Then, use the `pad_sequences` function to pad/truncate these sequences to length 116 (the 80th percentile of text lengths). Save the resulting arrays as `train_sequences` and `test_sequences`.

**Answer.**

In [20]:
# Train dataset
train_sequences = tokenizer.texts_to_sequences(train['Text'])
train_sequences = pad_sequences(train_sequences, maxlen=116)

# Test dataset
test_sequences = tokenizer.texts_to_sequences(test['Text'])
test_sequences = pad_sequences(test_sequences, maxlen=116)


In [21]:
labels = train['Score']
labels = labels.astype('int32')

-------

## Building a basic neural network model 

Now that we have preprocessed the text, let's create a basic neural network to train on our data. We'll use an embedding layer which performs [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) on our word sequences, two fully connected ("dense") layers, and an output layer with 5 neurons to represent the 5 possible star ratings.

Before we train a `keras` model, there is an additional `compile` step where we define what loss function and optimizer to use, and what metrics to output. Then we can train the model using the `fit` function. All of this is shown below.

Note the `validation_split=0.2` argument which tells Keras to train on only 80% of the training data and tune the model on the remaining 20%, which we call the validation set. You can see the accuracy and loss for both the training and validation set in the output for each epoch:

In [22]:
model = Sequential()
model.add(Embedding(20000, 128, input_length=116))
model.add(Dense(128, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(5, activation='sigmoid'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [23]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 116, 128)          2560000   
_________________________________________________________________
dense (Dense)                (None, 116, 128)          16512     
_________________________________________________________________
dense_1 (Dense)              (None, 116, 128)          16512     
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 645       
Total params: 2,593,669
Trainable params: 2,593,669
Non-trainable params: 0
_________________________________________________________________


In [24]:
model.fit(train_sequences, labels, validation_split=0.2, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x2be5da768c8>

### Exercise 3:

How well does this model perform? How does this compare to a baseline expectation? What do you notice about the accuracy and loss values for both the validation and training sets over time and what does this mean?

**Answer.**

It is observed that throughout the epochs the performance of the model does not improve very well the accuracy metric in validation, while in training it learns the data completely. Moreover, the value of the cost function in training is decreasing while in validation it does not improve. We should try different strategies later on 

-------

## Experimenting with different regularization strategies

There are many different ways to mitigate overfitting in a neural network, collectively known as *regularization* techniques. One common regularization technique is called [Dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout). In this regularization method, a set of neurons is randomly selected at each training step to be completely ignored. This is done so that the neurons in our network do not rely strongly on their neighboring neurons and we avoid the creation of ["co-adaptations"](http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf) that do not generalize well to unseen data. This making the model more robust and less prone to overffiting.

You can create dropouts in `keras` by adding a layer named `Dropout(p)`, where `p` is the probability of dropping neurons in the previous layer. For example, the following model would implement dropout by removing roughly 20% percent of the outputs of the embedding layer at each training step:

In [25]:
model2 = Sequential()
model2.add(Embedding(20000, 128, input_length=116))
model2.add(Dropout(0.2)) # --------------------------->Dropout layer will affect the output of previous layer.
model2.add(Dense(128, activation='relu')) 
model2.add(Dense(128, activation='relu'))
model2.add(GlobalMaxPooling1D())
model2.add(Dense(5, activation='sigmoid'))
model2.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model2.fit(train_sequences, labels, validation_split=0.2, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x2be5dae9a08>

### Exercise 4:

Modify the neural network definition above to try and fix the overfitting problem using Dropout. Explain the configuration that you tried and your results. Why do you think your modifications were or were not able to mitigate the overfitting problem?

**Answer.**

In [26]:
model_ecx4 = Sequential()
model_ecx4.add(Embedding(20000, 128, input_length=116))
model_ecx4.add(Dense(128, activation='relu'))
model_ecx4.add(Dropout(0.5))
model_ecx4.add(Dense(128, activation='relu'))
model_ecx4.add(Dropout(0.5))
model_ecx4.add(GlobalMaxPooling1D())
model_ecx4.add(Dense(5, activation='sigmoid'))
model_ecx4.compile(loss='sparse_categorical_crossentropy', optimizer='adam',  metrics=['accuracy'])
model_ecx4.fit(train_sequences, labels, validation_split=0.2, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x2be5edc8108>

-------

### Exercise 5:

Keras allows you to add [L1](https://www.tensorflow.org/api_docs/python/tf/keras/regularizers/l1), [L2](https://www.tensorflow.org/api_docs/python/tf/keras/regularizers/l2), or [L1 and L2](https://www.tensorflow.org/api_docs/python/tf/keras/regularizers/l1_l2) combined regularizers on individual layers by passing in the `kernel_regularizer`, `bias_regularizer` or `activity_regularizer` arguments. In neural networks, these regularizers work by penalizing the loss function in different ways, based on the number of weights or the size of the weights.

Try 4-5 different combinations of L1, L2, L1 and L2 regularization in different combinations on different layers. In each example, explain why you tried that configuration and the results. Why do you think your modifications were or were not able to mitigate the overfitting problem?

**Answer.**

In [27]:
model_ecx5_1 = Sequential()
model_ecx5_1.add(Embedding(20000, 128, input_length=116))
model_ecx5_1.add(Dense(128, activation='relu', kernel_regularizer=regularizers.l1(0.001)))
model_ecx5_1.add(Dense(128, activation='relu', kernel_regularizer=regularizers.l1(0.001)))
model_ecx5_1.add(GlobalMaxPooling1D())
model_ecx5_1.add(Dense(5, activation='sigmoid'))
model_ecx5_1.compile(loss='sparse_categorical_crossentropy', optimizer='adam',  metrics=['accuracy'])
model_ecx5_1.fit(train_sequences, labels, validation_split=0.2, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x2be5f43ba48>

In [28]:
model_ecx5_2 = Sequential()
model_ecx5_2.add(Embedding(20000, 128, input_length=116))
model_ecx5_2.add(Dense(128, activation='relu', kernel_regularizer=regularizers.l1(0.0001)))
model_ecx5_2.add(Dense(128, activation='relu', kernel_regularizer=regularizers.l1(0.0001)))
model_ecx5_2.add(GlobalMaxPooling1D())
model_ecx5_2.add(Dense(5, activation='sigmoid'))
model_ecx5_2.compile(loss='sparse_categorical_crossentropy', optimizer='adam',  metrics=['accuracy'])
model_ecx5_2.fit(train_sequences, labels, validation_split=0.2, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x2be62da2188>

In [29]:
model_ecx5_3 = Sequential()
model_ecx5_3.add(Embedding(20000, 128, input_length=116))
model_ecx5_3.add(Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.001)))
model_ecx5_3.add(Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.001)))
model_ecx5_3.add(GlobalMaxPooling1D())
model_ecx5_3.add(Dense(5, activation='sigmoid'))
model_ecx5_3.compile(loss='sparse_categorical_crossentropy', optimizer='adam',  metrics=['accuracy'])
model_ecx5_3.fit(train_sequences, labels, validation_split=0.2, epochs=10)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x2be6320ed08>

In [31]:
model_ecx5_4 = Sequential()
model_ecx5_4.add(Embedding(20000, 128, input_length=116))
model_ecx5_4.add(Dense(128, activation='relu', kernel_regularizer=regularizers.l1_l2(0.00001)))
model_ecx5_4.add(Dense(128, activation='relu', kernel_regularizer=regularizers.l1_l2(0.00001)))
model_ecx5_4.add(GlobalMaxPooling1D())
model_ecx5_4.add(Dense(5, activation='sigmoid'))
model_ecx5_4.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model_ecx5_4.fit(train_sequences, labels, validation_split=0.2, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x2be00b9e088>

-------

## Regularization through adding more data

Depending on the configurations you tried above, you probably saw that L1 and L2 regularization are pretty limited for this model and this amount of data. A more straightforward way to prevent overfitting is simply by adding more training data. If the network has more (and more varied) examples to learn from, perhaps it will learn more generalizable rules.

### Exercise 6:

How would you test the hypothesis that adding more data would result in a more generalizable model? Explain any change in results you see from further experimentation.

**Hint:** Try adding 6000 reviews for each score instead. Compare with the original proposed model.

**Answer.**

In [32]:
# Reload and change the size
appended_data = []
for i in range(1,6):
    temp = []
    temp = amazon_reviews[amazon_reviews.Score ==i ].sample(6000)
    appended_data.append(temp)
    
df = pd.concat(appended_data,  ignore_index=True)

df['Score'].replace({1: 0, 
                     2: 1, 
                     3:2, 
                     4:3, 
                     5:4}, inplace=True)

df.sort_values(by='Score', ascending=True).reset_index()

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,0,121435,B000V6DW5S,A1SC0PCDCLY8R4,"pen name ""ok?""",0,1,0,1330646400,Gross,I love Italian wedding and Campbell's own regu...
1,4005,252584,B001EQ5ERI,A30DO3OIRLDC8B,Law,0,3,0,1336953600,Don't bother,I was seriously disappointed with this product...
2,4004,196961,B000EMQFY4,A2J57VGDETZKF6,4 Kids 2 Exhausted,1,1,0,1327449600,Terribly false advertising,"I had my hopes for this bar. The box reads ""Fr..."
3,4003,49909,B00430B73W,A1HCIYQF7NYKE,K. Swanson,2,7,0,1154995200,Just plain awful,"I can at least tolerate most foods, but these ..."
4,4002,12459,B0079YD36I,A92JJZ71TKRSJ,Leeza,0,3,0,1340582400,Coffee Lovers Beware,I purchased this coffee on sale at my local Vo...
...,...,...,...,...,...,...,...,...,...,...,...
29995,25994,198561,B000FVBYCW,A2BS1XZLSI5FTK,Lulu,0,0,4,1350691200,"Great Tea, Great Price",Tea comes in bags inside a large foil lined po...
29996,25993,156887,B000BZ1OUO,A2VJKSJQO9IKNW,"D. Newray ""Dazeedave""",1,1,4,1231891200,This is truly the Best Giardiniera,"If you can't take spicey, then don't buy the H..."
29997,25992,59609,B000W5SLB8,A1ILH94WP2KTA0,Stephen J. Duffey,1,2,4,1308441600,Wonderful dog food,I used to buy my dog (5 year old Corgi) the re...
29998,26001,103538,B002Z08RIA,A39PGI6IGM5Y2A,Lee,0,0,4,1344211200,Excellent coconut juice,I'm an avid coconut water drinker - I've tried...


In [34]:
# Partition
train, test = train_test_split(df,
                               test_size=0.2,
                               random_state=42,
                               stratify = df['Score']
                              )

tokenizer = Tokenizer(num_words=20000)
tokenizer.fit_on_texts(train['Text'])

# Train dataset
train_sequences = tokenizer.texts_to_sequences(train['Text'])
train_sequences = pad_sequences(train_sequences, maxlen=116)

# Test dataset
test_sequences = tokenizer.texts_to_sequences(test['Text'])
test_sequences = pad_sequences(test_sequences, maxlen=116)

labels = train['Score']
labels = labels.astype('int32')

In [36]:
model_ecx6 = Sequential()
model_ecx6.add(Embedding(20000, 128, input_length=116))
model_ecx6.add(Dense(128, activation='relu'))
model_ecx6.add(Dense(128, activation='relu'))
model_ecx6.add(GlobalMaxPooling1D())
model_ecx6.add(Dense(5, activation='sigmoid'))
model_ecx6.compile(loss='sparse_categorical_crossentropy', optimizer='adam',  metrics=['accuracy'])
model_ecx6.fit(train_sequences, labels, validation_split=0.2, epochs=10)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x2be631f2888>

-------

## Regularization through early stopping

We have consistently seen that our neural network overfits at around the third epoch. Hence, another form of regularization is to end training early if validation loss starts increasing. (This is similar to the validation curves we used when constructing classification models.) Although the network will not have found an optimal function in the training data, the looser function that it has found will likely be more generalizable.

You can do this manually by inspecting the data as we have done above and modifying the `epochs` argument in `fit()`, but Keras also allows you to easily do this automatically via an [`EarlyStopping` callback](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping).

### Exercise 7:

Experiment with the `EarlyStopping` callback and explain the results.

**Answer.**

In [38]:
model_ecx7 = Sequential()
model_ecx7.add(Embedding(20000, 128, input_length=116))
model_ecx7.add(Dense(128, activation='relu'))
model_ecx7.add(Dense(128, activation='relu'))
model_ecx7.add(GlobalMaxPooling1D())
model_ecx7.add(Dense(5, activation='sigmoid'))
model_ecx7.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model_ecx7.fit(train_sequences, labels, validation_split=0.2, epochs=10,
 callbacks=[EarlyStopping(monitor='val_loss', mode='min')])


Epoch 1/10
Epoch 2/10
Epoch 3/10


<tensorflow.python.keras.callbacks.History at 0x2be03392648>

-------

## Evaluating our model

Unlike in most previous cases, we used *three* splits of our data instead of two. All of our model tuning has been done on the validation set, and we have not even touched the test set that we split off right at the start.

For experiments, it's very important that your model is only run **once** on your test set. As there is so much randomness at play, it's vital to not "cherry-pick" the best results, so optimize as much as you want on the validation set, but keep the test set until the end and all official results should be based on the single run of the test set (or whatever configuration was decided *before the experiment started*).

### Exercise 8:

Let's take the model configuration that resulted in the highest validation accuracy and use that one as our final model. Evaluate this configuration on how well it performs on the test set, and furthermore diagnose *what kinds of mistakes it makes*. Explain whether these mistakes are expected or not, and print some of these poorly classified reviews. Given the mistakes the model made, how would you then go back and try to improve the model or optimize the tuning steps?

**Hint:** You can use the [`predict_classes`](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#predict_classes) method on your model to get the most probable class directly.

**Answer.**

In [39]:
model_ecx7.predict_classes(test_sequences)

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).


array([3, 2, 1, ..., 4, 2, 3], dtype=int64)

In [43]:
# Save results into test data frame
pred = model_ecx7.predict_classes(test_sequences)
print(accuracy_score(test['Score'], pred))


0.478


In [44]:
confusion_matrix(test['Score'], pred)

array([[672, 262, 136,  53,  77],
       [257, 429, 329, 110,  75],
       [ 85, 225, 487, 275, 128],
       [ 57,  75, 255, 443, 370],
       [ 41,  37,  80, 205, 837]], dtype=int64)

In [45]:
from sklearn.metrics import classification_report

print(classification_report(test['Score'], pred))   

              precision    recall  f1-score   support

           0       0.60      0.56      0.58      1200
           1       0.42      0.36      0.39      1200
           2       0.38      0.41      0.39      1200
           3       0.41      0.37      0.39      1200
           4       0.56      0.70      0.62      1200

    accuracy                           0.48      6000
   macro avg       0.47      0.48      0.47      6000
weighted avg       0.47      0.48      0.47      6000



An alternative would be to review other network architectures that are more robust and suitable for the problem, you can also perform a pre-cleaning of the data as was done in EC4 and review how in conjunction with different neural network architectures improves or not the performance of the metrics.

-------

Hopefully, you have seen from this that there is no one-size-fits-all method when creating model architectures or tuning parameters. Often times, copious experimentation is needed, and even then it can be difficult to get significantly better results than a baseline model or even really diagnose what is going wrong under the hood (since neural networks are so "black-box"). In many cases, the quantity and quality of the data itself is far more important than the architecture of the network for getting good results.