# Dropout

<img src="pics/alvinn.png">
(Pomerleau, 1989)

(Remember ALVINN from the MTL paper by Caruana?)

* **Problem**: feature swamping (Sutton et al., 2006)


* **Idea**: corrupt features (nodes in network) (Hinton et al., 2012)

<img src="pics/dropout.png">

### Correspondences:

dropout ~ model averaging ~ regularization

### Keras

In [1]:
from keras.layers.core import Dropout
Dropout(0.2)

Using Theano backend.


<keras.layers.core.Dropout at 0x10ee9f278>

In [2]:
# some layers directly take the dropout argument:
from keras.models import Sequential
from keras.layers import Embedding
model = Sequential()
model.add(Embedding(10000, 128, input_length=100, dropout=0.2))


#### Sentiment Classification Example - With Dropout

We now take the RNN-based model from [notebook 8](08_Classification_And_Structured_Prediction_with_Keras.ipynb) which achieved an accuracy of 73%. First, we modify the network to get slighty higer performance. Then, we add dropout.

In [20]:
# load data - convert to indices, pad to max_length - y's no n-hot needed as this is a binary task
import numpy as np
import random
from collections import defaultdict
from sklearn.model_selection import train_test_split

positive_sentences = [l.strip() for l in open("exercise/rt-polaritydata/rt-polarity.pos").readlines()]
negative_sentences = [l.strip() for l in open("exercise/rt-polaritydata/rt-polarity.neg").readlines()]

positive_labels = [1 for sentence in positive_sentences]
negative_labels = [0 for sentence in negative_sentences]

sentences = np.concatenate([positive_sentences,negative_sentences], axis=0)
labels = np.concatenate([positive_labels,negative_labels],axis=0)

## make sure we have a label for every data instance
assert(len(sentences)==len(labels))
data={}
np.random.seed(113) #seed
data['target']= np.random.permutation(labels)
np.random.seed(113) # use same seed!
data['data'] = np.random.permutation(sentences)

X_rest, X_test, y_rest, y_test = train_test_split(data['data'], data['target'], test_size=0.2)
X_train, X_dev, y_train, y_dev = train_test_split(X_rest, y_rest, test_size=0.2)
del X_rest, y_rest

## map them to ids for embedding layer
w2i = defaultdict(lambda: len(w2i))
PAD = w2i["<pad>"] # index 0 is padding
UNK = w2i["<unk>"] # index 1 is for UNK

# convert words to indices, taking care of UNKs
X_train_num = [[w2i[word] for word in sentence.split(" ")] for sentence in X_train]
w2i = defaultdict(lambda: UNK, w2i) # freeze - cute trick!
X_dev_num = [[w2i[word] for word in sentence.split(" ")] for sentence in X_dev]
X_test_num = [[w2i[word] for word in sentence.split(" ")] for sentence in X_test]

max_sentence_length=max([len(s.split(" ")) for s in X_train] 
                        + [len(s.split(" ")) for s in X_dev] 
                        + [len(s.split(" ")) for s in X_test] )

from keras.preprocessing import sequence

# pad X
X_train_pad = sequence.pad_sequences(X_train_num, maxlen=max_sentence_length, value=PAD)
X_dev_pad = sequence.pad_sequences(X_dev_num, maxlen=max_sentence_length, value=PAD)
X_test_pad = sequence.pad_sequences(X_test_num, maxlen=max_sentence_length,value=PAD)


In [21]:
print("#train instances: {} #dev: {} #test: {}".format(len(X_train),len(X_dev),len(X_test)))

vocabulary_size = len(w2i)
embeds_size=64

#train instances: 6823 #dev: 1706 #test: 2133


In [56]:
np.random.seed(113) #set seed before any keras import
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding
from keras.layers import Dropout, LSTM

#Model without dropout (improvements from before: use LSTM, 128 h_dim, only 4 epochs, batch_size 50)
model = Sequential()
model.add(Embedding(vocabulary_size, embeds_size, input_length=max_sentence_length))
model.add(LSTM(128)) 
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [55]:
model.fit(X_train_pad, y_train, epochs=4, batch_size=50)
loss, accuracy = model.evaluate(X_dev_pad, y_dev)

print("Accuracy: ", accuracy *100)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


We now have a baseline LSTM which achieves 77% accuracy on our dev set. Does adding dropout help?

In [65]:
np.random.seed(113) #set seed before any keras import
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding
from keras.layers import Dropout, LSTM

#Model WITH dropout
model = Sequential()
model.add(Embedding(vocabulary_size, embeds_size, input_length=max_sentence_length))
model.add(LSTM(128,dropout=0.1))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [None]:
model.fit(X_train_pad, y_train, epochs=4, batch_size=50)
loss, accuracy = model.evaluate(X_dev_pad, y_dev)

print("Accuracy: ", accuracy *100)

### RNN-based models have two kinds of droput:

* `dropout`
* `recurrent_dropout`

What is the difference?


* `dropout`: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.
* `recurrent_dropout`: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state. 

See: http://keras.io/layers/core/#dropout

### References

* [Hinton et al., 2012](https://arxiv.org/pdf/1207.0580)
* [Srivastava et al., 2014](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)
* [Sutton et al., 2006](https://people.cs.umass.edu/~mccallum/papers/bags-hlt2006.pdf)
