## RNNs

We will use Recurrent Neural Networks, and in particular LSTMs, to perform sentiment analysis in Keras.  Conveniently, Keras has a built-in IMDb movie reviews dataset that we can use.

In [None]:
!conda update -n base -c defaults conda

In [15]:
!conda install keras

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.





  current version: 4.8.3
  latest version: 4.9.2

Please update conda by running

    $ conda update -n base -c defaults conda




In [1]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

import numpy as np
from data import get_data
from time import time

urls = 'data/hs_code.xlsx'
types = 'heading'

In [2]:
t0 = time()
sheets = ['8_digit','6_digit','4_digit', 'test_01', 'Declaration_2019_10']

df = get_data(urls,sheets,types)
df.columns = ['label', 'text']
print(len(df))
load_time = time() - t0
print("Load dataset time:  %0.3fs" % load_time)
df.sample(10)

Load dataset time:  106.517s
49041
Load dataset time:  106.518s


Unnamed: 0,label,text
41886,7216,accum hldas
45907,7610,aluminium door frame gg162182awaa
43330,5702,aren pvc bath mat tr 7136 cms tr
41672,7610,aluminium carport exterior 8rda08sc
13360,5514,"fabrics, woven; printed, containing less than ..."
14672,8412,"engines; pneumatic power engines and motors, o..."
31647,8708,alloy wheels bu1 8019 5h/112 et50 cb72.5 mistr...
44308,7610,aluminium door frame tf157182asaa
35829,7604,aluminium profiles ycrt655
46612,8544,adapter cord epb rh


In [11]:
#train test split

vocabulary_size = 5000
print("train test split")

from sklearn.model_selection import train_test_split
X = []
for i in range(df.shape[0]):
    X.append((df.iloc[i][1]))
y = np.array(df["label"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)
print('Loaded dataset with {} training samples, {} test samples'.format(len(X_train), len(X_test)))

train test split
Loaded dataset with 39232 training samples, 9809 test samples


In [3]:
vocabulary_size = 5000

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)
print('Loaded dataset with {} training samples, {} test samples'.format(len(X_train), len(X_test)))

Loaded dataset with 25000 training samples, 25000 test samples


 Inspect a sample review and its label

In [12]:
print('---review---')
print(X_train[6])
print('---label---')
print(y_train[6])

---review---
mace, crushed or ground, bombay or wild
---label---
908


Map word IDs back to words

In [None]:
from keras.datasets import imdb

In [None]:
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}
print('---review with words---')
print([id2word.get(i, ' ') for i in X_train[6]])
print('---label---')
print(y_train[6])

Maximum review length and minimum review length

In [6]:
print('Maximum review length: {}'.format(
len(max((X_train + X_test), key=len))))

Maximum review length: 2697


In [7]:
print('Minimum review length: {}'.format(
len(min((X_test + X_test), key=len))))

Minimum review length: 14


### Pad sequences

In order to feed this data into our RNN, all input documents must have the same length. We will limit the maximum review length to max_words by truncating longer reviews and padding shorter reviews with a null value (0). We can accomplish this using the pad_sequences() function in Keras. For now, set max_words to 500.

In [8]:
from keras.preprocessing import sequence

max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

### TODO: Design an RNN model for sentiment analysis

Build our model architecture in the code cell below. We have imported some layers from Keras that you might need but feel free to use any other layers / transformations you like.

Remember that our input is a sequence of words (technically, integer word IDs) of maximum length = max_words, and our output is a binary sentiment label (0 or 1).

In [9]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

embedding_size=32
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


To summarize, our model is a simple RNN model with 1 embedding, 1 LSTM and 1 dense layers. 213,301 parameters in total need to be trained.

### Train and evaluate our model

We first need to compile our model by specifying the loss function and optimizer we want to use while training, as well as any evaluation metrics we'd like to measure. Specify the approprate parameters, including at least one metric 'accuracy'.

In [10]:
model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

Once compiled, we can kick off the training process. There are two important training parameters that we have to specify - batch size and number of training epochs, which together with our model architecture determine the total training time.

Training may take a while, so grab a cup of coffee, or better, go for a run!

In [11]:
batch_size = 64
num_epochs = 3

X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]

model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)

Train on 24936 samples, validate on 64 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1a9529dea20>

scores[1] will correspond to accuracy if we pass metrics=['accuracy']

In [12]:
scores = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', scores[1])

Test accuracy: 0.86964
