Install these on your machine in an environment to get started, then relaunch the notebook.

In [2]:
'''
$ pip install --upgrade tensorflow
$ pip install numpy scipy
$ pip install scikit-learn
$ pip install pillow
$ pip install h5py
$ pip install keras
'''

'\n$ pip install --upgrade tensorflow\n$ pip install numpy scipy\n$ pip install scikit-learn\n$ pip install pillow\n$ pip install h5py\n$ pip install keras\n'

In [3]:
# Importing the packages we'll use
import numpy as np
import keras
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint 
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(42)

ModuleNotFoundError: No module named 'keras'

We're using Keras' own IMDB database for sentiment analysis, so nothing complicated in loading docs here. We *will*, however, truncate the total number of words in our corpus.  Since we need to one-hot encode our words, it would be ridiculous to have thousands of words that only appear in the corpus once.  We print the shape on the training and testing (a 50/50 split unless we specify), so we can specify later in keras how large the input vector will be.

In [1]:
top_words = 10000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = top_words)

print(x_train.shape)
print(x_test.shape)

NameError: name 'imdb' is not defined

We'll print the first observation, row 0, of our training set. It looks like the words have already been indexed by Keras, which is anoying, but we can work around it to index new text to match this indexing scheme. y\[0\] is just a 1 or 0 label to indicate sentiment.

In [15]:
print(x_train[0])
print(y_train[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
1


Since each of our inputs needs to be the same length, we'll further chop it to 1000 words and turn each review into a bag of words, one hot encoded.  If the word at an index appears in this review, it will be rewcorded as a 1 at that position. We'll print the first review to show:

In [16]:
tokenizer = Tokenizer(num_words=1000)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
print(x_train[0])

[0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0.
 0. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0.
 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0.
 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

We need to turn the second column into the labels so the first column, the text, is all we feed into the neural network.

In [17]:
num_classes = 2
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_test.shape)

(25000, 2)
(25000, 2)


Here's the fun part, where we build and train a network. The only real limitation is that the shape of the first layer needs to be specified as (25000, input_dim = x_train.shape\[1\]) since that's the shape of our impot vector. Build it however you'd like, and experiement with different layers (dense versus other kinds, dropouts versus no dropouts, different activation functions besides ReLU). It probably makes sense to stick to a Sequential model with a sigmoid at the end since this is a binary classification program. Also, research different loss functions, optimizers, and metrics to see why I chose these. We end by printing our model's architecture.

In [18]:
'''
model = Sequential()
model.add(Dense(25000, input_dim = x_train.shape[1]))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(250, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation ='sigmoid'))
'''
    
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics = ["accuracy"])
    
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 25000)             25025000  
_________________________________________________________________
dense_2 (Dense)              (None, 100)               2500100   
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 100)               10100     
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 250)               25250     
_________________________________________________________________
dropout_3 (Dropout)          (None, 250)               0         
__________

Now we'll fit the model to our data. You can specify the number of epochs (be careful of overfitting by having too many epochs!), batch sizes (so you update your model to fit x number of incoming reviews at a time rather than wildly changing to fit each review as it comes in), etc. Verbose lets you see your epochs as they run.

In [19]:
model.fit(x_train, y_train, nb_epoch=5, batch_size = 128, verbose=1)

  if __name__ == '__main__':


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x129e5a4e0>

In [20]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: ", score[1])

Accuracy:  0.85556


So... I don't have a real filepath written here.  I left this to Tom and Sonja.  But this is how you save a trained model to use at any time in the future on new data.

In [None]:
model.save(filepath)

Here's how you load the mdoel from the file path you specified.

In [None]:
model = keras.models.load_model(filepath)

And this is how you index a new input to the word index Keras used. You index to 0 for all words not in the index. Notice, the first index in the printed BOW of our first review above also had a zero in the first position (as did every one after that if you cared to go back and print them).  The first position, zero, is always left blank, just to account for extra words like in this case.

In [21]:
word_index = imdb.get_word_index()
list_in = ['word1', 'word2']
sentiment_analysis_input = np.array([word_index[word] if word in word_index else 0 for word in list_in])

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json
88584


TypeError: 'NoneType' object is not iterable