## Background

Keras enables models to use variable-size inputs by using None as a dimension in the input space.  Less clearly documented, this will not behave with the .fit method, instead .fit_generator must be used.

Resolution explained in a follow-up question on the thread here: https://github.com/keras-team/keras/issues/6776

Proper use example: https://datascience.stackexchange.com/questions/26366/training-an-rnn-with-examples-of-different-lengths-in-keras

## Sample Code - Sentiment Analysis

Since this example is mainly for demonstrating Keras usage with variable sequence length, using a minimal model:

    1) Just training our own embedding rather than using Glove or something smart.
    2) One layer LSTM followed by a Dense layer, low values for hidden dim and embedding dim.
    3) Use Bidirectional so the model isn't completely hopeless

In [1]:
# Standard Stuff
import pandas as pd
import numpy as np
import re
from collections import Counter, defaultdict

# NLP
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup

# Keras
from keras.models import Model
from keras.layers import Dense, Input, LSTM, Bidirectional, Embedding
from keras.utils import to_categorical

[nltk_data] Downloading package stopwords to /home/geugon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/geugon/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Using TensorFlow backend.


In [2]:
lem = WordNetLemmatizer()
def clean_text(orig_text):
    
    # html cleanup
    soup = BeautifulSoup(orig_text, "html.parser")
    review = soup.get_text()
    
    # white space cleanup
    review = re.sub('\[[^]]*\]', ' ', review)
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = review.lower().split()
    
    # lem and stopword removal 
    review = [lem.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    
    return review

In [3]:
# Data from:
# https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data?select=IMDB+Dataset.csv

df = pd.read_csv("~/Downloads/IMDB_Dataset.csv")
raw_reviews = df['review'][:1000].apply(clean_text) #slow!
sentiment = to_categorical(np.where(df['sentiment']=='positive', 1, 0))
del df

In [4]:
index = np.arange(len(raw_reviews))
np.random.shuffle(index)
n_train = 900#49500
n_valid = 100#500

counts = Counter(raw_reviews[index[:n_train]].sum())
id_to_token = [k for k, v in counts.items() if v>4]
vocab_size = len(id_to_token)
token_to_id = defaultdict(lambda : vocab_size, 
                   ((v,k) for k,v in enumerate(id_to_token)))
id_to_token.append('<RARE>')

reviews = [np.array([token_to_id[token] for token in review]) for review in raw_reviews]

print(vocab_size)

3516


In [5]:
def train_generator():
    while True:
        for i in range(0, n_train):
            x = reviews[index[i]].reshape(1,-1)
            y = sentiment[index[i]].reshape(1,2)
            yield x,y

def valid_generator():
    while True:
        for i in range(n_train, n_train+n_valid):
            x = reviews[index[i]].reshape(1,-1)
            y = sentiment[index[i]].reshape(1,2)
            yield x,y

In [6]:
embedding_dim = 16
hidden_dim = 32

input_ = Input(shape=(None,))
embed = Embedding(vocab_size + 1, embedding_dim)(input_)
rnn = Bidirectional(LSTM(hidden_dim, return_sequences=False))(embed)
predict = Dense(2, activation='sigmoid')(rnn)
model = Model(inputs=input_, outputs=predict)
print(model.summary())

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc'])



W0705 20:18:06.624911 139773621778176 deprecation_wrapper.py:119] From /home/geugon/anaconda3/envs/keras/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0705 20:18:06.647186 139773621778176 deprecation_wrapper.py:119] From /home/geugon/anaconda3/envs/keras/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0705 20:18:06.651402 139773621778176 deprecation_wrapper.py:119] From /home/geugon/anaconda3/envs/keras/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0705 20:18:07.246141 139773621778176 deprecation_wrapper.py:119] From /home/geugon/anaconda3/envs/keras/lib/python3.6/site-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, None)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, None, 16)          56272     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                12544     
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 130       
Total params: 68,946
Trainable params: 68,946
Non-trainable params: 0
_________________________________________________________________
None


In [7]:
model.fit_generator(train_generator(), 
                    validation_data=valid_generator(),
                    steps_per_epoch = n_train, #batch size is inherently 1 via generator
                    validation_steps= n_valid,
                    epochs=3,
                    verbose=1,)
                    

W0705 20:18:07.453125 139773621778176 deprecation.py:323] From /home/geugon/anaconda3/envs/keras/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0705 20:18:08.939650 139773621778176 deprecation_wrapper.py:119] From /home/geugon/anaconda3/envs/keras/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f1f45eb0da0>