# Simple LSTM baseline for predicting review scores

In this kernel, we will go through:

1. **Preprocessing** - Load the dataset, retrieve the reviews (aka documents) and scores, encode the target (scores) and split into a training and test set.
2. **Training Model** - We will train a simple LSTM model that will predict the rating based solely on the review comments. We will be using a FastText embedding; the matrix building process is inspired from [this excellent kernel](https://www.kaggle.com/thousandvoices/simple-lstm), and the LSTM model was simplified for learning purposes.
3. **Evaluation** - We will plot the loss and accuracy progression through epochs, and display the classification results as a table.

In [None]:
import json
import re

import numpy as np 
import pandas as pd
from keras.models import Model
from keras.layers import Input, Dense, Embedding, SpatialDropout1D, add, concatenate
from keras.layers import CuDNNLSTM, Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.callbacks import Callback, ModelCheckpoint
from keras.preprocessing import text, sequence
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tqdm import tqdm

In [None]:
tqdm.pandas()

# Preprocessing

## Loading data

In [None]:
df = pd.read_csv('../input/zomato-bangalore-restaurants/zomato.csv')

print(df.shape)
df.head()

## Retrieve the text data

In [None]:
all_ratings = []

for ratings in tqdm(df['reviews_list']):
    ratings = eval(ratings)
    
    for score, doc in ratings:
        if score:
            score = score.strip("Rated").strip()
            doc = doc.strip('RATED').strip()
            
            score = float(score)
            all_ratings.append([score, doc])

In [None]:
ratings_df = pd.DataFrame(all_ratings, columns=['score', 'doc'])

print(ratings_df.shape)
ratings_df.head()

## Remove ratings inside text

Some of the reviews have ratings within it. We want to train a generic model to predict ratings with only text, so we want to hide this extra information so the model does not overfit on that.

"Unhide" the output of the next cell to view sample of the ratings that contain the character `"/"`.

In [None]:
docs_with_ratings = []
for doc in ratings_df['doc'][:150]:
    if '/' in doc:
        print(doc)
        docs_with_ratings.append(doc)

In [None]:
print(len(docs_with_ratings))

We will use Regex to find and replace all the occurences of ratings. 

Let's take a look at all the parts of documents that matches the pattern `[0-9.]*[0-9]/[0-9]*[0-9]`, which correspond to any rating that has double digits (e.g. 10/10), single digits (5/5) or single digits with a fraction (e.g. 9.5/10 or 3.5/5).

In [None]:
for docs in docs_with_ratings:
    x = re.findall('[0-9.]*[0-9]/[0-9]*[0-9]', docs)
    print(x)

We will replace them with the word "score", since we do not want the model to overfit on ratings that are already given in the comments.

In [None]:
doc = docs_with_ratings[0]
subbed_doc = re.sub('[0-9.]*[0-9]/[0-9]*[0-9]', 'score', doc)
print("ORIGINAL:")
print(doc)
print("\nSUBBED:")
print(subbed_doc)

Now, we do it for all texts.

In [None]:
ratings_df['new_doc'] = ratings_df['doc'].progress_apply(
    lambda doc: re.sub('[0-9.]*[0-9]/[0-9]*[0-9]', 'score', doc)
)

## One Hot Encoding

Let's look at the distribution of ratings

In [None]:
ratings_df['score'].astype('category').value_counts()

Clearly this is categorical data, with some heavy class imbalance. This is something to address if you want to improve performance. We will go ahead and encode that into binary labels:

In [None]:
dummies = pd.get_dummies(ratings_df['score'])
dummies.head()

## Train Test Split

Finally, we split the data into train and test sets; the latter will be used to evaluate our model.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    ratings_df['new_doc'], 
    dummies, 
    test_size=0.1, random_state=2019
)

# Building an LSTM Model

At this point, we are ready to build our model. Similar to the original kernel, we will go through the following steps:
1. Fit the Keras Tokenizer
2. Build an embedding matrix
3. Tokenize and pad our training data
4. Train the model

Once we are done training the model, we evaluate how well it performs. This part is covered in the next section.

## Helper functions to create fasttext embedding

In [None]:
def build_matrix(word_index, path):
    def get_coefs(word, *arr):
        return word, np.asarray(arr, dtype='float32')

    def load_embeddings(path):
        with open(path) as f:
            embedding_index = {}
            
            for line in tqdm(f):
                word, arr = get_coefs(*line.strip().split(' '))    
                if word in word_index:
                    embedding_index[word] = arr
            
        return embedding_index

    embedding_index = load_embeddings(path)
    embedding_matrix = np.zeros((len(word_index) + 1, 300))
    
    for word, i in tqdm(word_index.items()):
        try:
            embedding_matrix[i] = embedding_index[word]
        except KeyError:
            pass
    return embedding_matrix

In [None]:
def build_model(embedding_matrix):
    words = Input(shape=(None,))
    x = Embedding(*embedding_matrix.shape, weights=[embedding_matrix], trainable=False)(words)
    x = SpatialDropout1D(0.2)(x)
    x = Bidirectional(CuDNNLSTM(128, return_sequences=True))(x)

    hidden = concatenate([
        GlobalMaxPooling1D()(x),
        GlobalAveragePooling1D()(x),
    ])
    hidden = Dense(512, activation='relu')(hidden)
    result = Dense(9, activation='softmax')(hidden)
    
    model = Model(inputs=words, outputs=result)
    model.compile(
        loss='categorical_crossentropy', 
        optimizer='adam',
        metrics=['accuracy']
    )

    return model

## Creating the tokenizer

In [None]:
%%time
CHARS_TO_REMOVE = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n“”’\'∞θ÷α•à−β∅³π‘₹´°£€\×™√²—'
tokenizer = text.Tokenizer(filters=CHARS_TO_REMOVE)
tokenizer.fit_on_texts(list(x_train) + list(x_test))

In [None]:
embedding_matrix = build_matrix(tokenizer.word_index, '../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec')

In [None]:
x_train = tokenizer.texts_to_sequences(x_train)
x_test = tokenizer.texts_to_sequences(x_test)
x_train = sequence.pad_sequences(x_train, maxlen=256)
x_test = sequence.pad_sequences(x_test, maxlen=256)

## Training

In [None]:
model = build_model(embedding_matrix)
model.summary()

checkpoint = ModelCheckpoint(
    'model.h5', 
    monitor='val_acc', 
    verbose=1, 
    save_best_only=True, 
    save_weights_only=False,
    mode='auto'
)

history = model.fit(
    x_train,
    y_train,
    batch_size=512,
    callbacks=[checkpoint],
    epochs=10,
    validation_split=0.1
)

# Evaluation

## Training history

Let's take a look at how well the model is training.

In [None]:
with open('history.json', 'w') as f:
    json.dump(history.history, f)

history_df = pd.DataFrame(history.history)
history_df[['loss', 'val_loss']].plot()
history_df[['acc', 'val_acc']].plot()

## Classification Report

Let's see how well the model is able to predict each class of score, i.e. what is its F1-Score for each possible rating (1.0, 1.5, 2.0,...5.0).

*Small note: Notice that we are using `keras.utils.to_categorical` here instead of `pd.get_dummies`. In theory both methods are the same, but the former is able to better encode integers (which is what we get when we perform an `np.array.argmax`), whereas the latter can better encode categorical data (i.e. `pd.Series` of `dtype='category'`). Feel free to experiment with both.*

In [None]:
model.load_weights('model.h5')

y_probs = model.predict(x_test, verbose=2)
y_pred = to_categorical(y_probs.argmax(axis=1))

report = classification_report(y_test.values, y_pred, labels=y_test.columns)
print(report)