In this kernel we train an LSTM model to predict sentiments (binary targets 0-1) from small texts (a few sentences per texts), using Keras.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import json
import sys
import csv
import os

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"  # 'last_expr' 

In [2]:
input_path = 'yelp_dataset/'

Here we load the whole dataset. Since it is too big, we later use only a fraction of the data (see small_df).

In [None]:
def convert(x):
    ob = json.loads(x)
    for k, v in ob.items():
        if isinstance(v, list):
            ob[k] = ','.join(v)
        elif isinstance(v, dict):
            for kk, vv in v.items():
                ob['%s_%s' % (k, kk)] = vv
            del ob[k]
    return ob

with open(input_path + 'review.json','rb') as f:
    data = f.readlines()

# this takes a while (too much data)
review_df = pd.DataFrame([convert(line) for line in data])

In [7]:
len(review_df)
review_df.head()

6685900

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
0,ujmEBvifdJM6h6RLv4wQIg,0,2013-05-07 04:34:36,1,Q1sbwvVQXV2734tPgoKj4Q,1.0,Total bill for this horrible service? Over $8G...,6,hG7b0MtEbXx5QzbzE6C_VA
1,NZnhc2sEQy3RmzKTZnqtwQ,0,2017-01-14 21:30:33,0,GJXCdrto3ASJOqKeVWPi6Q,5.0,I *adore* Travis at the Hard Rock's new Kelly ...,0,yXQM5uF2jS6es16SJzNHfg
2,WTqjgwHlXbSFevF32_DJVw,0,2016-11-09 20:09:03,0,2TzJjDVDEuAW6MR5Vuc1ug,5.0,I have to say that this office really has it t...,3,n6-Gk65cPZL6Uz8qRm3NYw
3,ikCg8xy5JIg_NGPx-MSIDA,0,2018-01-09 20:56:38,0,yi0R0Ugj_xUx_Nek0-_Qig,5.0,Went in for a lunch. Steak sandwich was delici...,0,dacAIZ6fTM6mqwW5uxkskg
4,b1b1eb3uo-w561D0ZfCEiQ,0,2018-01-30 23:07:38,0,11a8sVPMUFtaC7_ABRkmtw,1.0,Today was my second out of three sessions I ha...,7,ssoyf2_x0EQMed6fgHeMyQ


In [10]:
small_df = review_df.sample(100000, random_state=11)
small_df.to_json('review_small.json', orient = 'columns')

In [9]:
medium_df = review_df[:1000000]
medium_df.to_json('review_med.json', orient = 'columns')

We work on the saved small dataset only.

In [20]:
small_df = pd.read_json('review_small.json')
len(small_df)
small_df.head()

100000

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id
1000006,hX56JNZZjz_oEQbSsboHBQ,0,2018-01-04 01:08:09,0,4tu8zP8BAyTsNf5RP1l-Uw,5,Love this Medspa. I had a wonderful IPL photof...,0,8cCiGytDyiL48Ir6WI4NLQ
1000042,caGXS6ubNTlv91ZZyoirjQ,2,2008-10-12 19:18:53,0,B8FskbcnxMaW6hHm6dpWsg,1,All I heard when I first moved to Pittsburgh i...,1,e5O_lm2Mov6kHOka8wgvOA
1000165,y7Js-07RF8d3N_AEtaw2VQ,0,2015-10-15 01:23:06,0,VAVGbHr_idlRGxXABO0g1g,3,I had to come to this place because it was adv...,0,zmOdU_artMpKrG-AWYOSPQ
1000225,KqhvtfJITeZDVubTnMVAlg,0,2016-07-30 22:01:04,0,HqpEqva5z5LYw-1mzaraKw,5,Kenny was very efficient and dialed my car in ...,0,EmGbV1jbeKUCUm4UX6J3rg
100033,scoJNOqcw2peNlO31UYTaA,0,2013-05-12 01:50:22,1,yUAuey_JVSK64JtFrFr-Uw,4,"$7.00 for a ""create your own"" pizza? Yes pleas...",2,vPOkQJKahhR13LQ2ElSFGg


In [106]:
data_df = small_df.reset_index(drop=True)[['text','stars']]

In [107]:
data_df['sentiment'] = [1 if stars > 3 else 0 for stars in data_df['stars']]

In [108]:
pd.set_option('display.max_colwidth',-1)
data_df.head()

Unnamed: 0,text,stars,sentiment
0,Love this Medspa. I had a wonderful IPL photofacial with great results on my face and neck last week. \nOwner and laser tech extremely knowledgeable and experienced!\nHighly recommend. I will certainly return for additional treatments.,5,1
1,"All I heard when I first moved to Pittsburgh is how GREAT Primanti brothers is. Well I have eaten and two locations and they have the worst sandwiches I have eaten. EVER!\nWhile no one can deny that french fries on a sandwich is awesome, I think we can all agree that the stale non-toasted bread and runny coleslaw make this sandwich taste like I am eating ass. \nPrimanti brothers is a Pittsburgh staple, and Pittsburghers love ""their"" stuff, but this is one place they should let die out.",1,0
2,"I had to come to this place because it was advertised on Food network! What an experience!\n\n3 scales to weigh yourself--which I didn't partake in. You are dressed in a hospital gown as you walk in to enjoy this very caloric meal and there are posters and tvs all advertising the lack of a weight obsession makes for a happy camper.\n\nMy friend and I shared the single burger and onion rings. If the burgers came with the actual fillings of a typical burger--meat, cheese, lettuce, onions, tomatoes, and pickles--I might have said it was a delicious, fatty burger. But it didn't...no lettuce or pickles to take the brunt of the greasiness of the burger and to cut through, basically, all the fried grease of the onion ring as well. They didn't even have HOT SAUCE! For shame--it really needed it.\n\nI understand the concept--Heart Attack Grill--but STILL!!! Flavor and tartness would've have rounded up this meal in a MUCH more positive light. Also, getting the chance to order a drink out of IV lines--alas, that is a regret saved for another time. I will probably come back because the ambiance of the whole place was funky--but I will be bringing my own bottle of hot sauce next time!",3,0
3,Kenny was very efficient and dialed my car in tight. I will refer anyone who asks where to go for their tinting needs..... Thank you again. Fabulous job BOYS.,5,1
4,"$7.00 for a ""create your own"" pizza? Yes please.\n\nThis is the only truck I havent tried that comes to the ASU Food Truck Block Party. I never see a line there, so I was helped almost immediately. I ordered my pie (the guy was super nice) and waited. About 15 minutes later, I had my pizza to-go.\n\nI chose the standard crust, and cheese and added Italian sausage, green peppers, and onions. The pie had a nice crust w/ bubbles (which I love). The ingredients blended nicely, and I was quite pleased w/ my creation (even the next day). My one small complaint is that the bottom of my pizza was about half burned.\n\nI would absolutely recommend trying this truck. Like I said, theres never a lone when they come to the block party at ASU- TAKE ADVANTAGE OF THAT!!",4,1


In [109]:
data_df.dtypes

text         object
stars        int64 
sentiment    int64 
dtype: object

In [110]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=2000) 
# default option values: filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' '
tokenizer.fit_on_texts(data_df.text.values)

In [111]:
# Map the text tokens to integers, return the list X of integers for each text. 
# Keep only num_words most frequent words.
X = tokenizer.texts_to_sequences(data_df.text.values) 

In [118]:
X[:3]

([[93,
   15,
   3,
   23,
   4,
   407,
   14,
   35,
   1608,
   20,
   12,
   904,
   2,
   199,
   370,
   380,
   2,
   1479,
   413,
   765,
   2,
   1253,
   284,
   138,
   3,
   60,
   1097,
   444,
   10,
   1275],
  [37,
   3,
   647,
   50,
   3,
   96,
   685,
   5,
   1342,
   9,
   122,
   35,
   9,
   83,
   3,
   21,
   853,
   2,
   134,
   1150,
   2,
   17,
   21,
   1,
   509,
   684,
   3,
   21,
   853,
   145,
   159,
   64,
   46,
   71,
   13,
   644,
   248,
   20,
   4,
   288,
   9,
   208,
   3,
   157,
   16,
   71,
   37,
   1640,
   13,
   1,
   852,
   329,
   2,
   121,
   15,
   288,
   246,
   43,
   3,
   144,
   436,
   1877,
   9,
   4,
   1342,
   2,
   93,
   48,
   527,
   18,
   15,
   9,
   46,
   30,
   17,
   232,
   334,
   1089,
   38],
  [3,
   23,
   5,
   105,
   5,
   15,
   30,
   76,
   8,
   6,
   20,
   28,
   62,
   55,
   117,
   141,
   5,
   770,
   65,
   3,
   97,
   11,
   19,
   27,
   11,
   4,
   1973,
   32,
   19,
   

In [119]:
X = pad_sequences(X)
X[:3]

array([[   0,    0,    0, ...,  444,   10, 1275],
       [   0,    0,    0, ...,  334, 1089,   38],
       [   0,    0,    0, ...,  180,  169,   44]], dtype=int32)

In [120]:
X.shape

(100000, 941)

In [137]:
from keras.utils import to_categorical

Y = to_categorical(data_df.sentiment.values)  # -> one-hot-encoding
Y.shape

(100000, 2)

### The LSTM model

In [121]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM

In [125]:
embed_dim = 128
lstm_out = 196

In [147]:
model = Sequential()
model.add(Embedding(2000, embed_dim, input_length = X.shape[1], name ='embedding_1'))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2, name='lstm_1'))
model.add(Dense(2, activation='softmax', name='dense_1'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics =['accuracy'])

In [148]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 941, 128)          256000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 196)               254800    
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 394       
Total params: 511,194
Trainable params: 511,194
Non-trainable params: 0
_________________________________________________________________


In [149]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=11)

print([E.shape for E in [X_train, Y_train, X_test, Y_test]])

[(80000, 941), (80000, 2), (20000, 941), (20000, 2)]


In [150]:
batch_size = 32
epochs = 1

model.fit(X_train, Y_train, batch_size = batch_size, epochs=epochs, verbose=1)

Epoch 1/1


<keras.callbacks.History at 0x1a3332c2e8>

In [151]:
model.save('LSTM_binary_clf_model.h5')

In [152]:
loss, acc = model.evaluate(X_test, Y_test, batch_size=batch_size)



In [153]:
print("Test loss: {} \nTest accuracy: {}".format(loss,acc))

Test loss: 0.26389781827926634 
Test accuracy: 0.89275


89% accuracy with a single training epoch! LSTM is rather impressive.