# Sentiment classification for movie reviews

Análisis de sentimiento mediante Recurrent Neural Networs usando el dataset publicado en http://ai.stanford.edu/~amaas/data/sentiment/ y procesado en un único .csv por https://www.kaggle.com/utathya/imdb-review-dataset

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("imdb_master.csv", encoding='latin-1', index_col = 0)
data.head()

Unnamed: 0,type,review,label,file
0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt
1,test,This is an example of why the majority of acti...,neg,10000_4.txt
2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt
3,test,Not even the Beatles could write songs everyon...,neg,10002_3.txt
4,test,Brass pictures (movies is not a fitting word f...,neg,10003_3.txt


In [3]:
data_labeled = data[data.label != 'unsup']
del data

In [4]:
print("Número total de reviews --> ",len(data_labeled))
print("Número total de reviews positivas --> ",len(data_labeled[data_labeled["label"]=='pos']))
print("Número total de reviews negativas --> ",len(data_labeled[data_labeled["label"]=='neg']))

Número total de reviews -->  50000
Número total de reviews positivas -->  25000
Número total de reviews negativas -->  25000


In [5]:
y=data_labeled['label'].apply(lambda x: 0 if x == 'neg' else 1)

Para construir la estructura de nuestra red neuronal y evitar un tiempo excesivo en la fase de entrenamiento, dividiremos el conjunto data_labeled en dos conjuntos, uno large y otro small. Reservaremos un subconjunto de 40000 reviews para entrenar mejor nuestra red neuronal más adelante. Y ahora trabajaremos con un conjunto de 10000 reviews, de las cuales 2500 serán nuestro conjunto de datos de test.

In [6]:
from sklearn.model_selection import train_test_split

#Aislamos nuestro conjunto de test y de train
reviews_large, reviews_test, y_large, y_test = train_test_split(data_labeled['review'], y, test_size=2500, stratify=y)

#Extraemos un subconjunto de entrenamiento de solo 7500 reviews
reviews_rest, reviews_train, y_rest, y_train = train_test_split(reviews_large, y_large, test_size=7500, stratify=y_large)

### Procesamiento de texto 

In [7]:
import keras

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [8]:
max_dic = 1000 #Número máximo de palabras que tendrá nuestro diccionario.

#El Tokenizer de Keras nos permite quedarnos con las palabras más frecuentes de todas las reviews
diccionario = keras.preprocessing.text.Tokenizer(num_words = max_dic)
diccionario.fit_on_texts(reviews_train)
#Ahora, por cada review obtenemos un vector de enteros indicando la palabra del diccionario
X_train = diccionario.texts_to_sequences(reviews_train)

#Realizamos lo mismo para el data set de test
X_test = diccionario.texts_to_sequences(reviews_test)

In [10]:
#Es recomendable que todas las reviews tengan la misma extensión de palabras
max_palabras=300
X_train=keras.preprocessing.sequence.pad_sequences(X_train,maxlen=max_palabras)
X_test=keras.preprocessing.sequence.pad_sequences(X_test,maxlen=max_palabras)

### Estructura RNN

In [11]:
red_neuronal=keras.models.Sequential()

#Primera capa tipo embedding. Creamos un embedding de dimensión 64
red_neuronal.add(keras.layers.embeddings.Embedding(input_dim=max_dic, input_length=max_palabras, output_dim=64))

#Segunda capa tipo LSTM con 32 neuronas. Devuelve un vector después de procesar la secuencia completa
red_neuronal.add(keras.layers.recurrent.LSTM(32))

#Última capa que devuelve un valor entre 0 y 1
red_neuronal.add(keras.layers.core.Dense(1))
red_neuronal.add(keras.layers.core.Activation('sigmoid'))

red_neuronal.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 64)           64000     
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                12416     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0         
Total params: 76,449
Trainable params: 76,449
Non-trainable params: 0
_________________________________________________________________


In [12]:
red_neuronal.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])

In [13]:
red_neuronal.fit(X_train, y_train, batch_size=32, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x272b5b60128>

In [14]:
validacion=red_neuronal.evaluate(X_test, y_test)
print("Test loss", validacion[0])
print("Test accuracy", validacion[1])

Test loss 0.41100694622993467
Test accuracy 0.8168


Observamos que la precisión de nuestro conjunto de entrenamiento es mucho mayor que la de nuestro conjunto de test. Para solucionarlo, usaremos Dropout

In [19]:
red_neuronal=keras.models.Sequential()
red_neuronal.add(keras.layers.embeddings.Embedding(input_dim=max_dic, input_length=max_palabras, output_dim=64))
red_neuronal.add(keras.layers.core.Dropout(0.45))
red_neuronal.add(keras.layers.recurrent.LSTM(32,recurrent_dropout=0.45))
red_neuronal.add(keras.layers.core.Dropout(0.45))
red_neuronal.add(keras.layers.core.Dense(1))
red_neuronal.add(keras.layers.core.Activation('sigmoid'))
red_neuronal.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 300, 64)           64000     
_________________________________________________________________
dropout_2 (Dropout)          (None, 300, 64)           0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 32)                12416     
_________________________________________________________________
dropout_3 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
_________________________________________________________________
activation_2 (Activation)    (None, 1)                 0         
Total params: 76,449
Trainable params: 76,449
Non-trainable params: 0
_________________________________________________________________


In [20]:
red_neuronal.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
red_neuronal.fit(X_train, y_train, batch_size=32, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x272be31d470>

In [21]:
validacion=red_neuronal.evaluate(X_test, y_test)
print("Test loss", validacion[0])
print("Test accuracy", validacion[1])

Test loss 0.44726420788764953
Test accuracy 0.7952


Se regula mejor el overfitting. Vamos a añadir otro LSTM layer manteniendo los dropout.

In [23]:
red_neuronal=keras.models.Sequential()
red_neuronal.add(keras.layers.embeddings.Embedding(input_dim=max_dic, input_length=max_palabras, output_dim=64))
red_neuronal.add(keras.layers.core.Dropout(0.45))
red_neuronal.add(keras.layers.recurrent.LSTM(32,recurrent_dropout=0.45,return_sequences=True))
red_neuronal.add(keras.layers.recurrent.LSTM(32))
red_neuronal.add(keras.layers.core.Dropout(0.45))
red_neuronal.add(keras.layers.core.Dense(1))
red_neuronal.add(keras.layers.core.Activation('sigmoid'))
red_neuronal.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 300, 64)           64000     
_________________________________________________________________
dropout_5 (Dropout)          (None, 300, 64)           0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 300, 32)           12416     
_________________________________________________________________
lstm_7 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dropout_6 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
_________________________________________________________________
activation_3 (Activation)    (None, 1)                 0         
Total para

In [24]:
red_neuronal.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
red_neuronal.fit(X_train, y_train, batch_size=32, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x272c0647940>

In [25]:
validacion=red_neuronal.evaluate(X_test, y_test)
print("Test loss", validacion[0])
print("Test accuracy", validacion[1])

Test loss 0.4061969689130783
Test accuracy 0.8132


La precisión en nuestro test data set no mejora añadiendo un layer. Vamos a pasar a probar a entrenar un modelo con las 40000 reviews que no hemos usado junto con nuestro training set actual

In [26]:
#Debemos usar el dataset grande para definir un nuevo diccionario
diccionario_large = keras.preprocessing.text.Tokenizer(num_words = max_dic)
diccionario_large.fit_on_texts(reviews_train)
X_train_large = diccionario_large.texts_to_sequences(reviews_large)
X_test_large = diccionario_large.texts_to_sequences(reviews_test)

In [27]:
max_palabras=300
X_train_large=keras.preprocessing.sequence.pad_sequences(X_train_large,maxlen=max_palabras)
X_test_large=keras.preprocessing.sequence.pad_sequences(X_test_large,maxlen=max_palabras)

In [28]:
red_neuronal=keras.models.Sequential()
red_neuronal.add(keras.layers.embeddings.Embedding(input_dim=max_dic, input_length=max_palabras, output_dim=64))
red_neuronal.add(keras.layers.core.Dropout(0.45))
red_neuronal.add(keras.layers.recurrent.LSTM(32,recurrent_dropout=0.45))
red_neuronal.add(keras.layers.core.Dropout(0.45))
red_neuronal.add(keras.layers.core.Dense(1))
red_neuronal.add(keras.layers.core.Activation('sigmoid'))
red_neuronal.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 300, 64)           64000     
_________________________________________________________________
dropout_7 (Dropout)          (None, 300, 64)           0         
_________________________________________________________________
lstm_8 (LSTM)                (None, 32)                12416     
_________________________________________________________________
dropout_8 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 33        
_________________________________________________________________
activation_4 (Activation)    (None, 1)                 0         
Total params: 76,449
Trainable params: 76,449
Non-trainable params: 0
_________________________________________________________________


In [29]:
red_neuronal.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])
red_neuronal.fit(X_train_large, y_large, batch_size=32, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x272c303dba8>

In [30]:
validacion=red_neuronal.evaluate(X_test, y_test)
print("Test loss", validacion[0])
print("Test accuracy", validacion[1])

Test loss 0.35239518189430236
Test accuracy 0.8472
