https://vgpena.github.io/classifying-tweets-with-keras-and-tensorflow/

En el anterior enlace, tenéis un ejemplo sobre cómo, a partir de tweets con un label específico (un sentimiento, positivo o negativo): 

1. Genera un conjunto de entrenamiento. El conjunto de entrenamiento es formado a partir de tweets completos pasados a un array con un tamaño específico.
2. Ese array (X_train de tamaño N) tiene un label que representa el sentimiento (y_train)
3. Como todas las frases tienen un tamaño N, la entrada de la red neuronal será de tamaño N y la salida de la red será de tamaño 2 usando activación softmax(porque hay dos clases).

Se pide: 

- Realizar un clasificador de reviews para el dataset de IMDB de la carpeta data_exercise/

**Cuando usa la importación "keras.x", reemplázalo por "tensorflow.keras.x"**

Complex is better than complicated.

Flat is better than nested.

- Tokenized: 

[['complex', 'is', 'better', 'than', 'complicated', 'flat', 'is', 'better', 'than', 'nested']]

- In a lookup dictionary: 


  {'complex': 0,
  'is': 1,
  'better': 2,
  'than': 3,
  'complicated': 4,
  'flat': 5,
  'nested': 6}

- In one-hot encoding:

[
  [1, 0, 0, 0, 0, 0, 0], #complex

  [0, 1, 0, 0, 0, 0, 0], #is

  [0, 0, 1, 0, 0, 0, 0], #better

  [0, 0, 0, 1, 0, 0, 0], #than
  
  [0, 0, 0, 0, 1, 0, 0], #complicated
]

# Ejercicio

Se pide: 

- Realizar un clasificador de reviews para el dataset de IMDB de la carpeta data_exercise/

**Cuando usa la importación "keras.x", reemplázalo por "tensorflow.keras.x"**

In [3]:
# Libraries
import pandas as pd
import numpy as np
import json
import keras
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential, model_from_json
from keras.layers import Dense, Dropout, Activation



In [4]:
# Data
df = pd.read_csv('data_exercise/IMDB_Dataset.csv')

In [5]:
df.shape

(50000, 2)

In [6]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [7]:
# Defining X and y and encoding y to numbers
le = LabelEncoder()
X_train = np.array(df['review'])
y_train = np.array(df['sentiment'])
y_train = le.fit_transform(y_train)

In [8]:
print(X_train.shape)
print(y_train.shape)

(50000,)
(50000,)


In [9]:
# only work with the 3000 most popular words found in our dataset
max_words = 3000

# create a new Tokenizer
tokenizer = Tokenizer(num_words=max_words)

# feed our training data to the Tokenizer
tokenizer.fit_on_texts(X_train)

# Tokenizers come with a convenient list of words and IDs
dictionary = tokenizer.word_index

# Saving dictionary to json file so we can use it later
with open('dictionary.json', 'w') as dictionary_file:
    json.dump(dictionary, dictionary_file)


def convert_text_to_index_array(text):
    # kpt.text_to_word_sequence receives a sentence and stores each word separately in a list
    # This returns a list of the numerical values of the dictionary of each word in the text
    return [dictionary[word] for word in kpt.text_to_word_sequence(text)]

allWordIndices = []
# for each tweet, change each token to its ID in the Tokenizer's word_index
for words in X_train:
    wordIndices = convert_text_to_index_array(words)
    allWordIndices.append(wordIndices)

# now we have a list of all reviews converted to index arrays, we'll switch the list to array
allWordIndices = np.asarray(allWordIndices)

# create one-hot matrices out of the indexed tweets
X_train = tokenizer.sequences_to_matrix(allWordIndices, mode='binary')

# treat the labels as categories
y_train = keras.utils.to_categorical(y_train, 2)

In [10]:
# Creating a model

model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

In [11]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 512)               1536512   
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               131328    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 514       
Total params: 1,668,354
Trainable params: 1,668,354
Non-trainable params: 0
_________________________________________________________________


In [12]:
# Compiling the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [13]:
# Fitting the model
model.fit(X_train, y_train, batch_size=32, epochs=5, verbose=1, validation_split=0.1, shuffle=True)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fc51f1033d0>

In [14]:
# Saving the model
model_json = model.to_json()
with open('model_IMDB.json', 'w') as json_file:
    json_file.write(model_json)

model.save_weights('model_IMDB.h5')

In [15]:
# Opening the Saved model
json_file = open('model_IMDB.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)

# and weight your nodes with your saved values
model.load_weights('model_IMDB.h5')

In [17]:
# Now running the model
labels = ['negative', 'positive']
while 1:
    evalSentence = input('Input a sentence to be evaluated, or Enter to quit: ')

    if len(evalSentence) == 0:
        break

    # format your input for the neural net, function defined above
    testArr = convert_text_to_index_array(evalSentence)
    inp = tokenizer.sequences_to_matrix([testArr], mode='binary')

    # predict which bucket your input belongs in
    pred = model.predict(inp)
    print('Sentence:', evalSentence)
    print("%s sentiment; %f%% confidence" % (labels[np.argmax(pred)], pred[0][np.argmax(pred)] * 100))

Sentence: good movie
positive sentiment; 92.949563% confidence
Sentence: bad movie
negative sentiment; 99.963331% confidence
Sentence: loved it
positive sentiment; 99.962723% confidence
Sentence: hated it
negative sentiment; 99.296749% confidence
Sentence: high expectations
negative sentiment; 52.942806% confidence
Sentence: great
positive sentiment; 99.918979% confidence
Sentence: positive review
negative sentiment; 94.821256% confidence
