# Tutorial de los tweets

https://vgpena.github.io/classifying-tweets-with-keras-and-tensorflow/

En el anterior enlace, tenéis un ejemplo sobre cómo, a partir de tweets con un label específico (un sentimiento, positivo o negativo): 

1. Genera un conjunto de entrenamiento. El conjunto de entrenamiento es formado a partir de tweets completos pasados a un array con un tamaño específico.
2. Ese array (X_train de tamaño N) tiene un label que representa el sentimiento (y_train)
3. Como todas las frases tienen un tamaño N, la entrada de la red neuronal será de tamaño N y la salida de la red será de tamaño 2 usando activación softmax(porque hay dos clases).

Complex is better than complicated.

Flat is better than nested.

- Tokenized: 

[['complex', 'is', 'better', 'than', 'complicated', 'flat', 'is', 'better', 'than', 'nested']]

- In a lookup dictionary: 


  {'complex': 0,
  'is': 1,
  'better': 2,
  'than': 3,
  'complicated': 4,
  'flat': 5,
  'nested': 6}

- In one-hot encoding:

[
  [1, 0, 0, 0, 0, 0, 0], #complex

  [0, 1, 0, 0, 0, 0, 0], #is

  [0, 0, 1, 0, 0, 0, 0], #better

  [0, 0, 0, 1, 0, 0, 0], #than
  
  [0, 0, 0, 0, 1, 0, 0], #complicated

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Importing the dataset

# notice the cool options to skip lines at the beginning
# and to only take data from certain columns
training = np.genfromtxt('data_exercise/Sentiment_Analysis.csv', delimiter=',', skip_header=1, usecols=(1, 3), dtype=None)

# create our training data from the tweets
train_x = [x[1] for x in training]
# index all the sentiment labels
train_y = np.asarray([x[0] for x in training])

In [None]:
# First column is sentiment (0=sad, 1=happy)
pd.DataFrame({'Sentiment': train_y, 'Text': train_x})

In [None]:
import json
import keras
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer

# only work with the 3000 most popular words found in our dataset
max_words = 3000

# create a new Tokenizer
tokenizer = Tokenizer(num_words=max_words)
# feed our tweets to the Tokenizer
tokenizer.fit_on_texts(train_x)

# Tokenizers come with a convenient list of words and IDs
dictionary = tokenizer.word_index
# Let's save this out so we can use it later
with open('dictionary.json', 'w') as dictionary_file:
    json.dump(dictionary, dictionary_file)


def convert_text_to_index_array(text):
    # one really important thing that `text_to_word_sequence` does
    # is make all texts the same length -- in this case, the length
    # of the longest text in the set.
    return [dictionary[word] for word in kpt.text_to_word_sequence(text)]

allWordIndices = []
# for each tweet, change each token to its ID in the Tokenizer's word_index
for text in train_x:
    wordIndices = convert_text_to_index_array(text)
    allWordIndices.append(wordIndices)

# now we have a list of all tweets converted to index arrays.
# cast as an array for future usage.
allWordIndices = np.asarray(allWordIndices)

# create one-hot matrices out of the indexed tweets
train_x = tokenizer.sequences_to_matrix(allWordIndices, mode='binary')
# treat the labels as categories
train_y = keras.utils.to_categorical(train_y, 2)

In [None]:
train_x

In [None]:
# Un vector one-hot por cada tweet de los valores one hot de las 3000 palabras mas repetidas
train_x.shape

In [None]:
train_y

In [None]:
# Valores one-hot de las emociones: [1, 0] --> 0 (Sad)   [0, 1] --> 1 (Happy)
train_y.shape

In [None]:
# Los tokens the cada palabra en una lista por cada tweet
allWordIndices

In [None]:
# una lista por cada tweet
allWordIndices.shape

In [None]:
# El primer tweet
allWordIndices[0]

In [None]:
# El primer tweet tiene 7 palabras
len(allWordIndices[0])

In [None]:
# Creating a model
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

In [None]:
model.summary()

In [None]:
# Compiling the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
# Fitting the model
model.fit(train_x, train_y,
  batch_size=32,
  epochs=8,
  verbose=1,
  validation_split=0.1,
  shuffle=True)

In [None]:
# Saving the model
model_json = model.to_json()
with open('model_tweets.json', 'w') as json_file:
    json_file.write(model_json)

model.save_weights('model_tweets.h5')

In [None]:
# Executing the model

from keras.models import model_from_json

# we're still going to use a Tokenizer here, but we don't need to fit it
tokenizer = Tokenizer(num_words=3000)
# for human-friendly printing
labels = ['negative', 'positive']

# read in our saved dictionary
with open('dictionary.json', 'r') as dictionary_file:
    dictionary = json.load(dictionary_file)


In [None]:
# Loading model from saved files
json_file = open('model_tweets.json', 'r')
loaded_model_json = json_file.read()
json_file.close()

# and create a model from that
model = model_from_json(loaded_model_json)
# and weight your nodes with your saved values
model.load_weights('model_tweets.h5')

In [None]:
# Executing the model
while 1:
    evalSentence = input('Input a sentence to be evaluated, or Enter to quit: ')

    if len(evalSentence) == 0:
        break

    # format your input for the neural net
    testArr = convert_text_to_index_array(evalSentence)
    inp = tokenizer.sequences_to_matrix([testArr], mode='binary')
    # predict which bucket your input belongs in
    pred = model.predict(inp)
    # and print it for the humons
    print("%s sentiment; %f%% confidence" % (labels[np.argmax(pred)], pred[0][np.argmax(pred)] * 100))