# Deep Learning for Sentiment Analysis

<img style="float: left;" src="images/sentiment.png">

In questo tutorial vediamo come creare un modello di sentiment analysis in Keras.

Per creare classificatori di testo i passi fondamentali sono:
    - Embed
    - Encode
    - (Attend)
    - Predict

[Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models](https://explosion.ai/blog/deep-learning-formula-nlp)

Vediamo passo passo questi passaggi reimplementando il modello stato dell'arte per il sentiment analysis a Semeval 2015.

[Twitter sentiment analysis with deep convolutional neural networks](https://pdfs.semanticscholar.org/9320/a229b450bee8384f218681634e039acd9c2f.pdf)

## Data preparation

Data and embeddings can be downloaded from [here](https://drive.google.com/open?id=0B8xjf4y9r8jCdVFjVTZqdzZTbVU)



Come prima cosa prepariamo i dati per il training facendo del preprocessing. Questo [tokenizer](https://github.com/jaredks/tweetokenize) applica delle semplici trasformazioni al tweet: 

- lowercase
- mappa i numeri in ad un token speciale NUMBER
- mappa il nome utente ad un carattere speciale USERNAME

In [1]:
from tokenizer import Tokenizer

tkn = Tokenizer()

def preprocess(tweet):
    return tkn.tokenize(tweet)

In [2]:
preprocess("@bestuser Gas by my house hit $3.39!!!! I'm going to Chapel Hill on Sat. :) #lol")

[u'USERNAME',
 u'gas',
 u'by',
 u'my',
 u'house',
 u'hit',
 u'NUMBER',
 u'!',
 u'!',
 u'!',
 u'!',
 u"i'm",
 u'going',
 u'to',
 u'chapel',
 u'hill',
 u'on',
 u'sat',
 u'.',
 u':)',
 u'#lol']

A questo punto possiamo preprocessare il training set di Semeval 2015.

In [3]:
def load_dataset(file_name, gold=None):
    labels = {'negative':0, 'neutral':1, 'positive':2, 'unknwn':1}
    X_, y_ = [], []
    with open(file_name) as f:
        for line in f:
            label, _, text = line.strip().split('\t') 
            y_.append(labels[label])
            X_.append(preprocess(text))
    if gold:
        y_ = []
        with open(gold) as f:
            for line in f:
                _, _, label = line.strip().split('\t')
                y_.append(labels[label])
    return (X_, y_)
            
X_train, y_train = load_dataset("data/train'13.csv")
X_dev, y_dev = load_dataset("data/dev'13.csv")
X_train = X_train+X_dev
y_train = y_train+y_dev
X_dev, y_dev = load_dataset("data/test'13.csv")
X_test_1, _ = load_dataset("data/sms'13.csv")
X_test_2, _ = load_dataset("data/test'14.csv")
X_test_3, _ = load_dataset("data/test'15.csv")
X_dev[5]

[u'excuse',
 u'the',
 u'connectivity',
 u'of',
 u'this',
 u'live',
 u'stream',
 u',',
 u'from',
 u'baba',
 u'amr',
 u',',
 u'so',
 u'many',
 u'activists',
 u'using',
 u'only',
 u'one',
 u'sat',
 u'modem',
 u'.',
 u'LIVE',
 u'URL',
 u'#Homs']

Per velocizzare il mapping assegnamo creiamo un dizionario dove ad ogni parola è assegnato un Id univoco. In questo dizionario aggiungiamo una token speciale per le parole sconosciute e uno per il PADDING (Spigherò dopo) 

In [4]:
from itertools import chain
dictionary = {'PAD':0, 'UNK':1}

toks = set(chain.from_iterable(X_train+X_dev+X_test_1+X_test_2+X_test_3))
for i, tok in enumerate(toks):
    dictionary[tok] = i+2
len(dictionary)

33609

Ora mappiamo le parole di training e dev set a questi indici nel dizionario:

In [5]:
def word2id(sent):
    return map(lambda x: dictionary.get(x, 1), sent)

X_train = map(word2id, X_train)
X_dev = map(word2id, X_dev)
    
X_dev[5]

[20017,
 10819,
 13784,
 15722,
 26775,
 11018,
 32099,
 29495,
 14115,
 9877,
 3011,
 29495,
 17467,
 17670,
 1314,
 18253,
 19701,
 733,
 25967,
 26452,
 337,
 265,
 9510,
 29007]

In generale le reti neurali accettano solo vettori (tensori) di dimensione prefissata quindi mapperemo tutte le frasi alla frase piu lunga del training set. E convertiamo le frasi in vettori numpy.

In [6]:
import numpy as np

max_len = max(len(x) for x in X_train)

def _pad(s, maxlen):
    pad_ = np.zeros(maxlen, dtype='int32')
    trunc = np.asarray(s[-maxlen:], dtype='int32')
    pad_[-len(trunc):] = trunc
    return pad_

X_train = np.array(map(lambda x: _pad(x, max_len), X_train))
X_dev = np.array(map(lambda x: _pad(x, max_len), X_dev))
X_train[123]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0, 14158, 10819, 32671,
       32837, 22163, 10819,  4200, 26828, 27247, 25524, 32837, 18384,
       28365,  2483, 17877,  5846,  7431, 22140, 25735, 14638, 28244,
       13358, 12938,   105, 16485,   337], dtype=int32)

Come ultimo passaggio mappiamo le labels delle tre classi in one-hot vectors

In [7]:
def labels(x):
    out_ = np.zeros(3, dtype='int32')
    out_[x] = 1
    return out_

y_train = np.array(map(labels, y_train))
y_dev = np.array(map(labels, y_dev))
y_dev

array([[0, 0, 1],
       [0, 0, 1],
       [0, 1, 0],
       ..., 
       [0, 0, 1],
       [0, 1, 0],
       [0, 1, 0]], dtype=int32)

## The network

A questo punto definiamo la network.

In [10]:
np.random.seed(1337)
from keras.models import Sequential
from keras.layers import (Dropout,
                          Conv1D,
                          GlobalMaxPooling1D,
                          Dense,
                          Embedding)
from keras.optimizers import Adadelta
from keras.regularizers import l2
from keras import backend as K

emb_dim = 100
conv_filters = 300

model = Sequential()
model.add(Embedding(len(dictionary), emb_dim, input_length=max_len)) #Embed
model.add(Conv1D(filters=conv_filters, kernel_size=5, padding='same', activation='relu'))
model.add(GlobalMaxPooling1D()) #Encode
model.add(Dense(3, activation='softmax')) #Predict


model.compile(loss='categorical_crossentropy',
              optimizer=Adadelta(lr=1.0, rho=0.90, epsilon=1e-8),
              metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 68, 100)           3360900   
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 68, 300)           150300    
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 300)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 903       
Total params: 3,512,103
Trainable params: 3,512,103
Non-trainable params: 0
_________________________________________________________________


In [12]:
from keras.callbacks import EarlyStopping, ModelCheckpoint

early_stopping = EarlyStopping(monitor='val_acc', patience=3, mode='max')

model_checkpoint = ModelCheckpoint('model.tra', save_best_only=True, mode='max', monitor='val_acc')

model.fit(X_train[:1000], y_train[:1000],
          batch_size=32,
          epochs=1000,
          shuffle=True,
          validation_data=(X_dev, y_dev),
          callbacks=[early_stopping, model_checkpoint])

Train on 1000 samples, validate on 3813 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000


<keras.callbacks.History at 0x11605d2d0>

In [20]:
from keras.models import load_model

model = load_model('model.tra')

def evaluate(file_name, gold):
    print(file_name)
    from sklearn.metrics import precision_recall_fscore_support
    from sklearn.metrics import classification_report
    
    X_, y_ = load_dataset(file_name, gold)
    X_ = map(word2id, X_)
    X_ = np.array(map(lambda x: _pad(x, max_len), X_))
    pred = model.predict_classes(X_, verbose=0)
    ev = precision_recall_fscore_support(y_, pred)
    f1 = (ev[2][0]+ev[2][2])/2
    print('Semeval F1 score: {} %'.format(f1*100))
    print(classification_report(y_, pred))


files = ["data/sms'13.csv",
         "data/test'13.csv",
         "data/test'15.csv"]

gold = [None,
        None,
        "data/SemEval2015-task10-test-B-gold.txt"]


for file_name, gold in zip(files, gold):
    evaluate(file_name, gold)

data/sms'13.csv


  'precision', 'predicted', average, warn_for)


Semeval F1 score: 21.8996062992 %
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       394
          1       0.83      0.38      0.52      1208
          2       0.29      0.90      0.44       492

avg / total       0.55      0.43      0.40      2094

data/test'13.csv
Semeval F1 score: 29.9707126586 %
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       601
          1       0.58      0.81      0.67      1640
          2       0.61      0.59      0.60      1572

avg / total       0.50      0.59      0.54      3813

data/test'15.csv
Semeval F1 score: 29.7563504406 %
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       365
          1       0.54      0.82      0.65       987
          2       0.64      0.55      0.60      1038

avg / total       0.50      0.58      0.53      2390



In [21]:
from gensim.models import Word2Vec

w2v = Word2Vec.load_word2vec_format('data/embeddings.bin', binary=True)

In [22]:
def emb_matrix(dictionary, model):
    embedding_matrix = np.random.uniform(-0.25, 0.25, (len(dictionary), 100))
    for word in dictionary:
        if word in model:
            embedding_matrix[dictionary[word]] = model[word]
    return embedding_matrix
        

In [23]:
np.random.seed(1337)
from keras.models import Sequential
from keras.layers import (Dropout,
                          Convolution1D,
                          GlobalMaxPooling1D,
                          Dense,
                          Embedding)
from keras.optimizers import Adadelta
from keras.regularizers import l2

emb_dim = 100
conv_filters = 300

embeddings = Embedding(len(dictionary), 100, input_length=max_len, weights=[emb_matrix(dictionary, w2v)], trainable=True)

model = Sequential()
model.add(embeddings) #Embed
model.add(Conv1D(filters=conv_filters, kernel_size=5, padding='same', activation='relu'))
model.add(GlobalMaxPooling1D()) #Encode
model.add(Dense(3, activation='softmax')) #Predict

model.compile(loss='categorical_crossentropy',
              optimizer=Adadelta(lr=1.0, rho=0.90, epsilon=1e-8),
              metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 68, 100)           3360900   
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 68, 300)           150300    
_________________________________________________________________
global_max_pooling1d_5 (Glob (None, 300)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 3)                 903       
Total params: 3,512,103
Trainable params: 3,512,103
Non-trainable params: 0
_________________________________________________________________


In [24]:
from keras.callbacks import EarlyStopping, ModelCheckpoint
early_stopping = EarlyStopping(monitor='val_acc', patience=3, mode='max')
model_checkpoint = ModelCheckpoint('model.tra', save_best_only=True, mode='max', monitor='val_acc')

model.fit(X_train[:1000], y_train[:1000],
          batch_size=32,
          epochs=1000,
          shuffle=True,
          validation_data=(X_dev, y_dev),
          callbacks=[early_stopping, model_checkpoint])

Train on 1000 samples, validate on 3813 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000


<keras.callbacks.History at 0x12407fa10>

In [25]:
from keras.models import load_model

model = load_model('model.tra')

def evaluate(file_name, gold):
    print(file_name)
    from sklearn.metrics import precision_recall_fscore_support
    from sklearn.metrics import classification_report
    
    X_, y_ = load_dataset(file_name, gold)
    X_ = map(word2id, X_)
    X_ = np.array(map(lambda x: _pad(x, max_len), X_))
    pred = model.predict_classes(X_, verbose=0)
    ev = precision_recall_fscore_support(y_, pred)
    f1 = (ev[2][0]+ev[2][2])/2
    print('Semeval F1 score: {} %'.format(f1*100))
    print(classification_report(y_, pred))


files = ["data/sms'13.csv",
         "data/test'13.csv",
         "data/test'15.csv"]

gold = [None,
        None,
        "data/SemEval2015-task10-test-B-gold.txt"]


for file_name, gold in zip(files, gold):
    evaluate(file_name, gold)

data/sms'13.csv
Semeval F1 score: 52.3418466508 %
             precision    recall  f1-score   support

          0       0.43      0.55      0.48       394
          1       0.89      0.46      0.60      1208
          2       0.43      0.84      0.57       492

avg / total       0.69      0.56      0.57      2094

data/test'13.csv
Semeval F1 score: 55.8681577527 %
             precision    recall  f1-score   support

          0       0.70      0.30      0.42       601
          1       0.66      0.77      0.71      1640
          2       0.68      0.72      0.70      1572

avg / total       0.68      0.67      0.66      3813

data/test'15.csv
Semeval F1 score: 53.7115973045 %
             precision    recall  f1-score   support

          0       0.56      0.32      0.40       365
          1       0.62      0.79      0.70       987
          2       0.71      0.64      0.67      1038

avg / total       0.65      0.65      0.64      2390



In [26]:
np.random.seed(1337)
from keras.models import Sequential
from keras.layers import (Dropout,
                          GRU,
                          Bidirectional,
                          Dense,
                          Embedding)
from keras.optimizers import Adadelta
from keras.regularizers import l2

emb_dim = 100
conv_filters = 300

model = Sequential()
model.add(Embedding(len(dictionary), 100, input_length=max_len, weights=[emb_matrix(dictionary, w2v)], trainable=True)) #Embed
model.add(GRU(300, activation='tanh'))
model.add(Dense(3, activation='softmax')) #Predict

model.compile(loss='categorical_crossentropy',
              optimizer=Adadelta(lr=1.0, rho=0.90, epsilon=1e-8),
              metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 68, 100)           3360900   
_________________________________________________________________
gru_2 (GRU)                  (None, 300)               360900    
_________________________________________________________________
dense_7 (Dense)              (None, 3)                 903       
Total params: 3,722,703
Trainable params: 3,722,703
Non-trainable params: 0
_________________________________________________________________


In [27]:
from keras.callbacks import EarlyStopping, ModelCheckpoint
early_stopping = EarlyStopping(monitor='val_acc', patience=3, mode='max')
model_checkpoint = ModelCheckpoint('model.tra', save_best_only=True, mode='max', monitor='val_acc')

model.fit(X_train[:1000], y_train[:1000],
          batch_size=32,
          epochs=1000,
          shuffle=True,
          validation_data=(X_dev, y_dev),
          callbacks=[early_stopping, model_checkpoint])

Train on 1000 samples, validate on 3813 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000


<keras.callbacks.History at 0x1351e2810>

In [29]:
from keras.models import load_model

model = load_model('model.tra')

def evaluate(file_name, gold):
    print(file_name)
    from sklearn.metrics import precision_recall_fscore_support
    from sklearn.metrics import classification_report
    
    X_, y_ = load_dataset(file_name, gold)
    X_ = map(word2id, X_)
    X_ = np.array(map(lambda x: _pad(x, max_len), X_))
    pred = model.predict_classes(X_, verbose=0)
    ev = precision_recall_fscore_support(y_, pred)
    f1 = (ev[2][0]+ev[2][2])/2
    print('Semeval F1 score: {} %'.format(f1*100))
    print(classification_report(y_, pred))


files = ["data/sms'13.csv",
         "data/test'13.csv",
         "data/test'15.csv"]

gold = [None,
        None,
        "data/SemEval2015-task10-test-B-gold.txt"]


for file_name, gold in zip(files, gold):
    evaluate(file_name, gold)

data/sms'13.csv
Semeval F1 score: 48.7468812026 %
             precision    recall  f1-score   support

          0       0.36      0.57      0.44       394
          1       0.81      0.47      0.59      1208
          2       0.44      0.69      0.53       492

avg / total       0.64      0.54      0.55      2094

data/test'13.csv
Semeval F1 score: 54.2490282103 %
             precision    recall  f1-score   support

          0       0.49      0.45      0.47       601
          1       0.60      0.72      0.66      1640
          2       0.68      0.56      0.61      1572

avg / total       0.62      0.61      0.61      3813

data/test'15.csv
Semeval F1 score: 49.5697541452 %
             precision    recall  f1-score   support

          0       0.37      0.44      0.41       365
          1       0.56      0.70      0.62       987
          2       0.71      0.50      0.59      1038

avg / total       0.60      0.57      0.57      2390



Combiniamo un convolutional model con un Recurrent model. Prima creaiamo le rappresentazioni dei 5-gram usando la convoluzione. Per velocizzare il training riduciamo la lunghezza dell'input usando Max pooling (4). Questo layer esegue l'operazione di max pooling non a livello di frase ma ogni 4 n-gram embeddings. In seguito usiamo gated recurrent unit per ottenere il vettore in output

In [30]:
np.random.seed(1337)
from keras.models import Sequential
from keras.layers import (Dropout,
                          Convolution1D,
                          MaxPooling1D,
                          GRU,
                          Dense,
                          Embedding)
from keras.optimizers import Adadelta
from keras.regularizers import l2

emb_dim = 100
conv_filters = 300

embeddings = Embedding(len(dictionary), 100, input_length=max_len, weights=[emb_matrix(dictionary, w2v)], trainable=True)

model = Sequential()
model.add(embeddings) #Embed
model.add(Conv1D(filters=conv_filters, kernel_size=5, padding='same', activation='relu'))
model.add(MaxPooling1D(4)) #Encode
model.add(GRU(300, activation='relu'))
model.add(Dense(3, activation='softmax')) #Predict

model.compile(loss='categorical_crossentropy',
              optimizer=Adadelta(lr=1.0, rho=0.90, epsilon=1e-8),
              metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 68, 100)           3360900   
_________________________________________________________________
conv1d_7 (Conv1D)            (None, 68, 300)           150300    
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 17, 300)           0         
_________________________________________________________________
gru_4 (GRU)                  (None, 300)               540900    
_________________________________________________________________
dense_9 (Dense)              (None, 3)                 903       
Total params: 4,053,003
Trainable params: 4,053,003
Non-trainable params: 0
_________________________________________________________________


In [31]:
from keras.callbacks import EarlyStopping, ModelCheckpoint
early_stopping = EarlyStopping(monitor='val_acc', patience=3, mode='max')
model_checkpoint = ModelCheckpoint('model.tra', save_best_only=True, mode='max', monitor='val_acc')

model.fit(X_train[:1000], y_train[:1000],
          batch_size=32,
          epochs=1000,
          shuffle=True,
          validation_data=(X_dev, y_dev),
          callbacks=[early_stopping, model_checkpoint])

Train on 1000 samples, validate on 3813 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000


<keras.callbacks.History at 0x124113690>

In [32]:
model = load_model('model.tra')

def evaluate(file_name, gold):
    print(file_name)
    from sklearn.metrics import precision_recall_fscore_support
    from sklearn.metrics import classification_report
    
    X_, y_ = load_dataset(file_name, gold)
    X_ = map(word2id, X_)
    X_ = np.array(map(lambda x: _pad(x, max_len), X_))
    pred = model.predict_classes(X_, verbose=0)
    ev = precision_recall_fscore_support(y_, pred)
    f1 = (ev[2][0]+ev[2][2])/2
    print('Semeval F1 score: {} %'.format(f1*100))
    print(classification_report(y_, pred))


files = ["data/sms'13.csv",
         "data/test'13.csv",
         "data/test'15.csv"]

gold = [None,
        None,
        "data/SemEval2015-task10-test-B-gold.txt"]


for file_name, gold in zip(files, gold):
    evaluate(file_name, gold)

data/sms'13.csv
Semeval F1 score: 50.0319022878 %
             precision    recall  f1-score   support

          0       0.40      0.50      0.44       394
          1       0.79      0.57      0.66      1208
          2       0.47      0.69      0.56       492

avg / total       0.64      0.59      0.60      2094

data/test'13.csv
Semeval F1 score: 51.3382067927 %
             precision    recall  f1-score   support

          0       0.59      0.28      0.38       601
          1       0.61      0.74      0.67      1640
          2       0.66      0.64      0.65      1572

avg / total       0.63      0.63      0.61      3813

data/test'15.csv
Semeval F1 score: 49.80448975 %
             precision    recall  f1-score   support

          0       0.51      0.29      0.37       365
          1       0.58      0.76      0.66       987
          2       0.68      0.58      0.63      1038

avg / total       0.61      0.61      0.60      2390

