# Deep Learning for Sentiment Analysis

<img style="float: left;" src="images/sentiment.png">

In questo tutorial vediamo come creare un modello di sentiment analysis in Keras.

Per creare classificatori di testo i passi fondamentali sono:
    - Embed
    - Encode
    - (Attend)
    - Predict

[Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models](https://explosion.ai/blog/deep-learning-formula-nlp)

Vediamo passo passo questi passaggi reimplementando il modello stato dell'arte per il sentiment analysis a Semeval 2015.

[Twitter sentiment analysis with deep convolutional neural networks](https://pdfs.semanticscholar.org/9320/a229b450bee8384f218681634e039acd9c2f.pdf)

## Data preparation

Data and embeddings can be downloaded from [here](https://drive.google.com/open?id=0B8xjf4y9r8jCdVFjVTZqdzZTbVU)



Come prima cosa prepariamo i dati per il training facendo del preprocessing. Questo [tokenizer](https://github.com/jaredks/tweetokenize) applica delle semplici trasformazioni al tweet: 

- lowercase
- mappa i numeri in ad un token speciale NUMBER
- mappa il nome utente ad un carattere speciale USERNAME

In [1]:
from tokenizer import Tokenizer

tkn = Tokenizer()

def preprocess(tweet):
    return tkn.tokenize(tweet)

In [2]:
preprocess("@bestuser Gas by my house hit $3.39!!!! I'm going to Chapel Hill on Sat. :) #lol")

[u'USERNAME',
 u'gas',
 u'by',
 u'my',
 u'house',
 u'hit',
 u'NUMBER',
 u'!',
 u'!',
 u'!',
 u'!',
 u"i'm",
 u'going',
 u'to',
 u'chapel',
 u'hill',
 u'on',
 u'sat',
 u'.',
 u':)',
 u'#lol']

A questo punto possiamo preprocessare il training set di Semeval 2015.

In [4]:
def load_dataset(file_name, gold=None):
    labels = {'negative':0, 'neutral':1, 'positive':2, 'unknwn':1}
    X_, y_ = [], []
    with open(file_name) as f:
        for line in f:
            label, _, text = line.strip().split('\t') 
            y_.append(labels[label])
            X_.append(preprocess(text))
    if gold:
        y_ = []
        with open(gold) as f:
            for line in f:
                _, _, label = line.strip().split('\t')
                y_.append(labels[label])
    return (X_, y_)
            
X_train, y_train = load_dataset("data/train'13.csv")
X_dev, y_dev = load_dataset("data/dev'13.csv")
X_train = X_train+X_dev
y_train = y_train+y_dev
X_dev, y_dev = load_dataset("data/test'13.csv")
X_test_1, _ = load_dataset("data/sms'13.csv")
X_test_2, _ = load_dataset("data/test'14.csv")
X_test_3, _ = load_dataset("data/test'15.csv")
X_dev[5]

[u'excuse',
 u'the',
 u'connectivity',
 u'of',
 u'this',
 u'live',
 u'stream',
 u',',
 u'from',
 u'baba',
 u'amr',
 u',',
 u'so',
 u'many',
 u'activists',
 u'using',
 u'only',
 u'one',
 u'sat',
 u'modem',
 u'.',
 u'LIVE',
 u'URL',
 u'#Homs']

Per velocizzare il mapping assegnamo creiamo un dizionario dove ad ogni parola è assegnato un Id univoco. In questo dizionario aggiungiamo una token speciale per le parole sconosciute e uno per il PADDING (Spigherò dopo) 

In [5]:
from itertools import chain
dictionary = {'PAD':0, 'UNK':1}

toks = set(chain.from_iterable(X_train+X_dev+X_test_1+X_test_2+X_test_3))
for i, tok in enumerate(toks):
    dictionary[tok] = i+2
len(dictionary)

33609

Ora mappiamo le parole di training e dev set a questi indici nel dizionario:

In [6]:
def word2id(sent):
    return map(lambda x: dictionary.get(x, 1), sent)

X_train = map(word2id, X_train)
X_dev = map(word2id, X_dev)
    
X_dev[5]

[20017,
 10819,
 13784,
 15722,
 26775,
 11018,
 32099,
 29495,
 14115,
 9877,
 3011,
 29495,
 17467,
 17670,
 1314,
 18253,
 19701,
 733,
 25967,
 26452,
 337,
 265,
 9510,
 29007]

In generale le reti neurali accettano solo vettori (tensori) di dimensione prefissata quindi mapperemo tutte le frasi alla frase piu lunga del training set. E convertiamo le frasi in vettori numpy.

In [7]:
import numpy as np

max_len = max(len(x) for x in X_train)

def _pad(s, maxlen):
    pad_ = np.zeros(maxlen, dtype='int32')
    trunc = np.asarray(s[-maxlen:], dtype='int32')
    pad_[-len(trunc):] = trunc
    return pad_

X_train = np.array(map(lambda x: _pad(x, max_len), X_train))
X_dev = np.array(map(lambda x: _pad(x, max_len), X_dev))
X_train[123]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0, 14158, 10819, 32671,
       32837, 22163, 10819,  4200, 26828, 27247, 25524, 32837, 18384,
       28365,  2483, 17877,  5846,  7431, 22140, 25735, 14638, 28244,
       13358, 12938,   105, 16485,   337], dtype=int32)

Come ultimo passaggio mappiamo le labels delle tre classi in one-hot vectors

In [8]:
def labels(x):
    out_ = np.zeros(3, dtype='int32')
    out_[x] = 1
    return out_

y_train = np.array(map(labels, y_train))
y_dev = np.array(map(labels, y_dev))
y_dev

array([[0, 0, 1],
       [0, 0, 1],
       [0, 1, 0],
       ..., 
       [0, 0, 1],
       [0, 1, 0],
       [0, 1, 0]], dtype=int32)

## The network

A questo punto definiamo la network.

In [13]:
np.random.seed(1337)
from keras.models import Sequential
from keras.layers import (Dropout,
                          Convolution1D,
                          GlobalMaxPooling1D,
                          Dense,
                          Embedding)
from keras.optimizers import Adadelta
from keras.regularizers import l2
from keras import backend as K

emb_dim = 100
conv_filters = 300

model = Sequential()
model.add(Embedding(len(dictionary), emb_dim, input_length=max_len)) #Embed
model.add(Convolution1D(nb_filter=conv_filters, filter_length=5, border_mode='same', activation='relu'))
model.add(GlobalMaxPooling1D()) #Encode
model.add(Dropout(0.3))
model.add(Dense(3, activation='softmax')) #Predict


model.compile(loss='categorical_crossentropy',
              optimizer=Adadelta(lr=1.0, rho=0.90, epsilon=1e-8),
              metrics=['accuracy'])
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_4 (Embedding)          (None, 68, 100)       3360900     embedding_input_4[0][0]          
____________________________________________________________________________________________________
convolution1d_4 (Convolution1D)  (None, 68, 300)       150300      embedding_4[0][0]                
____________________________________________________________________________________________________
globalmaxpooling1d_4 (GlobalMaxP (None, 300)           0           convolution1d_4[0][0]            
____________________________________________________________________________________________________
dropout_4 (Dropout)              (None, 300)           0           globalmaxpooling1d_4[0][0]       
___________________________________________________________________________________________

In [14]:
from keras.callbacks import EarlyStopping, ModelCheckpoint

early_stopping = EarlyStopping(monitor='val_acc', patience=3, mode='max')

model_checkpoint = ModelCheckpoint('model.tra', save_best_only=True, mode='max', monitor='val_acc')

model.fit(X_train, y_train,
          batch_size=32,
          nb_epoch=1000,
          shuffle=True,
          validation_data=(X_dev, y_dev),
          callbacks=[early_stopping, model_checkpoint])

Train on 11338 samples, validate on 3813 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000


<keras.callbacks.History at 0x114542b10>

In [15]:
from keras.models import load_model

model = load_model('model.tra')

def evaluate(file_name, gold):
    print(file_name)
    from sklearn.metrics import precision_recall_fscore_support
    from sklearn.metrics import classification_report
    
    X_, y_ = load_dataset(file_name, gold)
    X_ = map(word2id, X_)
    X_ = np.array(map(lambda x: _pad(x, max_len), X_))
    pred = model.predict_classes(X_, verbose=0)
    ev = precision_recall_fscore_support(y_, pred)
    f1 = (ev[2][0]+ev[2][2])/2
    print('Semeval F1 score: {} %'.format(f1*100))
    print(classification_report(y_, pred))


files = ["data/sms'13.csv",
         "data/test'13.csv",
         "data/test'15.csv"]

gold = [None,
        None,
        "data/SemEval2015-task10-test-B-gold.txt"]


for file_name, gold in zip(files, gold):
    evaluate(file_name, gold)

data/sms'13.csv
Semeval F1 score: 55.7038501905 %
             precision    recall  f1-score   support

          0       0.44      0.59      0.50       394
          1       0.82      0.69      0.75      1208
          2       0.59      0.64      0.61       492

avg / total       0.69      0.66      0.67      2094

data/test'13.csv
Semeval F1 score: 59.0717948718 %
             precision    recall  f1-score   support

          0       0.61      0.40      0.48       601
          1       0.65      0.82      0.73      1640
          2       0.75      0.65      0.70      1572

avg / total       0.69      0.68      0.68      3813

data/test'15.csv
Semeval F1 score: 52.4286331433 %
             precision    recall  f1-score   support

          0       0.49      0.37      0.42       365
          1       0.59      0.81      0.68       987
          2       0.74      0.54      0.63      1038

avg / total       0.64      0.63      0.62      2390



In [11]:
from gensim.models import Word2Vec

w2v = Word2Vec.load_word2vec_format('data/embeddings.bin', binary=True)

In [12]:
def emb_matrix(dictionary, model):
    embedding_matrix = np.random.uniform(-0.25, 0.25, (len(dictionary), 100))
    for word in dictionary:
        if word in model:
            embedding_matrix[dictionary[word]] = model[word]
    return embedding_matrix
        

In [13]:
np.random.seed(1337)
from keras.models import Sequential
from keras.layers import (Dropout,
                          Convolution1D,
                          GlobalMaxPooling1D,
                          Dense,
                          Embedding)
from keras.optimizers import Adadelta
from keras.regularizers import l2

emb_dim = 100
conv_filters = 300

embeddings = Embedding(len(dictionary), 100, input_length=max_len, weights=[emb_matrix(dictionary, w2v)], trainable=True)

model = Sequential()
model.add(embeddings) #Embed
model.add(Convolution1D(nb_filter=conv_filters, filter_length=5, border_mode='same', activation='relu'))
model.add(GlobalMaxPooling1D()) #Encode
model.add(Dropout(0.3))
model.add(Dense(3, activation='softmax')) #Predict

model.compile(loss='categorical_crossentropy',
              optimizer=Adadelta(lr=1.0, rho=0.90, epsilon=1e-8),
              metrics=['accuracy'])
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_2 (Embedding)          (None, 68, 100)       3360900     embedding_input_3[0][0]          
____________________________________________________________________________________________________
convolution1d_2 (Convolution1D)  (None, 68, 300)       150300      embedding_2[0][0]                
____________________________________________________________________________________________________
globalmaxpooling1d_2 (GlobalMaxP (None, 300)           0           convolution1d_2[0][0]            
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 300)           0           globalmaxpooling1d_2[0][0]       
___________________________________________________________________________________________

In [54]:
from keras.callbacks import EarlyStopping, ModelCheckpoint
early_stopping = EarlyStopping(monitor='val_acc', patience=3, mode='max')
model_checkpoint = ModelCheckpoint('model.tra', save_best_only=True, mode='max', monitor='val_acc')

model.fit(X_train, y_train,
          batch_size=32,
          nb_epoch=1000,
          shuffle=True,
          validation_data=(X_dev, y_dev),
          callbacks=[early_stopping, model_checkpoint])

Train on 11338 samples, validate on 3813 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000


<keras.callbacks.History at 0x106b5afd0>

In [55]:
from keras.models import load_model

model = load_model('model.tra')

def evaluate(file_name, gold):
    print(file_name)
    from sklearn.metrics import precision_recall_fscore_support
    from sklearn.metrics import classification_report
    
    X_, y_ = load_dataset(file_name, gold)
    X_ = map(word2id, X_)
    X_ = np.array(map(lambda x: _pad(x, max_len), X_))
    pred = model.predict_classes(X_, verbose=0)
    ev = precision_recall_fscore_support(y_, pred)
    f1 = (ev[2][0]+ev[2][2])/2
    print('Semeval F1 score: {} %'.format(f1*100))
    print(classification_report(y_, pred))


files = ["data/sms'13.csv",
         "data/test'13.csv",
         "data/test'15.csv"]

gold = [None,
        None,
        "data/SemEval2015-task10-test-B-gold.txt"]


for file_name, gold in zip(files, gold):
    evaluate(file_name, gold)

data/sms'13.csv
Semeval F1 score: 65.3348018351 %
             precision    recall  f1-score   support

          0       0.53      0.81      0.64       394
          1       0.87      0.73      0.79      1208
          2       0.68      0.66      0.67       492

avg / total       0.76      0.73      0.73      2094

data/test'13.csv
Semeval F1 score: 66.6498282735 %
             precision    recall  f1-score   support

          0       0.66      0.58      0.62       601
          1       0.68      0.84      0.75      1640
          2       0.81      0.64      0.72      1572

avg / total       0.73      0.72      0.72      3813

data/test'15.csv
Semeval F1 score: 60.8250776346 %
             precision    recall  f1-score   support

          0       0.50      0.60      0.55       365
          1       0.62      0.79      0.70       987
          2       0.83      0.56      0.67      1038

avg / total       0.70      0.66      0.66      2390



In [14]:
np.random.seed(1337)
from keras.models import Sequential
from keras.layers import (Dropout,
                          GRU,
                          Bidirectional,
                          Dense,
                          Embedding)
from keras.optimizers import Adadelta
from keras.regularizers import l2

emb_dim = 100
conv_filters = 300

model = Sequential()
model.add(Embedding(len(dictionary), 100, input_length=max_len, weights=[emb_matrix(dictionary, w2v)], dropout=0.2)) #Embed
model.add(Bidirectional(GRU(150, activation='relu', return_sequences=True)))
model.add(GRU(300, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax')) #Predict

model.compile(loss='categorical_crossentropy',
              optimizer=Adadelta(lr=1.0, rho=0.90, epsilon=1e-8),
              metrics=['accuracy', 'fbeta_score'])
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_3 (Embedding)          (None, 68, 100)       3360900     embedding_input_4[0][0]          
____________________________________________________________________________________________________
bidirectional_1 (Bidirectional)  (None, 68, 300)       225900      embedding_3[0][0]                
____________________________________________________________________________________________________
gru_2 (GRU)                      (None, 300)           540900      bidirectional_1[0][0]            
____________________________________________________________________________________________________
dropout_3 (Dropout)              (None, 300)           0           gru_2[0][0]                      
___________________________________________________________________________________________

In [None]:
from keras.callbacks import EarlyStopping, ModelCheckpoint
early_stopping = EarlyStopping(monitor='val_acc', patience=3, mode='max')
model_checkpoint = ModelCheckpoint('model.tra', save_best_only=True, mode='max', monitor='val_acc')

model.fit(X_train, y_train,
          batch_size=32,
          nb_epoch=1000,
          shuffle=True,
          validation_data=(X_dev, y_dev),
          callbacks=[early_stopping, model_checkpoint])

In [None]:
from keras.models import load_model

model = load_model('model.tra')

def evaluate(file_name, gold):
    print(file_name)
    from sklearn.metrics import precision_recall_fscore_support
    from sklearn.metrics import classification_report
    
    X_, y_ = load_dataset(file_name, gold)
    X_ = map(word2id, X_)
    X_ = np.array(map(lambda x: _pad(x, max_len), X_))
    pred = model.predict_classes(X_, verbose=0)
    ev = precision_recall_fscore_support(y_, pred)
    f1 = (ev[2][0]+ev[2][2])/2
    print('Semeval F1 score: {} %'.format(f1*100))
    print(classification_report(y_, pred))


files = ["data/sms'13.csv",
         "data/test'13.csv",
         "data/test'15.csv"]

gold = [None,
        None,
        "data/SemEval2015-task10-test-B-gold.txt"]


for file_name, gold in zip(files, gold):
    evaluate(file_name, gold)

Combiniamo un convolutional model con un Recurrent model. Prima creaiamo le rappresentazioni dei 5-gram usando la convoluzione. Per velocizzare il training riduciamo la lunghezza dell'input usando Max pooling (4). Questo layer esegue l'operazione di max pooling non a livello di frase ma ogni 4 n-gram embeddings. In seguito usiamo gated recurrent unit per ottenere il vettore in output

In [15]:
np.random.seed(1337)
from keras.models import Sequential
from keras.layers import (Dropout,
                          Convolution1D,
                          MaxPooling1D,
                          GRU,
                          Dense,
                          Embedding)
from keras.optimizers import Adadelta
from keras.regularizers import l2

emb_dim = 100
conv_filters = 300

embeddings = Embedding(len(dictionary), 100, input_length=max_len, weights=[emb_matrix(dictionary, w2v)], trainable=True)

model = Sequential()
model.add(embeddings) #Embed
model.add(Convolution1D(nb_filter=conv_filters, filter_length=5, border_mode='same', activation='relu'))
model.add(MaxPooling1D(4)) #Encode
model.add(GRU(300, activation='relu'))
#model.add(Dropout(0.3))
model.add(Dense(3, activation='softmax')) #Predict

model.compile(loss='categorical_crossentropy',
              optimizer=Adadelta(lr=1.0, rho=0.90, epsilon=1e-8),
              metrics=['accuracy', 'fbeta_score'])
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_4 (Embedding)          (None, 68, 100)       3360900     embedding_input_5[0][0]          
____________________________________________________________________________________________________
convolution1d_3 (Convolution1D)  (None, 68, 300)       150300      embedding_4[0][0]                
____________________________________________________________________________________________________
maxpooling1d_1 (MaxPooling1D)    (None, 17, 300)       0           convolution1d_3[0][0]            
____________________________________________________________________________________________________
gru_3 (GRU)                      (None, 300)           540900      maxpooling1d_1[0][0]             
___________________________________________________________________________________________

In [None]:
from keras.callbacks import EarlyStopping, ModelCheckpoint
early_stopping = EarlyStopping(monitor='val_acc', patience=3, mode='max')
model_checkpoint = ModelCheckpoint('model.tra', save_best_only=True, mode='max', monitor='val_acc')

model.fit(X_train, y_train,
          batch_size=32,
          nb_epoch=1000,
          shuffle=True,
          validation_data=(X_dev, y_dev),
          callbacks=[early_stopping, model_checkpoint])

In [None]:
model = load_model('model.tra')

def evaluate(file_name, gold):
    print(file_name)
    from sklearn.metrics import precision_recall_fscore_support
    from sklearn.metrics import classification_report
    
    X_, y_ = load_dataset(file_name, gold)
    X_ = map(word2id, X_)
    X_ = np.array(map(lambda x: _pad(x, max_len), X_))
    pred = model.predict_classes(X_, verbose=0)
    ev = precision_recall_fscore_support(y_, pred)
    f1 = (ev[2][0]+ev[2][2])/2
    print('Semeval F1 score: {} %'.format(f1*100))
    print(classification_report(y_, pred))


files = ["data/sms'13.csv",
         "data/test'13.csv",
         "data/test'15.csv"]

gold = [None,
        None,
        "data/SemEval2015-task10-test-B-gold.txt"]


for file_name, gold in zip(files, gold):
    evaluate(file_name, gold)