# Sentiment Classification with an Keras Embedding Layer
Verwendung eines RNNs in Verbindung mit LSTM

TF-IDF = anderes bag of word model encoding

**mehr Informationen befinden sich im Machine Learning Dokument (word)**

Tensorflow Embedding: https://www.tensorflow.org/programmers_guide/embedding

https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

https://github.com/transcranial/keras-js

In [2]:
from keras.datasets import imdb
import itertools
from numpy import array, asarray, zeros
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.preprocessing import sequence
from keras.layers import Embedding, LSTM, Dense, Dropout, Flatten
from keras.preprocessing.text import Tokenizer

## Beispiel 1

In [5]:
vokabel_anz = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vokabel_anz)
print('{} Trainingsdaten, {} Testdaten'.format(len(X_train), len(X_test)))

25000 Trainingsdaten, 25000 Testdaten


In [6]:
print('---Filmreview---')
print(X_train[6])
print('---label---')
print(y_train[6])

---Filmreview---
[1, 2, 365, 1234, 5, 1156, 354, 11, 14, 2, 2, 7, 1016, 2, 2, 356, 44, 4, 1349, 500, 746, 5, 200, 4, 4132, 11, 2, 2, 1117, 1831, 2, 5, 4831, 26, 6, 2, 4183, 17, 369, 37, 215, 1345, 143, 2, 5, 1838, 8, 1974, 15, 36, 119, 257, 85, 52, 486, 9, 6, 2, 2, 63, 271, 6, 196, 96, 949, 4121, 4, 2, 7, 4, 2212, 2436, 819, 63, 47, 77, 2, 180, 6, 227, 11, 94, 2494, 2, 13, 423, 4, 168, 7, 4, 22, 5, 89, 665, 71, 270, 56, 5, 13, 197, 12, 161, 2, 99, 76, 23, 2, 7, 419, 665, 40, 91, 85, 108, 7, 4, 2084, 5, 4773, 81, 55, 52, 1901]
---label---
1


In [7]:
# Mapping Integers to Words
word2id = imdb.get_word_index()
print('---word2id---')
print(dict(itertools.islice(word2id.items(), 10)))

id2word = {i: word for word, i in word2id.items()}
print('---id2word---')
print(dict(itertools.islice(id2word.items(), 10)))

print('---review with words---')
print([id2word.get(i, ' ') for i in X_train[6]])
print('---label---')
print(y_train[6])

---word2id---
{'fawn': 34701, 'tsukino': 52006, 'nunnery': 52007, 'sonja': 16816, 'vani': 63951, 'woods': 1408, 'spiders': 16115, 'hanging': 2345, 'woody': 2289, 'trawling': 52008}
---id2word---
{34701: 'fawn', 52006: 'tsukino', 52007: 'nunnery', 16816: 'sonja', 63951: 'vani', 1408: 'woods', 16115: 'spiders', 2345: 'hanging', 2289: 'woody', 52008: 'trawling'}
---review with words---
['the', 'and', 'full', 'involving', 'to', 'impressive', 'boring', 'this', 'as', 'and', 'and', 'br', 'villain', 'and', 'and', 'need', 'has', 'of', 'costumes', 'b', 'message', 'to', 'may', 'of', 'props', 'this', 'and', 'and', 'concept', 'issue', 'and', 'to', "god's", 'he', 'is', 'and', 'unfolds', 'movie', 'women', 'like', "isn't", 'surely', "i'm", 'and', 'to', 'toward', 'in', "here's", 'for', 'from', 'did', 'having', 'because', 'very', 'quality', 'it', 'is', 'and', 'and', 'really', 'book', 'is', 'both', 'too', 'worked', 'carl', 'of', 'and', 'br', 'of', 'reviewer', 'closer', 'figure', 'really', 'there', 'will'

In [8]:
print("längstes Review mit {} Zeichen", len(max((X_train + X_test), key=len)))
print("kürzestes Review mit {} Zeichen", len(min((X_train + X_test), key=len)))

längstes Review mit {} Zeichen 2697
kürzestes Review mit {} Zeichen 70


## Pad Sequences
**Bei Daten mit unterschiedlicher Länge an Wörtern pro Satz**

Um das RNN mit Daten zu füttern, benötigen die Texte die gleiche Länge. Zu lange Sätze werden abgeschnitten und in kleinere Sätze aufgeteilt. Zu kurze Sätze werden mit 0en aufgefüllt.
--> Verwendung der pad_sequences() Fkt. in Keras

In [9]:
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)
# die Matrix hat nun 500 Spalten --> Reviews unter 500 Wörten werden mit 0 aufgefüllt und Reviews über 500 gesplittet

In [36]:
len(X_test[0])

500

## Embedding Layer
* wird als erster Hidden-Layer im Netwerk definiert
* wird mit zufälligen Gewichten initialisiert und lernt das Embedding für alle Wörter im Trianingsset
* d.h. die Gewichte der Wörter werden gelernt
* Vorraussetzung: die Wörter der Inputtexte müssen zu Integerzahlen aus dem Vokabular geparst werden
* benötigt 3 Argumente:
    * **input_dim**: Größe des Vokabulars (z.B. Integerzahlen von 0-100, Vokabular = 101)
    * **output_dim**: Größe/Dim des Outputvektors in den die Wörter embedded sind
    * **input_length**: Länge der Input Sequenz (z.B. Anz der Inputspalten der Matrix) z.B. 4 durch pad_sequence
    * **trainable**: false, Wortvektoren werden während des Trainings nicht angepasst

#### mask_zero (Argument Embedding Layer)
Wenn der Eingabewert 0 ein durch padding (auffüllen) ausgeblendet werden soll. 
nützlich, wenn wiederkehrende Layer verwendet werden, die Eingaben mit variabler Länge benötigen. 
Wenn dies der TrueFall ist, müssen alle nachfolgenden Schichten im Modell die Maskierung unterstützen oder es wird eine Ausnahme ausgelöst. 
Wenn mask_zero auf True gesetzt ist, kann der Index 0 nicht im Vokabular verwendet werden (input_dim sollte die Größe des Vokabulars + 1 haben).


Der Output des Embedding Layers ist ein 2 Dim. Matrix: mit einem Embedding für jedes Wort in der input Sequenz der Wörter (Input Dokument).
Um  nach einem Embedding Layer ein Dense Layer zu verwenden muss man vorher einen Flatten-Layer intigrieren. (2 Dim Matrix zu 1 Dim Vektor)

In [10]:
embedding_size=32
model=Sequential()
model.add(Embedding(vokabel_anz, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


In [11]:
model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

In [12]:
batch_size = 64
num_epochs = 3

X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]

model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)

Train on 24936 samples, validate on 64 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x29424869fd0>

In [18]:
model.save("./Models/keras_sentiment_analysis.h5")

In [19]:
from keras.models import load_model
new_model = load_model("./Models/keras_sentiment_analysis.h5")

In [23]:
scores = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', scores[1])

KeyboardInterrupt: 

In [40]:
#review = [1, 2, 365, 1234, 5, 1156, 354, 11, 14, 2, 2, 7, 1016, 2, 2, 356, 44, 4, 1349, 500, 746, 5, 200, 4, 4132, 11, 2, 2, 1117, 1831, 2, 5, 4831, 26, 6, 2, 4183, 17, 369, 37, 215, 1345, 143, 2, 5, 1838, 8, 1974, 15, 36, 119, 257, 85, 52, 486, 9, 6, 2, 2, 63, 271, 6, 196, 96, 949, 4121, 4, 2, 7, 4, 2212, 2436, 819, 63, 47, 77, 2, 180, 6, 227, 11, 94, 2494, 2, 13, 423, 4, 168, 7, 4, 22, 5, 89, 665, 71, 270, 56, 5, 13, 197, 12, 161, 2, 99, 76, 23, 2, 7, 419, 665, 40, 91, 85, 108, 7, 4, 2084, 5, 4773, 81, 55, 52, 1901]
#review_pad = sequence.pad_sequences(review, maxlen=500)
model.predict(X_test[0:2])

array([[ 0.01096611],
       [ 0.9954626 ]], dtype=float32)

## Beispiel 2
von https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

In [2]:
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])
# integer encode the documents
vocab_size = 50

### one hot encode

In [3]:
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)

[[11, 40], [18, 13], [37, 48], [42, 13], [17], [42], [16, 48], [1, 18], [16, 13], [7, 21, 40, 10]]


### pad-sequence

In [4]:
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[11 40  0  0]
 [18 13  0  0]
 [37 48  0  0]
 [42 13  0  0]
 [17  0  0  0]
 [42  0  0  0]
 [16 48  0  0]
 [ 1 18  0  0]
 [16 13  0  0]
 [ 7 21 40 10]]


### Model + Training

In [7]:
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length)) 
#(50, 8, 4) --> 50 Wörter im Vokabular, 8 Spalten Outputmatrix, 4 Spalten Inputmatrix
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

In [15]:
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
flatten_1 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None


In [19]:
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))
print('Loss: %f' % loss)

Accuracy: 100.000000
Loss: 0.283767


# Beispiel mit vortrainierten Wortvektoren (GloVe Embedding)
von https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

In [3]:
# define documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])

In [8]:
# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
print(t.word_index)
print("Vokabelanzahl: {}".format(vocab_size))

{'work': 1, 'done': 2, 'good': 3, 'effort': 4, 'poor': 5, 'well': 6, 'great': 7, 'nice': 8, 'excellent': 9, 'weak': 10, 'not': 11, 'could': 12, 'have': 13, 'better': 14}
Vokabelanzahl: 15


In [11]:
# integer encode the documents
# die Beispiele müssen als Ganzzahlen codiert werden u
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)

[[6, 2], [3, 1], [7, 4], [8, 1], [9], [10], [5, 4], [11, 3], [5, 1], [12, 13, 2, 14]]


In [13]:
# pad documents to a max length of 4 words
# die Sequenzen müssen auf die gleiche Länge aufgefüllt warden
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[ 6  2  0  0]
 [ 3  1  0  0]
 [ 7  4  0  0]
 [ 8  1  0  0]
 [ 9  0  0  0]
 [10  0  0  0]
 [ 5  4  0  0]
 [11  3  0  0]
 [ 5  1  0  0]
 [12 13  2 14]]


In [16]:
def is_ascii(self, text):
        try:
            text.encode('ascii')
            return True
        except:
            return False

In [20]:
embeddings_dict = dict()
with open('../../Daten/glove.6B/glove.6B.50d.txt', "r", encoding="utf-8") as file:
    for line in file:
        values = line.split()
        word = values[0]
        coefs = asarray(values[1:], dtype='float32')

        if word not in embeddings_dict or word.strip() == "":
            embeddings_dict[word] = coefs
print('Loaded %s word vectors.' % len(embeddings_dict))

Loaded 400000 word vectors.


In [23]:
# das Wort gefolgt mit 100 Zahlen/ Gewichten -> Datei mit 100 oder 50 Dim -> Anzahl Zahlen hinter dem Wort
embeddings_dict['the']

array([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
       -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
        2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
        1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
       -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
       -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
        4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
        7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
       -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
        1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01],
      dtype=float32)

In [29]:
# Müssen Wörter in Zahlen und Zahlen in Wörter mappen
# create a weight matrix for words in training docs
embedding_matrix = zeros((vocab_size, 50)) #Matrix mit 15 Zeilen und 50 Spalten
for word, i in t.word_index.items(): #{'work': 1, 'done': 2,...}
    embedding_vector = embeddings_dict.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [31]:
print(len(embedding_matrix))
print(embedding_matrix)

15
[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 [ 5.13589978e-01  1.96950004e-01 -5.19439995e-01 -8.62179995e-01
   1.54940002e-02  1.09729998e-01 -8.02929997e-01 -3.33609998e-01
  -1.61189993e-04  1.01889996e-02  4.6

In [34]:
# define model
model = Sequential()

### Embedding Layer

In [35]:
e = Embedding(vocab_size, 50, weights=[embedding_matrix], input_length=max_length, trainable=False)
# max_length = 4 -> maximal 4 Wörter in einem Sentiment text
# der Output muss immer die gleiche Anzahl haben, wie Gewichte Dimension
# trainable=False -> wir wollen die gelernten Gewichte nicht verändern

In [36]:
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 4, 50)             750       
_________________________________________________________________
flatten_2 (Flatten)          (None, 200)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 201       
Total params: 951
Trainable params: 201
Non-trainable params: 750
_________________________________________________________________
None


In [37]:
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 89.999998
