# Sentiment Model with GloVe Embeddings

> In this notebook we will build a sentiment classifier model from comments (texts) scrapped from Youtube. And show how to use the `Tokenizer` for embedding a document and use it on `GloVe` pretrained weights.

- Note: At the time this tokenizer does not fully integrate models like `Bert` and `Albert` as the `transformers` library does this! [huggingface-transformers github](https://github.com/huggingface/transformers)

**Let's start by first loading the dataset**

In [3]:
from david.tokenizers import YTCommentsDataset, Tokenizer

# For this demo, 1600 samples works but you can choose up to 6k samples.
train_dataset, _ = YTCommentsDataset.split_train_test(2000, subset=0.8)
print('comment:', train_dataset[0])
print('samples:', len(train_dataset))

comment: This is very Good Way to Wake up myself from dreaming Fairy Life. Feeling Energetic Now.
samples: 1600


In [4]:
# Contruct a Tokenizer object and pass a document to build the vocabulary.
tokenizer = Tokenizer(document=train_dataset)
print(tokenizer)

<Tokenizer(vocab_size=7828)>


In [6]:
def print_vocab(n=5):
    # Lazy printing method to inspect the tokenizer's inner-workings.
    print("* tokens-to-index: {}\n* tokens-to-count: {}".format(
        tokenizer.bag_of_tokens(n), tokenizer.most_common(n)))

print_vocab(5)

* tokens-to-index: [('this', 1), ('is', 2), ('very', 3), ('good', 4), ('way', 5)]
* tokens-to-count: [('.', 2215), ('the', 2102), (',', 1613), ('i', 1297), ('to', 1286)]


> That was easy, the `Tokenizer` did all the hardwork for us! So is that it? Well, no. Lets see what else you can do with it, We did not import a `class` for one line of code!

**- Next : ( `encoding / decoding` )**

In [8]:
# Encoding strings to the vocabulary's index:
text = "hello, world! this text was embedded with youtube comments! 😁"
indexed_text = tokenizer.convert_string_to_ids(text)
print(indexed_text)

[519, 21, 567, 66, 1, 99, 216, 2726, 456, 61, 3378, 66, 847]


In [9]:
# Decoding indexed sequences to tokens of string sequences:
tokenized_index = tokenizer.convert_ids_to_tokens(indexed_text)
print(tokenized_index)

['hello', ',', 'world', '!', 'this', 'text', 'was', 'embedded', 'with', 'youtube', 'comments', '!', '😁']


In [10]:
# Decode from index to string:
print(tokenizer.convert_ids_to_string(indexed_text))

hello, world! this text was embedded with youtube comments! 😁


> Remember, you can **`convert`** from any input state to whatever `x` state you want with one call. The `Tokenizer` has you covered!

```bash
convert_ids_to_string, convert_ids_to_tokens
convert_string_to_ids, convert_string_to_tokens
convert_tokens_to_ids, convert_tokens_to_string
```

**- Next : `encoding the dataset`**

> In order to do any cool *'Machine Learning'* our dataset needs to meet the following steps...

In [14]:
# Which the tokenizer will do for us :) transform the dataset to embeddings
sequences = tokenizer.document_to_sequences(document=train_dataset)

# The method yields each item but our dataset is small and memory wont do us harm!
sequences = list(sequences)

Applying the needed requirements for fitting the documents. Calling
`self.vocab_index_to_frequency` for you...[0m


**What is that? We forgot to explicitly call the method to.. let me just show you!**

- before calling: `tokenizer.document_to_sequences`.

```python
* tokens-to-index: [('this', 1), ('is', 2), ('very', 3), ('good', 4), ('way', 5)]
* tokens-to-count: [('.', 2215), ('the', 2102), (',', 1613), ('i', 1297), ('to', 1286)]
```

- after calling: `tokenizer.document_to_sequences`.

```python
* tokens-to-index: [('.', 1), ('the', 2), (',', 3), ('i', 4), ('to', 5)]
* tokens-to-count: [('.', 2215), ('the', 2102), (',', 1613), ('i', 1297), ('to', 1286)]
```

> Long story short, the `Tokenizer` did not do any stupid magic behind our backs. The message is there to let us know - For next-time if we need to do `x` before this. 

Furthermore, If you look at the order on present **`tokens-to-index`** - you will notice that tokens `['.', 'the', ',', 'i']` are aligned in relation to the order frequency - which are the tokens present in  **`tokens-to-count`**. We can say the vocabulary is now aliged to its `term` frequecy. Lastly, It is not a hard problem, as it is simply re-indexing the tokens to the term frequency found in the dataset. This `indexing` thing is your friend and you need to know when it happens!

**Next : `( loading / saving )`**

In [15]:
print_vocab(5)

* tokens-to-index: [('.', 1), ('the', 2), (',', 3), ('i', 4), ('to', 5)]
* tokens-to-count: [('.', 2215), ('the', 2102), (',', 1613), ('i', 1297), ('to', 1286)]


In [16]:
import os
# How to save the vocabulary from the tokenizer.
 # You need handle your own path to files and directory.
VOCAB_BASE = "vocab"
if not os.path.exists(VOCAB_BASE):
    os.makedirs(VOCAB_BASE, exist_ok=True)
VOCAB_FILE = os.path.join(VOCAB_BASE, "vocab.pkl")
VECTORS_FILE = os.path.join(VOCAB_BASE, "vectors.pkl")

# one way to save the vocab if you really just want the vocabulary.
tokenizer.save_vocabulary(VOCAB_FILE)

[38;5;4mℹ INFO: Using `self.save_vectors` and `self.load_vectors` is
recommended over simply saving the vocabulary as it saves both states from the
vocab_index and vocab_count dict objects Both which improve the tokenizer's
features.[0m


In [18]:
# This way we can 'restore' our instance without having to load the dataset again!
tokenizer.save_vectors(VECTORS_FILE)
del tokenizer # just to prove my point
tokenizer

NameError: name 'tokenizer' is not defined

In [20]:
tokenizer = Tokenizer(VECTORS_FILE)
tokenizer

<Tokenizer(vocab_size=7828)>

In [21]:
# Everything is loaded like nothing happended including the vocab as frequency of terms
print(tokenizer.convert_string_to_tokens(text))

['hello', ',', 'world', '!', 'this', 'text', 'was', 'embedded', 'with', 'youtube', 'comments', '!', '😁']


In [22]:
print_vocab(5)

* tokens-to-index: [('.', 1), ('the', 2), (',', 3), ('i', 4), ('to', 5)]
* tokens-to-count: [('.', 2215), ('the', 2102), (',', 1613), ('i', 1297), ('to', 1286)]


In [23]:
# last trick, you can also get the vocab as [(token, freq, index)]
vocab_vectors = tokenizer.vocab_to_vectors()[:5]
print(vocab_vectors)

[('.', 2215, 1), ('the', 2102, 2), (',', 1613, 3), ('i', 1297, 4), ('to', 1286, 5)]


## Embedding sequences with GloVe's pretrained weights

> The `david.tokenizer.Tokenizer` class made the preprocessing and encoding a lot easier, but it gets easier to use the `GloVe` embeddings with one line of code! We can simply pass the indexed vocabulary and choose the vocab dimension we want to use.

In [24]:
from david.models import GloVe
glove_embeddings = GloVe.fit_embeddings(tokenizer.vocab_index, vocab_dim="100d")

[38;5;2m✔ Loading vocab file from
/home/ego/david_models/glove/glove.6B/glove.6B.100d.txt[0m
[38;5;2m✔ num-dim:(100), vocab-size: 7829[0m
[38;5;2m✔ *** embedding vocabulary 🤗 ***[0m


In [25]:
# Thats it! we now have embedded our sequences with glove's vocab weights.
glove_embeddings.shape

(7829, 100)

In [26]:
from typing import List, Sequence
from david.text import get_sentiment_polarity

def get_sentiment_labels(sequences: List[Sequence[int]]) -> List[int]:
    '''Overkill for obtaining sentiment scores. But its an easy
    way to show how we can use the tokenizer to decode the embedded
    sequences back to strings.'''
    labels = []
    for sequence in sequences:
        string = tokenizer.convert_ids_to_string(sequence)
        polarity = get_sentiment_polarity(string)
        labels.append(1 if polarity > 0 else 0)
    return labels

sentiment_labels = get_sentiment_labels(sequences=sequences)
print('sentiment labels:', sentiment_labels[:5])
print('dataset / labels:', (len(sentiment_labels), len(sequences)))

sentiment labels: [1, 1, 0, 0, 0]
dataset / labels: (1600, 1600)


## Building the sentiment model (neural-network) with Keras

> After preprocessing and encoding the dataset from youtube comments - we can begin with creating a Sequential model from the `glove embeddings`.

In [27]:
from david.text import largest_string_sequence

from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.embeddings import Embedding

Using TensorFlow backend.


In [28]:
vocab_size, dimensions = glove_embeddings.shape
largest_input_length = largest_string_sequence(document=train_dataset,
                                               tokenizer=tokenizer.tokenize) 
model = Sequential()
embedding_layer = Embedding(vocab_size, dimensions,
                            weights=[glove_embeddings],
                            input_length=largest_input_length,
                            trainable=False)
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["acc"])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 798, 100)          782900    
_________________________________________________________________
flatten_1 (Flatten)          (None, 79800)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 79801     
Total params: 862,701
Trainable params: 79,801
Non-trainable params: 782,900
_________________________________________________________________


In [29]:
from keras.preprocessing.sequence import pad_sequences

# Here we use the sequences from the tokenizer and we can now train our model
padded_sequences = pad_sequences(sequences, largest_input_length, padding="post")
model.fit(padded_sequences, sentiment_labels, epochs=100, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.callbacks.History at 0x7f61f4105190>

## Predicting sentiment on new inputs

> Below I created some helper methods for predicting sentiment from new inputs (texts not in the dataset).

In [58]:
def pad_input(string: str) -> List[List[Sequence[int]]]:
    """New inputs need follow the same encoding steps as the dataset."""
    tokens = tokenizer.tokenize(string)
    embedd = tokenizer.convert_tokens_to_ids(tokens)
    maxlen = largest_input_length  # including the largest input value.
    return pad_sequences([embedd], maxlen=maxlen, padding="post")


def predict(string: str, k=.6, model=model) -> str:
    """Print the prediction for new inputs from the trained model."""
    embedd_input = pad_input(string)
    embedd_score = model.predict(embedd_input)[0]
    out_template = "input: {} : {} -> ({})%"

    if embedd_score[0] >= k:
        out_template = out_template.format(
            string, "<POSITIVE>", round(embedd_score[0]*100, 4))
    else:
        out_template = out_template.format(
            string, "<NEGATIVE>", round(embedd_score[0]*100, 4))

    return out_template

In [59]:
predict("hello there i am so glad this demo worked")

'input: hello there i am so glad this demo worked : <POSITIVE> -> (66.8464)%'

In [60]:
# Here we can see the difference of training the model
# with/without punctuation. The Tokenizer helped the model
# with detecting punctuation and emoji's as part of semantical context.

predict("hello there! i am so glad this demo worked!")

'input: hello there! i am so glad this demo worked! : <POSITIVE> -> (74.7259)%'

In [61]:
# in this demo we dont handle emoji's in this demo, but it is part of the models context.
# There is multiple ways (rule-based or model-based) to include emoji's as part of sentiment scores

emoji = {'pos': '😁!', 'neg': '😡!'}
t1 = predict("I love this, but hate it {}".format(emoji["pos"]))
t2 = predict("I hate this, but love it {}".format(emoji["neg"]))
print('1:', t1)
print('2:', t2)

1: input: I love this, but hate it 😁! : <POSITIVE> -> (88.8346)%
2: input: I hate this, but love it 😡! : <POSITIVE> -> (85.3823)%


In [81]:
import random
test_dataset = random.sample(train_dataset, k=10)
for comment in test_dataset:
    pred = predict(comment)
    print(f"{pred}\n")

input: America is a falling empire, these moves are the only things they have left to rely on. It's just a matter of time before Samsung come to their senses and join forces with Huawei and create their own OS and ecosystem. who is to say that the American government won't come for Samsung if they get to the position Huawei is in. Huawei is it for the long game. Already they have been laying the groundwork all over Africa and Asia and eastern Europe. Google and all western brands are fucked in the long run. This is the future while the west is on a population decline. Africa 1,679 Billion 2030 – Africa 4.3 Billion 2100 India 1,527 Billion 2030 – India 1.6 Billion 2100 China 1,415 Billion 2030 – China 1 Billion 2100 All of Asia combine 4,922 Billion 2030 - Asia 4,888 Billion 2100 https://www.populationpyramid.net/africa/2100/ : <NEGATIVE> -> (0.012)%

input: Thank you very much for this. This is really great :) : <POSITIVE> -> (99.9406)%

input: just went through a bad breakup , lost my

## Saving and loading the model

> Steps of how easy it is to save and load a trained model with `keras`.

In [54]:
from keras.models import load_model

MODEL_DIR = 'model'
MODEL_FILE = 'sentiment.h5'
# deletes the existing model
delete_existing_model = False

if not os.path.exists(MODEL_DIR): os.mkdir(MODEL_DIR)
MODEL_PATH = os.path.join(MODEL_DIR, MODEL_FILE)
    
# creates a HDF5 file
model.save(MODEL_PATH)
if delete_existing_model:
    del model

# returns a compiled model identical to the previous one
sentiment_model = load_model(MODEL_PATH)
sentiment_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 798, 100)          782900    
_________________________________________________________________
flatten_1 (Flatten)          (None, 79800)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 79801     
Total params: 862,701
Trainable params: 79,801
Non-trainable params: 782,900
_________________________________________________________________


In [55]:
def predict_sentiment(string: str, k=0.60):
    return predict(string, k=k, model=sentiment_model)

predict_sentiment("hello, world! this a text from the loaded model!")

'input: hello, world! this a text from the loaded model! : <POSITIVE> -> (95.4739)%'