# Usage:
- usefull for sequential data such as text or characters
- sentiment analysis
- character generation
- goal: write a play

### Bag of Words
- every single word in datatset is the vocabulary
- every word will be placed in a dictionary, with val being an integer that represents it
- Whenever we see a word we'll throw its number into the bag
- lose order, but keep track of the frequency
- feed the bag into neural network

Disadvantages:

Consider this:
- I thought the movie was going to be bad, but it was actually amazing
- I thought the movie was going to be amazing, but it was actually bad

This technique offers no distinction with these two sentences, as only frequency is considered; needs context

### Word embedding
- translate each word into a vector
- their angle determined by semantics: similar words have similar angles

### Recurrent NN
- conatins a loop: process  one word at a time while maintaining an internal memory of what it has already seen
- treating input as a sequence

#### Special Layers:
- Simple RNN layer
    - think of converyor belt
    - looks at current / prev word, creates a model based on that
- LSTM layer
    - accesses output from any state from any previous cell


In [1]:
# Movie reviews
# encoding based on how common a word is in a dataset
from keras.datasets import imdb
from keras.preprocessing import sequence
import tensorflow as tf
import os
import numpy as np

VOCAB_SIZE = 88584
MAXLEN = 250
BATCH_SIZE = 64
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = VOCAB_SIZE)

2022-08-01 21:11:42.779318: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-08-01 21:11:42.779354: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [6]:
from keras.utils import pad_sequences

# add padding, as everything has to be the same length
train_data = pad_sequences(train_data, MAXLEN)
test_data = pad_sequences(test_data, MAXLEN)

In [7]:
# creating the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

2022-08-01 21:18:19.091666: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-08-01 21:18:19.093860: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (fedora): /proc/driver/nvidia/version does not exist
2022-08-01 21:18:19.109172: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [8]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 32)          2834688   
                                                                 
 lstm (LSTM)                 (None, 32)                8320      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


In [10]:
model.compile(loss="binary_crossentropy",optimizer="rmsprop",metrics=['acc'])

history = model.fit(train_data, train_labels, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [11]:
results = model.evaluate(test_data, test_labels)
print(results)

2022-08-01 21:31:36.670905: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 25000000 exceeds 10% of free system memory.


[0.41340261697769165, 0.8602399826049805]


In [17]:
# making predictions
word_index = imdb.get_word_index()

# encoding function
def encode_text(text):
    tokens = tf.keras.preprocessing.text.text_to_word_sequence(text)
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]
    return pad_sequences([tokens], MAXLEN)[0]

text = "that movie was just amazing, oh the misery"
encoded = encode_text(text)
print(encoded)

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [18]:
# decode function

reverse_word_index = {value: key for (key, value) in word_index.items()}

def decode_integers(integers):
    text = ""
    for num in integers:
        if num != 0:
            text += reverse_word_index[num] + " "
    return text

print(decode_integers(encoded))

that movie was just amazing oh the misery 


In [23]:
# making predictions

def predict(text):
    encoded_text = encode_text(text)
    pred = np.zeros((1 , 250))
    pred[0] = encoded_text
    result = model.predict(pred)
    print(result[0])

    
text = "this movie was great really loved it and would watch it again because it was amazingly great engaging plot"
predict(text)

# lower number = more negative
# higher number = more positive

[0.9113181]
