# Sentiment in Texts

In [2]:
import tensorflow as tf
from tensorflow import keras

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [3]:
sentences = {
    'I love my dog',
    'I love my cat',
    'You love my dog.',
    'Do you think my dog is amazing?'
}

## Text to sequence

- Tokenizer creates a token which represents every distinct word. Actually it represents the vocabulary. num_words represents how many words gonna distinct. oov_token is token which represents the words that out of the vocabulary.

- fit_on_texts function creates the vocabulary from sentences.

    Here is the distinct words:
    
> {'<oOOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
    
- After vocabulary has been created, we can create sequences from texts with texts_to_sequences function. It creates sequences using the vocabulary.

>    I love my dog: 5, 3, 2, 7
    
- Most of the sentence has different length. So, this is a problem when using sequences in the neural network. Padding can handle this problem. Padding makes all of the sequences the same length.

    Padded Sequences:
    
>    [[ 0  5  3  2  7] <br>
     [ 0  5  3  2  4]  <br>
     [ 9  2  4 10 11]  <br>
     [ 0  6  3  2  4]]
     
- maxlen parameter controls the maximum length of the sequences. If it's lower than maximum sequence length, it discards some words from beginning or end. truncating parameter controls this feature.

In [5]:
tokenizer = Tokenizer(num_words=100, oov_token='<oOOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, maxlen=5)

print('\nWord Index = ', word_index)
print('\nSequences = ', sequences)
print('\nPadded Sequences:')
print(padded)


Word Index =  {'<oOOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}

Sequences =  [[5, 3, 2, 7], [5, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11], [6, 3, 2, 4]]

Padded Sequences:
[[ 0  5  3  2  7]
 [ 0  5  3  2  4]
 [ 9  2  4 10 11]
 [ 0  6  3  2  4]]


## Testing the tokenizer

In [7]:
# Trying the words that the tokenizer wasn't fit to
test_data = {
    'i really love my dog',
    'my dog loves my manatee'
}

In [8]:
test_seq = tokenizer.texts_to_sequences(test_data)
print('\nTest Sequence = ', test_seq)

padded = pad_sequences(test_seq, maxlen=10)
print('\nPadded Test Sequences:')
print(padded)


Test Sequence =  [[2, 4, 1, 2, 1], [5, 1, 3, 2, 4]]

Padded Test Sequences:
[[0 0 0 0 0 2 4 1 2 1]
 [0 0 0 0 0 5 1 3 2 4]]


- 1 represents the word does not exist in the vocabulary.