<h1 align=center><font size=5>Text Preprocessing</font></h1>

### One-hot encoding <a id="one_hot"></a>

We will create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word. 

In [None]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

text = 'The   cat sat on  the mat.'
text = text.lower().split()
print(text)

label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(text)
print(integer_encoded)

onehot_encoded = to_categorical(integer_encoded)
print(onehot_encoded)

['the', 'cat', 'sat', 'on', 'the', 'mat.']
[4 0 3 2 4 1]
[[0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 0.]]


> This approach is inefficient. Imagine we have 10,000 words in the vocabulary. To one-hot encode, we would create a vector where 99.99% of the elements are zero.

### Encode each word with a unique number <a id="integer_enc"></a>

A second approach we might try is to encode each word using a unique number. Continuing the example above, we could assign 1 to "cat", 2 to "mat", and so on. We could then encode the sentence "The cat sat on the mat" as a dense vector like [5, 1, 4, 3, 5, 2]. 



#### Text Tokenization

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

texts = ['The cat sat on the mat.',
         'The dog sat on the log.',
         'Dogs and cats living together.']

tokenizer = Tokenizer(num_words = 20) 
tokenizer.fit_on_texts(texts)

word_index = tokenizer.word_index
print('Word index:\n', word_index)

sequences = tokenizer.texts_to_sequences(texts)
print('Sequences:\n', sequences)

Word index:
 {'the': 1, 'sat': 2, 'on': 3, 'cat': 4, 'mat': 5, 'dog': 6, 'log': 7, 'dogs': 8, 'and': 9, 'cats': 10, 'living': 11, 'together': 12}
Sequences:
 [[1, 4, 2, 3, 1, 5], [1, 6, 2, 3, 1, 7], [8, 9, 10, 11, 12]]


#### Test Sequence

In [None]:
X_train = ['The cat sat on the mat.']

tokenizer = Tokenizer(num_words = 20) 
tokenizer.fit_on_texts(X_train)

word_index = tokenizer.word_index
print('Word index:\n', word_index)

X_train_seq = tokenizer.texts_to_sequences(X_train)
print('Sequences:\n', X_train_seq)
# --------------------------------------------------------
X_test = ['The dog sat on the log.']

X_test_seq = tokenizer.texts_to_sequences(X_test)
print('Test sequence:\n', X_test_seq)

Word index:
 {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5}
Sequences:
 [[1, 2, 3, 4, 1, 5]]
Test sequence:
 [[1, 3, 4, 1]]


#### Out Of Vocabulary (OOV) words

In [None]:
X_train = ['The cat sat on the mat.']

tokenizer = Tokenizer(num_words = 20, oov_token = '<OOV>') 
tokenizer.fit_on_texts(X_train)

word_index = tokenizer.word_index
print('Word index:\n', word_index)

X_train_seq = tokenizer.texts_to_sequences(X_train)
print('Sequences:\n', X_train_seq)
# --------------------------------------------------------
X_test = ['The dog sat on the log.']

X_test_seq = tokenizer.texts_to_sequences(X_test)
print('Test sequence:\n', X_test_seq)

Word index:
 {'<OOV>': 1, 'the': 2, 'cat': 3, 'sat': 4, 'on': 5, 'mat': 6}
Sequences:
 [[2, 3, 4, 5, 2, 6]]
Test sequence:
 [[2, 1, 4, 5, 2, 1]]


#### Padding

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = ['I love my dog', 
             'You love my dog!',
             'Do you think my dog is amazing?']

tokenizer = Tokenizer(num_words = 20, oov_token = '<OOV>') 
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print('Word index:\n', word_index)

sequences = tokenizer.texts_to_sequences(sentences)
print('Sequences:\n', sequences)

padded = pad_sequences(sequences)
print('Padded sequences:\n', padded)

matrix2 = tokenizer.texts_to_matrix(['I love my dog']) 
print(matrix2)

Word index:
 {'<OOV>': 1, 'my': 2, 'dog': 3, 'love': 4, 'you': 5, 'i': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
Sequences:
 [[6, 4, 2, 3], [5, 4, 2, 3], [7, 5, 8, 2, 3, 9, 10]]
Padded sequences:
 [[ 0  0  0  6  4  2  3]
 [ 0  0  0  5  4  2  3]
 [ 7  5  8  2  3  9 10]]
[[0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


In [None]:
padded = pad_sequences(sequences, padding = 'post', maxlen = 5, truncating = 'post') 
print('Padded sequences:\n', padded)
print('Padded shape:', padded.shape)

Padded sequences:
 [[6 4 2 3 0]
 [5 4 2 3 0]
 [7 5 8 2 3]]
Padded shape: (3, 5)


In [None]:
texts = ['The the the the the cat sat on the mat cat.']
tokenizer = Tokenizer(num_words = 10) 
tokenizer.fit_on_texts(texts)


word_index = tokenizer.word_index
print('Word index:', word_index)

sequences = tokenizer.texts_to_sequences(texts)
print('Sequences:', sequences)

for mode in ['binary', 'count', 'freq', 'tfidf']:
    matrix = tokenizer.texts_to_matrix(texts, mode)
    print('-'*20, mode, '-'*20)
    print(matrix)

Word index: {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5}
Sequences: [[1, 1, 1, 1, 1, 2, 3, 4, 1, 5, 2]]
-------------------- binary --------------------
[[0. 1. 1. 1. 1. 1. 0. 0. 0. 0.]]
-------------------- count --------------------
[[0. 6. 2. 1. 1. 1. 0. 0. 0. 0.]]
-------------------- freq --------------------
[[0.         0.54545455 0.18181818 0.09090909 0.09090909 0.09090909
  0.         0.         0.         0.        ]]
-------------------- tfidf --------------------
[[0.         1.13196106 0.6865121  0.40546511 0.40546511 0.40546511
  0.         0.         0.         0.        ]]
