# Text preprocessing
- Basic text preprocessing using Keras API
- Doc: https://keras.io/preprocessing/text/

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence, one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences

### Tokenization of a sentence
- Tokenization: the process of converting a sequence of characters into a sequence of tokens (https://en.wikipedia.org/wiki/Lexical_analysis#Token)

In [2]:
sentences = ['Curiosity killed the cat.', 'But satisfaction brought it back']

In [3]:
tk = Tokenizer()    # create Tokenizer instance

In [4]:
tk.fit_on_texts(sentences)    # tokenizer should be fit with text data in advance

#### Converting sentence into (integer) sequence
- One of simple ways of modeling text is to create sequence of integers for each sentence
- By doing so, information regarding order of words can be preserved

In [5]:
seq = tk.texts_to_sequences(sentences)
print(seq)

[[1, 2, 3, 4], [5, 6, 7, 8, 9]]


#### One-hot encoding of sentence
- Sometimes, it is preferred to check only whether certain word appeared in sentence or not
- This way of characterizing sentence is called "one-hot encoding"
    - IF word appeared in sentence, it is encoded as **"one"**
    - IF not, it is encoded as **"zero"**

In [6]:
mat = tk.sequences_to_matrix(seq)
print(mat)

[[0. 1. 1. 1. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]]


#### Padding sequences
- Oftentimes, to preserve the dimensionality of sentences, zero padding is performed
- Idea is similar to that of padding exterior of image-format data, but applied to sequences

In [7]:
# if set padding to 'pre', zeros are appended to start of sentences
pad_seq = pad_sequences(seq, padding='pre')     
print(pad_seq)

[[0 1 2 3 4]
 [5 6 7 8 9]]


In [9]:
# if set padding to 'post', zeros are appended to end of sentences
pad_seq = pad_sequences(seq, padding='post')
print(pad_seq)

[[1 2 3 4 0]
 [5 6 7 8 9]]
