# Colab notebooks remark

**Note**: make sure you connect this notebook to a GPU. To do so, please go to main menu and click "Edit" -> "Notebook settings". In the dialog, select "GPU" in the "Hardware accelerator" dropdown menu. Don't forget to save the settings. When you see in the upper right corner that your notebook is connected (dropdown menu besides the editing menu) you are ready to go.

**Origin**: This Colab notebook is based on [this](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/6.1-one-hot-encoding-of-words-or-characters.ipynb) Jupyter notebook at GitHub.

# One-hot encoding of words or characters

This notebook contains the first code sample found in Chapter 6, Section 1 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

----

One-hot encoding is the most common, most basic way to turn a token into a vector. You already saw it in action in our initial IMDB and 
Reuters examples from chapter 3 (done with words, in our case). It consists in associating a unique integer index to every word, then 
turning this integer index i into a binary vector of size N, the size of the vocabulary, that would be all-zeros except for the i-th 
entry, which would be 1.

Of course, one-hot encoding can be done at the character level as well. To unambiguously drive home what one-hot encoding is and how to 
implement it, here are two toy examples of one-hot encoding: one for words, the other for characters.



## Word level one-hot encoding (toy example):

In [1]:
import keras
print('Keras version:', keras.__version__)
import tensorflow
print('Tensorflow version:', tensorflow.__version__)

if tensorflow.test.gpu_device_name():
  print('Default GPU Device: {}'.format(tensorflow.test.gpu_device_name()))
else:
  print("Please install GPU version of TF")

Keras version: 2.12.0
Tensorflow version: 2.12.0
Default GPU Device: /device:GPU:0


In [2]:
import numpy as np

# This is our initial data; one entry per "sample"
# (in this toy example, a "sample" is just a sentence, but
# it could be an entire document).
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# First, build an index of all tokens in the data.
token_index = {}
for sample in samples:
    
    # We simply tokenize the samples via the `split` method.
    # in real life, we would also strip punctuation and special characters
    # from the samples.
    for word in sample.split():
        
        if word not in token_index:
            
            # Assign a unique index to each unique word
            token_index[word] = len(token_index) + 1
            # Note that we don't attribute index 0 to anything.

In [None]:
token_index

{'The': 1,
 'cat': 2,
 'sat': 3,
 'on': 4,
 'the': 5,
 'mat.': 6,
 'dog': 7,
 'ate': 8,
 'my': 9,
 'homework.': 10}

In [3]:
# Next, we vectorize our samples.
# We will only consider the first `max_length` words in each sample.
max_length = 10

# This is where we store our results:
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))

for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

In [None]:
results

array([[[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

       [[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0

In [None]:
np.shape(results)

(2, 10, 11)

## Character level one-hot encoding (toy example)

In [4]:
import string

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable  # All printable ASCII characters.
token_index = dict(zip(characters, range(1, len(characters) + 1)))

In [5]:
characters

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

In [6]:
token_index

{'0': 1,
 '1': 2,
 '2': 3,
 '3': 4,
 '4': 5,
 '5': 6,
 '6': 7,
 '7': 8,
 '8': 9,
 '9': 10,
 'a': 11,
 'b': 12,
 'c': 13,
 'd': 14,
 'e': 15,
 'f': 16,
 'g': 17,
 'h': 18,
 'i': 19,
 'j': 20,
 'k': 21,
 'l': 22,
 'm': 23,
 'n': 24,
 'o': 25,
 'p': 26,
 'q': 27,
 'r': 28,
 's': 29,
 't': 30,
 'u': 31,
 'v': 32,
 'w': 33,
 'x': 34,
 'y': 35,
 'z': 36,
 'A': 37,
 'B': 38,
 'C': 39,
 'D': 40,
 'E': 41,
 'F': 42,
 'G': 43,
 'H': 44,
 'I': 45,
 'J': 46,
 'K': 47,
 'L': 48,
 'M': 49,
 'N': 50,
 'O': 51,
 'P': 52,
 'Q': 53,
 'R': 54,
 'S': 55,
 'T': 56,
 'U': 57,
 'V': 58,
 'W': 59,
 'X': 60,
 'Y': 61,
 'Z': 62,
 '!': 63,
 '"': 64,
 '#': 65,
 '$': 66,
 '%': 67,
 '&': 68,
 "'": 69,
 '(': 70,
 ')': 71,
 '*': 72,
 '+': 73,
 ',': 74,
 '-': 75,
 '.': 76,
 '/': 77,
 ':': 78,
 ';': 79,
 '<': 80,
 '=': 81,
 '>': 82,
 '?': 83,
 '@': 84,
 '[': 85,
 '\\': 86,
 ']': 87,
 '^': 88,
 '_': 89,
 '`': 90,
 '{': 91,
 '|': 92,
 '}': 93,
 '~': 94,
 ' ': 95,
 '\t': 96,
 '\n': 97,
 '\r': 98,
 '\x0b': 99,
 '\x0c': 100

In [7]:
# Next, we vectorize our samples.
# We will only consider the first `max_length` characters in each sample.
max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))

# This is where we store our results:
for i, sample in enumerate(samples):
    
    for j, character in enumerate(sample[:max_length]):
        index = token_index.get(character)
        results[i, j, index] = 1.

In [8]:
results

array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])

In [None]:
np.shape(results)

(2, 50, 101)

Note that Keras has built-in utilities for doing one-hot encoding text at the word level or character level, starting from raw text data. 
This is what you should actually be using, as it will take care of a number of important features, such as stripping special characters 
from strings, or only taking into the top N most common words in your dataset (a common restriction to avoid dealing with very large input 
vector spaces).

## Using Keras for word-level one-hot encoding:

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# We create a tokenizer, configured to only take
# into account the top-1000 most common words
tokenizer = Tokenizer(num_words=1000)

# This builds the word index
tokenizer.fit_on_texts(samples)

# This turns strings into lists of integer indices.
sequences = tokenizer.texts_to_sequences(samples)

In [None]:
sequences

[[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]

What else did the tokenizer learn from the two documents?

In [None]:
# How many times a word appears across all documents (samples)?
print(tokenizer.word_counts)

OrderedDict([('the', 3), ('cat', 1), ('sat', 1), ('on', 1), ('mat', 1), ('dog', 1), ('ate', 1), ('my', 1), ('homework', 1)])


In [None]:
# How many documents (samples) were processed?
print(tokenizer.document_count)

2


In [None]:
# Get the token word index
print(tokenizer.word_index)

{'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}


In [None]:
# In how many documents (samples) did a word appear?
print(tokenizer.word_docs)

defaultdict(<class 'int'>, {'on': 1, 'the': 2, 'cat': 1, 'mat': 1, 'sat': 1, 'dog': 1, 'ate': 1, 'homework': 1, 'my': 1})


In [None]:
# You could also directly get the one-hot binary representations.
# Note that other vectorization modes than one-hot encoding are supported (see end of Colab notebook)!
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

# This is how you can recover the word index that was computed
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 9 unique tokens.


In [None]:
one_hot_results

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [None]:
np.shape(one_hot_results)

(2, 1000)


A variant of one-hot encoding is the so-called "one-hot hashing trick", which can be used when the number of unique tokens in your 
vocabulary is too large to handle explicitly. Instead of explicitly assigning an index to each word and keeping a reference of these 
indices in a dictionary, one may hash words into vectors of fixed size. This is typically done with a very lightweight hashing function. 
The main advantage of this method is that it does away with maintaining an explicit word index, which 
saves memory and allows online encoding of the data (starting to generate token vectors right away, before having seen all of the available 
data). The one drawback of this method is that it is susceptible to "hash collisions": two different words may end up with the same hash, 
and subsequently any machine learning model looking at these hashes won't be able to tell the difference between these words. The likelihood 
of hash collisions decreases when the dimensionality of the hashing space is much larger than the total number of unique tokens being hashed.

## Word-level one-hot encoding with hashing trick (toy example):

The hashing trick is useful when the model is learning on-line, i.e. when we cannot acquire a token index before modelling. Another advantage is that we save memory, since we don't have token index and we don't need to store it. However, if dimensionality is set too low, hash collisions may happen.

In [9]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# We will store our words as vectors of size 1000.
# Note that if you have close to 1000 words (or more)
# you will start seeing many hash collisions, which
# will decrease the accuracy of this encoding method.
dimensionality = 1000
max_length = 10

results = np.zeros((len(samples), max_length, dimensionality))

for i, sample in enumerate(samples):

  for j, word in list(enumerate(sample.split()))[:max_length]:
        # Hash the word into a "random" integer index
        # that is between 0 and 1000
        index = abs(hash(word)) % dimensionality
        results[i, j, index] = 1.

In [10]:
results

array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])

In [11]:
np.shape(results)

(2, 10, 1000)

In [12]:
for i, sample in enumerate(samples):

  for j, word in list(enumerate(sample.split()))[:max_length]:
        # Hash the word into a "random" integer index
        # that is between 0 and 1000
        index = abs(hash(word)) % dimensionality
        print(word, '\t\t',  hash(word), '\t', abs(hash(word)), '\t', abs(hash(word)) % dimensionality)

The 		 -2816866875701909779 	 2816866875701909779 	 779
cat 		 8719935608692621737 	 8719935608692621737 	 737
sat 		 -7813532714720098285 	 7813532714720098285 	 285
on 		 8183384042238474669 	 8183384042238474669 	 669
the 		 5095630944560830428 	 5095630944560830428 	 428
mat. 		 57538674282556307 	 57538674282556307 	 307
The 		 -2816866875701909779 	 2816866875701909779 	 779
dog 		 -7000368803002212508 	 7000368803002212508 	 508
ate 		 7071214967748246649 	 7071214967748246649 	 649
my 		 -1868773100422285156 	 1868773100422285156 	 156
homework. 		 7369738959097316453 	 7369738959097316453 	 453


**NOTE**: **%** is the modulus operator. It divides left hand operand by right hand operand and returns remainder. That means if we set `dimensionality` to 1000, we will yield integers between 0 and 999. These integers will then be used as indices for the `results` tensor.

## Keras built-in methods for text processing

There are serval methods in the Keras Text module (`keras.preprocessing.text`) that automate the processing of text sequences. The links below will direct you to the Keras documentation.

Keras [`Tokenizer`](https://keras.io/preprocessing/text/#tokenizer)

Keras [`hashing_trick`](https://keras.io/preprocessing/text/#hashing_trick)

Keras [`one_hot`](https://keras.io/preprocessing/text/#one_hot)

Keras [`text_to_word_sequence`](https://keras.io/preprocessing/text/#text_to_word_sequence)

Keras [`texts_to_matrix`](https://keras.rstudio.com/reference/texts_to_matrix.html) - **NOTE**: this is the documentation of Keras for **R**.
* `binary`: Whether or not each word is present in the document. This is the default.
* `count`: The count of each word in the document.
* `tfidf`: The Term Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document.
* `freq`: The frequency of each word as a ratio of words within each document.

In [13]:
from tensorflow.keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# We create a tokenizer, configured to only take
# into account the top-1000 most common words
tokenizer = Tokenizer(num_words=10)

# This builds the word index
tokenizer.fit_on_texts(samples)

# This turns strings into lists of integer indices.
mode_binary = tokenizer.texts_to_matrix(samples, mode='binary')
mode_count = tokenizer.texts_to_matrix(samples, mode='count')
mode_tfidf = tokenizer.texts_to_matrix(samples, mode='tfidf')
mode_freq = tokenizer.texts_to_matrix(samples, mode='freq')

In [15]:
print(tokenizer.word_index)

{'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}


In [None]:
mode_binary

array([[0., 1., 1., 1., 1., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1., 1., 1., 1.]])

In [16]:
np.shape(mode_binary)

(2, 10)

In [17]:
mode_count

array([[0., 2., 1., 1., 1., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 1., 1., 1., 1.]])

In [18]:
np.shape(mode_count)

(2, 10)

In [19]:
mode_tfidf

array([[0.        , 0.86490296, 0.69314718, 0.69314718, 0.69314718,
        0.69314718, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.51082562, 0.        , 0.        , 0.        ,
        0.        , 0.69314718, 0.69314718, 0.69314718, 0.69314718]])

In [20]:
np.shape(mode_tfidf)

(2, 10)

In [21]:
mode_freq

array([[0.        , 0.33333333, 0.16666667, 0.16666667, 0.16666667,
        0.16666667, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.2       , 0.        , 0.        , 0.        ,
        0.        , 0.2       , 0.2       , 0.2       , 0.2       ]])

In [22]:
np.shape(mode_freq)

(2, 10)

Find out which methods and attributes are available in class Tokenizer.

In [None]:
dir(tokenizer)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_keras_api_names',
 '_keras_api_names_v1',
 'analyzer',
 'char_level',
 'document_count',
 'filters',
 'fit_on_sequences',
 'fit_on_texts',
 'get_config',
 'index_docs',
 'index_word',
 'lower',
 'num_words',
 'oov_token',
 'sequences_to_matrix',
 'sequences_to_texts',
 'sequences_to_texts_generator',
 'split',
 'texts_to_matrix',
 'texts_to_sequences',
 'texts_to_sequences_generator',
 'to_json',
 'word_counts',
 'word_docs',
 'word_index']

In [None]:
hash('Oliver')

8368648330736616601

In [None]:
abs(hash('Oliver'))

8368648330736616601

In [None]:
abs(hash('Oliver')) % 10

1

In [None]:
abs(hash('Stavanger')) % 10

9