<img src="https://www.th-koeln.de/img/logo.svg" style="float: right;" width="200">

# 10th exercise: <font color="#C70039">One-hot encodings of words or characters</font>
* Course: DIS21a.1
* Lecturer: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Author of notebook modifications and adaptations: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Date:  08.08.2023

<img src="https://miro.medium.com/max/674/1*YEJf9BQQh0ma1ECs6x_7yQ.png" style="float: center;" width="500">

---------------------------------
**GENERAL NOTE 1**: 
Please make sure you are reading the entire notebook, since it contains a lot of information about your tasks (e.g. regarding the set of certain paramaters or specific computational tricks, etc.), and the written mark downs as well as comments contain a lot of information on how things work together as a whole. 

**GENERAL NOTE 2**: 
* Please, when commenting source code, just use English language only. 
* When describing an observation (for instance, after you have run through your test plan) you may use German language.
This applies to all exercises in DIS 21a.1.  

---------------------

### <font color="ce33ff">DESCRIPTION</font>:

This notebook allows for learning how to deal with text data as this knowledge is needed when it comes to modeling sequences (e.g. spoken language).

One-hot encoding is the most common and most basic way to turn a token (word or character) into a vector. It was used in the previous IMDB and Reuters classification examples and in both cases it was done with entire words. 
One-hot encoding is fairly simple and consists in associating a unique integer index to every word, then turning this integer index i into a binary vector of size N, which is the size of the vocabulary, that would be all-zeros except for the i-th entry, which would be 1. 

Of course, one-hot encoding can be done on character level as well. This exercise shows how one-hot encoding is implemented and this is what you will learn. So, here are examples of one-hot encoding: one for words, the other for characters. The third example shows how to implement this not from scrach, but using the built-in utilities in Keras.

-------------------------------------------------------------------------------------------------------------

### <font color="FFC300">TASKS</font>:
The tasks that you need to work on in this notebook are always indicated below as bullet points. 
If a task is more challenging and consists of several steps, this is indicated as well. 
Make sure you have worked down the task list and commented your doings. 
This should be done by using markdown.<br> 
<font color=red>Make sure you don't forget to specify your name and your matriculation number in the notebook before submitting it.</font>

**YOUR TASKS in this exercise are as follows**:
1. import the notebook to Google Colab.
2. make sure you specified you name and your matriculation number in the header below my name and date. 
    * set the date too and remove mine.
3. read the entire notebook carefully. 
    * add comments whereever you feel it necessary for better understanding.
    * run the notebook and try to follow and understand all steps and examples.
4. the exercise in this notebook contains one single implementational task. 
    * implement a one-hot encoding on character level using Keras' built-in functions. 
    * let yourself be guided from the other three examples in this notebook. 
   
-----------------------------------------------------------------------------------

# START OF THE NOTEBOOK CODE
----------------------------------------------------------------------------------------------------------------------


## <font color="#C70039">EXAMPLE I</font>
## Word level one-hot encoding example

### necessary imports
others are going to be included as soon as they are needed

In [None]:
import tensorflow
tensorflow.keras.__version__

This is the initial data. One entry per "sample". In this example, a "sample" is just a sentence, but also it could be an entire document. The first step is building an index by simply tokenizing the samples via the `split` method. In real life, punctuation and special characters would be stripped from the samples.

In [None]:
import numpy as np

# sample data
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# build an index of all tokens in the data
token_index = {}
for sample in samples:
    # tokenize the samples via the `split` method
    for word in sample.split():
        if word not in token_index:
            # assign a unique index to each unique word
            token_index[word] = len(token_index) + 1
            ''' IMPORTANT NOTE: ''' # index 0 is not used for anything.

# Now vectorize the samples
# Consider the first `max_length` words in each sample
max_length = 10

# store the results in a numpy array
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
print(results.shape)

for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

-------------------
## <font color="#C70039">EXAMPLE II</font>
## Character level one-hot encoding example

In [None]:
import string

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
# take all printable ASCII characters
characters = string.printable  
token_index = dict(zip(characters, range(1, len(characters) + 1)))

max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))

for i, sample in enumerate(samples):
    for j, character in enumerate(sample[:max_length]):
        index = token_index.get(character)
        results[i, j, index] = 1.

--------------------
## <font color="#C70039">EXAMPLE III</font>
## Word and character level one-hot encoding example with Keras' built-in functions 

In Keras there are built-in functions for one-hot encoding text at word or character level. 

These built-in functions are to be preferred since they will take care of a number of important features, such as stripping special characters from strings or taking only the top N most common words in a data set, which is a common restriction to avoid dealing with very large input vector spaces.

Using Keras for one-hot encoding on word level works as follows.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# create a tokenizer, configured to only take the top N=1000 most common words
tokenizer = Tokenizer(num_words=1000)
# this builds the word index
tokenizer.fit_on_texts(samples)

# this turns strings into lists of integer indices
sequences = tokenizer.texts_to_sequences(samples)

# directly get the one-hot binary representation
'''NOTE:''' 
# other vectorization modes than one-hot encoding are supported too !!!!
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary') 

# recover the word index that was computed
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

<font color="ce33ff">TASK 4</font><br>
transfer what you have learned and implement a one-hot encoding on character level using Keras' built-in functions.


In [None]:
# add your code here
