# AI and ML for Coders

Let's see how to tokenize with Tensorflow Keras

In [1]:
import tensorflow
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'Today is a sunny day',
    'Today is a rainy day',
    'Is it sunny today?'
]

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)

2021-09-27 20:13:39.520371: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-27 20:13:39.520431: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
{'today': 1, 'is': 2, 'a': 3, 'sunny': 4, 'day': 5, 'rainy': 6, 'it': 7}


In [2]:
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[1, 2, 3, 4, 5], [1, 2, 3, 6, 5], [2, 7, 4, 1]]


`fit_on_texts` fits the tokenizer to the words, creating a vocabulary

This means words not provided in the initial sentences will be ignored. Instead of ignoring, which will remove context and information, tokenize unknown words with special tokens. These unknonw words are considered out-of-vocabulary tokens or `OOV`

In [3]:
test_data = [
    'Today is a snowy day',
    'Will it be rainy tomorrow?'
]

In [4]:
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)

test_sequences = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_sequences)

{'<OOV>': 1, 'today': 2, 'is': 3, 'a': 4, 'sunny': 5, 'day': 6, 'rainy': 7, 'it': 8}
[[2, 3, 4, 1, 6], [1, 8, 1, 7, 1]]


Sequences have to be of equal length when input into a model. To handle inputing batch sentences of varying size, pad the sentences with padding tokens with the max expected length

In [5]:
sentences = [
    'Today is a sunny day',
    'Today is a rainy day',
    'Is it sunny today?',
    'I really enjoyed walking in the snow today'
]

seq = tokenizer.texts_to_sequences(sentences)
print(seq)

[[2, 3, 4, 5, 6], [2, 3, 4, 7, 6], [3, 8, 5, 2], [1, 1, 1, 1, 1, 1, 1, 2]]


In [6]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

padded = pad_sequences(seq)
print(padded)

[[0 0 0 2 3 4 5 6]
 [0 0 0 2 3 4 7 6]
 [0 0 0 0 3 8 5 2]
 [1 1 1 1 1 1 1 2]]


By default sequences are _pre-padded_ use the `padding` parameter to use _postpadding_

In [7]:
padded = pad_sequences(seq, padding='post')
print(padded)

[[2 3 4 5 6 0 0 0]
 [2 3 4 7 6 0 0 0]
 [3 8 5 2 0 0 0 0]
 [1 1 1 1 1 1 1 2]]


Set desired max length with `maxlen` parameter

In [8]:
padded = pad_sequences(seq, padding='post', maxlen=6)
print(padded)

padded = pad_sequences(seq, padding='post', maxlen=6, truncating='post')
print(padded)

[[2 3 4 5 6 0]
 [2 3 4 7 6 0]
 [3 8 5 2 0 0]
 [1 1 1 1 1 2]]
[[2 3 4 5 6 0]
 [2 3 4 7 6 0]
 [3 8 5 2 0 0]
 [1 1 1 1 1 1]]


In [9]:
!pip install bs4==0.0.1

Collecting bs4==0.0.1
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
[K     |████████████████████████████████| 97 kB 10.9 MB/s 
[?25hCollecting soupsieve>1.2
  Downloading soupsieve-2.2.1-py3-none-any.whl (33 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1271 sha256=f67f7b5c87f807d24ab2ef5d7af5fd87fcc137e10260e70e370ef17f6b427d4f
  Stored in directory: /root/.cache/pip/wheels/0a/9e/ba/20e5bbc1afef3a491f0b3bb74d508f99403aabe76eda2167ca
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.10.0 bs4-0.0.1 soupsieve-2.2.1


In [10]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(sentences[0])
sentence = soup.get_text()

## Skipping rest of chapter

The rest of the chapter goes into loading data from different data sources and tokenizing them. For me that's very trivial, so i'm going to skip. 

Ultimately I hope to not follow a textbook but start creating my own projects.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=2917b4c9-c92f-4224-9a22-9b2a43486dc1' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>