# Lesson 1



## tokenization sentences

In [7]:
import tensorflow as tf

sentences = [
    'I love my dog',
    'I, love my cat',
    'You love my dog!'
]

# initialize the layer
# create an instance of text vectorization layer
vectorize_layer = tf.keras.layers.TextVectorization()

# takes in the data and generates a vocabulary outta the words found in these sentences
vectorize_layer.adapt(sentences)

# view the result
vocabulary = vectorize_layer.get_vocabulary(include_special_tokens=False)
print(vocabulary)

[np.str_('my'), np.str_('love'), np.str_('i'), np.str_('dog'), np.str_('you'), np.str_('cat')]


* TextVectorization layer only get a vocabulary outta the words found in the sentences.
* also lower case-ing the word.
* TextVectorization strips out punctuations.
`include_special_tokens=False` -> not to get empty string and [UNK] token

In [8]:
for idx, word in enumerate(vocabulary):
  print(idx, word)

0 my
1 love
2 i
3 dog
4 you
5 cat


include the special tokens. the first one at `0` is used for padding and `1` isused for out-of-vocabulary words.

In [9]:
# get the vocabulary list
vocabulary = vectorize_layer.get_vocabulary()

# print the token index
for idx, word in enumerate(vocabulary):
  print(idx, word)

0 
1 [UNK]
2 my
3 love
4 i
5 dog
6 you
7 cat


* `[PAD]` | Padding (to fill gaps so that all input sequences have the same length)
* `[UNK]` | Unknown (used when the model encounters a word it doesn't know)
* `[CLS]` | Classification token (used in models like BERT, marks the start of a text for classification)
* `[SEP]` | Separator (used to separate different sentences)

Why are special tokens often included?
Because the model needs to understand when:

* There are empty spaces (padding).
* It encounters an unknown word (word not in the vocabulary).
* It needs to separate or classify sentences.

If special tokens are missing:

* The model might get confused when input sequences have different lengths.
* It won't know how to handle unknown words.

If special tokens are included:

* It can fill empty spaces using [PAD].
* It can replace unknown words with [UNK].
* It can mark the beginning of sentences with [CLS] (for classification tasks).


# Lesson 2

* add another sentences, so the sentence will a bit longer than the others.
* will demonstrate padding.

## encoding

In [14]:
sentences = [
    'I love my dog',
    'I, love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

# initialize the layer
# create an instance of text vectorization layer
vectorize_layer = tf.keras.layers.TextVectorization()

# takes in the data and generates a vocabulary outta the words found in these sentences
vectorize_layer.adapt(sentences)

# view the result
vocabulary = vectorize_layer.get_vocabulary()

sequence = vectorize_layer('I love my dog')

for idx, word in enumerate(vocabulary):
  print(idx, word)

print()
print(sequence)

0 
1 [UNK]
2 my
3 love
4 dog
5 you
6 i
7 think
8 is
9 do
10 cat
11 amazing

tf.Tensor([6 3 2 4], shape=(4,), dtype=int64)


In [15]:
sequence2 = vectorize_layer(sentences)
print(sequence2)

tf.Tensor(
[[ 6  3  2  4  0  0  0]
 [ 6  3  2 10  0  0  0]
 [ 5  3  2  4  0  0  0]
 [ 9  5  7  2  4  8 11]], shape=(4, 7), dtype=int64)


* vectorize_layer.adapt() -> give the tokenization.
* vectorize_layer() -> give the encoding.
* 0 -> padding: to make all of this in the same length as the last row
* post padding -> the padding tokens at the end of the sequences.

## padding

* this case to handle many texts
* more scalable than `vectorize_layer()`

### pre-padding for small dataset using tf data pipeline

In [21]:
# initialize TextVectorization
vectorize_layer = tf.keras.layers.TextVectorization()

# get tokenization
vectorize_layer.adapt(sentences)
vocabulary = vectorize_layer.get_vocabulary()

sentences_dataset = tf.data.Dataset.from_tensor_slices(sentences)
sequences = sentences_dataset.map(vectorize_layer)
sequences_pre = tf.keras.utils.pad_sequences(sequences, padding='pre')
print(sequences_pre)

[[ 0  0  0  6  3  2  4]
 [ 0  0  0  6  3  2 10]
 [ 0  0  0  5  3  2  4]
 [ 9  5  7  2  4  8 11]]


In [28]:
for sentence, sequence in zip(sentences, sequences):
  print(f'{sentences} -> {sequence}')

['I love my dog', 'I love my cat', 'You love my dog', 'Do you think my dog is amazing?'] -> [6 3 2 4]
['I love my dog', 'I love my cat', 'You love my dog', 'Do you think my dog is amazing?'] -> [ 6  3  2 10]
['I love my dog', 'I love my cat', 'You love my dog', 'Do you think my dog is amazing?'] -> [5 3 2 4]
['I love my dog', 'I love my cat', 'You love my dog', 'Do you think my dog is amazing?'] -> [ 9  5  7  2  4  8 11]


### padding with limitation


In [30]:
sequences_post_trunc = tf.keras.utils.pad_sequences(sequences,
                                                    maxlen=5,
                                                    padding='pre')

print('INPUT:')
# if print the sequence directly it'll return to be
# [<tf.Tensor: shape=(4,), dtype=int64, numpy=array([6, 3, 2, 4])>, ... ]
[print(sequence.numpy()) for sequence in sequences]
print()

print('OUTPUT:')
print(sequences_post_trunc)

INPUT:
[6 3 2 4]
[ 6  3  2 10]
[5 3 2 4]
[ 9  5  7  2  4  8 11]

OUTPUT:
[[ 0  6  3  2  4]
 [ 0  6  3  2 10]
 [ 0  5  3  2  4]
 [ 7  2  4  8 11]]


### keep the real text without padding

In [23]:
# initialize TextVectorization
vectorize_layer = tf.keras.layers.TextVectorization(ragged=True)

# get tokenization
vectorize_layer.adapt(sentences)
vocabulary = vectorize_layer.get_vocabulary()

# encoded outputs
ragged_sequences = vectorize_layer(sentences)
for idx, word in enumerate(vocabulary):
  print(idx, word)
print()
print(ragged_sequences)

0 
1 [UNK]
2 my
3 love
4 dog
5 you
6 i
7 think
8 is
9 do
10 cat
11 amazing

<tf.RaggedTensor [[6, 3, 2, 4], [6, 3, 2, 10], [5, 3, 2, 4], [9, 5, 7, 2, 4, 8, 11]]>


* longer sequences will be truncated

## whp when meet new data in test set

In [24]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog',
    'Do you think my dog is amazing?'
]

vectorize_layer = tf.keras.layers.TextVectorization()
vectorize_layer.adapt(sentences)

test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = vectorize_layer(test_data)
print(test_seq)

tf.Tensor(
[[6 1 3 2 4]
 [2 4 1 2 1]], shape=(2, 5), dtype=int64)


In [26]:
for idx, word in enumerate(vocabulary):
  print(idx, word)

0 
1 [UNK]
2 my
3 love
4 dog
5 you
6 i
7 think
8 is
9 do
10 cat
11 amazing


* since the vocabulary doesn't have some other words, it can't handle the new word (it's writen to be 1, if it's new word)

# Lesson 3



## Tokenizing the sarcasm dataset

In [37]:
import json

# open the file
with open('sarcasm.json', 'r') as f:
  datastore = json.load(f)

In [33]:
datastore[:2]

[{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5',
  'headline': "former versace store clerk sues over secret 'black code' for minority shoppers",
  'is_sarcastic': 0},
 {'article_link': 'https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365',
  'headline': "the 'roseanne' revival catches up to our thorny political mood, for better and worse",
  'is_sarcastic': 0}]

* datastore is list of dictionaries
* make each key to be list. sentences needed for training, labels as the ground truth, and link to check sources not needed in the modeling.

In [38]:
# append the headline elements into the list
sentences = [item['headline'] for item in datastore]

In [40]:
# print a sample headline and sequence
index = 2
print(f'sample headline: {sentences[index]}')
print(f'padded sequence: {post_padded_sequences[index]}')
print()

# print dimensions of padded sequences
print(f'shape of padded sequences: {post_padded_sequences.shape}')

sample headline: mom starting to fear son's web series closest thing she will have to grandchild
padded sequence: [  140   825     2   813  1100  2048   571  5057   199   139    39    46
     2 13050     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0]

shape of padded sequences: (26709, 39)


In [36]:
import tensorflow as tf

vectorize_layer = tf.keras.layers.TextVectorization()
vectorize_layer.adapt(sentences)

vocabulary = vectorize_layer.get_vocabulary()
post_padded_sequences = vectorize_layer(sentences)
print(f'padded sequence: {post_padded_sequences[2]}')

padded sequence: [  140   825     2   813  1100  2048   571  5057   199   139    39    46
     2 13050     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0]


### pre-padding for large dataset concept

In [41]:
# instiate the layer and set the ragged to be `True`
vectorize_layer = tf.keras.layers.TextVectorization(ragged=True)

# build the vocabulary
vectorize_layer.adapt(sentences)

In [42]:
# apply the layer to generate a ragged tensor
ragged_sequences = vectorize_layer(sentences)

In [43]:
# print a sample headline and sequence
index = 2
print(f'sample headline: {sentences[index]}')
print(f'padded sequence: {ragged_sequences[index]}')
print()

# print dimensions of padded sequences
print(f'shape of padded sequences: {ragged_sequences.shape}')

sample headline: mom starting to fear son's web series closest thing she will have to grandchild
padded sequence: [  140   825     2   813  1100  2048   571  5057   199   139    39    46
     2 13050]

shape of padded sequences: (26709, None)


* no padded part
* `(26709, None)` previously 39 and now `None` indicates no longer post padded to the max length.

In [45]:
ragged_sequences.dtype

tf.int64

In [47]:
# apply pre-padding to the ragged tensor
pre_padded_sequences = tf.keras.utils.pad_sequences(ragged_sequences.numpy())

# preview the result for the 2nd sequence
pre_padded_sequences[2]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,   140,   825,
           2,   813,  1100,  2048,   571,  5057,   199,   139,    39,
          46,     2, 13050], dtype=int32)

In [48]:
# print a sample headline and sequence
index = 2
print(f'sample headline: {sentences[index]}')
print(f'padded sequence: {post_padded_sequences[index]}')
print()
print(f'padded sequence: {pre_padded_sequences[index]}')
print()

# print dimensions of padded sequences
print(f'shape of post-padded sequences: {post_padded_sequences.shape}')
print(f'shape of pre-padded sequences: {pre_padded_sequences.shape}')

sample headline: mom starting to fear son's web series closest thing she will have to grandchild
padded sequence: [  140   825     2   813  1100  2048   571  5057   199   139    39    46
     2 13050     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0]

padded sequence: [    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0   140   825     2   813  1100  2048   571  5057   199   139    39
    46     2 13050]

shape of post-padded sequences: (26709, 39)
shape of pre-padded sequences: (26709, 39)
