<a href="https://colab.research.google.com/github/dklpp/tf-developer/blob/main/C3_W1_Tokenize_Basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing
- corpus: input text
- token: numeric representation of a word

### Tokenizer()
- punctuations are ignored
- lowercases all words

In [6]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [7]:
sentences = [
    "I love my dog",
    "I love my cat",
    "You love my dog!" # tokenizer strips punctuation out
]

In [15]:
tokenizer = Tokenizer(num_words=100) # take the top 100 words by volume and just encode those
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)


{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


# Text to Sequences

In [16]:
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

In [17]:
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
print(sequences)

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]


In [18]:
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)

print(test_seq)

[[4, 2, 1, 3], [1, 3, 1]]


- We lost some information, since some new words are not in our corpus.

### What if word is not in our vocabulary?
- Use oov_token="< OOV>" - out of vocabulary
- "1" stands for unknown word.

In [21]:
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words=100, oov_token="<OOV>") # u can use whatever, like oov_token="<UNK>"
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

print(word_index)
print(sequences)

test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)

print(test_seq)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


# Padding

In [27]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

print('Not padded: ', sequences)
print('\n')
padded = pad_sequences(sequences)
print('With padding:\n\n', padded)

Not padded:  [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]


With padding:

 [[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


In [28]:
# post padding

padded_post = pad_sequences(sequences, padding='post')

print(padded_post)

[[ 5  3  2  4  0  0  0]
 [ 5  3  2  7  0  0  0]
 [ 6  3  2  4  0  0  0]
 [ 8  6  9  2  4 10 11]]


In [30]:
# max length define (by default used the maxlen of the longest word)

padded_post_maxlen = pad_sequences(sequences, padding='post', maxlen=5)

print(padded_post_maxlen)

[[ 5  3  2  4  0]
 [ 5  3  2  7  0]
 [ 6  3  2  4  0]
 [ 9  2  4 10 11]]


In [34]:
# truncate from the end

padded_trunc = pad_sequences(sequences, padding='post',  # default 'pre'
                             truncating='post', maxlen=5) # default 'pre'

print(padded_trunc)

[[5 3 2 4 0]
 [5 3 2 7 0]
 [6 3 2 4 0]
 [8 6 9 2 4]]


In [36]:
padded_trunc = pad_sequences(sequences, padding='post',  # default 'pre'
                             truncating='post', maxlen=9) # default 'pre'

print(padded_trunc)

[[ 5  3  2  4  0  0  0  0  0]
 [ 5  3  2  7  0  0  0  0  0]
 [ 6  3  2  4  0  0  0  0  0]
 [ 8  6  9  2  4 10 11  0  0]]


# Tokenizing the Sarcasm Dataset
- sarcasm detection

In [37]:
!wget https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json

--2023-11-20 19:41:16--  https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.179.207, 172.253.115.207, 172.253.122.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.179.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5643545 (5.4M) [application/json]
Saving to: ‘sarcasm.json’


2023-11-20 19:41:16 (112 MB/s) - ‘sarcasm.json’ saved [5643545/5643545]



In [38]:
import json

# Load the JSON file
with open("./sarcasm.json", 'r') as f:
    datastore = json.load(f)

In [40]:
type(datastore)

list

In [41]:
len(datastore)

26709

In [42]:
datastore[0]

{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5',
 'headline': "former versace store clerk sues over secret 'black code' for minority shoppers",
 'is_sarcastic': 0}

In [43]:
sentences = []
labels = []
urls = []

for item in datastore:
  sentences.append(item['headline'])
  labels.append(item['is_sarcastic'])
  urls.append(item['article_link'])

In [44]:
len(labels)

26709

In [45]:
labels[:5]

[0, 0, 1, 1, 0]

In [47]:
import numpy as np
import pandas as pd

pd.Series(labels).value_counts()

0    14985
1    11724
dtype: int64

# Preprocessing the Headlines

In [49]:
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences) # more than 20,000 words... 29,657 words
word_index = tokenizer.word_index

In [57]:
print(len(word_index))

29657


In [53]:
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')

print(padded[0])
print('\n\n')
print(padded.shape) # 26709 sentences, encoded in 40 words with paddings if needed

[  308 15115   679  3337  2298    48   382  2576 15116     6  2577  8434
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0]



(26709, 40)


In [56]:
# Print a sample headline
index = 2
print(f'sample headline: {sentences[index]}')
print(f'padded sequence: {padded[index]}')
print()

# Print dimensions of padded sequences
print(f'shape of padded sequences: {padded.shape}')

sample headline: mom starting to fear son's web series closest thing she will have to grandchild
padded sequence: [  145   838     2   907  1749  2093   582  4719   221   143    39    46
     2 10736     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0]

shape of padded sequences: (26709, 40)


- If we specify Tokenizer(num_words=10), the word_inex anyway will have all the num of words = 29,657.
- But it will take into account only the 10 words (most popular) when it's actually creating the sequences; and other will be < OOV >.