<a href="https://colab.research.google.com/github/adhang/dts-machine-learning-with-tensorflow/blob/main/NLP_Notes_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notes - Natural Language Processing 1

Author: Adhang Muntaha Muhammad

[![LinkedIn](https://img.shields.io/badge/linkedin-0077B5?style=for-the-badge&logo=linkedin&logoColor=white&link=https://www.linkedin.com/in/adhangmuntaha/)](https://www.linkedin.com/in/adhangmuntaha/)
[![GitHub](https://img.shields.io/badge/github-121011?style=for-the-badge&logo=github&logoColor=white&link=https://github.com/adhang)](https://github.com/adhang)
[![Kaggle](https://img.shields.io/badge/kaggle-20BEFF?style=for-the-badge&logo=kaggle&logoColor=white&link=https://www.kaggle.com/adhang)](https://www.kaggle.com/adhang)
[![Tableau](https://img.shields.io/badge/tableau-E97627?style=for-the-badge&logo=tableau&logoColor=white&link=https://public.tableau.com/app/profile/adhang)](https://public.tableau.com/app/profile/adhang)
___

# Libraries

In [19]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenizer

## Word Index

In [10]:
sentences = ['I love my cat',
             'I love my dog']

tokenizer = Tokenizer(num_words=100)
# num_words = num of vocab that we allow
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'cat': 4, 'dog': 5}


In [11]:
sentences = ['I love my cat',
             'I love my dog']

tokenizer = Tokenizer(num_words=3)
# num_words = num of vocab that we allow
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'cat': 4, 'dog': 5}


But, the `word_index` will remain the same even if we change the `num_words` limit

## Sequences

### Num Words Effect

In [12]:
sentences = ['I love my cat',
             'I love my dog']

tokenizer = Tokenizer(num_words=100)
# num_words = num of vocab that we allow
tokenizer.fit_on_texts(sentences)

sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[1, 2, 3, 4], [1, 2, 3, 5]]


In [13]:
sentences = ['I love my cat',
             'I love my dog']

tokenizer = Tokenizer(num_words=3)
# num_words = num of vocab that we allow
tokenizer.fit_on_texts(sentences)

sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[1, 2], [1, 2]]


See? The sequences will be trimmed if the `num_words` is less than the number of vocabs in our sentence.
<br><br>
The higher the `num_words`, the higher the accuracy, but it will take longer in the training process. Or it can cause an overfitting.

### New Vocabs

In [14]:
sentences = ['I love my cat',
             'I love my dog']

tokenizer = Tokenizer(num_words=100)
# num_words = num of vocab that we allow
tokenizer.fit_on_texts(sentences)

sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[1, 2, 3, 4], [1, 2, 3, 5]]


In [16]:
sentences_2 = ['You hate your cat',
               'You hate your dog']

sequences = tokenizer.texts_to_sequences(sentences_2)
print(sequences)

[[4], [5]]


## Out of Vocabulary (OOV)

New vocabs will be labeled with OOV

In [17]:
sentences = ['I love my cat',
             'I love my dog']

tokenizer = Tokenizer(num_words=100, oov_token='<DOV>')
# num_words = num of vocab that we allow
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print(word_index)

{'<DOV>': 1, 'i': 2, 'love': 3, 'my': 4, 'cat': 5, 'dog': 6}


OOV's index is always `1`

In [18]:
sentences_2 = ['You hate your cat',
               'You hate your dog']

sequences = tokenizer.texts_to_sequences(sentences_2)
print(sequences)

[[1, 1, 1, 5], [1, 1, 1, 6]]


New vocabs will be labeled with index `1` (OOV)

# Padding and Truncating

To make sure that our 'input shape' will be the same. 

## Padding

In [36]:
sentences = ['I love my cat',
             'I love my dog',
             'I love my cat cat cat']

tokenizer = Tokenizer(num_words=100, oov_token='<DOV>')
# num_words = num of vocab that we allow
tokenizer.fit_on_texts(sentences)

sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[3, 4, 5, 2], [3, 4, 5, 6], [3, 4, 5, 2, 2, 2]]


In [37]:
padded = pad_sequences(sequences)
print(padded)

[[0 0 3 4 5 2]
 [0 0 3 4 5 6]
 [3 4 5 2 2 2]]


## Post Padding

In [38]:
padded = pad_sequences(sequences, padding='post')
print(padded)

[[3 4 5 2 0 0]
 [3 4 5 6 0 0]
 [3 4 5 2 2 2]]


## Truncating

In [43]:
truncated = pad_sequences(sequences, maxlen=5)
print(truncated)

[[0 3 4 5 2]
 [0 3 4 5 6]
 [4 5 2 2 2]]


## Post Truncating

In [44]:
truncated = pad_sequences(sequences, maxlen=5, truncating='post')
print(truncated)

[[0 3 4 5 2]
 [0 3 4 5 6]
 [3 4 5 2 2]]
