## Encoding Language to Numbers

- The essence of NLP lies in the process used to efficiently convert letter / words to number such that each letter / word is uniquely identified.
- This process is called **`Tokenization`**
- **Tokenization** can convert `letter / words to numbers` and `words / sentences to sequences`

## Tokenization

In [1]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

# Initialising the Tokenizer
tokenizer = Tokenizer(num_words=100)

In [2]:
# Set of sentences for Tokenization
sentences = [
    "Lewis Hamilton is a legend!. He has won '8 Formula-1 World Championships' in a row at Mercedes",
    "Sebastian Vettel is a legend!. He has won '4 Formula-1 World Championships' in a row at Red Bull",
    "Lewis and Seb is really good friends",
    "Roger Federer is a legend!. He has won '5 Wimbledon and 5 US Open titles' in a row",
    "Rafael Nadal is a legend!. He has won '14 Roland Garros titles' an unreal feat",
    "Roger and Rafa are really good friends"
]

In [3]:
# Fitting the Tokenizer to the Sentences
tokenizer.fit_on_texts(sentences)

# Retriving the encoded word_indexes
word_index = tokenizer.word_index
print(word_index)

{'a': 1, 'is': 2, 'legend': 3, 'he': 4, 'has': 5, 'won': 6, 'in': 7, 'row': 8, 'and': 9, 'lewis': 10, 'formula': 11, '1': 12, 'world': 13, "championships'": 14, 'at': 15, 'really': 16, 'good': 17, 'friends': 18, 'roger': 19, "titles'": 20, 'hamilton': 21, "'8": 22, 'mercedes': 23, 'sebastian': 24, 'vettel': 25, "'4": 26, 'red': 27, 'bull': 28, 'seb': 29, 'federer': 30, "'5": 31, 'wimbledon': 32, '5': 33, 'us': 34, 'open': 35, 'rafael': 36, 'nadal': 37, "'14": 38, 'roland': 39, 'garros': 40, 'an': 41, 'unreal': 42, 'feat': 43, 'rafa': 44, 'are': 45}


**Inference**
- All words are converted to lower case
- All punctuations except apostrophe’s are removed

## Converting Sentences into Sequences

- Once the corpus of words has been generated by fitting the tokenizer we can convert the sentences into sequences

In [4]:
sequences = tokenizer.texts_to_sequences(sentences)

# Displaying the Sequences
for i in sequences:
    print(i)

[10, 21, 2, 1, 3, 4, 5, 6, 22, 11, 12, 13, 14, 7, 1, 8, 15, 23]
[24, 25, 2, 1, 3, 4, 5, 6, 26, 11, 12, 13, 14, 7, 1, 8, 15, 27, 28]
[10, 9, 29, 2, 16, 17, 18]
[19, 30, 2, 1, 3, 4, 5, 6, 31, 32, 9, 33, 34, 35, 20, 7, 1, 8]
[36, 37, 2, 1, 3, 4, 5, 6, 38, 39, 40, 20, 41, 42, 43]
[19, 9, 44, 45, 16, 17, 18]


**Inference**
- A good encoding should be able to highlight sentences with similar meaning.
- Similar meaning sentences --> similar words --> similar encodings --> similar sequences.

**OOV Token**
- The **`OOV Token`** - **`Out of Vocabulary Token`** is used by the Tokenizer to encode words / letter not yet defined by the tokenizer in its corpus

In [5]:
new_sentences = [
    "I had the time of my life in secondary school",
    "Scotland is a beautiful place, loving everyday"
]

# Encoding the New Sentences
new_sequences = tokenizer.texts_to_sequences(new_sentences)

print("Encoding without OOV Token: ")
for i in new_sequences:
    print(i)

Encoding without OOV Token: 
[7]
[2, 1]


**Inference**
- Since the Tokenizer hasnt learnt many of the words in the above sentences it has chosen to encode the known words and omit the rest.
- This changes the Context of the Sentences

**Using the OOV Token**
- The **`OOV Token`** can be parsed into the Tokenizer as a Parameter.
- It should be a unique string which wont be reused.

In [6]:
# Declaring the Tokenizer with the OOV Token
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")

# Fitting the Tokenizer
tokenizer.fit_on_texts(sentences)
word_idx = tokenizer.word_index
print("This is the word index as learnt by the Tokenizer:\n", word_idx)

This is the word index as learnt by the Tokenizer:
 {'<OOV>': 1, 'a': 2, 'is': 3, 'legend': 4, 'he': 5, 'has': 6, 'won': 7, 'in': 8, 'row': 9, 'and': 10, 'lewis': 11, 'formula': 12, '1': 13, 'world': 14, "championships'": 15, 'at': 16, 'really': 17, 'good': 18, 'friends': 19, 'roger': 20, "titles'": 21, 'hamilton': 22, "'8": 23, 'mercedes': 24, 'sebastian': 25, 'vettel': 26, "'4": 27, 'red': 28, 'bull': 29, 'seb': 30, 'federer': 31, "'5": 32, 'wimbledon': 33, '5': 34, 'us': 35, 'open': 36, 'rafael': 37, 'nadal': 38, "'14": 39, 'roland': 40, 'garros': 41, 'an': 42, 'unreal': 43, 'feat': 44, 'rafa': 45, 'are': 46}


**Inference**
- By Default the `OOV Token` is parsed into the Dictionary of Encodings

In [7]:
print("First Sequences after Encoding:\n")
sequences = tokenizer.texts_to_sequences(sentences)
for i in sequences:
    print(i)
    
print("\nNew Sequences after Encoding:\n")
new_sequences = tokenizer.texts_to_sequences(new_sentences)
for i in new_sequences:
    print(i)

First Sequences after Encoding:

[11, 22, 3, 2, 4, 5, 6, 7, 23, 12, 13, 14, 15, 8, 2, 9, 16, 24]
[25, 26, 3, 2, 4, 5, 6, 7, 27, 12, 13, 14, 15, 8, 2, 9, 16, 28, 29]
[11, 10, 30, 3, 17, 18, 19]
[20, 31, 3, 2, 4, 5, 6, 7, 32, 33, 10, 34, 35, 36, 21, 8, 2, 9]
[37, 38, 3, 2, 4, 5, 6, 7, 39, 40, 41, 21, 42, 43, 44]
[20, 10, 45, 46, 17, 18, 19]

New Sequences after Encoding:

[1, 1, 1, 1, 1, 1, 1, 8, 1, 1]
[1, 3, 2, 1, 1, 1, 1]


**Inference**
- The new sentences to contain many `OOV Tokens` retains the length of the sentences.
- **This indicates that we need to increase the corpus with better word coverage for the new sentences.**

## Padding

- Padding is used to convert the sentences being learnt into a standard size like the `target-size` parameter in `CNN's`
- Sequences: Contains 4 Sentences of similar length and 2 Sentences of similar length.
- Scope for padding

In [8]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def_pad_seq = pad_sequences(sequences)
print("The default padding:\n")
for i in def_pad_seq:
    print(i)

The default padding:

[ 0 11 22  3  2  4  5  6  7 23 12 13 14 15  8  2  9 16 24]
[25 26  3  2  4  5  6  7 27 12 13 14 15  8  2  9 16 28 29]
[ 0  0  0  0  0  0  0  0  0  0  0  0 11 10 30  3 17 18 19]
[ 0 20 31  3  2  4  5  6  7 32 33 10 34 35 36 21  8  2  9]
[ 0  0  0  0 37 38  3  2  4  5  6  7 39 40 41 21 42 43 44]
[ 0  0  0  0  0  0  0  0  0  0  0  0 20 10 45 46 17 18 19]


**Pre Padding**
- Here padding of 0's is done before the beginning of sentences


**Post Padding**
- Here padding of 0's is done before the after the completion of sentences

In [9]:
post_pad_seq = pad_sequences(sequences, padding="post")
print("The sequences after Post Paddding:\n\n")
for i in post_pad_seq:
    print(i)

The sequences after Post Paddding:


[11 22  3  2  4  5  6  7 23 12 13 14 15  8  2  9 16 24  0]
[25 26  3  2  4  5  6  7 27 12 13 14 15  8  2  9 16 28 29]
[11 10 30  3 17 18 19  0  0  0  0  0  0  0  0  0  0  0  0]
[20 31  3  2  4  5  6  7 32 33 10 34 35 36 21  8  2  9  0]
[37 38  3  2  4  5  6  7 39 40 41 21 42 43 44  0  0  0  0]
[20 10 45 46 17 18 19  0  0  0  0  0  0  0  0  0  0  0  0]


**MaxLen**
- If most of sentences are having mean length and very few sentences have high variance from the mean length we can standardise the lenght of each sequence being encoded.

In [10]:
maxlen_seq = pad_sequences(sequences, padding="post", maxlen=12)
print("The sequences after Post Padding and Max Len:\n")
for i in maxlen_seq:
    print(i)

The sequences after Post Padding and Max Len:

[ 6  7 23 12 13 14 15  8  2  9 16 24]
[ 7 27 12 13 14 15  8  2  9 16 28 29]
[11 10 30  3 17 18 19  0  0  0  0  0]
[ 6  7 32 33 10 34 35 36 21  8  2  9]
[ 2  4  5  6  7 39 40 41 21 42 43 44]
[20 10 45 46 17 18 19  0  0  0  0  0]


**Inference**
- The use of the Maxlen Parameter has unified the length of the sentences but truncated the sentences from the beginning.
- This can be changed to truncate from the end.

In [11]:
trunc_seq = pad_sequences(sequences, padding="post", maxlen=12, truncating="post")
print("The sequences after Post Padding and Max Len:\n")
for i in trunc_seq:
    print(i)

The sequences after Post Padding and Max Len:

[11 22  3  2  4  5  6  7 23 12 13 14]
[25 26  3  2  4  5  6  7 27 12 13 14]
[11 10 30  3 17 18 19  0  0  0  0  0]
[20 31  3  2  4  5  6  7 32 33 10 34]
[37 38  3  2  4  5  6  7 39 40 41 21]
[20 10 45 46 17 18 19  0  0  0  0  0]


## Cleaning Text and Stopwords

**Stopwords**
- A list of commonly used words which are filtered out to infer the meaning of the sentence

In [12]:
stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at",
             "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do",
             "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having",
             "he", "hed", "hes", "her", "here", "heres", "hers", "herself", "him", "himself", "his", "how",
             "hows", "i", "id", "ill", "im", "ive", "if", "in", "into", "is", "it", "its", "itself",
             "lets", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought",
             "our", "ours", "ourselves", "out", "over", "own", "same", "she", "shed", "shell", "shes", "should",
             "so", "some", "such", "than", "that", "thats", "the", "their", "theirs", "them", "themselves", "then",
             "there", "theres", "these", "they", "theyd", "theyll", "theyre", "theyve", "this", "those", "through",
             "to", "too", "under", "until", "up", "very", "was", "we", "wed", "well", "were", "weve", "were",
             "what", "whats", "when", "whens", "where", "wheres", "which", "while", "who", "whos", "whom", "why",
             "whys", "with", "would", "you", "youd", "youll", "youre", "youve", "your", "yours", "yourself",
             "yourselves"]

In [13]:
# Displaying all the sentences before filtering
print("The sentences before filtering stopwords:\n")
for sentence in sentences:
    print(sentence)

# List of the Filtered Sentences
filtered_sentences = []

# Filtering the Sentence for Stopwords
for sentence in sentences:
    
    # Converting a Sentence into list of Words
    words = sentence.split()
    filtered_sentence = ""
    
    # Checking for the Stopwords
    for word in words:
        if word not in stopwords:
            filtered_sentence = filtered_sentence + word + " "
    
    filtered_sentences.append(filtered_sentence)
    
print("\n\nThe sentences after filtering stopwords:\n")
for sentence in filtered_sentences:
    print(sentence)

The sentences before filtering stopwords:

Lewis Hamilton is a legend!. He has won '8 Formula-1 World Championships' in a row at Mercedes
Sebastian Vettel is a legend!. He has won '4 Formula-1 World Championships' in a row at Red Bull
Lewis and Seb is really good friends
Roger Federer is a legend!. He has won '5 Wimbledon and 5 US Open titles' in a row
Rafael Nadal is a legend!. He has won '14 Roland Garros titles' an unreal feat
Roger and Rafa are really good friends


The sentences after filtering stopwords:

Lewis Hamilton legend!. He won '8 Formula-1 World Championships' row Mercedes 
Sebastian Vettel legend!. He won '4 Formula-1 World Championships' row Red Bull 
Lewis Seb really good friends 
Roger Federer legend!. He won '5 Wimbledon 5 US Open titles' row 
Rafael Nadal legend!. He won '14 Roland Garros titles' unreal feat 
Roger Rafa really good friends 


**Removing Punctuation**

In [14]:
import string

print("The sentences before filtering stopwords and punctuation:\n")
for sentence in sentences:
    print(sentence)

# Utilising the Translation Function
'''
The two empty strings promote equal length, they are used to identify the location for 
placing the words after the punctuation has been stripped. 
'''
translation_table = str.maketrans("", "", string.punctuation)

filtered_sentences = []
for sentence in sentences:
    words = sentence.split()
    filtered_sentence = ""
    
    # Making the Translations for each word and filtering the stopwords
    for word in words:
        word = word.translate(translation_table)
        if word not in stopwords:
            filtered_sentence = filtered_sentence + word + " "
            
    filtered_sentences.append(filtered_sentence)
    
print("\n\nThe sentences after filtering stopwords and punctuation:\n")
for sentence in filtered_sentences:
    print(sentence)

The sentences before filtering stopwords and punctuation:

Lewis Hamilton is a legend!. He has won '8 Formula-1 World Championships' in a row at Mercedes
Sebastian Vettel is a legend!. He has won '4 Formula-1 World Championships' in a row at Red Bull
Lewis and Seb is really good friends
Roger Federer is a legend!. He has won '5 Wimbledon and 5 US Open titles' in a row
Rafael Nadal is a legend!. He has won '14 Roland Garros titles' an unreal feat
Roger and Rafa are really good friends


The sentences after filtering stopwords and punctuation:

Lewis Hamilton legend He won 8 Formula1 World Championships row Mercedes 
Sebastian Vettel legend He won 4 Formula1 World Championships row Red Bull 
Lewis Seb really good friends 
Roger Federer legend He won 5 Wimbledon 5 US Open titles row 
Rafael Nadal legend He won 14 Roland Garros titles unreal feat 
Roger Rafa really good friends 


**Encoding the Fully Filtered Sentences**

In [15]:
# Initialisation
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")

# Fitting
tokenizer.fit_on_texts(filtered_sentences)
print("The Word Index learnt by the Tokenizer:\n\n", tokenizer.word_index)

# Encoding
sequences = tokenizer.texts_to_sequences(filtered_sentences)
print("\n\nThe sequences after encoding are:\n")
for seq in sequences:
    print(seq)

The Word Index learnt by the Tokenizer:

 {'<OOV>': 1, 'legend': 2, 'he': 3, 'won': 4, 'row': 5, 'lewis': 6, 'formula1': 7, 'world': 8, 'championships': 9, 'really': 10, 'good': 11, 'friends': 12, 'roger': 13, '5': 14, 'titles': 15, 'hamilton': 16, '8': 17, 'mercedes': 18, 'sebastian': 19, 'vettel': 20, '4': 21, 'red': 22, 'bull': 23, 'seb': 24, 'federer': 25, 'wimbledon': 26, 'us': 27, 'open': 28, 'rafael': 29, 'nadal': 30, '14': 31, 'roland': 32, 'garros': 33, 'unreal': 34, 'feat': 35, 'rafa': 36}


The sequences after encoding are:

[6, 16, 2, 3, 4, 17, 7, 8, 9, 5, 18]
[19, 20, 2, 3, 4, 21, 7, 8, 9, 5, 22, 23]
[6, 24, 10, 11, 12]
[13, 25, 2, 3, 4, 14, 26, 14, 27, 28, 15, 5]
[29, 30, 2, 3, 4, 31, 32, 33, 15, 34, 35]
[13, 36, 10, 11, 12]


## Working with Real World Data Sources

**IMDB Dataset**
- Loading a dataset from TFDS with imdb movie reviews

In [16]:
import tensorflow_datasets as tfds

# Loading the Dataset
imdb = tfds.as_numpy(tfds.load("imdb_reviews", split="train"))
print(imdb)

[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete545XHZ/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete545XHZ/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete545XHZ/imdb_reviews-unsupervised.t…

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m
<tensorflow_datasets.core.dataset_utils._IterableDataset object at 0x79bbb276c460>


In [17]:
list_of_reviews = []

# Storing all the reviews
for item in imdb:
    list_of_reviews.append(str(item["text"]))
    
print("Length: ", len(list_of_reviews))
print("First Item: ", list_of_reviews[0])

Length:  25000
First Item:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."


In [18]:
# Creating the Movie Tokenizer
movie_tokenizer = Tokenizer(num_words=100000, oov_token="<OOV>")

# Fitting the Sentences
movie_tokenizer.fit_on_texts(list_of_reviews)

# Viewing the Word Index
print(len(movie_tokenizer.word_index))

86539


In [19]:
# Converting the Sentences to Sequences
sequences = movie_tokenizer.texts_to_sequences(list_of_reviews)

for i in sequences[:5]:
    print(i, end="\n\n")

[59, 12, 14, 35, 439, 400, 18, 174, 29, 10624, 9, 33, 1378, 3401, 42, 496, 11109, 197, 25, 88, 156, 19, 12, 211, 340, 29, 70, 248, 213, 9, 486, 62, 70, 88, 116, 99, 24, 5740, 12, 3317, 657, 777, 12, 18, 7, 35, 406, 8228, 178, 2477, 426, 2, 92, 1253, 140, 72, 149, 55, 2, 30181, 7525, 72, 229, 70, 2962, 16, 20482, 2880, 20483, 18416, 1506, 4998, 3, 40, 3947, 119, 1608, 17, 3401, 14, 163, 19, 4, 1253, 927, 7986, 9, 4, 18, 13, 14, 4200, 5, 102, 148, 1237, 11, 240, 692, 13, 44, 25, 101, 39, 12, 7232, 10374, 39, 1378, 25011, 52, 409, 11, 99, 1214, 874, 145, 10]

[256, 28, 78, 585, 6, 815, 2383, 317, 109, 19, 12, 7, 643, 696, 6, 4, 2249, 5, 183, 599, 68, 1483, 114, 2289, 3, 4005, 22, 2, 34225, 3, 263, 43, 4754, 4, 173, 190, 22, 12, 4126, 11, 1604, 2383, 87, 2, 20, 14, 1945, 2, 115, 950, 14, 1838, 1367, 563, 3, 365, 183, 477, 6, 602, 19, 17, 61, 1845, 5, 51, 14, 4090, 98, 42, 138, 11, 983, 11, 200, 28, 1059, 171, 5, 2, 20, 19, 11, 298, 2, 2182, 5, 10, 3, 285, 43, 477, 6, 602, 5, 94, 203, 30182

**Filtering the Reviews for Stop word and Punctuations**

In [20]:
from bs4 import BeautifulSoup

# String Translation Table
table = str.maketrans("", "", string.punctuation)

# List of Filtered Reviews
filtered_reviews = []

for review in imdb:
    
    # Decoding the Dataset to a String of Lower Cases
    sentence = str(review["text"].decode("UTF-8").lower())
    sentence = sentence.replace(",", " , ")
    sentence = sentence.replace(".", " . ")
    sentence = sentence.replace("-", " - ")
    sentence = sentence.replace("/", " / ")
    
    # Removing all the HTML Tags
    soup = BeautifulSoup(sentence)
    sentence = soup.get_text()
    
    # Splitting the Sentence into words
    words = sentence.split()
    
    # Removing the Stopwords
    filtered_review = ""
    for word in words:
        word = word.translate(table)
        
        if word not in stopwords:
            filtered_review = filtered_review + word + " "
            
    filtered_reviews.append(filtered_review)

  soup = BeautifulSoup(sentence)


In [21]:
# Viewing the Filtered Sentence
for i in filtered_reviews[:5]:
    print(i)

absolutely terrible movie  dont lured christopher walken michael ironside  great actors  must simply worst role history  even great acting not redeem movies ridiculous storyline  movie early nineties us propaganda piece  pathetic scenes columbian rebels making cases revolutions  maria conchita alonso appeared phony  pseudo  love affair walken nothing pathetic emotional plug movie devoid real meaning  disappointed movies like  ruining actors like christopher walkens good name  barely sit  
known fall asleep films  usually due combination things including  really tired  warm comfortable sette just eaten lot  however occasion fell asleep film rubbish  plot development constant  constantly slow boring  things seemed happen  no explanation causing  admit  may missed part film  watched majority everything just seemed happen accord without real concern anything else  cant recommend film  
mann photographs alberta rocky mountains superb fashion  jimmy stewart walter brennan give enjoyable perf

**Tokenizing the Sentences**

In [22]:
new_movie_tokenizer = Tokenizer(num_words=100000, oov_token="<OOV>")
new_movie_tokenizer.fit_on_texts(filtered_reviews)
print(len(new_movie_tokenizer.word_index))

86124


In [23]:
sequences = new_movie_tokenizer.texts_to_sequences(filtered_reviews)

for seq in sequences[:5]:
    print(seq, end="\n\n")

[316, 284, 2, 24, 10377, 1248, 3449, 378, 10859, 22, 64, 115, 222, 146, 118, 363, 11, 22, 40, 4, 5525, 26, 525, 643, 2, 293, 7884, 84, 2326, 304, 1091, 54, 29713, 7298, 130, 2756, 17390, 2814, 20243, 18246, 1349, 4746, 3801, 42, 1447, 3449, 71, 1091, 791, 7749, 2, 4036, 58, 1080, 560, 26, 6, 6915, 64, 6, 1248, 13679, 8, 294, 1061, 739]

[452, 677, 2238, 29, 510, 564, 2080, 87, 471, 13, 1322, 2113, 3829, 33602, 7, 4496, 82, 96, 3940, 1448, 2238, 3, 1779, 38, 812, 1682, 1215, 429, 250, 87, 351, 472, 9, 1687, 3913, 840, 105, 915, 80, 3, 188, 2021, 180, 7, 351, 472, 29714, 109, 58, 4282, 134, 226, 85, 276, 3]

[4247, 5958, 26985, 4664, 3830, 769, 1469, 1873, 1216, 2213, 8607, 104, 614, 247, 110, 198, 117, 252, 24738, 855, 21, 3851, 422, 14181, 26986, 11131, 308, 11131, 24739, 1034, 33603, 7885, 1797, 1007, 394, 71, 11, 2456, 7097, 456, 2131, 383, 3450, 21473, 1677, 3222, 327, 4247, 1032, 846, 3914, 3851, 422, 21474, 2131, 2270, 192, 1184, 1095, 2131, 608, 2691, 17391, 2846, 431, 614, 16573

**Using IMDB Subwords**

In [24]:
(train_data, test_data), info = tfds.load(
    name="imdb_reviews/subwords8k",
    split=(tfds.Split.TRAIN, tfds.Split.TEST),
    as_supervised=True,
    with_info=True
)

print(info)

[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteTEXVMA/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteTEXVMA/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incompleteTEXVMA/imdb_reviews-unsupervised.t…

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.[0m
tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/subwords8k/1.0.0',
    description="""
    Large Movie Review Dataset. This is a dataset for binary sentiment
    classification containing substantially more data than previous benchmark
    datasets. We provide a set of 25,000 highly polar movie reviews for training,
    and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Uses `tfds.deprecated.text.SubwordTextEncoder` with 8k vocab size
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_path=PosixGPath('/tmp/tmpvzyx_n11tfds'),
    file_format=tfrecord,
    download_size=80.23 MiB,
    dataset_size=54.72 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
        'text': T

**Accessing the Encoder used to Encode the Corpus**

In [25]:
encoder = info.features["text"].encoder
print(f"The vocab size of the Encoder is: {encoder.vocab_size}")

The vocab size of the Encoder is: 8185


**Encoding a string**

In [26]:
encoded_string = encoder.encode("Rising to a wonderful day")
print("Encoded String is: {}".format(encoded_string))

Encoded String is: [1625, 1160, 7, 4, 650, 606]


In [27]:
decoded_string = ""
for token in encoded_string:
    word = encoder.subwords[token]
    decoded_string = decoded_string + word + " "
    print(word)
    
print("The string as encoded using all the subwords: ")
print(decoded_string)

writer
lle
s_
and_
production_
human_
The string as encoded using all the subwords: 
writer lle s_ and_ production_ human_ 


In [28]:
decoded_string = encoder.decode(encoded_string)
print(decoded_string)

Rising to a wonderful day


## Handling CSV Files

In [29]:
!wget --no-check-certificate --no-cache \
    https://storage.googleapis.com/learning-datasets/binary-emotion.csv \
    -O /tmp/binary-emotion.csv

--2024-02-01 17:05:24--  https://storage.googleapis.com/learning-datasets/binary-emotion.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 64.233.184.207, 74.125.206.207, 64.233.167.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|64.233.184.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2690504 (2.6M) [text/csv]
Saving to: '/tmp/binary-emotion.csv'


2024-02-01 17:05:25 (5.12 MB/s) - '/tmp/binary-emotion.csv' saved [2690504/2690504]



In [30]:
import csv

sentences = []
labels = []

with open("/tmp/binary-emotion.csv", encoding="UTF-8") as csvfile:
    reader = csv.reader(csvfile, delimiter=",")
    
    # The reading the CSV - 1 row at a time
    for row in reader:
        labels.append(row[0])
        sentence = row[1].lower()
        
        # Replacing punctuation to add spaces
        sentence = sentence.replace(",", " , ")
        sentence = sentence.replace(".", " . ")
        sentence = sentence.replace("-", " - ")
        sentence = sentence.replace("/", " / ")
        
        # Removing HTML Tags
        soup = BeautifulSoup(sentence)
        sentence = soup.get_text()
        
        # Splitting into Words and filtering the words
        words = sentence.split()
        filtered_sentence = ""
        
        for word in words:
            word = word.translate(table)    
            if word not in stopwords:
                filtered_sentence = filtered_sentence + word + " "
                
        sentences.append(filtered_sentence)

print(f"No of Sentences Read: {len(sentences)}")

  soup = BeautifulSoup(sentence)


No of Sentences Read: 35327


## Creating a Training and Test Set

In [31]:
train_size = 27000

train_sentences = sentences[:train_size]
train_labels = labels[:train_size]
test_sentences = sentences[train_size:]
test_label = labels[train_size:]

In [32]:
csv_tokenizer = Tokenizer(num_words=25000, oov_token="<OOV>")

# Fitting the Tokenizer to the Training Sentences
csv_tokenizer.fit_on_texts(train_sentences)

# Converting Sentences to Sequences
sequences = csv_tokenizer.texts_to_sequences(train_sentences)

# Modifying the Examples
pad_sequences = pad_sequences(
    sequences, maxlen=10, truncating="post", padding="post"
)

In [33]:
print("Original Training Sequence:\n", sequences[0])
print("Padded Training Sequence:\n", pad_sequences[0])

Original Training Sequence:
 [19, 3521, 47, 4641, 621, 504, 921, 419]
Padded Training Sequence:
 [  19 3521   47 4641  621  504  921  419    0    0]


## Working with JSON Files

**Loading the Dataset**

In [34]:
!wget --no-check-certificate \
    https://storage.googleapis.com/learning-datasets/sarcasm.json \
    -O /tmp/sarcasm.json

--2024-02-01 17:05:38--  https://storage.googleapis.com/learning-datasets/sarcasm.json
Resolving storage.googleapis.com (storage.googleapis.com)... 64.233.184.207, 74.125.206.207, 64.233.167.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|64.233.184.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5643545 (5.4M) [application/json]
Saving to: '/tmp/sarcasm.json'


2024-02-01 17:05:39 (12.0 MB/s) - '/tmp/sarcasm.json' saved [5643545/5643545]



In [35]:
import json

with open("/tmp/sarcasm.json", "r") as jsonfile:
    loaded_file = json.load(jsonfile)
    
labels = []
sentences = []
links = []
for item in loaded_file:
    labels.append(item["is_sarcastic"])
    sentences.append(item["headline"])
    links.append(item["article_link"])
    
print("Labels:\n", labels[:5])
print("\n\nHeadlines:\n", sentences[:5])
print("\n\nLinks:\n", links[:5])

Labels:
 [0, 0, 1, 1, 0]


Headlines:
 ["former versace store clerk sues over secret 'black code' for minority shoppers", "the 'roseanne' revival catches up to our thorny political mood, for better and worse", "mom starting to fear son's web series closest thing she will have to grandchild", 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas', 'j.k. rowling wishes snape happy birthday in the most magical way']


Links:
 ['https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5', 'https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365', 'https://local.theonion.com/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697', 'https://politics.theonion.com/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302', 'https://www.huffingtonpost.com/entry/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb']


In [36]:
# Filtering the Data
with open("/tmp/sarcasm.json", "r") as jsonfile:
    loaded_file = json.load(jsonfile)
    
labels = []
sentences = []
links = []
for item in loaded_file:
    sentence = item["headline"].lower()
    sentence = sentence.replace(".", " . ")
    sentence = sentence.replace("-", " - ")
    sentence = sentence.replace("/", " / ")
    sentence = sentence.replace(",", " , ")
    
    soup = BeautifulSoup(sentence)
    sentence = soup.get_text()
    
    words = sentence.split()
    filtered_sentence = ""
    for word in words:
        word = word.translate(table)
        if word not in stopwords:
            filtered_sentence = filtered_sentence + word + " "
            
    labels.append(item["is_sarcastic"])
    sentences.append(filtered_sentence)
    links.append(item["article_link"])
    

  soup = BeautifulSoup(sentence)


In [37]:
for sentence in sentences[:5]:
    print(sentence)

former versace store clerk sues secret black code minority shoppers 
roseanne revival catches thorny political mood  better worse 
mom starting fear sons web series closest thing will grandchild 
boehner just wants wife listen  not come alternative debt  reduction ideas 
j  k  rowling wishes snape happy birthday magical way 


In [38]:
# Tokenizing the Cleaned Sentences
sarcasm_tokenizer = Tokenizer(num_words=1e5)

# Fitting to the Text
sarcasm_tokenizer.fit_on_texts(sentences)
word_index = sarcasm_tokenizer.word_index

print("The Top 10 Words are:")
for word in list(word_index.keys())[:10]:
    print(word, "\t", word_index[word])

The Top 10 Words are:
new 	 1
trump 	 2
man 	 3
not 	 4
just 	 5
will 	 6
one 	 7
report 	 8
year 	 9
area 	 10


In [39]:
# Coverting to Sequences
sarcasm_sequences = sarcasm_tokenizer.texts_to_sequences(sentences)
for seq in sarcasm_sequences[:10]:
    print(seq)

[227, 14300, 587, 3275, 2208, 286, 42, 2010, 2497, 8209]
[8210, 3276, 2670, 8211, 316, 2849, 172, 897]
[77, 754, 732, 1025, 2011, 496, 4625, 137, 6, 10376]
[1412, 5, 147, 310, 1603, 4, 234, 2850, 1305, 6810, 800]
[696, 707, 4626, 830, 10377, 475, 476, 1184, 43]
[10378, 276, 33]
[6811, 269, 372, 4187, 2102, 1360]
[388, 6, 1060, 79, 47, 86, 270]
[166, 3547, 6812, 464, 5149, 1923, 81]
[2012, 241, 263, 317, 23, 14301, 3809]
