Welcome to the third part of the NLP Concepts series. IN the previous part, we have seen hoe to build a text classification pipeline using ml algorithms. In this notebook, we will do the same thing with Feed Forward Neural Networks (FFN).

# Basic Text Classification

In [1]:
#import libraries
import tensorflow as tf
from tensorflow.keras import layers
import requests
import zipfile
import io
import tensorflow_datasets as tfds



In [2]:
#get the imdb dataset from tensorflow datasets
train_dataset,test_dataset = tfds.load('imdb_reviews',as_supervised = True,with_info = False,split = ['train','test'],batch_size = 32)

[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteI2NDBM/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteI2NDBM/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteI2NDBM/imdb_reviews-unsupervised.t…

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In TensorFlow, the procedure is prettty straight forward. We can use `layers.TextVectorization()` to tokenize the words (applies some preprocessing by default) and `layers.Embedding()` to build the embeddings. Before jumping in the code, we need to be aware of some tradeoffs:

Tradeoffs
---------

**Max Token Tradeoff:** Tokenizing each word allows us to handle much more complex texts but this choice increases the space and time complexities. On the other hand, we can only tokenize a few of words that are the most frequent, and label the others as `<OOV>` (Out of Vocabulary). This has the opposite effect.

&nbsp;

**Output Sequence Length Tradeoff:** Some sentences are long and some are short. However, the Embedding layer expects an input having a constant length. We can increase the length by adding 0 (called padding) to the short sequences or we can cut some of the lengthy sequences. Now in my experience, the sequence lengths vary a lot. That being said, a good strategy to mitigate this problem is trying to cover 95% of the sequences (this is not my method, I've learned from Daniel Bourke).

&nbsp;

**Embedding Dimensions:** The more dimensions the model has, the more complex texts it can handle. But again this increases time and space complexities.

&nbsp;

**How to optimize these parameters?**

 Personally, I start small and increase the model complexity. During my trials, I save the learning histories and after that I plot all the models on tensorboard to see what is working and what is not. Here is an example [project](https://github.com/egonos/Some-of-my-Data-Science-Work/blob/main/Projects%20with%20Tensorflow/Basics/NLP/Text%20Classification%20Example%20Project.ipynb) that I've done before.

**A small note:** This applies to most NN applications. The best strategy is often to try and see.



# Using Pretrained Embeddings

Using pretrained embeddings is another strategy for feature extraction when we are dealing with the text classification. Often it works pretty well. Here is an [illustration](https://github.com/egonos/Some-of-my-Data-Science-Work/blob/main/Projects%20with%20Tensorflow/Basics/NLP/Text%20Classification%20Using%20Glove.ipynb) for using Glove Embeddings. For more, you can check the Deep Learning AI's [NLP course](https://www.coursera.org/learn/natural-language-processing-tensorflow?specialization=tensorflow-in-practice).

## Using Transfer Learning Models

This is also a way to handle problems. Let's build one from Tensorflow Hub.

In [3]:
import tensorflow_hub as hub
use = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

#convert to the keras layer
use = hub.KerasLayer(use,trainable = False)

In [4]:
inputs = layers.Input(shape = (), dtype = tf.string)
x = use(inputs, training = False)
x = layers.Flatten()(x)
outputs = layers.Dense(1,activation = 'sigmoid')(x)

model = tf.keras.Model(inputs,outputs)

#compile the model
model.compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',
    metrics = ['accuracy']
)

#fit the model
model.fit(train_dataset,
          validation_data = test_dataset,
          epochs = 5)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7a6bcf791600>