# Text Classification

We will now attempt to look at text based classification using tensorflow. This is a case of binary classification, which is where we are basically trying to find a way to seperate data into either positive or negative label. This type of problem is very commonly solved using SVM(Support Vector Machines) or other similar classifying algorithms, where the aim is to draw a line to seperate the data into its constituent labels. But right now we are going to be using a neural network approach to the problem. We will be using the IMDB dataset to complete this task. Let us now import our libraries and dependancies


In [1]:
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

In [2]:
#Checking if GPU is available for processing
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")


GPU is NOT AVAILABLE


Now we will download the IMDB dataset.


In [3]:
train_data, validation_data, test_data = tfds.load(
name="imdb_reviews",
split=('train[:60%]', 'train[60%:]', 'test'),#Splitting the data into training and testing data
as_supervised=True)
#When we download using this system, the data is automatically shuffled and split

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /home/aj/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /home/aj/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteOEPA38/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /home/aj/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteOEPA38/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /home/aj/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteOEPA38/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))

[1mDataset imdb_reviews downloaded and prepared to /home/aj/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


### Exploring the Data

Ok, we've downloaded our data now. Let us explore it a bit and try to understand what the data actually is

In [6]:
next(iter(train_data)) #This is the first review

(<tf.Tensor: shape=(), dtype=string, numpy=b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.">,
 <tf.Tensor: shape=(), dtype=int64, numpy=0>)

We are now going to batch the data into sets of 10 and then print the first 10 reviews.

In [10]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
train_examples_batch

<tf.Tensor: shape=(10,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell 

It can be seen that the data is in the form of moview reviews. It is of type string. There is also an associated label with each review indicating if the review is positive or negative. We will now build and train our model on this data. But first, let me brainstrom what might we need for classifying these sentences.

Well first, we will need some way of changing the string into a numerical representation. One common approach here is to 1-hot encode the sentences based on teh appearance of words (vocabulary built through bag of words). Another way we might want to transform the labels would be to use a RNN to transform the data from sentences into a representative vector. Then use these vectors to do our further classification. 

The appraoch taken in the tutorial is by taking the second appraoch, but instead of creating our own RNN for this task, we are using a pretrained model to convert the text into embeddings vectors. This will benefit us in three ways:
1. We dont need to worry about any text preprocessing
2. We benefit from **transfer learning** which is basically the use of previously learned information about word embeddings etc. to benefit our classification
3. The word embeddings are a fixed size

For our example, we will be using the <a href = google/tf2-preview/gnews-swivel-20dim/1>google/tf2-preview/gnews-swivel-20dim/1</a> text embeddings pre-trained model.

In [11]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 20), dtype=float32, numpy=
array([[ 1.765786  , -3.882232  ,  3.9134233 , -1.5557289 , -3.3362343 ,
        -1.7357955 , -1.9954445 ,  1.2989551 ,  5.081598  , -1.1041286 ,
        -2.0503852 , -0.72675157, -0.65675956,  0.24436149, -3.7208383 ,
         2.0954835 ,  2.2969332 , -2.0689783 , -2.9489717 , -1.1315987 ],
       [ 1.8804485 , -2.5852382 ,  3.4066997 ,  1.0982676 , -4.056685  ,
        -4.891284  , -2.785554  ,  1.3874227 ,  3.8476458 , -0.9256538 ,
        -1.896706  ,  1.2113281 ,  0.11474707,  0.76209456, -4.8791065 ,
         2.906149  ,  4.7087674 , -2.3652055 , -3.5015898 , -1.6390051 ],
       [ 0.71152234, -0.6353217 ,  1.7385626 , -1.1168286 , -0.5451594 ,
        -1.1808156 ,  0.09504455,  1.4653089 ,  0.66059524,  0.79308075,
        -2.2268345 ,  0.07446612, -1.4075904 , -0.70645386, -1.907037  ,
         1.4419787 ,  1.9551861 , -0.42660055, -2.8022065 ,  0.43727064]],
      dtype=float32)>


Basically what we have done is change the sentences into fixed length representative embeddings vectors. We can now build our full training model. Again this is done basically the exact same way as before.


In [14]:
model = tf.keras.Sequential([
    hub_layer,
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.summary()
weights = model.get_weights()
reset_model = lambda model: model.set_weights(weights)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense_2 (Dense)              (None, 16)                336       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


Now let us compile the model, selecting our loss function and optimizer

In [15]:
model.compile(optimizer="adam", loss= tf.keras.losses.BinaryCrossentropy(from_logits= True), metrics=['accuracy'])

Here, we are using binary crossentropy loss. This loss basically tells us the probability of a point being a certain class given our trained model. It is given by:
$$Loss = -\frac{1}{N}\sum_{i=1}^N y_i\cdot\log(p(y_i)) + (1-y_i)\cdot\log(p((1-y_i)))$$

What we are esentially doing is penalizing the wrong predictions and ratifying the positive ones. And this is done by taking their probabilities and negative logging them. This concept is actially pretty well explained here:  https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a  

### Training the model

The next step really is to train our model. Now that we have everything set-up properly. We will train our model over 20 epochs using batch sizes of 512 samples. This means that the coefficients will only be updated every 512 samples, rather than after every sample.


In [18]:
train_data.size



AttributeError: 'PrefetchDataset' object has no attribute 'size'

In [19]:
history = model.fit(train_data.shuffle(10000).batch(512), #shuffle and batch the training data
                   epochs = 20,
                   validation_data = validation_data.batch(512),
                   verbose = 1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Now that we have trained the model, we now may want to evaluate how the model does when we test it on our test dataset.


In [20]:
results = model.evaluate(test_data.batch(512), verbose = 2)

for name, value in zip(model.metrics_names, results):
    print("%s: %.3f" % (name, value))

loss: 0.317
accuracy: 0.860


Thats it! We've trained our model to detect movie review sentiments and it works with 86% accuracy. Though this is high, it can be further improved by using more advanced classification techniques. So like the the image classification problem, this will also be revisisted once I learn a bit more.