We've looked at several options for text classification using word embeddings, so let's move on to deep learning! Deep learning is better suited to datasets larger than ours, but we'll still take a look at a few approaches and see how they perform.

We'll also need to use a GPU, either locally if that's an option for you or through Google Colab. On Colab, go to Edit -> Notebook Settings and select GPU. Then run the cell below to ensure a GPU is available. 

In [0]:
import torch

if torch.cuda.is_available():     
    device = torch.device("cuda")
    
else:
    print('No GPU available - maybe try again later?')

In [0]:
import pandas as pd
import numpy as np
import sklearn

In [3]:
from google.colab import drive

drive.mount('/content/gdrive')

train = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_train.csv')
val = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_val.csv')
test = pd.read_csv('gdrive/My Drive/RTANews_raw/arabic_test.csv')

train.head()

#On a GPU, we can use the full 40 classes! You could uncomment the two lines below if you're trying to run this on CPU.
#train = train[train.label <= 20]
#test = test[test.label <= 20]

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Unnamed: 0,text,category,label
0,لوكسمبورغ: كاميرون ارتكب خطأ تاريخيا بطرح الاس...,استفتاء_بريطانيا,1
1,روسيا بصدد تصنيع مركبة فضائية جديدة\n تبدأ عمل...,التقنية_والمعلومات,10
2,صادرات ألمانيا إلى روسيا عند أدنى مستوى منذ 1...,عقوبات_اقتصادية,25
3,الجيش السوري يصد هجوم جبهة النصرة في ريف حلب\n...,المعارضة_السورية,12
4,ردود أفعال وسائل إعلام غربية على عملية درع الف...,الأزمة_السورية,6


We're going to build a basic LSTM (Long Short-Term Memory) network, which until a few years ago was state of the art for text classification tasks because they allow networks to retain information about pieces of text that aren't necessary close together in a document - hence the name. Here's a longer explanation: https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

Let's try to implement this using `keras`, a high-level API for building basic neural networks. It's available through `tensorflow`.

In [4]:
#As in previous notebooks, we'll tokenize our text and remove stopwords
import nltk
from nltk.tokenize import WhitespaceTokenizer

nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('arabic')

tokenizer = WhitespaceTokenizer()

train_words = [tokenizer.tokenize(t) for t in train.text]
val_words = [tokenizer.tokenize(t) for t in val.text]
test_words = [tokenizer.tokenize(t) for t in test.text]

train_words = [[t for t in text if t not in stop_words] for text in train_words]
val_words = [[t for t in text if t not in stop_words] for text in val_words]
test_words = [[t for t in text if t not in stop_words] for text in test_words]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

#num_words sets the size of our vocabulary - so we're keeping the 7000 most common words here
tokenizer = Tokenizer(num_words = 8000, oov_token='[OOV]')
tokenizer.fit_on_texts(train_words)
word_index = tokenizer.word_index


Now we have the words in our training set mapped to indices. Less common words in the training set, as well as any words that appear in `val` and/or `test` but not `train`, are out of vocabulary. As discussed in previous notebooks, we can't ask a model to evaluate words that it hasn't been trained on.

We use our `word_index` to convert our words to indices and then "pad" each sequence of indices so that they're the same length. Sentences are different lengths, in other words, but tensorflow wants each input to have the same size. This is a standard step in deep learning for text, no matter what library or approach you're using. The `maxlen` argument specifies the maximum sentence length that we'll allow.

This also converts our inputs to tensors (read more here, but in short a tensor is a multi-dimensional array), the format expected by tensorflow.

In [0]:
train_sequences = tokenizer.texts_to_sequences(train_words)
train_padded = pad_sequences(train_sequences, maxlen=300, padding='post', truncating='post')

validation_sequences = tokenizer.texts_to_sequences(val_words)
val_padded = pad_sequences(validation_sequences, maxlen=300, padding='post', truncating='post')

test_sequences = tokenizer.texts_to_sequences(test_words)
test_padded = pad_sequences(test_sequences, maxlen=300, padding='post', truncating='post')

#Convert labels to arrays
train_labels = np.array(train.label)
val_labels = np.array(val.label)
test_labels = np.array(test.label)

Now we can build our model! `keras` makes this super easy, which hides a lot of complexity but makes it a great entry going.

We wrap our model in a call to `tf.keras.Sequential`, which lets us use a list to build the layers of our network.

We start with an embedding layer, with our vocab size (8000) as an input and an embedding size (we've chosen 64 here) as an output.
Then we have a single bidirectional LSTM layer and two linear layers.

Our final linear layer has 40 output nodes, because we have 40 classes.

In [7]:
keras_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(8000, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(40, activation='softmax')
])

keras_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          512000    
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               66048     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 40)                2600      
Total params: 588,904
Trainable params: 588,904
Non-trainable params: 0
_________________________________________________________________


In [8]:
keras_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
num_epochs = 10

history = keras_model.fit(train_padded, train_labels, epochs=num_epochs, validation_data=(val_padded, val_labels), verbose=2)

Epoch 1/10
692/692 - 21s - loss: 2.5847 - accuracy: 0.3034 - val_loss: 1.9898 - val_accuracy: 0.4243
Epoch 2/10
692/692 - 20s - loss: 1.7632 - accuracy: 0.4771 - val_loss: 1.7177 - val_accuracy: 0.4940
Epoch 3/10
692/692 - 20s - loss: 1.4150 - accuracy: 0.5657 - val_loss: 1.5697 - val_accuracy: 0.5352
Epoch 4/10
692/692 - 20s - loss: 1.1465 - accuracy: 0.6336 - val_loss: 1.4549 - val_accuracy: 0.5815
Epoch 5/10
692/692 - 20s - loss: 0.9883 - accuracy: 0.6771 - val_loss: 1.3958 - val_accuracy: 0.5905
Epoch 6/10
692/692 - 20s - loss: 0.8474 - accuracy: 0.7192 - val_loss: 1.4441 - val_accuracy: 0.5978
Epoch 7/10
692/692 - 20s - loss: 0.7508 - accuracy: 0.7457 - val_loss: 1.4861 - val_accuracy: 0.5894
Epoch 8/10
692/692 - 20s - loss: 0.6775 - accuracy: 0.7622 - val_loss: 1.5176 - val_accuracy: 0.5786
Epoch 9/10
692/692 - 21s - loss: 0.6176 - accuracy: 0.7800 - val_loss: 1.5874 - val_accuracy: 0.5779
Epoch 10/10
692/692 - 20s - loss: 0.5757 - accuracy: 0.7921 - val_loss: 1.6029 - val_accura

So what are we looking at here? For each epoch, or iteration over the dataset, we see the loss and accuracy for both our train and val sets. Loss--tracked here using categorical crossentropy, but different problems require different loss functions--is a measurement of how well the model is fitting to the training data. You might be familiar with the idea that for linear regression, you can look at the mean squared error. This is a loss function!

So lower numbers are better for loss. Here we see that with each epoch, the training loss is going down and the accuracy is going up. Great.

But in our last few epochs, we see that the validation loss is going up and the validation accuracy is going down, even though the training metrics continue to improve. This likely means that the model is overfitting - meaning that it's beginning to memorize characteristics of the training data specifically, rather than learning generalizable trends about the dataset. If we wanted to improve the model, we might train it for fewer epochs or use early stopping, a technique where we ask the model to stop training if it's beginning to overfit.

Looking for overfitting is one important reason to use a validation set. It's considered best practice not to tweak your hyperparameters or otherwise changing your models based on test metrics. A val set lets you get a sense of how your model is doing, for instance on overfitting, while still holding out a 'real' test set. 

Speaking of, let's see how this model performs on the test set.

In [9]:
from sklearn.metrics import f1_score

preds = keras_model.predict(test_padded)

preds[0]

array([8.7735877e-08, 2.8430168e-07, 4.6861078e-11, 6.0553972e-11,
       5.6843977e-11, 1.4128086e-08, 1.3539029e-06, 1.2695617e-10,
       7.0358871e-19, 3.3114012e-11, 7.9905517e-14, 7.7171863e-12,
       9.6451316e-08, 1.1430596e-06, 2.8013128e-13, 2.3505014e-07,
       2.3402378e-12, 3.0051389e-05, 3.1096235e-05, 7.8456594e-07,
       1.8648237e-10, 3.8474067e-08, 3.0096146e-04, 7.2220775e-12,
       6.8299113e-14, 8.5205615e-10, 2.4046727e-07, 3.9145602e-15,
       3.7505172e-09, 9.9962890e-01, 3.0067194e-12, 5.3953602e-07,
       6.7811083e-13, 1.7839497e-07, 1.9960604e-09, 4.9507221e-12,
       1.7385561e-20, 1.1451222e-09, 3.8347107e-06, 8.4892463e-08],
      dtype=float32)

This looks different from our previous predictions! Instead of a single predicted label, tensorflow gives us a probability that the text belongs to each label.

This can be helpful in many ways, but an easy way to collapse this down to a single prediction is to get the label that our model thinks is most likely.

In [0]:
def get_flat_preds(preds):

  max_preds = []

  for pred in preds:
    for i, x in enumerate(pred):
      if x==np.max(pred):
        max_preds.append(i)
  
  return max_preds

In [12]:
flat_preds = get_flat_preds(preds)

f1_score(test.label, flat_preds, average = 'weighted')

0.5888271234276691

Not bad! This is worse than our previous results, but remember that we're working with 40 classes instead of 20 now. We could probably also improve our results by increasing hyperparameters such as the vocab size and sequence length, or by improving our model architecture. Right now, for instance, we don't have any [dropout] (https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/).

Because we're using a fairly small dataset here, at least by deep learning standards, we've been able to skip a few steps that are standard for a real deep learning problem. We'll see some of these in the final notebook on BERT, but a few examples are:

- **Batching** - If it's too computationally expensive to pass our entire training set into a model, we might pass it in batches or subsets of the data. Both tensorflow and pytorch have utilities to make this easier.
- **Selecting an activation function, an optimizer, a learning rate...** - There are many choices you can make when building a neural network, and it's probably a good idea to read up on the basics on, say, the difference between some common activation functions. In practice, however, you'll often replicate these choices from someone else who has solved a problem similar to yours!