# Introduciton to NLP Fundamentals in TensorFlow

NLP has the goal of deriving information out of natual language (could be sequences test or speech)

Another common term for NLP problems is sequence to sequence problems (seq2seq)

## Check for GPU

In [5]:
!nvidia-smi -L

/bin/bash: line 1: nvidia-smi: command not found


## Get helper funcitons

In [6]:
!wget https://raw.githubusercontent.com/arrshsh/ML-and-DS/main/TensorFlow/helper_functions.py

--2025-01-27 22:31:02--  https://raw.githubusercontent.com/arrshsh/ML-and-DS/main/TensorFlow/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10637 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2025-01-27 22:31:03 (83.1 MB/s) - ‘helper_functions.py’ saved [10637/10637]



In [7]:
# Import the required functions
from helper_functions import unzip_data, plot_loss_curves, compare_historys

## Get a text dataset

the dataset we're going to use is Kaggle's introduciton to NLP dataset (text samples of Tweets labelled as disaster or not disaster).

See the original source here: https://www.kaggle.com/competitions/nlp-getting-started

In [8]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

# Unzip the data
unzip_data("nlp_getting_started.zip")

--2025-01-27 22:31:14--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.2.207, 142.250.141.207, 74.125.137.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.2.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2025-01-27 22:31:14 (94.3 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Visualising text data

In [9]:
import pandas as pd
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [10]:
train_df["text"][0]

'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

In [11]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac = 1, random_state = 42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [12]:
# What does test dataframe look like?
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [13]:
# How many samples of each class do we have?
train_df.target.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
0,4342
1,3271


In [14]:
# How many total samples?
len(train_df), len(test_df)

(7613, 3263)

In [15]:
# Let's visualize some random training examples
import random
random_index = random.randint(0, len(train_df)- 5) # Create random indexes not higher than the total number of samples
for row in train_df_shuffled[["text", "target"]][random_index: random_index + 5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "{real disaster}" if target > 0 else "{not real disaster}")
  print(f"Text: \n {text}\n")
  print("-------------------------------------------\n")

Target: 1 {real disaster}
Text: 
 Young dancer moves about 300 youth in attendance at the GMMBC Youth Explosion this past Saturday. Inspiring! http://t.co/TMmOrvxsWz

-------------------------------------------

Target: 1 {real disaster}
Text: 
 SpaceX Founder Musk: Structural Failure Took Down Falcon 9 http://t.co/LvIzO9CSSR

-------------------------------------------

Target: 1 {real disaster}
Text: 
 Strict liability in the context of an airplane accident: Pilot error is a common component of most aviation cr... http://t.co/6CZ3bOhRd4

-------------------------------------------

Target: 1 {real disaster}
Text: 
 Bestie is making me watch texas chainsaw massacre ????????

-------------------------------------------

Target: 1 {real disaster}
Text: 
 Malaysia seem more certain than France.

Plane debris is from missing MH370 http://t.co/eXZnmxbINJ

-------------------------------------------



### Splitting data into training and validation sets

In [16]:
from sklearn.model_selection import train_test_split # this function accepts input as numpy
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
train_df_shuffled["target"].to_numpy(),
test_size = 0.1,
random_state = 42)

In [17]:
# Check lengths
len(train_sentences), len(val_sentences), len(train_labels), len(val_labels)

(6851, 762, 6851, 762)

In [18]:
# Check the first 10 samples
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

## Convert text into numbers

When dealing with a text problem, one of the things you'll have to do before you can build a model is to convert your text to numbers. There are a few ways to do this, namely:

* Tokenisation: direct mapping of token (a token could be a word or character) to number.
* Embedding: create a matrix of feature vector for each token (the size of feature vector can be defined and this embedding can be learned and updated by our models during the training phase)

### Text vectorisation (tokenisation)

In [19]:
import tensorflow as tf

In [20]:
from tensorflow.keras.layers import TextVectorization

In [21]:
train_sentences[:5]

array(['@mogacola @zamtriossu i screamed after hitting tweet',
       'Imagine getting flattened by Kurt Zouma',
       '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
       "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
       'Somehow find you and I collide http://t.co/Ee8RpOahPk'],
      dtype=object)

In [22]:
# Use default TextVectorization parameters
text_vectorizer = TextVectorization(max_tokens = 10000, # How many words in the vocab? None means no cap on the limit, but can be used None only when pad_to_max_tokens is False
                                    standardize = "lower_and_strip_punctuation",
                                    split= "whitespace",
                                    ngrams = None, # Create groups of n-words
                                    output_mode = "int", # How to map tokens to numbers
                                    output_sequence_length = None, # How long do you want the sequences to be?
                                    pad_to_max_tokens = True)

In [23]:
train_sentences[0].split(), len(train_sentences[0].split())

(['@mogacola', '@zamtriossu', 'i', 'screamed', 'after', 'hitting', 'tweet'], 7)

In [24]:
# Find the average number of tokens (words) in the training tweets
round(sum(len(i.split()) for i in train_sentences) / len(train_sentences))

15

In [25]:
# Setup text vectorization variables
max_vocab_length = 10000 # Pick up the 10000 most common tokens from the data
max_length = 15 # The number of words from a tweet that our model sees. For instance if a tweet has 30 words, model jujst sees the first 15 tokens
text_vectorizer = TextVectorization(max_tokens = max_vocab_length,
                                    output_mode = "int",
                                    output_sequence_length = max_length)

In [26]:
# Fit the text vectorizer to the training data
text_vectorizer.adapt(train_sentences)

In [27]:
# Create a sample sentence and tokenize it
sample_sentence = "There's a fllod in the town!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3,   1,   4,   2, 801,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [28]:
# Choose random sentence from the dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text: \n {random_sentence} \n\n\
        Vectorized text:")
text_vectorizer([random_sentence])

Original text: 
 Wall of noise is one thing - but a wall of dust? Moving at 60MPH? http://t.co/9NwAJLi9cr How to not get blown away! http://t.co/j4NI4N0yFZ 

        Vectorized text:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[1104,    6, 3590,    9,   61,  498,   30,    3, 1104,    6,  398,
        1386,   17, 6308, 3769]])>

In [29]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5]
bottom_5_words = words_in_vocab[-5:]
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"5 most common words: {top_5_words}")
print(f"5 least common words: {bottom_5_words}")

Number of words in vocab: 10000
5 most common words: ['', '[UNK]', 'the', 'a', 'in']
5 least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


### Creating an embedding using an Embedding Layer

For the purpose, we'll be using TensorFlow's embedding layer: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding




Some important parameters of the Embedding layer are:






* `input_dim`: the size of the vocabulary
* `output_dim`: the size of the output embedding vectors, for instance a value of 100 would mean that each token is represented by a vector 100 long
* `input_length`: length of the sequences being passed to the embedding layer

In [30]:
from tensorflow.keras import layers
embedding = layers.Embedding(input_dim = max_vocab_length,
                             output_dim = 128,
                             embeddings_initializer = "uniform",
                             input_length = max_length )
embedding



<Embedding name=embedding, built=False>

A rule of thumb in ML is to keep parameter vakues such that they are divisible by 8, the model works better.
For eg, batch of 32 size

In [31]:
# Get a random sentence from the training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
        \n\nEmbedded text:\n")
# Embed the random sentence (turn it into dense vectors of fixed size)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
#Breaking144 Obama Declares Disaster for Typhoon-Devastated Saipan: Obama signs disaster declarat... http://t.co/M8CIKs60BX #AceNewsDesk        

Embedded text:



<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.02783159, -0.00593531,  0.04553375, ..., -0.03026455,
          0.0095581 , -0.045012  ],
        [ 0.04938303, -0.01219455, -0.03961499, ..., -0.04795771,
         -0.02901039, -0.01458982],
        [-0.00143818, -0.00748863,  0.01445435, ...,  0.03963767,
          0.03090886,  0.0145609 ],
        ...,
        [-0.02783159, -0.00593531,  0.04553375, ..., -0.03026455,
          0.0095581 , -0.045012  ],
        [-0.04538896,  0.00870328, -0.02202642, ...,  0.02961581,
         -0.03187122, -0.02266969],
        [-0.04538896,  0.00870328, -0.02202642, ...,  0.02961581,
         -0.03187122, -0.02266969]]], dtype=float32)>

In [32]:
# Check out the embedding for a a single token
sample_embed[0][0], sample_embed[0][0].shape, random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.02783159, -0.00593531,  0.04553375,  0.02596704, -0.02422067,
         0.02568677, -0.01814177,  0.01961067, -0.01775712,  0.04658644,
         0.04660046,  0.04616169, -0.0431397 , -0.04907414,  0.01234544,
         0.01170466, -0.02816175,  0.02537365,  0.02736492, -0.02700679,
        -0.01538647,  0.00987202, -0.04334461, -0.04978681, -0.04446924,
         0.03750518, -0.02878159, -0.0153814 , -0.03009037,  0.00410366,
        -0.00491425, -0.03988282, -0.04859049, -0.01048322, -0.00795631,
        -0.00981326, -0.02447286,  0.03650485, -0.02091633, -0.00381415,
        -0.01255889,  0.03610468, -0.02301617,  0.03988925, -0.03558526,
        -0.02829566, -0.03747328,  0.00469059, -0.02937651, -0.01881232,
        -0.01518621,  0.00707034,  0.04081931, -0.04380709, -0.01553204,
         0.02396259, -0.00158773,  0.04236979,  0.01436508, -0.04517291,
         0.03649468, -0.04278687,  0.007015  ,  0.02978346, -0.03749631,
  

## Modelling a text dataset (running a series of experiments)

Now that we've got a way to turn our sequences into numbers, it's time to start building a series of modelling experiments.

We'll start with baseline and move on from there.

* Model 0: Naive Bayes (baseline), this is from Scikit Learn ML map.
* Model 1: Feed-forward neural network (dense model)
* Model 2: LSTM model (RNN)
* Model 3: GRU model (RNN)
* Model 4: Bidirectional-LSTM model (RNN)
* Model 5: 1D Convolutional Neural Network (CNN)
* Model 6: TensorFlow Hub Pre-trained Feature extractor (using transfer learning for NLP)
* Model 7: Same as model 6 with 10% of training data


How will we approach these?
Use the standard steps in modelling with TensorFlow:
* Create a model
* Build a model
* Fit a model
* Evaluate our model

### Model 0: Getting a baseline

As with all machine learning modelling experiments, it's importannt to create a baseline model so you've got a benchmark for the future models to built uponn...

To create our baseline, we'll use Sklearn's Multinomial Naive Bayes using the TF-IDF formula to convert our words to numbers.

> **Note:** It's a common practice to use non-DL algorithms as a baseline because of their speed and then later using DL to see if we can improve upon them.

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Pipeline is like a guideline that do these steps in order
# Create tokenization and modeling pipeline
model_0 = Pipeline([
    ("tfidf", TfidfVectorizer()),  # Convert the words to numbers
    ("clf", MultinomialNB())       # Model the text
])

# Fit the pipeline to the training data
model_0.fit(train_sentences, train_labels)

In [34]:
train_sentences[0]

'@mogacola @zamtriossu i screamed after hitting tweet'

In [35]:
train_labels[0]

0

In [36]:
# Evlauate our baseline model
baseline_score = model_0.score(val_sentences, val_labels)
print(f"Our baseline modle achieves an accuracy of: {baseline_score*100:.2f}%")

Our baseline modle achieves an accuracy of: 79.27%


In [37]:
# Make predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

### Creating an evaluation function for our model experiments

The metrics considered are:
* Accuracy
* Precision
* Recall
* F1-score

In [38]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):
    """
    Calculates model accuracy, precision, recall, and f1-score of a binary classification model.
    Ensures inputs are converted to NumPy arrays if necessary and handles device mismatches.
    """
    # Convert TensorFlow Tensors to NumPy arrays if necessary
    y_true = y_true.numpy() if hasattr(y_true, "numpy") else y_true
    y_pred = y_pred.numpy() if hasattr(y_pred, "numpy") else y_pred

    # Debugging information about device placement
    print(f"y_true device: {y_true.device if hasattr(y_true, 'device') else 'CPU'}")
    print(f"y_pred device: {y_pred.device if hasattr(y_pred, 'device') else 'CPU'}")

    # Calculate model accuracy
    model_accuracy = accuracy_score(y_true, y_pred) * 100
    # Calculate other metrics
    model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")

    # Prepare the results
    model_results = {
        "accuracy": model_accuracy,
        "precision": model_precision,
        "recall": model_recall,
        "f1": model_f1
    }

    return model_results


In [39]:
baseline_results = calculate_results(y_true = val_labels,
                                     y_pred = baseline_preds)
baseline_results

y_true device: CPU
y_pred device: CPU


{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

### Model 1: A simple dense model

In [40]:
# Create a tensorboard callback (need to create a new one for each model)
from helper_functions import create_tensorboard_callback

# Create a directory to save TensorBoard logs
SAVE_DIR = "model_logs"

In [41]:
from tensorflow.keras import layers

# Define the input layer
inputs = layers.Input(shape=(1,), dtype="string")

# Apply text vectorization
x = text_vectorizer(inputs)

# Apply the embedding layer
x = embedding(x)

# Add global average pooling to reduce the dimensionality
x = layers.GlobalAveragePooling1D()(x)

# Add the dense output layer with sigmoid activation
outputs = layers.Dense(1, activation="sigmoid")(x)

# Create the model
model_1 = tf.keras.Model(inputs, outputs, name="model_1_dense")

In [42]:
model_1.summary()

In [43]:
# Compile the model
model_1.compile(loss = "binary_crossentropy",
                optimizer = tf.keras.optimizers.Adam(),
                metrics = ["accuracy"])

In [44]:
model_1_history = model_1.fit(x= train_sentences,
                              y = train_labels,
                              epochs = 5,
                              validation_data = (val_sentences, val_labels),
                              callbacks = [create_tensorboard_callback(dir_name= SAVE_DIR,
                              experiment_name = "model_1_dense")])

Saving TensorBoard log files to: model_logs/model_1_dense/20250127-223117
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 16ms/step - accuracy: 0.6474 - loss: 0.6479 - val_accuracy: 0.7415 - val_loss: 0.5372
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 22ms/step - accuracy: 0.8107 - loss: 0.4594 - val_accuracy: 0.7874 - val_loss: 0.4707
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 24ms/step - accuracy: 0.8632 - loss: 0.3537 - val_accuracy: 0.7953 - val_loss: 0.4560
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 13ms/step - accuracy: 0.8866 - loss: 0.2952 - val_accuracy: 0.7887 - val_loss: 0.4640
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 23ms/step - accuracy: 0.9207 - loss: 0.2338 - val_accuracy: 0.7835 - val_loss: 0.4761


In [45]:
# Check the results
model_1.evaluate(val_sentences, val_labels)

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7808 - loss: 0.5084


[0.47609713673591614, 0.7834645509719849]

In [46]:
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs.shape

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step


(762, 1)

In [47]:
model_1_pred_probs[:10]

array([[0.2993216 ],
       [0.75805575],
       [0.99808455],
       [0.10097639],
       [0.14383556],
       [0.9331775 ],
       [0.93617415],
       [0.99331194],
       [0.96191794],
       [0.2470271 ]], dtype=float32)

In [48]:
# Convert model prediction probabilites to label format
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)>

In [49]:
len(val_labels), len(model_1_preds)

(762, 762)

In [50]:
print(f"val_labels device: {val_labels.device if hasattr(val_labels, 'device') else 'CPU'}")
print(f"model_1_preds device: {model_1_preds.device if hasattr(model_1_preds, 'device') else 'CPU'}")


val_labels device: CPU
model_1_preds device: /job:localhost/replica:0/task:0/device:CPU:0


In [51]:
model_1_preds = tf.convert_to_tensor(model_1_preds).numpy() if tf.is_tensor(val_labels) else model_1_preds

In [52]:
val_labels = val_labels.numpy() if hasattr(val_labels, "numpy") else val_labels
model_1_preds = model_1_preds.numpy() if hasattr(model_1_preds, "numpy") else model_1_preds


In [53]:
print(f"val_labels device: {val_labels.device if hasattr(val_labels, 'device') else 'CPU'}")
print(f"model_1_preds device: {model_1_preds.device if hasattr(model_1_preds, 'device') else 'CPU'}")

val_labels device: CPU
model_1_preds device: CPU


In [54]:
# Calculate our model_1 results
model_1_results =  calculate_results(y_true = val_labels,
                                     y_pred = model_1_preds)

model_1_results

y_true device: CPU
y_pred device: CPU


{'accuracy': 78.34645669291339,
 'precision': 0.7872123378365872,
 'recall': 0.7834645669291339,
 'f1': 0.7807800582578169}

In [55]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

In [56]:
import numpy as np
np.array(list(model_1_results.values())) > np.array(list(baseline_results.values()))

array([False, False, False, False])

## Visualizing learnt embeddings

In [57]:
# Get the vocabulary from the text vectorization layer
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab), words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

In [58]:
# Model 1 summary
model_1.summary()

In [59]:
# Get the weight matrix of the embedding layer
# (these are the numerical representations of each token in out training data, which have been learnt for 5 epochs)
embed_weights = model_1.get_layer("embedding").get_weights()[0]
print(embed_weights.shape) # Should be the same as vocab size and embedding_dim (output_dim of our embedding layer)

(10000, 128)


In [60]:
embed_weights

array([[-0.03618862,  0.00855201, -0.01722494, ...,  0.00963791,
        -0.04461886, -0.04040532],
       [-0.02408862, -0.00674452,  0.04660494, ..., -0.03720271,
         0.00482799, -0.0509887 ],
       [ 0.01019228, -0.02663219,  0.02874871, ..., -0.03164796,
         0.01325365, -0.01363655],
       ...,
       [-0.01260849,  0.0425551 , -0.02184875, ..., -0.04369261,
        -0.01651797,  0.01524624],
       [ 0.05739795,  0.00609611,  0.05211947, ..., -0.07543751,
        -0.00104598, -0.02859927],
       [ 0.07848594,  0.04435731,  0.03013724, ..., -0.02485514,
        -0.02894735, -0.01820921]], dtype=float32)

In [61]:
# We use Projector by TensorFlow for the visualisations
# Create embedding files (we got this from TensorFlow's word embeddings documentation)
import io

out_v = io.open("vectors.tsv", "w", encoding = "utf-8")
out_m = io.open("metadata.tsv", "w", encoding = "utf-8")

for index, word in enumerate(words_in_vocab):
  if index == 0:
    continue # skip 0, it's padding
  vec = embed_weights[index]
  out_v.write("\t".join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

In [62]:
# # Download files from colab
# try:
#   from google.colab import files
#   files.download("vectors.tsv")
#   files.download("metadata.tsv")
# except Exception:
#   pass

## Recurrent Neural Networks (RNNs)

The premise of RNN is to use the representation of a previous input to aid the representation of a later input.

### Model 2: LSTM

LSTM = long short term memory (one of the most popular RNN cells)

A typical structure of an RNN looks something like this:

```
Input (text) -> Tokenize -> Embedding -> Layers (RNNs/dense) -> Output (label probability)
```

In [63]:
# Create an LSTM model
from tensorflow.keras import layers
inputs = layers.Input(shape = (1,), dtype = "string")
x = text_vectorizer(inputs)
x = embedding(x)
# print(x.shape)
# x = layers.LSTM(64, return_sequences=True)(x) # when stacking RNN cells, we need to set `return_sequences = True`
# print(x.shape)
x = layers.LSTM(64)(x)
# print(x.shape)
# x = layers.Dense(64, activation = "relu")(x)
# print(x.shape)
outputs = layers.Dense(1, activation = "sigmoid")(x)
model_2 = tf.keras.Model(inputs, outputs, name="model_2_LSTM")

In [64]:
model_2.summary()

In [65]:
# Compile the model
model_2.compile(loss= "binary_crossentropy",
                optimizer = tf.keras.optimizers.Adam(),
                metrics = ["accuracy"])

In [66]:
# Fit the model
model_2_history = model_2.fit(train_sentences,
                              train_labels,
                              epochs = 5,
                              validation_data = (val_sentences, val_labels),
                              callbacks = [create_tensorboard_callback(SAVE_DIR, "model_2_LSTM")])

Saving TensorBoard log files to: model_logs/model_2_LSTM/20250127-223155
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 45ms/step - accuracy: 0.8779 - loss: 0.3167 - val_accuracy: 0.7808 - val_loss: 0.5470
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 22ms/step - accuracy: 0.9405 - loss: 0.1567 - val_accuracy: 0.7730 - val_loss: 0.6735
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 29ms/step - accuracy: 0.9556 - loss: 0.1259 - val_accuracy: 0.7848 - val_loss: 0.6637
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 22ms/step - accuracy: 0.9593 - loss: 0.1040 - val_accuracy: 0.7874 - val_loss: 0.7677
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 40ms/step - accuracy: 0.9642 - loss: 0.0885 - val_accuracy: 0.7730 - val_loss: 1.0145


In [67]:
# Make predictions on the LSTM model
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 31ms/step


array([[1.0824631e-03],
       [6.7734462e-01],
       [9.9949527e-01],
       [5.2732602e-03],
       [5.8476109e-04],
       [9.9840397e-01],
       [7.4209613e-01],
       [9.9962485e-01],
       [9.9946272e-01],
       [3.1828952e-01]], dtype=float32)

In [68]:
# Convert model 2 pred probs to labels
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [69]:
val_labels

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0,
       0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,

In [70]:
# Calculate model 2 results
model_2_results = calculate_results(y_true = val_labels,
                                    y_pred = model_2_preds)
model_2_results

y_true device: CPU
y_pred device: CPU


{'accuracy': 77.29658792650919,
 'precision': 0.7787736210393056,
 'recall': 0.7729658792650919,
 'f1': 0.7692056527531042}

In [71]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

### Model 3: GRU

Another popular and effective RNN component is GRU or Gated Recurrent Unit.

The GRU cell has similar features to an LSTM cell but has less parameters.

In [72]:
# Build an RNN using the GRU layer
from tensorflow.keras import layers
inputs = layers.Input(shape= (1,), dtype = "string")
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.GRU(64)(x)
# print(x.shape)
# x = layers.GRU(64, return_sequences = True)(x)
# print(x.shape)
# x = layers.LSTM(64, return_sequences = True)(x)
# print(x.shape)
# x = layers.GRU(64)(x)
# print(x.shape)
# x = layers.Dense(64, activation = "relu")(x)
# print(x.shape)
# x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation= "sigmoid")(x)
model_3 = tf.keras.Model(inputs, outputs, name= "model_3_GRU")

In [73]:
# Get a summary
model_3.summary()

In [74]:
# Compile the model
model_3.compile(loss = "binary_crossentropy",
                optimizer = tf.keras.optimizers.Adam(),
                metrics = ["accuracy"])

In [75]:
# Fit the model
model_3_history = model_3.fit(train_sentences,
                              train_labels,
                              epochs = 5,
                              validation_data = (val_sentences, val_labels),
                              callbacks  =[create_tensorboard_callback(SAVE_DIR, "model_3_GRU")])

Saving TensorBoard log files to: model_logs/model_3_GRU/20250127-223241
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 69ms/step - accuracy: 0.8709 - loss: 0.2982 - val_accuracy: 0.7690 - val_loss: 0.7943
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 44ms/step - accuracy: 0.9691 - loss: 0.0870 - val_accuracy: 0.7756 - val_loss: 0.8303
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 23ms/step - accuracy: 0.9736 - loss: 0.0706 - val_accuracy: 0.7730 - val_loss: 0.8851
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 28ms/step - accuracy: 0.9777 - loss: 0.0584 - val_accuracy: 0.7690 - val_loss: 1.0591
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 44ms/step - accuracy: 0.9778 - loss: 0.0512 - val_accuracy: 0.7717 - val_loss: 1.0369


In [76]:
# Make some predictions with model_3
model_3_pred_probs = model_3.predict(val_sentences)
model_3_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 55ms/step


array([[7.9239294e-04],
       [5.8916467e-01],
       [9.9939388e-01],
       [2.5423381e-01],
       [1.8653885e-04],
       [9.9861842e-01],
       [1.1015416e-01],
       [9.9977773e-01],
       [9.9966633e-01],
       [7.4862057e-01]], dtype=float32)

In [77]:
# Convert model 3 pred to labels
model_3_preds = tf.squeeze(tf.round(model_3_pred_probs))
model_3_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 0., 1., 1., 1.], dtype=float32)>

In [78]:
# Calcualte model 3 results
model_3_results = calculate_results(y_true = val_labels,
                                    y_pred  =model_3_preds)
model_3_results

y_true device: CPU
y_pred device: CPU


{'accuracy': 77.16535433070865,
 'precision': 0.7747861668850706,
 'recall': 0.7716535433070866,
 'f1': 0.7688960790251899}

### Model 4: Bidirectional RNN

In [79]:
# Input layer
inputs = layers.Input(shape=(1,), dtype="string")

# Pass the input through the TextVectorization layer
x = text_vectorizer(inputs)

# Embedding layer
x = embedding(x)

# Bidirectional LSTM layer
x = layers.Bidirectional(layers.LSTM(64))(x)

# Output layer
outputs = layers.Dense(1, activation="sigmoid")(x)

# Define the model
model_4 = tf.keras.Model(inputs, outputs, name="model_4_bidirectional")

In [80]:
model_4.summary()

In [81]:
model_4.compile(loss= "binary_crossentropy",
                optimizer = tf.keras.optimizers.Adam(),
                metrics = ["accuracy"])

In [82]:
model_4_history = model_4.fit(train_sentences,
                              train_labels,
                              epochs = 5,
                              validation_data = (val_sentences, val_labels),
                              callbacks = [create_tensorboard_callback(SAVE_DIR, "model_4_bidirectional")])

Saving TensorBoard log files to: model_logs/model_4_bidirectional/20250127-223345
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 57ms/step - accuracy: 0.9361 - loss: 0.1991 - val_accuracy: 0.7703 - val_loss: 0.7858
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 67ms/step - accuracy: 0.9750 - loss: 0.0680 - val_accuracy: 0.7677 - val_loss: 1.2395
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 40ms/step - accuracy: 0.9783 - loss: 0.0555 - val_accuracy: 0.7651 - val_loss: 1.1846
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 42ms/step - accuracy: 0.9811 - loss: 0.0433 - val_accuracy: 0.7730 - val_loss: 1.3284
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 32ms/step - accuracy: 0.9818 - loss: 0.0398 - val_accuracy: 0.7651 - val_loss: 1.6328


In [83]:
# Make predictions with out bidirectional model
model_4_pred_probs = model_4.predict(val_sentences)
model_4_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 37ms/step


array([[9.5913623e-05],
       [7.0982128e-01],
       [9.9998814e-01],
       [6.9593526e-02],
       [1.3889540e-05],
       [9.9990767e-01],
       [7.8987181e-02],
       [9.9999338e-01],
       [9.9998963e-01],
       [9.9919623e-01]], dtype=float32)

In [84]:
# Convert pred probs to label
model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 0., 1., 1., 1.], dtype=float32)>

In [85]:
# Calculate the results of our bidirectional model
model_4_results = calculate_results(y_true  =val_labels,
                                    y_pred = model_4_preds)
model_4_results

y_true device: CPU
y_pred device: CPU


{'accuracy': 76.50918635170603,
 'precision': 0.7681519700378248,
 'recall': 0.7650918635170604,
 'f1': 0.7621795783524195}

## Cnvolutional Neural Networks for Text (and other types of sequences)

As opposed to images, our text data is 1D. Hence, we will be using Conv1D instead of Conv2D layer.

A typical structure of a Conv1D model for sequences:
```
Inputs (texts) -> Tokenize -> Embedding -> Layer (s) (typically Conv1D + pooling) -> Outputs (class probabilities)

### Model 5: Conv1D

To understand more about the parameters of Conv layers, refer cnnexplainer. Although it is for 2D, it can be used for understanding the concepts

To understand the difference between padding types: https://stackoverflow.com/questions/37674306/what-is-the-difference-between-same-and-valid-padding-in-tf-nn-max-pool-of-t

In [86]:
# Test out our embedding layer, Conv1D layer and max pooling
from tensorflow.keras import layers
embedding_test = embedding(text_vectorizer(["this is a test sentence"]))
conv_1d = layers.Conv1D(filters = 64,
                        kernel_size =5,
                        activation= "relu",
                        padding = "valid") # if padding = valid, then output is smaller. if padding = same, then output is the same size as the input
conv_1d_output = conv_1d(embedding_test)
max_pool = layers.GlobalMaxPool1D()
max_pool_output = max_pool(conv_1d_output)

embedding_test.shape, conv_1d_output.shape, max_pool_output.shape

(TensorShape([1, 15, 128]), TensorShape([1, 11, 64]), TensorShape([1, 64]))

In [87]:
embedding_test

<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.09968702,  0.01158612, -0.04062085, ...,  0.03056909,
         -0.0659719 ,  0.024792  ],
        [-0.00156914,  0.01939017, -0.0317148 , ...,  0.01776266,
         -0.03972599, -0.06365714],
        [-0.0609202 , -0.00399698,  0.02519457, ..., -0.04771664,
          0.00081576, -0.02798671],
        ...,
        [-0.02234595, -0.04512218, -0.00076562, ..., -0.0219384 ,
         -0.05277974, -0.01304052],
        [-0.02234595, -0.04512218, -0.00076562, ..., -0.0219384 ,
         -0.05277974, -0.01304052],
        [-0.02234595, -0.04512218, -0.00076562, ..., -0.0219384 ,
         -0.05277974, -0.01304052]]], dtype=float32)>

In [88]:
conv_1d_output

<tf.Tensor: shape=(1, 11, 64), dtype=float32, numpy=
array([[[2.84039341e-02, 0.00000000e+00, 6.10829033e-02, 0.00000000e+00,
         3.72033790e-02, 1.65733323e-02, 3.90250087e-02, 0.00000000e+00,
         1.33969933e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 6.27776533e-02, 0.00000000e+00,
         0.00000000e+00, 4.26546186e-02, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         0.00000000e+00, 0.00000000e+00, 2.36390419e-02, 0.00000000e+00,
         0.00000000e+00, 5.20425513e-02, 9.07095894e-03, 6.26171529e-02,
         0.00000000e+00, 2.00997144e-02, 2.17938609e-02, 0.00000000e+00,
         0.00000000e+00, 9.04522240e-02, 2.50710566e-02, 0.00000000e+00,
         2.81327218e-02, 3.29274461e-02, 1.85706727e-02, 3.45048774e-03,
         0.00000000e+00, 2.96929944e-02, 0.00000000e+00, 0.00000000e+00

In [89]:
max_pool_output

<tf.Tensor: shape=(1, 64), dtype=float32, numpy=
array([[6.69846907e-02, 4.03208733e-02, 6.10829033e-02, 6.93718642e-02,
        3.72033790e-02, 3.78961824e-02, 3.90250087e-02, 7.93027729e-02,
        2.20949147e-02, 5.53927049e-02, 5.92305027e-02, 1.81952268e-02,
        1.74789596e-03, 5.87133691e-05, 0.00000000e+00, 0.00000000e+00,
        4.51718979e-02, 0.00000000e+00, 6.27776533e-02, 4.98539656e-02,
        2.65425607e-03, 4.26546186e-02, 4.99386936e-02, 0.00000000e+00,
        4.95734997e-02, 4.01981100e-02, 0.00000000e+00, 3.52405906e-02,
        3.07261348e-02, 4.36520651e-02, 1.00231506e-01, 3.97025868e-02,
        9.33929011e-02, 9.98166651e-02, 4.40925397e-02, 6.26171529e-02,
        7.23205507e-02, 4.66258563e-02, 2.17938609e-02, 2.61565298e-03,
        3.02346740e-02, 9.04522240e-02, 2.73866206e-02, 0.00000000e+00,
        2.81327218e-02, 6.27606884e-02, 7.35518783e-02, 2.53971536e-02,
        8.39910358e-02, 2.96929944e-02, 7.00012669e-02, 5.96448891e-02,
        0.00000

In [90]:
# Create 1-dimensional convolutional layer to model sequences

# Input layer
inputs = layers.Input(shape=(1,), dtype="string")

# Pass the input through the TextVectorization layer
x = text_vectorizer(inputs)

# Embedding layer
x = embedding(x)

# Bidirectional LSTM layer
x = layers.Conv1D(filters = 64,
                  kernel_size =5,
                  strides = 1,
                  activation= "relu",
                  padding = "valid")(x)

# MaxPoll layer
x = layers.GlobalMaxPool1D()(x)

# Output layer
outputs = layers.Dense(1, activation="sigmoid")(x)

# Define the model
model_5 = tf.keras.Model(inputs, outputs, name="model_4_bidirectional")

In [91]:
# Compile the model
model_5.compile(loss = "binary_crossentropy",
                optimizer = tf.keras.optimizers.Adam(),
                metrics = ["accuracy"])

In [92]:
model_5.summary()

In [93]:
# Fit the model
model_5_history = model_5.fit(train_sentences,
                              train_labels,
                              epochs =5,
                              validation_data = (val_sentences, val_labels),
                              callbacks =[create_tensorboard_callback(SAVE_DIR, "Conv1D")])

Saving TensorBoard log files to: model_logs/Conv1D/20250127-223510
Epoch 1/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 16ms/step - accuracy: 0.9234 - loss: 0.1911 - val_accuracy: 0.7730 - val_loss: 0.9219
Epoch 2/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 15ms/step - accuracy: 0.9770 - loss: 0.0683 - val_accuracy: 0.7638 - val_loss: 1.0340
Epoch 3/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 19ms/step - accuracy: 0.9796 - loss: 0.0584 - val_accuracy: 0.7598 - val_loss: 1.1345
Epoch 4/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 18ms/step - accuracy: 0.9779 - loss: 0.0517 - val_accuracy: 0.7520 - val_loss: 1.2228
Epoch 5/5
[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 20ms/step - accuracy: 0.9816 - loss: 0.0472 - val_accuracy: 0.7598 - val_loss: 1.2336


In [94]:
# Make some predictions with Conv1D model
model_5_pred_probs = model_5.predict(val_sentences)
model_5_pred_probs[:10]

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 59ms/step


array([[1.8327411e-01],
       [8.5548562e-01],
       [9.9994922e-01],
       [2.5532680e-02],
       [1.4852361e-07],
       [9.8942471e-01],
       [5.9278506e-01],
       [9.9988896e-01],
       [9.9999934e-01],
       [7.9447138e-01]], dtype=float32)

In [95]:
# Convert the probabilities to labels
model_5_preds = tf.squeeze(tf.round(model_5_pred_probs))
model_5_preds

<tf.Tensor: shape=(762,), dtype=float32, numpy=
array([0., 1., 1., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 1., 1., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
       0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0.,
       1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 0.,
       1., 0., 0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0.,
       0., 1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 0., 0., 0., 0., 1., 0.,
       0., 1., 0., 0., 1., 1., 0., 1., 0., 0., 1., 0., 1., 1., 1., 1., 0.,
       1., 1., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       0., 0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 1.,
       1., 1., 1., 0., 1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 1., 0.,
       0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 0

In [96]:
# Calculate the results of Conv1D model
model_5_results = calculate_results(y_true = val_labels,
                                    y_pred= model_5_preds)
model_5_results

y_true device: CPU
y_pred device: CPU


{'accuracy': 75.98425196850394,
 'precision': 0.7620917281539858,
 'recall': 0.7598425196850394,
 'f1': 0.7571687002970852}

In [97]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

### Some extra bits

In [98]:
# To load in a mdoel with custom layer using h5 format
# import tensorflow_hub as hub
# model.save("model_name.h5")
# loaded_model = tf.keras.models.load_model("model_path",
#                                           custom_objects = {"KerasLayer": hub.KerasLayer})

In [99]:
# To load in a mdoel with custom layer using save format
# model.save("model_name")
# load_model = tf.keras.models.load_model("model_name")

## Finding the most wrong examples

* If our model is not perfect, what examples is it getting wrong?
* And of these wrong examples which ones is it getting most wrong (those will be the ones with prediction probabilites closest to the opposite class)

In [110]:
# Create a dataframe with validation sentences, validation labels, and best performing model prediction labels + probabilities
val_df = pd.DataFrame({
    "text": val_sentences,
    "target": val_labels,
    "pred": model_5_preds,
    "pred_prob": tf.squeeze(model_5_pred_probs)
})
val_df.head()

Unnamed: 0,text,target,pred,pred_prob
0,DFR EP016 Monthly Meltdown - On Dnbheaven 2015...,0,0.0,0.1832741
1,FedEx no longer to transport bioterror germs i...,0,1.0,0.8554856
2,Gunmen kill four in El Salvador bus attack: Su...,1,1.0,0.9999492
3,@camilacabello97 Internally and externally scr...,1,0.0,0.02553268
4,Radiation emergency #preparedness starts with ...,1,0.0,1.485236e-07


In [111]:
# Find the wrong predictions and sort by prediction probabilites
most_wrong = val_df[val_df["target"] != val_df["pred"]].sort_values("pred_prob", ascending = False)
most_wrong.head() # The output in this are example of false positives

Unnamed: 0,text,target,pred,pred_prob
698,åÈMGN-AFRICAå¨ pin:263789F4 åÈ Correction: Ten...,0,1.0,0.999958
206,Head on head collision Ima problem and nobody ...,0,1.0,0.999938
303,Trafford Centre film fans angry after Odeon ci...,0,1.0,0.999873
156,@cjbanning 4sake of argsuppose pre-born has at...,0,1.0,0.999837
291,He made such a good point. White person coming...,0,1.0,0.999601


Target labels are of the following format:
* `0`= not disaster
* `1`= disaster

In [112]:
most_wrong.tail() # The output in here is the example of false negative

Unnamed: 0,text,target,pred,pred_prob
457,Two hours to get to a client meeting. Whirlwin...,1,0.0,1.52298e-07
4,Radiation emergency #preparedness starts with ...,1,0.0,1.485236e-07
274,Crazy Mom Threw Teen Daughter a NUDE Twister S...,1,0.0,6.302521e-08
586,World War II book LIGHTNING JOE An Autobiograp...,1,0.0,4.272444e-08
627,Owner of Chicago-Area Gay Bar Admits to Arson ...,1,0.0,3.030409e-08


In [113]:
# cehck the false positives (model predicted 1 when it should have been 0)
for row in most_wrong[:10].itertuples():
  _, text, target, pred, pred_prob = row
  print(f"Target: {target}, Pred: {int(pred)}, Pred Prob: {pred_prob}")
  print(f"Text:\n{text}\n")

Target: 0, Pred: 1, Pred Prob: 0.9999580979347229
Text:
åÈMGN-AFRICAå¨ pin:263789F4 åÈ Correction: Tent Collapse Story: Correction: Tent Collapse story åÈ http://t.co/fDJUYvZMrv @wizkidayo

Target: 0, Pred: 1, Pred Prob: 0.9999383091926575
Text:
Head on head collision Ima problem and nobody can solve em on Long division

Target: 0, Pred: 1, Pred Prob: 0.9998728036880493
Text:
Trafford Centre film fans angry after Odeon cinema evacuated following false fire alarm   http://t.co/6GLDwx71DA

Target: 0, Pred: 1, Pred Prob: 0.9998369216918945
Text:
@cjbanning 4sake of argsuppose pre-born has attained individl rights.Generally courtof law forbids killing unless dead person did something

Target: 0, Pred: 1, Pred Prob: 0.9996011257171631
Text:
He made such a good point. White person comings mass murder labelled as criminal minority does the same thing... http://t.co/37qPsSnaCv

Target: 0, Pred: 1, Pred Prob: 0.9994409680366516
Text:
A change in the State fire code prohibits grills on decks at 

In [115]:
# cehck the false negatives (model predicted 0 when it should have been 1)
for row in most_wrong[-10:].itertuples():
  _, text, target, pred, pred_prob = row
  print(f"Target: {target}, Pred: {int(pred)}, Pred Prob: {pred_prob}")
  print(f"Text:\n{text}\n")

Target: 1, Pred: 0, Pred Prob: 4.1115200133390317e-07
Text:
So I pick myself off the ground and swam before I drowned. Hit the bottom so hard I bounced twice suffice this time around is different.

Target: 1, Pred: 0, Pred Prob: 3.5059991887465003e-07
Text:
@SaintRobinho86 someone has to be at the bottom of every league. Tonight clearly demonstrated why the Lions are where they are - sunk!

Target: 1, Pred: 0, Pred Prob: 3.282733871401433e-07
Text:
Just came back from camping and returned with a new song which gets recorded tomorrow. Can't wait! #Desolation #TheConspiracyTheory #NewEP

Target: 1, Pred: 0, Pred Prob: 2.637660827531363e-07
Text:
#ClimateChange Eyewitness to Extreme Weather: 11 Social Media Posts that Show Just How Crazy Things A... http://t.co/czpDn9oBiT #Anarchy

Target: 1, Pred: 0, Pred Prob: 1.9019692842903169e-07
Text:
You can never escape me. Bullets don't harm me. Nothing harms me. But I know pain. I know pain. Sometimes I share it. With someone like you.

Target: 

## The speed/score tradeoff

In [125]:
# # Let's make a function to measure the time of prediction
# import time
# def pred_timer(model, samples):
#   """
#   Times how long a model takes to make predictions on samples
#   """
#   start_time = time.perf_counter()
#   model.predict(samples)
#   end_time = time.perf_counter()
#   duration = end_time - start_time
#   print(f"Prediction time: {duration:.2f} seconds")
#   time_per_pred = duration / len(samples)
#   print(f"Time per prediction: {time_per_pred:.2f} seconds")
#   return duration, time_per_pred

In [None]:
# import matplotlib.pyplot as plt

# plt.figure(figsize= (10, 7))
# plt.scatter(baseline_time_per_pred, baseline_results["f1"], label = "baseline")
# plt.scatter(model_5_time_per_pred, model_5_results["f1"], label = "model_5")
# plt.legend()
# plt.title("F1-score vs Prediction time")
# plt.xlabel("Prediction time (seconds)")
# plt.ylabel("F1-score")
# plt.show()