In [1]:
# Check for GPU
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-b54e83fe-0a3b-0124-679a-376eec338e7f)


Binary classification using twitter data from Kaggle.https://www.kaggle.com/c/nlp-getting-started

In [2]:
# Download helper functions script
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2021-07-31 14:52:28--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2021-07-31 14:52:28 (102 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [3]:
# Import series of helper functions for the notebook
from helper_functions import unzip_data, create_tensorboard_callback, plot_loss_curves, compare_historys

In [4]:
# Download data (same as from Kaggle)
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

# Unzip data
unzip_data("nlp_getting_started.zip")

--2021-07-31 14:52:31--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.135.128, 74.125.142.128, 74.125.195.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.135.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2021-07-31 14:52:31 (148 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



Visualizing Text Dataset

In [5]:
# Turn .csv files into pandas DataFrame's
import pandas as pd
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1, random_state=42) # shuffle with random_state=42 for reproducibility
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [7]:
# The test data doesn't have a target (that's what we'd try to predict)
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [8]:
# How many examples of each class?
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [9]:
# How many samples total?
print(f"Total training samples: {len(train_df)}")
print(f"Total test samples: {len(test_df)}")
print(f"Total samples: {len(train_df) + len(test_df)}")

Total training samples: 7613
Total test samples: 3263
Total samples: 10876


In [10]:
# Let's visualize some random training examples
import random
random_index = random.randint(0, len(train_df)-5) # create random indexes not higher than the total number of samples
for row in train_df_shuffled[["text", "target"]][random_index:random_index+5].itertuples():
  _, text, target = row
  print(f"Target: {target}", "(real disaster)" if target > 0 else "(not real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target: 0 (not real disaster)
Text:
@abcnews UK scandal of 2009 caused major upheaval to Parliamentary expenses with subsequent sackings and prison. What are we waiting for?

---

Target: 1 (real disaster)
Text:
being stuck on a sleeper train for 24 hours after de-railing due to a landslide was most definitely the pit of the trip

---

Target: 1 (real disaster)
Text:
http://t.co/iXiYBAp8Qa The Latest: More homes razed by Northern California wildfire - Lynchburg News and Advance http://t.co/zEpzQYDby4

---

Target: 1 (real disaster)
Text:

---

Target: 0 (not real disaster)
Text:
seriously look like a get electrocuted after I blow dry my hair it's really attractive ??

---



**Split data into training and validation sets:**

Since the test set has no labels and we need a way to evalaute our trained models, we'll split off some of the training data and create a validation set.


In [11]:
from sklearn.model_selection import train_test_split

# Use train_test_split to split training data into training and validation sets
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1, # dedicate 10% of samples to validation set
                                                                            random_state=42) # random state for reproducibility

In [12]:
# Check the lengths
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)

(6851, 6851, 762, 762)

In [13]:
# View the first 10 training sentences and their labels
train_sentences[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object), array([0, 

In [14]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Use the default TextVectorization
text_vectorizer = TextVectorization(max_tokens=None, # how many words in the vocabulary (all of the different words in your text)
                                    standardize="lower_and_strip_punctuation", # how to process text
                                    split="whitespace", # how to split tokens
                                    ngrams=None, # create groups of n-words?
                                    output_mode="int", # how to map tokens to numbers
                                    output_sequence_length=None, # how long should the output sequence of tokens be?
                                    pad_to_max_tokens=True)

In [15]:
# Find average number of tokens (words) in training Tweets
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [16]:
# Setup text vectorization variables
max_vocab_length = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (e.g. how many words from a Tweet does our model see?)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

In [17]:
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_sentences)

In [18]:
# Choose a random sentence from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nVectorized version:")
text_vectorizer([random_sentence])

Original text:
Officer Wounded Suspect Killed in Exchange of Gunfire: Richmond police officer wounded suspect killed in exc... http://t.co/zDHwRN6cZc      

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 677,  269,  430,  111,    4, 1861,    6, 2098, 1607,   77,  677,
         269,  430,  111,    4]])>

In [19]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5] # most common tokens (notice the [UNK] token for "unknown" words)
bottom_5_words = words_in_vocab[-5:] # least common tokens
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"Top 5 most common words: {top_5_words}") 
print(f"Bottom 5 least common words: {bottom_5_words}")

Number of words in vocab: 10000
Top 5 most common words: ['', '[UNK]', 'the', 'a', 'in']
Bottom 5 least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


Creating an Embedding using an Embedding Layer

In [20]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=max_vocab_length, # set input shape
                             output_dim=128, # set size of embedding vector
                             embeddings_initializer="uniform", # default, intialize randomly
                             input_length=max_length) # how long is each input

embedding

<tensorflow.python.keras.layers.embeddings.Embedding at 0x7fc7202fb650>

In [21]:
# Get a random sentence from training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nEmbedded version:")

# Embed the random sentence (turn it into numerical representation)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
Suspect in latest theater attack had psychological issues http://t.co/3huhZxliiG      

Embedded version:


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[ 0.02812335, -0.0176071 ,  0.03786448, ...,  0.01181672,
         -0.01387002, -0.00567365],
        [-0.00872104,  0.0065316 ,  0.03304878, ..., -0.02280067,
          0.0405825 ,  0.01073887],
        [ 0.01852873,  0.00743871,  0.03585949, ...,  0.04838305,
          0.03472928, -0.03265958],
        ...,
        [-0.01419514, -0.02437115,  0.04102958, ...,  0.02906257,
          0.01228492, -0.03158795],
        [-0.01419514, -0.02437115,  0.04102958, ...,  0.02906257,
          0.01228492, -0.03158795],
        [-0.01419514, -0.02437115,  0.04102958, ...,  0.02906257,
          0.01228492, -0.03158795]]], dtype=float32)>

In [22]:
# Check out a single token's embedding
sample_embed[0][0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([ 0.02812335, -0.0176071 ,  0.03786448,  0.00695734, -0.00672927,
        0.02763915,  0.0128667 , -0.01421726, -0.04793794, -0.01006224,
        0.03066934,  0.00179293, -0.02682073,  0.04616796, -0.01939805,
       -0.0458983 , -0.01844096, -0.03710938, -0.02878438,  0.03444291,
       -0.02859312, -0.01718848,  0.01003831,  0.0329821 ,  0.01743137,
       -0.00372002, -0.03440138, -0.03809752, -0.03343268, -0.01246266,
       -0.03633636,  0.00937042,  0.0407165 ,  0.00897517,  0.03005395,
       -0.01248194, -0.01402598, -0.0052564 , -0.00946089,  0.04279665,
        0.03493786,  0.01278594, -0.04754704,  0.00219763, -0.0445423 ,
       -0.02402159,  0.02927131, -0.03641327, -0.01755774,  0.00415776,
        0.00480007, -0.04481507, -0.01972262,  0.03826382, -0.02756721,
        0.04854279,  0.00749606,  0.01365519,  0.04790497, -0.00587201,
        0.04838491, -0.04700221, -0.02711377, -0.02340456,  0.04698041,
        0.030670

**Mode Building**
We will begin with a baseline model and then compare it with other models.
More specifically, we'll be building the following:

    Model 0: Naive Bayes (baseline)
    Model 1: Feed-forward neural network (dense model)
    Model 2: LSTM model
    Model 3: GRU model
    Model 4: Bidirectional-LSTM model
    Model 5: 1D Convolutional Neural Network
    Model 6: TensorFlow Hub Pretrained Feature Extractor
    Model 7: Same as model 6 with 10% of training data


Model 0 is the simplest to acquire a baseline which we'll expect each other of the other deeper models to beat.

Each experiment will go through the following steps:

    Construct the model
    Train the model
    Make predictions with the model
    Track prediction evaluation metrics for later comparison


**Model 0**
To create our baseline, we'll create a Scikit-Learn Pipeline using the TF-IDF (term frequency-inverse document frequency) formula to convert our words to numbers and then model them with the Multinomial Naive Bayes algorithm.