<a href="https://colab.research.google.com/github/ashikshafi08/Learning_Tensorflow/blob/main/Notebooks/Introduction_to_NLP_with_TensorFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Checking what GPU are we running 
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



# Introduction to Natural Language Processing with TensorFlow 

NLP is the goal of deriving information out of natural langue (could be sequences, text or speech).

Another common term for NLP problems is sequence to sequence problems (seq2seq). 

### What we're going to cover 
- Downloading and preparing the dataset. 
- How to prepare text data for modelling (tokenization and embedding). 
- Setting up multiple modelling experiments with Recurent Neural Network. (RNNs).
- Building a text feature extraction to model using TensorFlow Hub. 
- Find the most wrong prediction sampls.=s.
= Using a modelwe've built to make prediction on text from the wild. 

Natural Language Processing and NLU (natural language understanding) are quite similar, they're more like brothers and sisters. It's like NLP might include NLU in it. 

### Example NLP problems

- **Classification —>** What tags should this article have? (multiple label options per sample) [Text Classification].
- **Text Generation —>** Making your model to learn all the Shakespeare's work and use our model to generate new Shakespeare's poems / text.
- **Machine Translation —>** Like we input one sequence of words and it gives us an another sequence of same words with translation.

    Many NLP and NLU problems are referred to as sequence to sequence problems.

- **Voice Assistant —>** Takes in one sequence (sound waves) and convert it into texts. After the conversion find meaning of the derive information from that text.

### Other Sequence problems 
- One to one —> One sequence to one output.
- **One to many —> One sequence to many output.**
    - Image Captioning, given a input image our model gives back a sequence as output.

- **Many to one —> Many sequence to one output.**
    - Sentiment Analysis, given a set of sequence of words our model will analyze the sentiments of the sequence's the output will be one (Positive or Negative).
    - Time Series Forecasting, our input would be historical prices of bitcoin and the predicted output will be the price on our desired day.
- Many to many —> Many sequence results to many outputs.
    - Here it's not synchronised, the best example is the Machine Translation. Input will be multiple words and it spits out the same sequence of outputs.
    - **In NLP words are also known as Tokens**
- Many to many (synchronised)

To get hands-on with NLP in TensorFlow, we're going to practice the steps we've used previously but this time with text data:

```
Text -> turn into numbers -> build a model -> train the model to find patterns -> use patterns (make predictions)
```


In [2]:
# Importing tensorflow 
import tensorflow as tf
print(tf.__version__)

2.4.1


### Getting the helper functions 


In [3]:
# Download helper functions script
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2021-05-18 20:55:25--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2021-05-18 20:55:25 (93.7 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [4]:
# Import the needed helper functions for the notebook 
from helper_functions import unzip_data , create_tensorboard_callback , plot_loss_curves , compare_historys

## Get a text dataset 

The dataset we're going to be using is Kaggle's Introduction to NLP dataset (text sampels of tweets labelled as disaster (or) not disaster). 

See the original source here https://www.kaggle.com/c/nlp-getting-started/data

In [5]:
# Download the data from Kaggle 
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

# Unzipping the data 
unzip_data('nlp_getting_started.zip')

--2021-05-18 20:55:26--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.112.128, 74.125.124.128, 172.217.212.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.112.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2021-05-18 20:55:26 (97.1 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Become one with the data 

We will Visualize , explore our test data in here. 

Text datasets come across in many different formats, aside from CSV files we will probably encounter `.txt` file and `.json` files too. Reading the below articles will help in those times, 
* [How to Read and Write Files in Python](https://realpython.com/read-write-files-python/)
* [Working with JSON Data in Python](https://realpython.com/python-json/)

In [6]:
# Importing pandas to look into our csv file 
import pandas as pd 
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# What's inside our train dataframe? 
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [7]:
# Getting the first sample of row 1 
train_df['text'][0]

'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'

So our goal now is to build a model to predict the target. Where our targets are it is a disaster or not a disaster. 

In [8]:
# Shuffle the traning dataframe 
train_df_shuffled = train_df.sample(frac = 1 , 
                                    random_state = 42)

# Looking into our shuffled dataframe 
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [9]:
# What does the test dataframe looks like 
test_df.head() 

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


Same as the train dataframe but no targets. In here we are going to use the text column and predict upcoming tweets whether they are Disaster or Not. 

In [10]:
# How many examples of each class? 
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

We can't say our targets are perfectly balanced but it's pretty much a 60-40 split balance between the targets. 

So if we have imabalanced target class refer this https://www.tensorflow.org/tutorials/structured_data/imbalanced_data

In [11]:
# How many total samples in both sets? 

len(train_df) , len(test_df)

(7613, 3263)

In [12]:
# Let's visualize some random training examples. 
import random 

# 5 random index random numbers
# Create random indexes not higher than the total number of samples
random_index = random.randint(0 , len(train_df) - 5) 

# Will return tuples
for row in train_df_shuffled[['text' , 'target']][random_index:random_index+5].itertuples():
  _, text , target = row
  print(f'Target: {target}' , "(real disaster)" if target > 0 else "(not real disaster")
  print(f'Text:\n{text}\n')
  print('----\n')


Target: 1 (real disaster)
Text:
16 dead in Russia bus accident: At least 16 people were killed and 26 others injured when two buses collided i... http://t.co/ybyP68ieVn

----

Target: 1 (real disaster)
Text:
#BHRAMABULL Watch Run The Jewels Use Facts to Defend Rioting in Ferguson: The socially minded duo takes on the... http://t.co/Ld5P1sIa2N

----

Target: 1 (real disaster)
Text:
I visited Hiroshima in 2006. It is an incredible place. This model shows devastation of the bomb. http://t.co/Gid6jqN8UG

----

Target: 1 (real disaster)
Text:
I scored 111020 points in PUNCH QUEST stopped when a squeaky bat collided into my skull. http://t.co/aEtgbxm1pL

----

Target: 1 (real disaster)
Text:
Someone teaching you that obedience will obliterate trials in your life is trying to sell you a used car. Jesus's life blows that theory.'

----



Though we have train and test datasets it's good to create an validation dataset. 

### Split data into training and validation set 

In [13]:
from sklearn.model_selection import train_test_split

# Let's split training data into train and val set
train_sentences , val_sentences , train_labels , val_labels = train_test_split(train_df_shuffled['text'].to_numpy() , 
                                                                               train_df_shuffled['target'].to_numpy() , 
                                                                               test_size = 0.1,  # Use 10% of the training data for validation set 
                                                                               random_state = 42)

In [14]:
# Checking the shapes of our splits 

train_sentences.shape , train_labels.shape , val_sentences.shape , val_labels.shape

((6851,), (6851,), (762,), (762,))

In [15]:
# Number of samples 
print(f'Number of sampels in train set: {len(train_sentences)}')
print(f'Number of sampels in validation set: {len(val_sentences)}')

Number of sampels in train set: 6851
Number of sampels in validation set: 762


In [16]:
# Check the first 10 samples 
train_sentences[:10] , train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object), array([0, 

Great now we got to know about how our data looks. The next step would be converting our features into numbers where our targets already in number we don't have to bother about it. 

Next step would be turn the text into numbers! 

## Converting text into numbers 

Alright now the challenge is to convert our text into numbers we can use two techniques they are, 

- **Tokenization**
- **Embeddings**

Let's look into them one by one. 

#### Tokenization

A straight mapping from word or character or sub-word to a numerical value. There are three main levels of tokenization: 

- Using **word-level tokenization** with the sentence 'I love TensorFlow' might result in 'I' being `0` , 'love' being `1` and TensorFlow being `2`. In this case, every word in a sequence considered as a single **token.**
- **Character - level tokenization,** such as converting the letters A-Z to values `1-26`. In this case, every character in a sequence (or) considered as a single token.
- **Sub-word tokenization** is in between word-level and character-level tokenization. It involves breaking individual words into smaller parts and then converting those smaller parts into numbers.

    For example ' my favourite food is pineapple pizza' might become —> " my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After doing this, these sub-words would then be mapped to a numerical value. In this case every word could be considered multiple tokens. 

#### Embeddings

A embedding is a representation of natural language which can be learned. Representation comes in the form of a **feature vector**. For example, the word 'dance' could be represented by the 5-dimensional vector `[-0.8457 , 0.4559 , -0.3332, 0.9877, 0.1112]`. 

It's important to note here, the size of the feature vector is tuneable (embedding_size). There are two ways to use embeddings: 

- **Create your own embedding** - Once your text has been turned into numbers (requires for an embedding), you can put them through an embedding layer (such as `[tf.keras.layers.Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)`) and embedding representation will be learned during model training.
- **Reuse a pre-learned embedding** - Many pre-trained embeddings exist online. These are pre-trained embeddings have often learned on large corpuses of a text (such as all of Wikipedia) and thus have good underlying representation of natural language. You can use pre-trained embedding to initialize your model and fine-tune it to your own specific task.

Example of **tokenization** (straight mapping from word to number) and **embedding** (richer representation of relationships between tokens).

> Question: What level of tokenzation should I use? What embedding should should I choose?

It depends on your problem. You could try character-level tokenization/embeddings and word-level tokenization/embeddings and see which perform best. You might even want to try stacking them (e.g. combining the outputs of your embedding layers using [`tf.keras.layers.concatenate`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/concatenate) ).

If you're looking for pre-trained word embeddings, [Word2vec embeddings](http://jalammar.github.io/illustrated-word2vec/), [GloVe embeddings](https://nlp.stanford.edu/projects/glove/) and many of the options available on TensorFlow Hub are great places to start.

> Note: Much like searching for a pre-trained computer vision model, you can search for pre-trained word embeddings to use for your problem. Try searching for something like "use pre-trained word embeddings in TensorFlow".

When dealing with a text problem one of the first things you'll have to do before you can build a model is to convert your text to numbers. 

There are few ways to do this, namely: 
* Tokenization - direct mapping of token (a token could be a word or character) to number. 
* Embedding - create a matrix of feature vector for each token (the size of the feature vector can be defined and this embedding can be learned). 

### Text vectorization (tokenization)

In [17]:
# Remind ourselves how our data looks like 
train_sentences[:5]

array(['@mogacola @zamtriossu i screamed after hitting tweet',
       'Imagine getting flattened by Kurt Zouma',
       '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
       "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
       'Somehow find you and I collide http://t.co/Ee8RpOahPk'],
      dtype=object)

In [18]:
# Import the tokenization layer 
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

How does the `TextVectorization` layer works? 
- standardize each sample (lower casing + punctuation stripping). 
- split each sample into substrings (usually words). 
- recombine substrings into tokens (usually ngrams which will group words). 
- index tokens (assign a unique int value to each token). 
- transform each sample using the index, either into a vector of ints or dense float vector.  

In [19]:
# Use the default TextVectorization parameters 

text_vectorizer = TextVectorization(max_tokens = None , # how many words in the vocabulary (automatically add <OOV>)
                                   standardize = 'lower_and_strip_punctuation' , # Standardize our text data like in image (convert into lower case and strip punctu)
                                   split = 'whitespace', # Split the sequence by whitespace
                                   ngrams = None , # create group of n-words (None will not group them)
                                   output_mode = 'int', # how to map tokens to numbers
                                   output_sequence_length = None, # More like batches (padding) None will pad each sequence to normal sequence
                                   pad_to_max_tokens = True )

We can pad out tweets to the longer sequence but to keep our data small we will find the average words in a sequence and will pad our whole data to it. 

In [20]:
ex = train_sentences[0].split()
len(ex)

7

In [21]:
# Find the average number of tokens (words) in the training tweets. 

round(sum([len(i.split()) for i in train_sentences]) / len(train_sentences))

15

In [22]:
# Setup text vectorization variable 
max_vocab_length = 10000 # Max number of words to have in our vocabulary
max_length = 15 # max length our sequence will be (e.g how many words from Tweet does a model see)

# Creating a instance
text_vectorizer = TextVectorization(max_tokens= max_vocab_length , 
                                    output_mode = 'int' , 
                                    output_sequence_length = max_length)

Now we have a instance of our `TextVectorization` layer now we will have to map this layer to our text data in order to convert them in numerical format. 

We can do this by using `.adapt()` method. 

In [23]:
# Fit the text vectorizer to the training sentence
text_vectorizer.adapt(train_sentences) # Will go through and apply the text vectorization

In [24]:
# Create a sample sentence and tokenize it
sample_sentence = "There's a flood in my street!"

# Applying our text vectorization to our above sentence 
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[264,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

We can observe that our word got converted into a number and rest 0 is to make sure to (pad) fill up the `output_sequence _length` we mentioned while creating our text vectorization layer. 

In [25]:
# Choose a random sentence from the training dataset and tokenize it 

random_sentence = random.choice(train_sentences)
print(f'Original text:\n {random_sentence}\
      \n\nVectorized version:')
text_vectorizer([random_sentence])

Original text:
 The Martyrs Who Kept Udhampur Terrorists at Bay Averted a Massacre: It was two youngÛ_ http://t.co/nux5XfPV2d SPSå¨      

Vectorized version:


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[   2,    1,   65, 2911, 1163, 1583,   17, 1551, 3167,    3,  344,
          15,   23,  116, 6540]])>

Great, thought our sentence is more than 15+ words long but we want to keep our sequence withing the range 0-15. 

We know that `max_tokens` in our text vectorizer layer, it will help us to keep track of the unique words it come across from our data. 

In [26]:
# Get the unique words in the vocabulary 
words_in_vocab = text_vectorizer.get_vocabulary() # Get all of the unique words in our training data

# Most common words in our vocabulary
top_5_words = words_in_vocab[:5] 

# Least common words in our vocabulary
bottom_5_words = words_in_vocab[-5:]

print(f'Number of words in vocab: {len(words_in_vocab)}')
print(f'5 most common words: {top_5_words}')
print(f'5 least common words: {bottom_5_words}')

Number of words in vocab: 10000
5 most common words: ['', '[UNK]', 'the', 'a', 'in']
5 least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


Number of unique words is **10000** its because that's what we set the max_length to. 

UNK - Unknown vocabulary these are words which isn't in our vocabulary. Like 10000 unique words might not cover all of the words in here. 

So if we increase our `max_length` parameter to 20000 then our text vectorizer can handle more unique values and there will be less [UNK] tokens. 

---

Alright let's try out embedding and we know the best part of embedding is it can be learned. 

### Creating an Embedding using a Embedding Layer 

We have got a way to map our text into numbers. How about we find a way and turn those numbers into embeddings. 

To make our embedding, we're going to use a TensorFlow's embedding layer. https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

At first the embedding is created the number are all going to random and we know our embedding layer learns and improves as it goes. Likewise the weights in our model it will get updated to better suit the order of the representation of our words. 

The parameters we care most about for our embedding layer: 
- `input_dim` = the size of our vocabulary
- `output_dim` = the size of the output embedding vector, for example, a value of 100 would mean each token gets represented by a vector 100 long. 
- `input_lenght` = length of the sequences being passed to the embedding layer (max_length) so it's going to be 15 long. 

In [27]:
# In practice 
from tensorflow.keras import layers 

embedding = layers.Embedding(input_dim= max_vocab_length , # set input shape
                             output_dim = 128,  # output shape
                             input_length = max_length # How long each input is
                             )

# Looking at our embedding layer 
embedding

<tensorflow.python.keras.layers.embeddings.Embedding at 0x7faf7f3ef4d0>

In [28]:
# Get a random sentence from our training sentence 
random_sentence = random.choice(train_sentences)
print(f'Original text:\n {random_sentence}')

# Mapping text into numbers (turn into dense vectors of fixed size)
tokenized_form = text_vectorizer([random_sentence]) 
print(f'\n After turning our text into numbers:\n\n {tokenized_form}')

# Using our embedding layer 
print(f'\nApplying the embedding layer to our tokenized vector\n\n {embedding(tokenized_form)}')

Original text:
 @TrillAC_ I think we've only had like one black mass murderer in the history of mass murders white people do that shit.

 After turning our text into numbers:

 [[7364    8  125 2608  126   94   25   61  159  157  538    4    2  705
     6]]

Applying the embedding layer to our tokenized vector

 [[[-7.6241121e-03 -4.0823676e-02  3.5990253e-03 ... -3.9833557e-02
   -3.5363436e-03  1.9198511e-02]
  [-1.7751195e-02  4.7068741e-02 -3.6746930e-02 ... -3.2503232e-03
    1.3757125e-03 -3.8338676e-03]
  [ 4.0857438e-02  4.9100172e-02 -4.3176234e-02 ... -3.5456680e-02
    3.5402205e-02 -4.9809527e-02]
  ...
  [-1.9616950e-02  3.1622473e-02  4.4707384e-02 ... -3.0019058e-02
   -3.3066310e-02 -3.7737738e-02]
  [ 4.3792080e-02  4.9930599e-02  1.9668713e-03 ... -9.3124807e-05
   -1.9669533e-04  4.7419224e-02]
  [ 3.4101557e-02 -4.0469788e-02 -1.5264608e-02 ... -8.6992383e-03
    2.5227237e-02 -3.6985565e-02]]]


In [29]:
sample_embed = embedding(tokenized_form)
sample_embed

<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-7.6241121e-03, -4.0823676e-02,  3.5990253e-03, ...,
         -3.9833557e-02, -3.5363436e-03,  1.9198511e-02],
        [-1.7751195e-02,  4.7068741e-02, -3.6746930e-02, ...,
         -3.2503232e-03,  1.3757125e-03, -3.8338676e-03],
        [ 4.0857438e-02,  4.9100172e-02, -4.3176234e-02, ...,
         -3.5456680e-02,  3.5402205e-02, -4.9809527e-02],
        ...,
        [-1.9616950e-02,  3.1622473e-02,  4.4707384e-02, ...,
         -3.0019058e-02, -3.3066310e-02, -3.7737738e-02],
        [ 4.3792080e-02,  4.9930599e-02,  1.9668713e-03, ...,
         -9.3124807e-05, -1.9669533e-04,  4.7419224e-02],
        [ 3.4101557e-02, -4.0469788e-02, -1.5264608e-02, ...,
         -8.6992383e-03,  2.5227237e-02, -3.6985565e-02]]], dtype=float32)>

What is 128? 
Every single token in our sequence are now in the format of 128 long vectors. 

In [30]:
# Check out a single token's embedding 
sample_embed[0][0] , sample_embed[0][0].shape , random_sentence[0]

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.00762411, -0.04082368,  0.00359903, -0.04425007, -0.02362163,
        -0.03128286, -0.02981092, -0.03448785, -0.04207481,  0.04927522,
         0.03884381, -0.04038532, -0.01040723, -0.00753536,  0.02708001,
        -0.02827619,  0.00425367, -0.0247598 , -0.03707813, -0.02677398,
         0.03371396,  0.03254266,  0.01022855,  0.04227283, -0.03268725,
         0.01432783,  0.01107226, -0.01615246,  0.03349051,  0.03297452,
         0.03233914,  0.02820451,  0.0257361 ,  0.02466771,  0.02931966,
         0.00690029, -0.04962222, -0.02688822, -0.04453341,  0.03148191,
        -0.00905512, -0.03407439, -0.00458197,  0.03999814,  0.02992659,
         0.028659  ,  0.0101454 , -0.03537948, -0.03202337,  0.04553839,
         0.04479903, -0.03271323, -0.02318591, -0.02727044,  0.00464325,
         0.02215873, -0.03437829, -0.04662727,  0.02453763, -0.02678231,
        -0.03655752,  0.01483127,  0.0336324 ,  0.01351691,  0.03450128,
  

Alright next we wil discuss the various modelling experiments we're going to run. 

## Modelling a text dataset (running a series of experiments) 

Once you've got your inputs and outputs prepared, it's a matter of figuring out which machine learning model to build in between them to bridge the gap.

To get a plenty of practice, we're going to build a series of different models, each has its won experiment. We'll then compare the results of each model and see which one performed better. 

We're going to build, 

- **Model 0**: Naive Bayes (common baseline for text based data- tf-idf)
- **Model 1**: Feed-forward neural network (dense model)
- **Model 2**: LSTM model (RNN)
- **Model 3**: GRU model (RNN)
- **Model 4**: Bidirectional-LSTM model (RNN)
- **Model 5**: 1D Convolutional Neural Network
- **Model 6**: TensorFlow Hub Pre-trained Feature Extractor
- **Model 7**: Same as model 6 with 10% of training data.

How are we going to approach all of these? 
Use the standard steps in modelling with TensorFlow: 
- Create a Model
- Build a Model 
- Fit a model 
- Evaluate our model

Let's build a Non-deeplearning model to be more specific a Naive Bayes model from scikit-learn

Let's experiment before the video 


In [31]:
# Checking the shapes of our splits 
train_sentences.shape , train_labels.shape , val_sentences.shape , val_labels.shape

((6851,), (6851,), (762,), (762,))

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Creating a instance
tfidf = TfidfVectorizer()

# Fitting our data to our TfidfVectorizer 
tf_transformer = tfidf.fit(train_sentences)

# Applying the transformation
train_sen_trans = tf_transformer.transform(train_sentences)

# Checking the shape 
train_sen_trans.shape

(6851, 20076)

In [33]:
train_sen_trans[:1]

<1x20076 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [34]:
# Importing the naive bayes model for our classified 
from sklearn.naive_bayes import MultinomialNB

# Creating a instance of our model 
clf_naive = MultinomialNB()

# Fitting the data
clf_naive.fit(train_sen_trans , train_labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Model 0: Getting a baseline 

To create our baseline, we'll use Sklearn's multinomal Naive Bayes using the TF-IDF formula to convert our words to numbers. 

> 🔑 **Note**: It's common practice to use non-DL algorithms as a baseline because of their speed and then later using the DL to see if we can improve. 



In [35]:
train_labels

array([0, 0, 1, ..., 1, 1, 0])

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer # (turn text into numbers)
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline 

# Create tokenization and modelling pipeline
model_0 = Pipeline([
                  ('tfidf' , TfidfVectorizer()) , # Convert words to numbers using tfidf
                  ('clf' , MultinomialNB()), # Model the text 
                ])

# Fit the pipeline to the training data 
model_0.fit(train_sentences , train_labels)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [37]:
# Evaluate our baseline model 
# Default evaluation metrics is accuracy
baseline_score = model_0.score(val_sentences , val_labels)
print(f'Our baseline mdoel achieves an accuracy of: {baseline_score*100:.2f}%')

Our baseline mdoel achieves an accuracy of: 79.27%


In [38]:
# Make predictions 
baseline_preds = model_0.predict(val_sentences)

# First 10 predictions
baseline_preds[:10]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0])

Let's use some of the other evaluation metrics. We will make a handy function which will help us to reduce the hustle of writing out every evaluation metrics. 

In [39]:
def classification_evaluation_metrics(y_true , 
                                      y_preds):
  '''
  Arguments: 
  y_true --> true labels of the data 
  y_preds --> predicted labels of the data 

  Returns: 
  A dictionary of evaluation metrics like precision , recall and f1_score
  '''

  # Let's first import the needed metrics 
  from sklearn.metrics import precision_score , f1_score , accuracy_score , recall_score

  # Creting the metrics 
  accuracy = accuracy_score(y_true , y_preds)
  f1_score = f1_score(y_true , y_preds)
  precision = precision_score(y_true , y_preds)
  recall = recall_score(y_true , y_preds)

  # Now will create a dictionary of these metrics and pack them
  evaluation_dict = {'Accuracy:': accuracy * 100 , 
                     'F1_Score: ': f1_score , 
                     'Precision: ': precision , 
                     'Recall: ': recall }

  # Return our dictionary 
  return evaluation_dict
  

In [40]:
# Using the above function 
evaluation_dict = classification_evaluation_metrics(val_labels , 
                                                    baseline_preds)

# Looking into the dictionary of evaluation metrics 
evaluation_dict

{'Accuracy:': 79.26509186351706,
 'F1_Score: ': 0.734006734006734,
 'Precision: ': 0.8861788617886179,
 'Recall: ': 0.6264367816091954}

Great! Our baseline model worked better than we expected. Now let's build a Feed-forward neural network model for our text data.

### Model 1: Feed Forward Neural Network (Dense layers) 


In [41]:
# Create a tensorboard callback (to track the model experiments) 
# New one for each model
from helper_functions import create_tensorboard_callback

# Create a directory to save tensorboard logs 
SAVE_DIR = 'model_logs'

In [42]:
# Build model with the functional API 
from tensorflow.keras import layers 

# Creating our input layer (inputs are 1D strings)
inputs = layers.Input(shape=(1,) , dtype = tf.string)

# Convert strings into numbers and applying word embedding
x = text_vectorizer(inputs) # turn input text into numbers 
x = embedding(x) # Create an embedding of the numberized inputs 

# Ouput layer (want binary outputs so use sigmoid activation function)
outputs = layers.Dense(1, activation='sigmoid')(x) 

# Packing into a model 
model_1 = tf.keras.Model(inputs , outputs , name = 'model_1_dense')

# Getting the model summary 
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 15)                0         
_________________________________________________________________
embedding (Embedding)        (None, 15, 128)           1280000   
_________________________________________________________________
dense (Dense)                (None, 15, 1)             129       
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________


At `embedding_2 (Embedding)` our embedding layers adds a extra dimension 128 so our model's parameters gets increased in numbers.

Because every single tokens (15 tokens) get represented as 128 long feature vector. 

In [43]:
# Compile the model 
model_1.compile(loss = tf.keras.losses.BinaryCrossentropy() , 
                optimizer = tf.keras.optimizers.Adam() , 
                metrics = ['accuracy'])

In [44]:
# Fit the model 
model_1_history = model_1.fit(train_sentences ,
                              train_labels , 
                              epochs = 5 , 
                              validation_data = (val_sentences , val_labels) , 
                              callbacks = [create_tensorboard_callback(dir_name = SAVE_DIR , 
                                                                       experiment_name = 'model_1_dense')])

Saving TensorBoard log files to: model_logs/model_1_dense/20210518-205528
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [45]:
# Check the results 
model_1.evaluate(val_sentences , val_labels)



[0.6354839205741882, 0.6472440958023071]

In [46]:
# Make some predictions and evaluate 
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs.shape

(762, 15, 1)

In [47]:
model_1_pred_probs[:1]

array([[[0.39799392],
        [0.39799392],
        [0.39799392],
        [0.17651647],
        [0.48920065],
        [0.39799392],
        [0.39799392],
        [0.39799392],
        [0.11474213],
        [0.36757755],
        [0.39799392],
        [0.94535613],
        [0.0377855 ],
        [0.39799392],
        [0.31791615]]], dtype=float32)

We want a single prediction probability for each sample but it looks like it is outputing a prediction probability for each tokens.

To fix this we will be using **GlobalAveragePooling** layer. This will help us to condense the feature vector for each token to one vector.

In [48]:
# Let's build the model again but with GlobalAveragePooling 
from tensorflow.keras import layers 

# Input layer
inputs = layers.Input(shape = (1,) , dtype = tf.string)

# Turn text into numbers and create an word embedding 
x = text_vectorizer(inputs)
x = embedding(x)

# Condense the feature vector for each token to "one vector"
x = layers.GlobalAveragePooling1D()(x)

# Output layer 
outputs = layers.Dense(1 , activation = 'sigmoid')(x)

# Packing into a model 
model_1 = tf.keras.Model(inputs , outputs , name = 'model_1_dense')


In [49]:
# Checking the summary of the model
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 15)                0         
_________________________________________________________________
embedding (Embedding)        (None, 15, 128)           1280000   
_________________________________________________________________
global_average_pooling1d (Gl (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________


In [50]:
# Compile the model 
model_1.compile(loss = tf.keras.losses.BinaryCrossentropy() , 
                optimizer = tf.keras.optimizers.Adam() , 
                metrics = 'accuracy')

In [51]:
# Fitting the model 
model_1.fit(train_sentences , 
            train_labels , 
            epochs = 5 , 
            validation_data = (val_sentences , val_labels) , 
            callbacks = create_tensorboard_callback(dir_name= SAVE_DIR , 
                                                    experiment_name = 'model_1_dense_pool_layer'))

Saving TensorBoard log files to: model_logs/model_1_dense_pool_layer/20210518-205545
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7faf7ecce8d0>

In [52]:
# Checking the results by evaluatiing our model 
model_1.evaluate(val_sentences , val_labels)



[0.4691266715526581, 0.7821522355079651]

In [53]:
# Making prediction 
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs.shape

(762, 1)

In [54]:
# How our predictions looks now? 
model_1_pred_probs[:10]

array([[0.3349681 ],
       [0.77674645],
       [0.99708915],
       [0.15756682],
       [0.11675829],
       [0.94847524],
       [0.89156425],
       [0.9965668 ],
       [0.9763675 ],
       [0.27481252]], dtype=float32)

In [55]:
# Convert model prediction probs to label format 

model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 1.], dtype=float32)>

In [56]:
# Calculate our model_1 results 
model_1_results = classification_evaluation_metrics(y_true = val_labels , 
                                                    y_preds = model_1_preds)

model_1_results

{'Accuracy:': 78.21522309711287,
 'F1_Score: ': 0.7414330218068536,
 'Precision: ': 0.8095238095238095,
 'Recall: ': 0.6839080459770115}

In [57]:
# Our baseline metrics 
evaluation_dict

{'Accuracy:': 79.26509186351706,
 'F1_Score: ': 0.734006734006734,
 'Precision: ': 0.8861788617886179,
 'Recall: ': 0.6264367816091954}

In [58]:
# Comparing our base line model with our deeplearning model
import numpy as np
np.array(list(model_1_results.values())) > np.array(list(evaluation_dict.values()))

array([False,  True, False,  True])

### Visualizing our model's learned word embeddings with TensorFlow's projector tool

In [59]:
# Get the vocabulary from the text vectorization layer 

# Getting the unique vocabulary our embedding layer learnt
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab) , words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

In [60]:
# Model 1 summary (inspect the embedding layer)
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 15)                0         
_________________________________________________________________
embedding (Embedding)        (None, 15, 128)           1280000   
_________________________________________________________________
global_average_pooling1d (Gl (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________


Let's get the weight matrix of our embedding layer 

These are the numerical representation of each token in our training data, which have been learned for 5 epochs (patterns from our embedding layer). 

In [61]:
embed_weights = model_1.get_layer('embedding').get_weights()
embed_weights

[array([[-0.00391051,  0.02504868, -0.02617101, ...,  0.04603732,
          0.0278583 , -0.06620234],
        [-0.05174899, -0.00908591, -0.05964603, ..., -0.01038372,
          0.03669757,  0.03520452],
        [-0.05457886,  0.05735459,  0.05414534, ...,  0.01196079,
         -0.06523348, -0.0638105 ],
        ...,
        [ 0.04691838,  0.00577299,  0.04433097, ...,  0.04817697,
          0.02018059, -0.00235616],
        [ 0.13170831,  0.11272785,  0.10495318, ..., -0.05242131,
          0.03523082, -0.10682646],
        [ 0.14288642,  0.06756102,  0.09590799, ..., -0.10125121,
          0.10648917, -0.06503454]], dtype=float32)]

Those are the weights learned by our embedding layers. 

In [62]:
# The shape should be same size as vocab and embedding_dim (output_dim of our embedding layer)
embed_weights[0].shape 

(10000, 128)

From above we can understand that,
1000 from every token is embedded into 128 dimension vector. 

Now we've got the embeddig matrix of our model has learned to represent our tokens, let's see how we visualize it. 

To do so, TensorFlow has a handy tool called projector: http://projector.tensorflow.org/_

And TensorFlow also has a guide on Word Embeddings: https://www.tensorflow.org/tutorials/text/word_embeddings

In [63]:
# Create embedding files (we got this from TensorFlow's word embeddings documentation)

import io 
# Creating a vector and metadata file for our embeddings (tokens and words)
out_v = io.open('vectors.tsv' , 'w' , encoding = 'utf-8')
out_m = io.open('metadata.tsv' , 'w' , encoding = 'utf-8')

# Loop through and write our values to the corresponding files we created above 
for index , word in enumerate(words_in_vocab):
  if index == 0:
    continue # Skip 0, its padding
  
  vec = embed_weights[0][index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + '\n')

out_v.close()
out_m.close()




In [64]:
# Download files from Colab to upload to the projector 
try: 
  from google.colab import files 
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception: 
  pass  

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

> Resources: To learn more about embedding, 
* Jay Alamars visualized word2vec: https://jalammar.github.io/illustrated-word2vec/
* TensorFlow's Word Embedding guide: https://www.tensorflow.org/tutorials/text/word_embeddings

## Recurrent Neural Networks (RNN's)

RNN's are useful for sequence data.

The premise of a recurrent neural network is to use the representation of previous inputs to aid the representation of the later input. 

Use information from the past to help you with the future (this is where the term recurrent comes from). In other words, take an input (X) and compute an output (y) based on all previous inputs.

> Resources to look into for learning RNN's 
- [RNN MIT Intro to deeplearning](https://www.youtube.com/watch?v=qjrad0V0uJE&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI&index=2)
- [Understanding LSTMS](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Unreasonable effectiveness of RNN](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)



### Model 2: LSTM 

LSTM = long short term memory (one of the most popular LSTM cells). 

Our structure of a RNN typically looks like this: 

```
Input (text) --> Tokenize --> Embedding --> Layers (RNNs/dense) --> Output (label probability)
```


In [65]:
# Create an LSTM model 
from tensorflow.keras import layers 

# Setting up inputs 
inputs = layers.Input(shape = (1,) , dtype = tf.string)

# Converting text into numbers and creating a embedding 
x = text_vectorizer(inputs)
x = embedding(x)
print(f'After embedding: {x.shape}')

# Our LSTM 
#x = layers.LSTM(64 , return_sequences=True)(x) 
# When you're stacking RNN cells together you need to set return_sequences = True
#print(f'Output with return sequence True: {x.shape}') # Output with return sequence True
x = layers.LSTM(64)(x)
#print(f'Output with return sequence False: {x.shape}') # Output with return sequence False
x = layers.Dense(64 , activation = 'relu')(x)

# Initializing our outputs 
outputs = layers.Dense(1 , activation = 'sigmoid')(x)

# Packing into a model 
model_2 = tf.keras.Model(inputs , outputs, name = 'model_2_LSTM')

After embedding: (None, 15, 128)


`inputs`: A 3D tensor with shape `[batch, timesteps, feature]`.

- batch --> None by default
- timestaps --> treat every sequence (word) as a timestamps. 

The default activation function for LSTM is **`tanh`**. 

[LSTM layer tensorflow](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)

In [66]:
# Checking the model summary 
model_2.summary()

Model: "model_2_LSTM"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 15)                0         
_________________________________________________________________
embedding (Embedding)        (None, 15, 128)           1280000   
_________________________________________________________________
lstm (LSTM)                  (None, 64)                49408     
_________________________________________________________________
dense_2 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 65        
Total params: 1,333,633
Trainable params: 1,333,633
Non-trainable params: 0
____________________________________________

In [67]:
# Compile the model 
model_2.compile(loss = tf.keras.losses.BinaryCrossentropy()  , 
                optimizer = tf.keras.optimizers.Adam() , 
                metrics = ['accuracy'])

In [68]:
# Fit the model 
model_2_history = model_2.fit(train_sentences , 
                              train_labels , 
                              validation_data = (val_sentences , val_labels) , 
                              epochs = 5 , 
                              callbacks = [create_tensorboard_callback(dir_name= SAVE_DIR , 
                                                                       experiment_name = 'model_2_LSTM')])

Saving TensorBoard log files to: model_logs/model_2_LSTM/20210518-205604
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [69]:
# Make predictions with LSTM model 
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs[:10]

array([[1.6903281e-03],
       [3.7813866e-01],
       [9.9998844e-01],
       [2.8579772e-02],
       [5.4605483e-05],
       [9.9998367e-01],
       [9.8673952e-01],
       [9.9999636e-01],
       [9.9999160e-01],
       [8.2357037e-01]], dtype=float32)

In [70]:
# Convert model 2 pred probs to labels 
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [71]:
# Calculate model 2 results 
model_2_results = classification_evaluation_metrics(val_labels , 
                                                    model_2_preds)
model_2_results

{'Accuracy:': 77.16535433070865,
 'F1_Score: ': 0.7347560975609756,
 'Precision: ': 0.7824675324675324,
 'Recall: ': 0.6925287356321839}

In [72]:
evaluation_dict

{'Accuracy:': 79.26509186351706,
 'F1_Score: ': 0.734006734006734,
 'Precision: ': 0.8861788617886179,
 'Recall: ': 0.6264367816091954}

### Model 3: GRU Model 

The GRU cell has similar features to an LSTM cell but has less parameters. 

[Understanding GRU networks](https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be)

Previously we used a LSTM layers to build our RNN model now we will use a GRU layer also called Gater Recurrent Unit to build our model. 

In [73]:
# Create a GRU model 
inputs = layers.Input(shape = (1,) , dtype = tf.string)

# Converting text into numbers and performing word embeddings. 
x = text_vectorizer(inputs)
x = embedding(x)
print(x.shape)

# Building our GRU model 
# If you want to stack recurrent layers on top of each other, you need return_sequences = True
x = layers.GRU(64 , activation = 'tanh' , return_sequences=True)(x)
print(x.shape)
x = layers.LSTM(44 , return_sequences= True)(x) 
print(x.shape)
x = layers.GRU(100)(x)
print(x.shape)
x = layers.Dense(64 , activation='relu')(x)

# Global Average Pooling layer
#x = layers.GlobalAveragePooling1D()(x)

# Output layer 
outputs = layers.Dense(1 , activation= 'sigmoid')(x)

# Packing into a model 
test_model_3 = tf.keras.Model(inputs , outputs)

(None, 15, 128)
(None, 15, 64)
(None, 15, 44)
(None, 100)


In [74]:
# Summary of the mdoel 
test_model_3.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 15)                0         
_________________________________________________________________
embedding (Embedding)        (None, 15, 128)           1280000   
_________________________________________________________________
gru (GRU)                    (None, 15, 64)            37248     
_________________________________________________________________
lstm_1 (LSTM)                (None, 15, 44)            19184     
_________________________________________________________________
gru_1 (GRU)                  (None, 100)               43800     
_________________________________________________________________
dense_4 (Dense)              (None, 64)                6464  

Alright we did some tinkering up there, now let's create a model 3 with only one GRU layer

In [75]:
# Creating a GRU model 
from tensorflow.keras import layers 

# Input layer 
inputs = layers.Input(shape = (1,) , dtype = tf.string)

# Converting text into numbers and perform word embeddings 
x = text_vectorizer(inputs)
x = embedding(x)

# Our GRU layer 
#x = layers.GRU(64 , activation= 'tanh' , return_sequences= True)(x)
x = layers.GRU(64)(x)

# Global Average Pooling 
#x = layers.GlobalAveragePooling1D()(x)

# Output layer 
outputs = layers.Dense(1 , activation='sigmoid')(x)

# Packing into a model
model_3 = tf.keras.Model(inputs , outputs) 


In [76]:
# Model summary 
model_3.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 15)                0         
_________________________________________________________________
embedding (Embedding)        (None, 15, 128)           1280000   
_________________________________________________________________
gru_2 (GRU)                  (None, 64)                37248     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 65        
Total params: 1,317,313
Trainable params: 1,317,313
Non-trainable params: 0
_________________________________________________________________


In [77]:
# Compile the model 
model_3.compile(loss = tf.keras.losses.BinaryCrossentropy() , 
                optimizer = tf.keras.optimizers.Adam() , 
                metrics = ['accuracy']) 

In [78]:
# Fit the model 
model_3_history = model_3.fit(train_sentences , 
                              train_labels , 
                              validation_data = (val_sentences , val_labels), 
                              epochs = 5 , 
                              callbacks = [create_tensorboard_callback(dir_name= SAVE_DIR , 
                                                                       experiment_name = 'model_3_GRU')]) 


Saving TensorBoard log files to: model_logs/model_3_GRU/20210518-205634
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [79]:
# Making predictions on the val set 
model_3_pred_probs = model_3.predict(val_sentences) 
model_3_pred_probs[:10]

array([[2.99480557e-03],
       [8.42095017e-01],
       [9.99764442e-01],
       [1.23365015e-01],
       [2.23934650e-04],
       [9.99530673e-01],
       [7.91117191e-01],
       [9.99940574e-01],
       [9.99888301e-01],
       [7.66336083e-01]], dtype=float32)

In [80]:
# Convert model 3 pred probs to labels 
model_3_preds = tf.squeeze(tf.round(model_3_pred_probs))
model_3_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [81]:
# Using our evalution metrics dictionary 
model_3_results = classification_evaluation_metrics(val_labels , 
                                                    model_3_preds)

In [82]:
model_3_results

{'Accuracy:': 77.42782152230971,
 'F1_Score: ': 0.7425149700598801,
 'Precision: ': 0.775,
 'Recall: ': 0.7126436781609196}

### Model 4: Bidirectional RNN Model 

Normal RNN's go from left to right (just like you'd read an English sentence) however, a bidirectional RNN goes from right to left as well as left to right. 

A birdirectional wrapper works for any RNN cells. 

Let's give a try before watching the video.  

In [83]:
# Creating a Bidirectional RNN model 
from tensorflow.keras import layers
# Input layer 
inputs = layers.Input(shape = (1, ) , dtype = tf.string)

# Convert text into numbers and perform word embeddings 
x = text_vectorizer(inputs)
x = embedding(x)
print(x.shape)

# Adding a Bidirectional RNN layer 
x = layers.Bidirectional(layers.LSTM(64 , return_sequences=True))(x)
print(x.shape)

x = layers.Bidirectional(layers.GRU(64))(x)
print(x.shape)

# Output layer 
outputs = layers.Dense(1 , activation= 'sigmoid')(x)
print(outputs.shape)

# Packing into a model 
test_model_4 = tf.keras.Model(inputs , outputs)

(None, 15, 128)
(None, 15, 128)
(None, 128)
(None, 1)


Here we have added `return_sequence = True` because we want to stack another RNN layer with our current RNN layer. 

In [84]:
# Model summary 
test_model_4.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 15)                0         
_________________________________________________________________
embedding (Embedding)        (None, 15, 128)           1280000   
_________________________________________________________________
bidirectional (Bidirectional (None, 15, 128)           98816     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               74496     
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 129       
Total params: 1,453,441
Trainable params: 1,453,441
Non-trainable params: 0
_________________________________________________

Following the video 


In [85]:
# Creating a Bidirectional RNN model 
from tensorflow.keras import layers
# Input layer 
inputs = layers.Input(shape = (1, ) , dtype = tf.string)

# Convert text into numbers and perform word embeddings 
x = text_vectorizer(inputs)
x = embedding(x)
print(x.shape)

# Adding a Bidirectional RNN layer 
x = layers.Bidirectional(layers.LSTM(64 , return_sequences=True))(x)
print(x.shape)
x = layers.Bidirectional(layers.GRU(64))(x)
print(x.shape)

# Output layer 
outputs = layers.Dense(1 , activation= 'sigmoid')(x)
print(outputs.shape)

# Packing into a model 
model_4 = tf.keras.Model(inputs , outputs)

(None, 15, 128)
(None, 15, 128)
(None, 128)
(None, 1)


In [86]:
# Model Summary 
model_4.summary()

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 15)                0         
_________________________________________________________________
embedding (Embedding)        (None, 15, 128)           1280000   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 15, 128)           98816     
_________________________________________________________________
bidirectional_3 (Bidirection (None, 128)               74496     
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 129       
Total params: 1,453,441
Trainable params: 1,453,441
Non-trainable params: 0
_________________________________________________

Bidirectional wrapper doubles the units so the representation will go from **64 to 128**. 

Bidirectional --> representation gets doubled

In [87]:
# Compile the model 
model_4.compile(loss = tf.keras.losses.BinaryCrossentropy() , 
                optimizer = tf.keras.optimizers.Adam() , 
                metrics = ['accuracy'])

In [88]:
# Fit the model 
model_4_history = model_4.fit(train_sentences , 
                              train_labels , 
                              validation_data = (val_sentences , val_labels) , 
                              epochs = 5 , 
                              callbacks = [create_tensorboard_callback(dir_name= SAVE_DIR , 
                                                                       experiment_name = 'model_4_bidirectional')])


Saving TensorBoard log files to: model_logs/model_4_bidirectional/20210518-205706
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Usually adding a bidirectional to a sequences increase the trainin time. 

In [89]:
# Make predictions without our model 4 
model_4_pred_probs = model_4.predict(val_sentences)
model_4_pred_probs[:10]

array([[1.54750645e-02],
       [8.17575276e-01],
       [9.99906123e-01],
       [1.97568297e-01],
       [1.89016428e-05],
       [9.99742329e-01],
       [2.40832418e-01],
       [9.99943852e-01],
       [9.99914110e-01],
       [9.94363546e-01]], dtype=float32)

In [90]:
# Convert pred probs to labels
model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 0., 1., 1., 1.], dtype=float32)>

In [92]:
# Calculate the results of our bidirectional model 
model_4_results = classification_evaluation_metrics(val_labels , 
                                                    model_4_preds)
model_4_results

{'Accuracy:': 75.7217847769029,
 'F1_Score: ': 0.7283406754772394,
 'Precision: ': 0.7447447447447447,
 'Recall: ': 0.7126436781609196}

Hmm.. Our bidirectional model is performing worse than our uni-directional models. 

We should emphasize from this that not all experiments will go well. 

## Convolutional Neural Networks for Text (and other types of sequences)

We've used CNN's for images but images are typically 2D (height x width).... however, our text data is 1D. 

Previously we've used Conv2D for our image data but now we're going to use Conv1D. 

The typical structure of a Conv1D model for sequences (in our case, text): 
```
Inputs (text) --> Tokenization --> Embedding --> Layer(s) (typically Conv1D + pooling) --> Outputs (class probabilities)
```

## Model 5: Conv1D



In [93]:
# Test our our embedding layer , COnv1D layer and max pooling (Let's look into inputs and outputs)
from tensorflow.keras import layers
# Turn target sequence into a embedding
embedding_test = embedding(text_vectorizer(['this is a test sentence']))

# Building a Conv1D layer
conv_1d = layers.Conv1D(filters = 32 , 
                        kernel_size = 5 , # this is salso referred to as ngrams of 5 (means group of 5 words at a time)
                        activation = 'relu' , 
                        padding = 'valid')

# Conv1D output 
conv_1d_output = conv_1d(embedding_test) # pass test embedding through conv1d layer

# Poolin layer 
max_pool = layers.GlobalMaxPool1D()

# Output of our max pool layer 
# equivalent to get the most important features (or) get the features with highest value
max_pool_output = max_pool(conv_1d_output)

In [94]:
# Checking the shapes 
embedding_test.shape , conv_1d_output.shape , max_pool_output.shape

(TensorShape([1, 15, 128]), TensorShape([1, 11, 32]), TensorShape([1, 32]))

In [95]:
# Our embedding text
embedding_test

<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.05982959,  0.06486472, -0.05374949, ...,  0.02540092,
          0.0800107 , -0.01283696],
        [ 0.05204929, -0.00306684,  0.07770057, ..., -0.05686172,
          0.00874661, -0.07750862],
        [-0.05386468, -0.03497231, -0.07387542, ..., -0.03751308,
         -0.00696255,  0.02545308],
        ...,
        [-0.00028458,  0.01705967, -0.00640403, ...,  0.00917574,
          0.02321031, -0.04516872],
        [-0.00028458,  0.01705967, -0.00640403, ...,  0.00917574,
          0.02321031, -0.04516872],
        [-0.00028458,  0.01705967, -0.00640403, ...,  0.00917574,
          0.02321031, -0.04516872]]], dtype=float32)>

In [96]:
# Our Conv1D output 
conv_1d_output

<tf.Tensor: shape=(1, 11, 32), dtype=float32, numpy=
array([[[0.        , 0.        , 0.        , 0.10124576, 0.07095301,
         0.        , 0.12715349, 0.02894124, 0.        , 0.05216077,
         0.        , 0.00677468, 0.        , 0.        , 0.10846195,
         0.        , 0.        , 0.00193499, 0.        , 0.        ,
         0.06222846, 0.09670666, 0.        , 0.05050628, 0.10097651,
         0.        , 0.06739939, 0.01265104, 0.06746676, 0.        ,
         0.0087431 , 0.06327993],
        [0.        , 0.        , 0.        , 0.07699712, 0.15629509,
         0.        , 0.        , 0.02733262, 0.02422336, 0.        ,
         0.        , 0.02526975, 0.        , 0.        , 0.        ,
         0.04897532, 0.        , 0.0352303 , 0.        , 0.05087379,
         0.02565017, 0.        , 0.0415939 , 0.03236188, 0.        ,
         0.10698706, 0.04543754, 0.        , 0.02231793, 0.01832335,
         0.02238441, 0.02943115],
        [0.        , 0.0535873 , 0.00966292, 0.    

In [97]:
# Condense our conv1d 
# We take max across [1 , 15 , 64] and condense into 
max_pool_output

<tf.Tensor: shape=(1, 32), dtype=float32, numpy=
array([[0.        , 0.0535873 , 0.00966292, 0.10124576, 0.15629509,
        0.03394502, 0.12715349, 0.04721297, 0.04355679, 0.05216077,
        0.04130907, 0.02526975, 0.00225037, 0.        , 0.13346946,
        0.04897532, 0.        , 0.0352303 , 0.10032471, 0.0621746 ,
        0.07607633, 0.09670666, 0.0415939 , 0.07590594, 0.10097651,
        0.10698706, 0.06739939, 0.06557755, 0.06746676, 0.01832335,
        0.07884632, 0.09416923]], dtype=float32)>

Taking on the challenge and building a Conv1D model 

In [98]:
# Creating Conv1D sequence model 
inputs = layers.Input(shape = (1, ) , dtype = tf.string)

x = text_vectorizer(inputs)
x = embedding(x)
print(f'Shape after embedding: {x.shape}')

# Our Conv1D layer 
x = layers.Conv1D(filters = 32 , 
                  kernel_size = 5 , 
                  padding = 'same' , 
                  activation = 'relu')(x)
print(f'Shape after Conv1D: {x.shape}')
x = layers.Dense(64 , activation= 'relu')(x)
print(f'Shape after into the Dense layer: {x.shape}')
# GlobalMaxPool1D works better
x = layers.GlobalMaxPool1D()(x)
print(f'Shape after into the GlobalMaxPool layer: {x.shape}')

# Output layer 
outputs = layers.Dense(1 , activation= 'sigmoid')(x)

# Packing into a model 
test_model_5 = tf.keras.Model(inputs , outputs)


Shape after embedding: (None, 15, 128)
Shape after Conv1D: (None, 15, 32)
Shape after into the Dense layer: (None, 15, 64)
Shape after into the GlobalMaxPool layer: (None, 64)


In [100]:
# Compile the model 
test_model_5.compile(loss = tf.keras.losses.BinaryCrossentropy(), 
                   optimizer = tf.keras.optimizers.Adam() ,
                   metrics = ['accuracy'])

In [101]:
# Fitting the model 
test_model_5.fit(train_sentences , 
                 train_labels , 
                 validation_data = (val_sentences , val_labels), 
                 epochs = 5 )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7faf73a58310>

Now following the video 


In [108]:
# Creating Conv1D sequence model 
inputs = layers.Input(shape = (1, ) , dtype = tf.string)

x = text_vectorizer(inputs)
x = embedding(x)
print(f'Shape after embedding: {x.shape}')

# Our Conv1D layer 
x = layers.Conv1D(filters = 32 , 
                  kernel_size = 5 , 
                  padding = 'same' , 
                  activation = 'relu')(x)
print(f'Shape after Conv1D: {x.shape}')
#x = layers.Dense(64 , activation= 'relu')(x)
#print(f'Shape after into the Dense layer: {x.shape}')
# GlobalMaxPool1D works better
x = layers.GlobalMaxPool1D()(x)
print(f'Shape after into the GlobalMaxPool layer: {x.shape}')

# Output layer 
outputs = layers.Dense(1 , activation= 'sigmoid')(x)

# Packing into a model 
model_5 = tf.keras.Model(inputs , outputs , name = 'model_5_conv_1d')


Shape after embedding: (None, 15, 128)
Shape after Conv1D: (None, 15, 32)
Shape after into the GlobalMaxPool layer: (None, 32)


In [109]:
# Compile the model 
model_5.compile(loss = tf.keras.losses.BinaryCrossentropy(), 
                   optimizer = tf.keras.optimizers.Adam() ,
                   metrics = ['accuracy'])

In [110]:
model_5.summary()

Model: "model_5_conv_1d"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_11 (InputLayer)        [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 15)                0         
_________________________________________________________________
embedding (Embedding)        (None, 15, 128)           1280000   
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 15, 32)            20512     
_________________________________________________________________
global_max_pooling1d_4 (Glob (None, 32)                0         
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 33        
Total params: 1,300,545
Trainable params: 1,300,545
Non-trainable params: 0
_________________________________________

In [111]:
# Fitting the model 
model_5.fit(train_sentences , 
                 train_labels , 
                 validation_data = (val_sentences , val_labels), 
                 epochs = 5 , 
            callbacks = [create_tensorboard_callback(dir_name = SAVE_DIR , 
                                                     experiment_name = 'model_5_1d_conv_layer')])

Saving TensorBoard log files to: model_logs/model_5_1d_conv_layer/20210518-220430
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7faf72f2ed90>

In [112]:
# Make some predictions with our Conv1D model 
model_5_pred_probs = model_5.predict(val_sentences)
model_5_pred_probs[:10]

array([[8.0735773e-02],
       [8.7902522e-01],
       [9.9997652e-01],
       [6.8679065e-02],
       [1.2854501e-06],
       [9.9195516e-01],
       [9.3900108e-01],
       [9.9999946e-01],
       [9.9999976e-01],
       [8.3305240e-01]], dtype=float32)

In [113]:
# Convert model 5 pred probs to labels 
model_5_preds = tf.squeeze(tf.round(model_5_pred_probs))
model_5_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [114]:
# Evaluate model 5 predictions 
model_5_results = classification_evaluation_metrics(val_labels , 
                                                    model_5_preds)

In [115]:
model_5_results

{'Accuracy:': 75.19685039370079,
 'F1_Score: ': 0.7191679049034174,
 'Precision: ': 0.7446153846153846,
 'Recall: ': 0.6954022988505747}

In [116]:
# Our base line model 
evaluation_dict

{'Accuracy:': 79.26509186351706,
 'F1_Score: ': 0.734006734006734,
 'Precision: ': 0.8861788617886179,
 'Recall: ': 0.6264367816091954}