# Introduction to NLP Fundamentals in TensorFlow

NLP has the goal of deriving information of natural language (could be sequence text or speech)

Another common term of NLP problems is sequence to sequence problems (seq2seq)

## Check for GPU

In [3]:
!nvidia-smi

Mon Jan  8 16:36:49 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## GET helper functions

In [4]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2024-01-08 16:36:50--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2024-01-08 16:36:50 (67.9 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [5]:
# Import series of helper function for the notebook
from helper_functions import unzip_data, create_tensorboard_callback,compare_historys, plot_loss_curves

## Get text dataset

The dataset we are going to be using is kaggles introduction to NLP dataset (text samples of tweets labeld as disaster or not disaster)

Original source [here](https://www.kaggle.com/competitions/nlp-getting-started)

In [6]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

--2024-01-08 16:36:57--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.204.207, 172.217.203.207, 142.250.97.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.204.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2024-01-08 16:36:57 (124 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [7]:
unzip_data("nlp_getting_started.zip")

## Visualizing a text dataset

To visualize our text samples, we first have to read them in, one way to do so would be to use Python

But i prefer to get visual straight away.

So another way to do this is to use pandas...

In [8]:
import pandas as pd
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [9]:
train_df["text"][1389]

'California Bush fires please evacuate affected areas ASAP when california govts advised you to do so http://t.co/ubVEVUuAch'

In [10]:
# Shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1,random_state=42)
train_df_shuffled.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [11]:
# What does the test dataframe looks like
test_df

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago http://t.co/U...
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...


In [12]:
# how many examples of each class?
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [13]:
# How many samples in total?
len(train_df),len(test_df)

(7613, 3263)

In [14]:
# Lets visualize some random training examples
import random
random_index = random.randint(0,len(train_df)-5) #create random indexes not higher than a total nubmer of samples
for row in train_df_shuffled[["text","target"]][random_index:random_index+5].itertuples():
  _,text,target = row
  print(f"Target: {target}", "(real disaster)" if target>0 else "(not real disaster)")
  print(f"Text:\n {text}\n")
  print("---\n")

Target: 0 (not real disaster)
Text:
 I can't bloody wait!! Sony Sets a Date For Stephen KingÛªs Û÷The Dark TowerÛª #stephenking #thedarktower http://t.co/J9LPdRXCDE  @bdisgusting

---

Target: 1 (real disaster)
Text:
 @blairmcdougall and when will you be commenting on Ian Taylor's dealings with mass - murderer Arkan?

---

Target: 1 (real disaster)
Text:
 When ur friend and u are talking about forest fires in a forest and he tells u to drop ur mix tape out there... #straightfire

---

Target: 1 (real disaster)
Text:
 @WaseemBadami Condemning of Deaths More than 1000 due to Heat Wave in Karachi. 
May Allah gv Patience to their Heirs. http://t.co/iTG84q7vIi

---

Target: 1 (real disaster)
Text:
 All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected

---



### Split data into training and validation sets

In [15]:
from sklearn.model_selection import train_test_split
import numpy as np

In [16]:
# Use train_test_split to split training data into training and validation sets
train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df_shuffled["text"].to_numpy(),
                                                                            train_df_shuffled["target"].to_numpy(),
                                                                            test_size=0.1, # dedicate 10% of samples to validation set
                                                                            random_state=42) # random state for reproducibility



In [17]:
# Check the lenghts of each
len(train_sentences),len(train_labels),len(val_sentences),len(val_labels)

(6851, 6851, 762, 762)

In [18]:
# Cehck the first 10
train_sentences[:10],train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object),
 array([0,

## Converting text into numbers

When delaing with a text problem one thing you need to do first is to convert text to numbers



Wonderful! We've got a training set and a validation set containing Tweets and labels.

Our labels are in numerical form (0 and 1) but our Tweets are in string form.

🤔 Question: What do you think we have to do before we can use a machine learning algorithm with our text data?

If you answered something along the lines of "turn it into numbers", you're correct. A machine learning algorithm requires its inputs to be in numerical form.

In NLP, there are two main concepts for turning text into numbers:

**Tokenization** - A straight mapping
from word or character or sub-word to a numerical value. There are three main levels of tokenization:
1. Using **word-level tokenization** with the sentence "I love TensorFlow" might result in "I" being 0, "love" being 1 and "TensorFlow" being 2. In this case, every word in a sequence considered a single token.
2. **Character-level tokenization**, such as converting the letters A-Z to values 1-26. In this case, every character in a sequence considered a single token.
3. **Sub-word tokenization** is in between word-level and character-level tokenization. It involves breaking invidual words into smaller parts and then converting those smaller parts into numbers. For example, "my favourite food is pineapple pizza" might become "my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After doing this,these sub-words would then be mapped to a numerical value. In this case, every word could be considered multiple tokens.
**Embeddings** - An embedding is a representation of natural language which can be learned. Representation comes in the form of a feature vector. For example, the word "dance" could be represented by the 5-dimensional vector [-0.8547, 0.4559, -0.3332, 0.9877, 0.1112]. It's important to note here, the size of the feature vector is tuneable. There are two ways to use embeddings:
1. **Create your own embedding** - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as tf.keras.layers.Embedding) and an embedding representation will be learned during model training.
2. **Reuse a pre-learned embedding** - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.
 Example of tokenization* (straight mapping from word to number) and embedding (richer representation of relationships between tokens).

### Text vectorazation (tokenazation)

In [19]:
train_sentences[:10]

array(['@mogacola @zamtriossu i screamed after hitting tweet',
       'Imagine getting flattened by Kurt Zouma',
       '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
       "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
       'Somehow find you and I collide http://t.co/Ee8RpOahPk',
       '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
       'destroy the free fandom honestly',
       'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
       '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
       'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
      dtype=object)

In [20]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

# Use the default text vectorization parameters
text_vectorizer = TextVectorization(max_tokens=10000000000, #This parameter how many words in vocabulaty (automaticly add this <OOV>)
                                    standardize="lower_and_strip_punctuation",
                                    split="whitespace",
                                    ngrams=None, # Create groups of n-words
                                    output_mode="int",# how to map tokens to numbers
                                    output_sequence_length=None, # how long do you want your sequences to be
                                    pad_to_max_tokens=True)

In [21]:
train_sentences[0].split()

['@mogacola', '@zamtriossu', 'i', 'screamed', 'after', 'hitting', 'tweet']

In [22]:
# Find the average number of tokens (words) in the training tweets
round(sum([len(i.split())for i in train_sentences])/len(train_sentences))

15

In [23]:
# Set up text vectoriation variables
max_vocab_lenght = 10000 # max nuber of words to have in our vocabulary
max_lenght = 15 #max lenght our sequences will be

text_vectorizer = TextVectorization(max_tokens=max_vocab_lenght,
                                    output_mode="int",
                                    output_sequence_length=max_lenght)

In [24]:
# Fit the text vectorizer to training text
text_vectorizer.adapt(train_sentences)

In [25]:
# Create a sample sentence and tokenize it
sample_sentence = "There a flood in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[ 74,   3, 232,   4,  13, 698,   0,   0,   0,   0,   0,   0,   0,
          0,   0]])>

In [26]:
# Chose random sentence from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text: \n {random_sentence}\n\nVectorized version:")
text_vectorizer(random_sentence)

Original text: 
 @Barbi_Twins We need help-horses will die! Please RT &amp; sign petition! Take a stand &amp; be a voice for them! #gilbert23 https://t.co/e8dl1lNCVu

Vectorized version:


<tf.Tensor: shape=(15,), dtype=int64, numpy=
array([   1,   46,  162,    1,   38,  686,  170,   96,   35,  986, 1381,
        167,    3,  807,   35])>

In [27]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary() # get all of the unique words in our training data
top_5_words = words_in_vocab[:5] # get the most common words
bottom_5_words = words_in_vocab[-5:] # get the least common words
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"5 most common words: {top_5_words}")
print(f"5 least common words: {bottom_5_words}")

Number of words in vocab: 10000
5 most common words: ['', '[UNK]', 'the', 'a', 'in']
5 least common words: ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


In [28]:
train_sentences

array(['@mogacola @zamtriossu i screamed after hitting tweet',
       'Imagine getting flattened by Kurt Zouma',
       '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
       ...,
       'Near them on the sand half sunk a shattered visage lies... http://t.co/0kCCG1BT06',
       "kesabaran membuahkan hasil indah pada saat tepat! life isn't about waiting for the storm to pass it's about learning to dance in the rain.",
       "@ScottDPierce @billharris_tv @HarrisGle @Beezersun I'm forfeiting this years fantasy football pool out of fear I may win n get my ass kicked"],
      dtype=object)

### Creating Embedding using embedding layer

To make our embedding we are going to use tensorflow's embedding layer

The parameters we care most about for our embedding layer:
* `input dim` = the size of our vocab
* `output dim` = the size of the output embedding vector for example a value of 100 woild mean each token gets represented by a vector 100 long
* `input_length` = lenght of sequences being passed to ebedding layer

In [29]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = max_vocab_lenght, # see input shape
                             output_dim=128,
                             input_length = max_lenght)
embedding

<keras.src.layers.core.embedding.Embedding at 0x7f46808183d0>

In [30]:
# Get a random sentence from the training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n {random_sentence}\n\nEmbedded version: ")

# Embedded the random sentence (turns it into dense vector fixed size)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
 Stretcher in 5 min // Speaker Deck http://t.co/0YO2l38OZr

Embedded version: 


<tf.Tensor: shape=(1, 15, 128), dtype=float32, numpy=
array([[[-0.01598261,  0.04673567,  0.02288642, ...,  0.02317688,
          0.01327011, -0.04312951],
        [-0.01099749, -0.01887019,  0.01733751, ..., -0.04299319,
         -0.01748555, -0.04535202],
        [ 0.01246585,  0.01868539,  0.00494437, ...,  0.04399469,
          0.02634717,  0.04122387],
        ...,
        [-0.03359035, -0.02199047,  0.02893018, ...,  0.02078963,
         -0.03316555,  0.03169684],
        [-0.03359035, -0.02199047,  0.02893018, ...,  0.02078963,
         -0.03316555,  0.03169684],
        [-0.03359035, -0.02199047,  0.02893018, ...,  0.02078963,
         -0.03316555,  0.03169684]]], dtype=float32)>

In [31]:
# Check out a single tokens embedding
sample_embed[0][0],sample_embed[0][0].shape,random_sentence

(<tf.Tensor: shape=(128,), dtype=float32, numpy=
 array([-0.01598261,  0.04673567,  0.02288642, -0.03519303, -0.03679321,
        -0.0099238 ,  0.04444772,  0.03505744, -0.00540862,  0.00846655,
         0.00780616, -0.01712239, -0.01039619, -0.00989244,  0.00185903,
        -0.00496539, -0.00864682, -0.03053116, -0.03437829, -0.00507392,
        -0.03937124,  0.02831283, -0.0208164 ,  0.01405773,  0.03690262,
        -0.04777994,  0.02544742, -0.04787599,  0.00117036,  0.00463231,
        -0.02945217, -0.04467893,  0.01076509, -0.0237064 , -0.03621825,
         0.00765058, -0.00828564,  0.01042423, -0.03571744,  0.00673657,
        -0.03452072, -0.00984845,  0.04239893,  0.03934742, -0.00747789,
         0.0061883 ,  0.03777159, -0.01128957, -0.0219922 ,  0.02186752,
         0.01424487,  0.02940557, -0.0434904 ,  0.03657618,  0.03476819,
         0.00162678, -0.04456943,  0.04879434,  0.0495207 ,  0.01358495,
         0.04135114,  0.03871853,  0.03311474,  0.02245876,  0.00154401,
  

## Modeling and text dataset (running a series of experiments)

Now we ve got a way to turn our text sequences into numbers its time to start building a series of modelling experiments.

We ll start with a baseline and move on from there.

* Model 0: naive bayes (baseline)
* Model 1: Feed-forward neurlal network (dense model)
* Model 2: LSTM model (RNN)
* Model 3: GRU model (RNN)
* Model 4: Bidirection LSTM model (RNN)
* Model 5: 1D Convolutional Neural Network (CNN)
* Model 6: Tensorflow Hub pretrained feature extrator (using transfer learning for NLP)
* Model 7: Same as model 6 with 10 % of training data


How we are going to approach all of these?

Use the standard steps in modelling with tensorflow:

* Create a model
* Build a model
* Fit the model
* Evaluate the model

### Model 0: Getting a baseline

As with all machine learning modelling experiments, its important to crate a baseline model so youve got a benchmark for future experiments to build upon

To create our baseline we ll use sklearns multinomial naive bayes to using TF-IDF to convert our words to numbers

**NOTE**: Its common practice to use DL algorithams as baseline because of their speed and then later using DL to see if you can imporve upon them

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Create tokenazation and modeling pipeline
model_0 = Pipeline([
                    ("tfidf", TfidfVectorizer()), #convert words to numbers using tfidf
                    ("clf", MultinomialNB()) # model the text
])

# Fit the pipeline to training data
model_0.fit(train_sentences,train_labels)


In [33]:
# Evaluate our baseline model
baseline_score = model_0.score(val_sentences,val_labels)
print(f"Our baseline model achives an accuracy of: {baseline_score*100:.2f}%")

Our baseline model achives an accuracy of: 79.27%


In [34]:
# Maek predictions
baseline_preds = model_0.predict(val_sentences)
baseline_preds[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

### Creating an evaluation function for our model experiments

We could evaluate all of our models predicitons with different metrics every time , this will be cumbersome and could easily be fixed with a function

Lets create one to compare our models prediction with the truth labels using the following matrix
* Accuracy
* Precision
* Recall
* F1-score

sklearn metrics for evaluatin,TAKE A LOOK:https://scikit-learn.org/stable/modules/model_evaluation.html

In [35]:
#Function to evaluate:accuracy,precsion,recall,F1-score
from sklearn.metrics import accuracy_score,precision_recall_fscore_support

def calculate_results(y_true,y_pred):
  """
  Calculates model accuracy, precision,recall and f1-score of a binary classification model
  """
  # Calculate model accuracy
  model_accuracy = accuracy_score(y_true,y_pred)*100
  # Calculate model precision, recall and f1-score using  "weighted" average
  model_precision,model_recall,model_f1,_ = precision_recall_fscore_support(y_true,y_pred,average="weighted")
  model_results = {"accuracy": model_accuracy,
                   "precision": model_precision,
                   "recall": model_recall,
                   "f1": model_f1}
  return model_results

In [36]:
# Get baseline results
baseline_results = calculate_results(y_true=val_labels,
                                     y_pred=baseline_preds)

baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

## Model 1: A simple dense model

In [37]:
# Cretae tensorboard callback (ned to create a new one for each model)
from helper_functions import create_tensorboard_callback

# Create a drectory to save Tensorboard logs
SAVE_DIR = "model_logs"

In [38]:
# Build model with the functional API
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,),dtype=tf.string) # inputs are 1-dimensional strings
x = text_vectorizer(inputs) #turn our inputs text into numbers
x = embedding(x) #turn an embedding of the numberized inputs
x = layers.GlobalAveragePooling1D() (x) #condese the feature vector for each token
outputs = layers.Dense(1,activation="sigmoid")(x) #create the output layer,want binary outputs so use sigmoid activation function
model_1 = tf.keras.Model(inputs,outputs,name="model_1_dense")

In [39]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (  (None, 128)               0         
 GlobalAveragePooling1D)                                         
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1280129 (4.88 MB)
Trainable params: 128

In [40]:
# Compile the model
model_1.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [41]:
# Fit the model
model_1_history = model_1.fit(train_sentences, # input sentences can be a list of strings due to text preprocessing layer built-in model
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences, val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                     experiment_name="simple_dense_model")])

Saving TensorBoard log files to: model_logs/simple_dense_model/20240108-163702
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [42]:
# Check the reults but after adding GlobalAveragePolling layer
model_1.evaluate(val_sentences,val_labels)



[0.4796052575111389, 0.787401556968689]

In [43]:
# Check the reults
model_1.evaluate(val_sentences,val_labels)



[0.4796052575111389, 0.787401556968689]

In [44]:
# Make some predicitons and evaluate those
model_1_pred_probs = model_1.predict(val_sentences)
model_1_pred_probs.shape



(762, 1)

In [45]:
# look at a single prediciton
model_1_pred_probs[0]

array([0.38766468], dtype=float32)

In [46]:
model_1_pred_probs[:10]

array([[0.38766468],
       [0.8640733 ],
       [0.99816114],
       [0.09357543],
       [0.12335065],
       [0.9152779 ],
       [0.902844  ],
       [0.99356174],
       [0.9639493 ],
       [0.23676535]], dtype=float32)

In [47]:
# Convert model prediciton probs to label format
model_1_preds = tf.squeeze(tf.round(model_1_pred_probs))
model_1_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)>

In [48]:
# Calculate our model_1 results
model_1_results = calculate_results(y_true=val_labels,
                                    y_pred=model_1_preds)
model_1_results

{'accuracy': 78.74015748031496,
 'precision': 0.7932296029485675,
 'recall': 0.7874015748031497,
 'f1': 0.7841130596930417}

In [49]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

In [50]:
np.array(list(model_1_results.values())) > np.array(list(baseline_results.values()))

array([False, False, False, False])

## Visualzing learned embedding

In [51]:
# Get the vocabulary from the vectorization layer
words_in_vocab = text_vectorizer.get_vocabulary()
len(words_in_vocab),words_in_vocab[:10]

(10000, ['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is'])

In [52]:
max_vocab_lenght

10000

In [53]:
# Model 1 sumaary
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 global_average_pooling1d (  (None, 128)               0         
 GlobalAveragePooling1D)                                         
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1280129 (4.88 MB)
Trainable params: 128

In [54]:
# Get the matrix of embedding layer
# (these are the numerical represantation of each token in our training data, which have been learned for -5 epochs)

embed_weights = model_1.get_layer("embedding").get_weights()[0]
print(embed_weights.shape) # same size as vocab size and embedding dim

(10000, 128)


Now we got the embeddding matrix our model has learned to represent our tokens, lets see how we can visualize it.

To do so, Tensorflow has a handy tool called projector: https://projector.tensorflow.org/

And tensorflow also has an incredible guide on word embedding themselves

In [55]:
# Create embedding files (we got from TensorFlows word embeddings)
import io
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(words_in_vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = embed_weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

# Download files from Colab to upload to [projector](https://projector.tensorflow.org/)

In [56]:
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception:
  pass

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Recurrent Neural Networks (RNN's)

RNN's are useful for sequence data

The premise of a recurrent neural network is to use the representation of a previous input to aid the representation of a later input.

If you want an overview of the eternals of a recurent neural network:

**Resources**

 * -MIT's sequence modelling lecture http://introtodeeplearning.com/

 * -LSTM (RNN's) Chris overview (maybe the best on net) https://colah.github.io/posts/2015-08-Understanding-LSTMs/

 * Andrej kaprathys the unreasonable effectivnes http://karpathy.github.io/2015/05/21/rnn-effectiveness/


### Model 2: LSTM

LSTM =long short term memory (one of the most popular LSTM cells)

Our structure of RNN typically look like this

```
Input (text) -> Tokenize -> Embedding -> Layers(RNN's/dense) -> output (label probs)

```

In [57]:
# Create an LSTM model
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,),dtype="string")
x = text_vectorizer(inputs)
x = embedding(x)
#print(x.shape)
#x = layers.LSTM(64,return_sequences=True)(x) # When you are stacking RNN cels together you need to set return_sequences
#print(x.shape)
x = layers.LSTM(64)(x)
#print(x.shape)
#x = layers.Dense(64,activation="relu")(x)
#print(x.shape)
outputs = layers.Dense(1,activation="sigmoid")(x)
model_2 = tf.keras.Model(inputs,outputs,name="model_2_LSTM")

In [58]:
# Get a summary
model_2.summary()

Model: "model_2_LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 lstm (LSTM)                 (None, 64)                49408     
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1329473 (5.07 MB)
Trainable params: 1329473 (5.07 MB)
Non-trainable params: 0 (0.00 Byte)
________________

In [59]:
# Compile the model
model_2.compile(loss="binary_crossentropy",
                optimizer= tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [60]:
# Fit the model
model_2_history = model_2.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences,val_labels),
                              callbacks=[create_tensorboard_callback(dir_name=SAVE_DIR,
                                                                     experiment_name="model_2_LSTM")])

Saving TensorBoard log files to: model_logs/model_2_LSTM/20240108-163727
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [61]:
# Make predicitons with LSTM model
model_2_pred_probs = model_2.predict(val_sentences)
model_2_pred_probs[:10]



array([[6.3866109e-01],
       [6.9697863e-01],
       [9.9990070e-01],
       [5.1107772e-02],
       [3.9767404e-04],
       [9.9981624e-01],
       [9.9476886e-01],
       [9.9992621e-01],
       [9.9989545e-01],
       [9.4468755e-01]], dtype=float32)

In [62]:
# Convert model 2 pred probs to labels
model_2_preds = tf.squeeze(tf.round(model_2_pred_probs))
model_2_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([1., 1., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [63]:
# Calculate model 2 results
model_2_results = calculate_results(y_true=val_labels,
                                    y_pred=model_2_preds)
model_2_results

{'accuracy': 77.42782152230971,
 'precision': 0.7743891358240415,
 'recall': 0.7742782152230971,
 'f1': 0.7743283279430294}

In [64]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

### Model 3: GRU

Another popular and effective RNN component is the GRU or gated recurrent unit

The GRU cell has similar features to an LSTM cell but has less parameters.

In [65]:
# Build an RNN using the GRU cell
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,),dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.GRU(64)(x)
#print(x.shape)
#x = layers.GRU(64, return_sequences=True)(x) # if you want to stack reccurent layers on top of each other you need return_sequences = true
#print(x.shape)
#x = layers.LSTM(42, return_sequences=True)(x)
#print(x.shape)
#x = layers.GRU(99)(x)
#print(x.shape)
#x = layers.Dense(64,activation="relu")(x)
outputs = layers.Dense(1,activation="sigmoid")(x)
model_3 = tf.keras.Model(inputs,outputs,name="model_3_GRU")

In [66]:
# gET THE SUMMARY
model_3.summary()

Model: "model_3_GRU"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 gru (GRU)                   (None, 64)                37248     
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1317313 (5.03 MB)
Trainable params: 1317313 (5.03 MB)
Non-trainable params: 0 (0.00 Byte)
_________________

In [67]:
# Compile the model
model_3.compile(loss="binary_crossentropy",
                optimizer = tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [68]:
#Fit the model
model_3_history = model_3.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences,val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "model_3_GRU")])

Saving TensorBoard log files to: model_logs/model_3_GRU/20240108-163750
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [69]:
# Make some predicitons with GRU
model_3_pred_probs = model_3.predict(val_sentences)
model_3_pred_probs[:10]



array([[2.1129665e-04],
       [5.4701293e-01],
       [9.9977177e-01],
       [2.7511952e-02],
       [3.3837947e-05],
       [9.9870944e-01],
       [5.9870875e-01],
       [9.9993110e-01],
       [9.9981993e-01],
       [2.9403311e-01]], dtype=float32)

In [70]:
# COnvert model 3 pred probs to labels
model_3_preds = tf.squeeze(tf.round(model_3_pred_probs))
model_3_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0.], dtype=float32)>

In [71]:
# Calculate model 3 results
model_3_results = calculate_results(y_true=val_labels,
                                    y_pred=model_3_preds)
model_3_results

{'accuracy': 77.42782152230971,
 'precision': 0.7803914348835892,
 'recall': 0.7742782152230971,
 'f1': 0.7704556073136567}

### Model 4: Bidirecitional RNN

Normal RNN's go from left to right (just like you read an English sentence) however , a biderectional RNN goes from right to left as well as left to right


In [72]:
# Build a biderectional RNN in tensorflow
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,),dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
#x = layers.Bidirectional(layers.LSTM(64,return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
outputs = layers.Dense(1,activation="sigmoid")(x)
model_4 = tf.keras.Model(inputs,outputs,name="model_4_bidirectional")

In [73]:
# Get a summary
model_4.summary()

Model: "model_4_bidirectional"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 bidirectional (Bidirection  (None, 128)               98816     
 al)                                                             
                                                                 
 dense_3 (Dense)             (None, 1)                 129       
                                                                 
Total params: 1378945 (5.26 MB)
Trainable par

In [74]:
# Compile model
model_4.compile(loss="binary_crossentropy",
                optimizer = tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [75]:
#Fit the model
model_4_history = model_4.fit(train_sentences,train_labels,
                              epochs=5,
                              validation_data=(val_sentences,val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "model_4_bidirectional")])

Saving TensorBoard log files to: model_logs/model_4_bidirectional/20240108-163814
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [76]:
# Make prediciton with our bidirectional model
model_4_pred_probs = model_4.predict(val_sentences)
model_4_pred_probs[:10]



array([[7.1597574e-03],
       [8.9380240e-01],
       [9.9997902e-01],
       [8.1446134e-02],
       [3.7744205e-05],
       [9.9984014e-01],
       [9.3351400e-01],
       [9.9998713e-01],
       [9.9997926e-01],
       [9.5819014e-01]], dtype=float32)

In [77]:
# CONVERT pred probs to pred labels
model_4_preds = tf.squeeze(tf.round(model_4_pred_probs))
model_4_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([0., 1., 1., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 1.], dtype=float32)>

In [78]:
# Calculate results of our bidirectional model
model_4_results = calculate_results(y_true=val_labels,
                                    y_pred=model_4_preds)
model_4_results

{'accuracy': 76.24671916010499,
 'precision': 0.7648032381971924,
 'recall': 0.7624671916010499,
 'f1': 0.7598225942829093}

In [79]:
model_3_results

{'accuracy': 77.42782152230971,
 'precision': 0.7803914348835892,
 'recall': 0.7742782152230971,
 'f1': 0.7704556073136567}

## Convolutional Neural Networks for Text(and other types of sequences)

We've used CNNs for images but images are 2D....however our text data is 1D.

Previously we used Conv2D for our image data but now we will use Conv1D

The tipical structure of a COnv1D model for sequences in our case (text):
```
Inputs(text) -> Tokenazation -> Embedding -> Layers(typicly conv1d + pooling) -> Outputs(class probabilities)
```

### Model 5: Conv1D

For different explanations of parameters see:
* https://poloclub.github.io/cnn-explainer/ (this is for 2D) but can relate to 1D data
* Difference beetween valid and same padding look at the overflow

In [92]:
from tensorflow.keras import layers
embedding_test = embedding(text_vectorizer(["this is a test sentence"])) #turn target sequence into embedding
conv_1d = layers.Conv1D(filters=32,
                        kernel_size=5, #this is also reffered to ngram of 5 (meanign it looks 5 words at a time)
                        activation="relu",
                        strides=1,
                        padding="same")

conv_1d_output = conv_1d(embedding_test) #pass test embedding though conv1d layer
max_pool = layers.GlobalMaxPool1D()
max_pool_output = max_pool(conv_1d_output) #this is equivalent to "get the most important feature" or "get the feature with the maximum value"

embedding_test.shape , conv_1d_output.shape, max_pool_output.shape

(TensorShape([1, 15, 128]), TensorShape([1, 15, 32]), TensorShape([1, 32]))

In [96]:
#embedding_test

In [95]:
#conv_1d_output

In [98]:
#max_pool_output

In [107]:
# Create 1-dimensional convolutional layer to model sequences
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,),dtype=tf.string)
x = text_vectorizer(inputs)
x = embedding(x)
x = layers.Conv1D(filters=64,kernel_size=5,strides=1,activation="relu",padding="valid")(x)
x = layers.GlobalMaxPool1D()(x)
# x layers.Dense(64,activaiton-"relu")(x)
outputs = layers.Dense(1,activation="sigmoid")(x)
model_5 = tf.keras.Model(inputs,outputs,name="model_5_Conv1D")

# Compile Conv1D
model_5.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

# Get the summary
model_5.summary()

Model: "model_5_Conv1D"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_7 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization_1 (Text  (None, 15)                0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 conv1d_9 (Conv1D)           (None, 11, 64)            41024     
                                                                 
 global_max_pooling1d_7 (Gl  (None, 64)                0         
 obalMaxPooling1D)                                               
                                                                 
 dense_5 (Dense)             (None, 1)              

In [108]:
# Fit the model
model_5_history = model_5.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences,val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "conv1D")])

Saving TensorBoard log files to: model_logs/conv1D/20240108-170737
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [111]:
# Make some prediciton with our Conv1D model
model_5_pred_probs = model_5.predict(val_sentences)
model_5_pred_probs[:20]



array([[3.8346547e-01],
       [8.6945421e-01],
       [9.9986887e-01],
       [4.6706986e-02],
       [4.7398668e-08],
       [9.9618393e-01],
       [8.8452250e-01],
       [9.9997914e-01],
       [9.9999845e-01],
       [7.8410906e-01],
       [8.2493287e-08],
       [9.3041492e-01],
       [6.8133102e-07],
       [3.7340057e-01],
       [3.2465323e-07],
       [3.5280387e-03],
       [5.1377085e-04],
       [8.2370379e-06],
       [1.3145802e-02],
       [9.9262804e-01]], dtype=float32)

In [113]:
#Convert model 5 pred probs to labels
model_5_preds = tf.squeeze(tf.round(model_5_pred_probs))
model_5_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 1., 1., 0., 0., 1., 1., 1., 1., 1.], dtype=float32)>

In [114]:
#Evaluate_model 5 preds
model_5_results = calculate_results(y_true=val_labels,
                                    y_pred=model_5_preds)
model_5_results

{'accuracy': 76.77165354330708,
 'precision': 0.7683074753719822,
 'recall': 0.7677165354330708,
 'f1': 0.7661635916954678}

In [115]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

## Model 6: Tensorflow Hub pretained Sentence Encoder

Now've built a few of our own models, lets try and use transfer learning for NLP, specificallu using TensorFlow Hub's Universal sentence encoder


In [116]:
import tensorflow_hub as hub

embed = hub.load("https://www.kaggle.com/models/google/universal-sentence-encoder/frameworks/TensorFlow2/variations/universal-sentence-encoder/versions/2")
embed_samples = embed([sample_sentence,
                       "When you can the universan sentence encoder on a sentence, turns it to numbers"
])
print(embed_samples[0][:50])

tf.Tensor(
[ 0.01242417  0.0188776   0.02122122 -0.02356055  0.01999637  0.08199081
  0.00489581  0.04467827 -0.04281935 -0.00426615  0.02165267 -0.01137037
  0.00148258  0.05903588  0.06165852 -0.02590531  0.03185736 -0.06056454
  0.01157929 -0.06712484 -0.01647752  0.02402009  0.02785436  0.00611262
  0.00701467 -0.04393741  0.01045688 -0.00949777 -0.01883714 -0.00692644
 -0.04347689  0.05181637 -0.01878772  0.00117221  0.02125993 -0.08305568
  0.03174268  0.05086726 -0.03023683 -0.08832537  0.01250006  0.00097091
 -0.0039418   0.0595032  -0.10078734 -0.04334236  0.01202807 -0.02835169
 -0.0445304   0.0203348 ], shape=(50,), dtype=float32)


In [118]:
embed_samples[0].shape

TensorShape([512])

In [119]:
 # Create a Keras Layer using the USE pretrained layer from tensorflow hub
 sentence_encoder_layer = hub.KerasLayer("https://www.kaggle.com/models/google/universal-sentence-encoder/frameworks/TensorFlow2/variations/universal-sentence-encoder/versions/2",
                                         input_shape=[],
                                         dtype=tf.string,
                                         trainable=False,
                                         name="USE")

In [137]:
# Create model using the Sequential API
model_6 = tf.keras.Sequential([
    sentence_encoder_layer,
    layers.Dense(64,activation="relu"),
    layers.Dense(1,activation="sigmoid",name="output_layer")
],name="model_6_USE")

In [138]:
# Compile the model
model_6.compile(loss="binary_crossentropy",
                optimizer=tf.keras.optimizers.Adam(),
                metrics=["accuracy"])
model_6.summary()

Model: "model_6_USE"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 USE (KerasLayer)            (None, 512)               256797824 
                                                                 
 dense_7 (Dense)             (None, 64)                32832     
                                                                 
 output_layer (Dense)        (None, 1)                 65        
                                                                 
Total params: 256830721 (979.73 MB)
Trainable params: 32897 (128.50 KB)
Non-trainable params: 256797824 (979.61 MB)
_________________________________________________________________


In [139]:
model_6_history = model_6.fit(train_sentences,
                              train_labels,
                              epochs=5,
                              validation_data=(val_sentences,val_labels),
                              callbacks=[create_tensorboard_callback(SAVE_DIR,
                                                                     "model_6_USE")])

Saving TensorBoard log files to: model_logs/model_6_USE/20240108-174906
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [140]:
# Make predictions with USE TF hub model
model_6_pred_probs = model_6.predict(val_sentences)
model_6_pred_probs[:10]



array([[0.17409956],
       [0.7432674 ],
       [0.98866045],
       [0.22647595],
       [0.7236055 ],
       [0.69959754],
       [0.9809039 ],
       [0.98024565],
       [0.9414004 ],
       [0.09016176]], dtype=float32)

In [141]:
#Convert prediction probs to labels
model_6_preds = tf.squeeze(tf.round(model_6_pred_probs))
model_6_preds[:20]

<tf.Tensor: shape=(20,), dtype=float32, numpy=
array([0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 1., 0., 0., 0.,
       1., 0., 0.], dtype=float32)>

In [142]:
#Calculate model 6 preformace
model_6_results = calculate_results(y_true=val_labels,
                                    y_pred=model_6_preds)
model_6_results

{'accuracy': 80.83989501312337,
 'precision': 0.8110027637273817,
 'recall': 0.8083989501312336,
 'f1': 0.806661353565622}

In [143]:
baseline_results

{'accuracy': 79.26509186351706,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706,
 'f1': 0.7862189758049549}

# Will be continued