<a href="https://colab.research.google.com/github/YasineNifa/DeepLearning-Using-TF/blob/master/nlp_in_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Basics in Tensorflow
A handful of example natural language processing (NLP) and natural language understanding (NLU) problems. These are also often referred to as sequence problems (going from one sequence to another).

The main goal of natural language processing (NLP) is to derive information from natural language.

Natural language is a broad term but you can consider it to cover any of the following:

* Text (such as that contained in an email, blog post, book, Tweet)
* Speech (a conversation you have with a doctor, voice commands you give to a smart speaker)

Under the umbrellas of text and speech there are many different things you might want to do.

If you're building an email application, you might want to scan incoming emails to see if they're spam or not spam (classification).

If you're trying to analyse customer feedback complaints, you might want to discover which section of your business they're for.

> 🔑 Note: Both of these types of data are often referred to as sequences (a sentence is a sequence of words). So a common term you'll come across in NLP problems is called seq2seq, in other words, finding information in one sequence to produce another sequence (e.g. converting a speech command to a sequence of text-based steps).

To get hands-on with NLP in TensorFlow, we're going to practice the steps we've used previously but this time with text data:

> Text -> turn into numbers -> build a model -> train the model to find patterns -> use patterns (make predictions)

>> 📖 Resource: For a great overview of NLP and the different problems within it, read the article A Simple Introduction to Natural Language Processing. (https://becominghuman.ai/a-simple-introduction-to-natural-language-processing-ea66a1747b32)


## In this notebook, we are going to cover :
* Downloading a text dataset
* Visualizing text data
* Converting text into numbers using tokenization
* Turning our tokenized text into an embedding
* Modelling a text dataset
  * Starting with a baseline (TF-IDF)
  * Building several deep learning text models
    * Dense, LSTM, GRU, Conv1D, Transfer learning
* Comparing the performance of each our models
* Combining our models into an ensemble
* Saving and loading a trained model
* Find the most wrong predictions

In [1]:
# check GPU
!nvidia-smi -L

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



### Download the text dataset
We'll be using the Real or Not? datset from Kaggle which contains text-based Tweets about natural disasters.

The Real Tweets are actually about diasters, for example:

> Jetstar and Virgin forced to cancel Bali flights again because of ash from Mount Raung volcano

The Not Real Tweets are Tweets not about diasters (they can be on anything), for example:

> 'Education is the most powerful weapon which you can use to change the world.' Nelson #Mandela #quote

In [2]:
# Download dataset
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

--2021-04-09 18:02:20--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.20.128, 74.125.142.128, 74.125.195.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.20.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2021-04-09 18:02:20 (90.8 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [3]:
# Unzip data
import zipfile
zip_ref = zipfile.ZipFile("nlp_getting_started.zip")
zip_ref.extractall()
zip_ref.close()

# create a methode for that :
def unzip_files(filename):
  zip_ref = zipfile.ZipFile(filename)
  zip_ref.extractall()
  zip_ref.close()

### Visualizing text dataset
Right now, our text data samples are in the form of .csv files. For an easy way to make them visual, let's turn them into pandas DataFrame's.

> 📖 Reading: You might come across text datasets in many different formats. Aside from CSV files (what we're working with), you'll probably encounter .txt files and .json files too. For working with these type of files, I'd recommend reading the two following articles by RealPython:

How to Read and Write Files in Python (https://realpython.com/read-write-files-python/)
Working with JSON Data in Python(https://realpython.com/python-json/)

In [4]:
import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
# shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1,random_state=42) #shuffle with random_state for reproducibility
train_df_shuffled.head(5)

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0



Notice how the training data has a "target" column.

We're going to be writing code to find patterns (e.g. different combinations of words) in the "text" column of the training dataset to predict the value of the "target" column.


In [6]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
test_df.tail()

Unnamed: 0,id,keyword,location,text
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago http://t.co/U...
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...
3262,10875,,,#CityofCalgary has activated its Municipal Eme...


In [8]:
len(train_df_shuffled)

7613

In [9]:
# How many example of target=1
len(train_df_shuffled[train_df_shuffled['target'] == 1])

3271

In [10]:
# How many example of target=0
len(train_df_shuffled[train_df_shuffled['target'] == 0])

4342

In [11]:
4342+3271

7613

In [12]:
# or
# a total description of data 
train_df_shuffled['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [13]:
# the total number of sample
print(f"Total training samples : {len(train_df_shuffled)}")
print(f"Total test samples : {len(test_df)}")
print(f"Total samples : {len(train_df_shuffled) + len(test_df)}")

Total training samples : 7613
Total test samples : 3263
Total samples : 10876


In [14]:
# Let's visualize some random training examples
# visualize 10 samples
import random

for i in range(10):
  raw_random = random.randint(0,len(train_df_shuffled))
  print('text : ', train_df_shuffled['text'][raw_random])
  target = train_df_shuffled['target'][raw_random]
  print(f'target : {target} ', "(real disaster)" if target>0 else "(not real disaster)")
  print()
  print("====================================")
  print()

text :  I feel like death...holy molys ????????
target : 0  (not real disaster)


text :  @supernovalester I feel so bad for them. I can literally feel that feeling of your heart sinking bc you didn't get anyone ugh jfc
target : 0  (not real disaster)


text :  Harshness Follows Us a
Better Day
by Sarah C
Racing thoughts with screaming sirens
Pacing back and forth for... http://t.co/ProNtOuo91
target : 0  (not real disaster)


text :  I got electrocuted this morning how is your day going? ??
target : 0  (not real disaster)


text :  Does the #FingerRockFire make you wonder 'am I prepared for a wildfire'. Find out at http://t.co/eX8A5JYZm5 #azwx http://t.co/DeEeKobmXa
target : 1  (real disaster)


text :  3 Former Executives To Be Prosecuted In Fukushima Nuclear Disaster http://t.co/UmjpRRwRUU
target : 1  (real disaster)


text :  When ur friend and u are talking about forest fires in a forest and he tells u to drop ur mix tape out there... #straightfire
target : 1  (real disaster)


te

### Split data into training and validation sets
Since the test set has no labels and we need a way to evalaute our trained models, we'll split off some of the training data and create a validation set.

When our model trains (tries patterns in the Tweet samples), it'll only see data from the training set and we can see how it performs on unseen data using the validation set.

We'll convert our splits from pandas Series datatypes to lists of strings (for the text) and lists of ints (for the labels) for ease of use later.

To split our training dataset and create a validation dataset, we'll use Scikit-Learn's train_test_split() method and dedicate 10% of the training samples to the validation set.

In [15]:
 from sklearn.model_selection import train_test_split
 # Use train_test_split to split training data into training and validation sets
train_data, val_data, train_labels, val_labels = train_test_split(train_df_shuffled['text'].to_numpy(), 
                                                                  train_df_shuffled['target'].to_numpy(), 
                                                                  test_size=0.1, #10% of sample to validation set
                                                                  random_state=42) #for reproducibility

In [16]:
#Check the lengths
len(train_data), len(val_data), len(train_labels), len(val_labels)

(6851, 762, 6851, 762)

In [17]:
# View the first 10 training sentences and their labels
train_data[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object), array([0, 

### Converting text into numbers
In NLP, there are two main concepts for turning text into numbers:

* Tokenization - A straight mapping from word or character or sub-word to a numerical value. There are three main levels of tokenization:
  * 1- Using word-level tokenization with the sentence "I love TensorFlow" might result in "I" being 0, "love" being 1 and "TensorFlow" being 2. In this case, every word in a sequence considered a single token.
  * 2- Character-level tokenization, such as converting the letters A-Z to values 1-26. In this case, every character in a sequence considered a single token.
  * 3- Sub-word tokenization is in between word-level and character-level tokenization. It involves breaking invidual words into smaller parts and then converting those smaller parts into numbers. For example, "my favourite food is pineapple pizza" might become "my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After doing this, these sub-words would then be mapped to a numerical value. In this case, every word could be considered multiple tokens.

* Embeddings - An embedding is a representation of natural language which can be learned. Representation comes in the form of a feature vector. For example, the word "dance" could be represented by the 5-dimensional vector [-0.8547, 0.4559, -0.3332, 0.9877, 0.1112]. It's important to note here, the size of the feature vector is tuneable. There are two ways to use embeddings:
  * 1- Create your own embedding - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as tf.keras.layers.Embedding) and an embedding representation will be learned during model training.
  * 2- Reuse a pre-learned embedding - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.

It depends on your problem. You could try character-level tokenization/embeddings and word-level tokenization/embeddings and see which perform best. You might even want to try stacking them (e.g. combining the outputs of your embedding layers using tf.keras.layers.concatenate).

If you're looking for pre-trained word embeddings, Word2vec embeddings, GloVe embeddings and many of the options available on TensorFlow Hub are great places to start.

> 🔑 Note: Much like searching for a pre-trained computer vision model, you can search for pre-trained word embeddings to use for your problem. Try searching for something like "use pre-trained word embeddings in TensorFlow".

#### Text vectorization (tokenization)
Enough talking about tokenization and embeddings, let's create some.

We'll practice tokenzation (mapping our words to numbers) first.

To tokenize our words, we'll use the helpful preprocessing layer tf.keras.layers.experimental.preprocessing.TextVectorization.

The TextVectorization layer takes the following parameters:

* max_tokens - The maximum number of words in your vocabulary (e.g. 20000 or the number of unique words in your text), includes a value for OOV (out of vocabulary) tokens.
* standardize - Method for standardizing text. Default is "lower_and_strip_punctuation" which lowers text and removes all punctuation marks.
* split - How to split text, default is "whitespace" which splits on spaces.
* ngrams - How many words to contain per token split, for example, ngrams=2 splits tokens into continuous sequences of 2.
* output_mode - How to output tokens, can be "int" (integer mapping), "binary" (one-hot encoding), "count" or "tf-idf". See documentation for more.
* output_sequence_length - Length of tokenized sequence to output. For example, if output_sequence_length=150, all tokenized sequences will be 150 tokens long.
* pad_to_max_tokens - If True (default), the output feature axis will be padded to max_tokens even if the number of unique tokens in the vocabulary is less than max_tokens.

In [18]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

text_vectorizer = TextVectorization(max_tokens=None, # how many words in the vocabulary
                                    standardize = "lower_and_strip_punctuation", #how to process data
                                    split = "whitespace", #how to split token
                                    ngrams = None, # create groups of n-words
                                    output_mode="int",#how to map token to number
                                    output_sequence_length=None, #how long should the output sequence of token be
                                    pad_to_max_tokens = True
                                    )

In [19]:
list_of_words = []
for sentence in train_data:
  for word in sentence:
    if (word.lower() not in list_of_words):
      list_of_words.append(word.lower())
len(list_of_words)


93

In [20]:
max_vocab_length = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (e.g. how many words from a Tweet does our model see?)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

In [21]:
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_data)

In [22]:
# Create sample sentence and tokenize it
sample1 = "There is a earth quick in Japon"
sample2 = "My name is Yassine"
print(text_vectorizer([sample1]))
print(text_vectorizer([sample2]))

tf.Tensor(
[[  74    9    3  954 1787    4    1    0    0    0    0    0    0    0
     0]], shape=(1, 15), dtype=int64)
tf.Tensor([[ 13 735   9   1   0   0   0   0   0   0   0   0   0   0   0]], shape=(1, 15), dtype=int64)



Wonderful, it seems we've got a way to turn our text into numbers (in this case, word-level tokenization). Notice the 0's at the end of the returned tensor, this is because we set output_sequence_length=15, meaning no matter the size of the sequence we pass to text_vectorizer, it always returns a sequence with a length of 15.

Finally, we can check the unique tokens in our vocabulary using the get_vocabulary() method.

In [23]:
all_tokens = text_vectorizer.get_vocabulary()
print(f"Number of words in vocab {len(all_tokens)}")
print(f"Top 5 most common words {all_tokens[:5]}")
print(f"Bottom 5 least common words {all_tokens[-5:]}")

Number of words in vocab 10000
Top 5 most common words ['', '[UNK]', 'the', 'a', 'in']
Bottom 5 least common words ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


### Creating an Embedding using an Embedding Layer
We've got a way to map our text to numbers. How about we go a step further and turn those numbers into an embedding?

The powerful thing about an embedding is it can be learned during training. This means rather than just being static (e.g. 1 = I, 2 = love, 3 = TensorFlow), a word's numeric representation can be improved as a model goes through data samples.

We can see what an embedding of a word looks like by using the tf.keras.layers.Embedding layer.

The main parameters we're concerned about here are:

* input_dim - The size of the vocabulary (e.g. len(text_vectorizer.get_vocabulary()).
* output_dim - The size of the output embedding vector, for example, a value of 100 outputs a feature vector of size 100 for each word.
* embeddings_initializer - How to initialize the embeddings matrix, default is "uniform" which randomly initalizes embedding matrix with uniform distribution. This can be changed for using pre-learned embeddings.
* input_length - Length of sequences being passed to embedding layer.
Knowing these, let's make an embedding layer.

In [24]:
from tensorflow.keras import layers
embedding = layers.Embedding(input_dim = max_vocab_length,
                             output_dim = 128,# size of embedding vector
                             embeddings_initializer = "uniform",
                             input_length = max_length
                             )
embedding

<tensorflow.python.keras.layers.embeddings.Embedding at 0x7fc0e04a8dd0>

In [25]:
import random
random_sentence = random.choice(train_data)
embed = embedding(text_vectorizer([random_sentence]))
print(f"sentence : {random_sentence}")
print(f"embedding : {embed}")

sentence : @AsterPuppet wounded and carried her back to where his brothers and sisters were and entered the air ship to go back to Academia
embedding : [[[ 0.01202271 -0.01909553 -0.01306673 ... -0.0199934   0.00858361
   -0.02902834]
  [ 0.04839375  0.00434519  0.01777941 ...  0.02681328 -0.00958415
    0.02690215]
  [-0.04848403 -0.01410124 -0.00385036 ...  0.04207517  0.02788556
    0.04630784]
  ...
  [ 0.03449057 -0.03993927  0.04320589 ...  0.04407411 -0.03113501
   -0.03178923]
  [-0.04848403 -0.01410124 -0.00385036 ...  0.04207517  0.02788556
    0.04630784]
  [ 0.03210819 -0.00737327 -0.03385525 ...  0.03758854  0.02152107
   -0.01963152]]]


In [26]:
embed.shape

TensorShape([1, 15, 128])

In [27]:
embed[0][0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([ 0.01202271, -0.01909553, -0.01306673, -0.04243455,  0.01082335,
       -0.04595197, -0.044429  , -0.0187036 ,  0.04748697, -0.04080638,
        0.04065247, -0.03557943, -0.01330417, -0.00321885, -0.04521981,
        0.02027673, -0.03853505,  0.00812838,  0.04247073,  0.00244125,
        0.0130035 ,  0.04655511,  0.03320671, -0.03958973, -0.02492262,
        0.01241686,  0.02157965, -0.04324573, -0.01528152,  0.02725169,
       -0.01389455,  0.01110402, -0.01906642, -0.04365375, -0.03785964,
       -0.03986339, -0.01013547, -0.01774768, -0.03539802, -0.02567568,
       -0.0219432 ,  0.04077082, -0.02905855, -0.00877553, -0.04917359,
        0.00464028,  0.00199188,  0.00733487, -0.04723002, -0.02118609,
       -0.02626782,  0.02600781, -0.03018875,  0.02124203,  0.01617894,
       -0.03477886, -0.01812537, -0.04270257,  0.02028689,  0.04473329,
        0.01670713,  0.02571614, -0.02894726,  0.04835543,  0.00092071,
        0.031628

### Model 0: Naive Bayes (baseline)
create a Scikit-Learn Pipeline using the TF-IDF (term frequency-inverse document frequency) formula to convert our words to numbers and then model them with the Multinomial Naive Bayes algorithm. 

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model_0 = Pipeline([("tfidf",TfidfVectorizer()),# convert words to numbers using tfidf
                    ("clf", MultinomialNB())#model the text
          ])

# fit the pipeline to the training data
model_0.fit(train_data,train_labels)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

let's evaluate our model 

In [29]:
baseline_score = model_0.score(val_data,val_labels)
baseline_score

0.7926509186351706

In [30]:
pred = model_0.predict(val_data)
pred[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

In [31]:
val_labels[:20]

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0])

ground truth labels and computes the following:

* Accuracy
* Precision
* Recall
* F1-score

🔑 Note: Since we're dealing with a classification problem, the above metrics are the most appropriate. If we were working with a regression problem, other metrics such as MAE (mean absolute error) would be a better choice.

In [33]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_results(y_true, y_pred):

  model_accuracy = accuracy_score(y_true, y_pred)*100
  model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")

  model_results = {'accuracy' : model_accuracy,
                   'precision' : model_precision,
                   'recall' : model_recall,
                   'f1' : model_f1}
  return model_results

In [34]:
# Get baseline results
baseline_results = calculate_results(val_labels, pred)
baseline_results

{'accuracy': 79.26509186351706,
 'f1': 0.7862189758049549,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706}

### Model 1 : A simple dense model
a single layer dense model. a single layer

The first "deep" model we're going to build is a single layer dense model. In fact, it's barely going to have a single layer.

It'll take our text and labels as input, tokenize the text, create an embedding, find the average of the embedding (using Global Average Pooling) and then pass the average through a fully connected layer with one output unit and a sigmoid activation function.

If the previous sentence sounds like a mouthful, it'll make sense when we code it out (remember, if in doubt, code it out).

And since we're going to be building a number of TensorFlow deep learning models, we'll import our create_tensorboard_callback() function from helper_functions.py to keep track of the results of each.

In [40]:
import datetime

def create_tensorboard_callback(dir_name, experiment_name):
  """
  Creates a TensorBoard callback instand to store log files.
  Stores log files with the filepath:
    "dir_name/experiment_name/current_datetime/"
  Args:
    dir_name: target directory to store TensorBoard log files
    experiment_name: name of experiment directory (e.g. efficientnet_model_1)
  """
  log_dir = dir_name + "/" + experiment_name + "/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
  tensorboard_callback = tf.keras.callbacks.TensorBoard(
      log_dir=log_dir
  )
  print(f"Saving TensorBoard log files to: {log_dir}")
  return tensorboard_callback

In [41]:
SAVE_DIR = "model_logs"
from tensorflow.keras import layers
inputs = layers.Input(shape=(1,),dtype="string")#1 dimensional string
x = text_vectorizer(inputs)# turn input into numbers
x = embedding(x) # create an embedding of the numerized numbers
x = layers.GlobalAveragePooling1D()(x)# lower the dimensionality of the embedding
outputs = layers.Dense(1,activation="sigmoid")(x) #to get binary outputs we can must use sigmoid activation
# binary classification=>sigmoid activation
model_1 = tf.keras.Model(inputs,outputs,name="model_1_dense") #construct the model


In [42]:
model_1.compile(loss="binary_crossentropy",
                optimizer = tf.keras.optimizers.Adam(),
                metrics=["accuracy"])

In [43]:
model_1.summary()

Model: "model_1_dense"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_1 (TextVe (None, 15)                0         
_________________________________________________________________
embedding (Embedding)        (None, 15, 128)           1280000   
_________________________________________________________________
global_average_pooling1d (Gl (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________


Most of the trainable parameters are contained within the embedding layer. Recall we created an embedding of size 128 (output_dim=128) for a vocabulary of size 10,000 (input_dim=10000), hence the 1,280,000 trainable parameters.

Alright, our model is compiled, let's fit it to our training data for 5 epochs. We'll also pass our TensorBoard callback function to make sure our model's training metrics are logged.

In [44]:
mode_1_history = model_1.fit(train_data,
                             train_labels,
                             epochs=5,
                             validation_data=(val_data,val_labels),
                             callbacks = [create_tensorboard_callback(dir_name=SAVE_DIR,experiment_name="simple_dense_model")])

Saving TensorBoard log files to: model_logs/simple_dense_model/20210409-183208
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [45]:
model_1.evaluate(val_data,val_labels)



[0.4807247817516327, 0.787401556968689]

And since we tracked our model's training logs with TensorBoard, how about we visualize them?

We can do so by uploading our TensorBoard log files (contained in the model_logs directory) to TensorBoard.dev.

> 🔑 Note: Remember, whatever you upload to TensorBoard.dev becomes public. If there are training logs you don't want to share, don't upload them.

In [46]:
# View tensorboard logs of transfer learning modelling experiments (should be 4 models)
# Upload TensorBoard dev records
!tensorboard dev upload --logdir ./model_logs \
  --name "First deep model on text data" \
  --description "Trying a dense model with an embedding layer" \
  --one_shot # exits the uploader when upload has finished

2021-04-09 18:34:50.858353: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

***** TensorBoard Uploader *****

This will upload your TensorBoard logs to https://tensorboard.dev/ from
the following directory:

./model_logs

This TensorBoard will be visible to everyone. Do not upload sensitive
data.

Your use of this service is subject to Google's Terms of Service
<https://policies.google.com/terms> and Privacy Policy
<https://policies.google.com/privacy>, and TensorBoard.dev's Terms of Service
<https://tensorboard.dev/policy/terms/>.

This notice will not be shown again while you are logged into the uploader.
To log out, run `tensorboard dev auth revoke`.

Continue? (yes/NO) yes

Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=373649185512-8v619h5kft38l4456nm2dj4ubeqsrvh6.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3A

In [47]:
# If you need to remove previous experiments, you can do so using the following command
# !tensorboard dev delete --experiment_id EXPERIMENT_ID_TO_DELETE

In [57]:
# make some predictions
model_1_probabilities = model_1.predict(val_data)
#model_1_probabilities[:20]

Since our final layer uses a sigmoid activation function, we get our predictions back in the form of probabilities.

To convert them to prediction classes, we'll use tf.round(), meaning prediction probabilities below 0.5 will be rounded to 0 and those above 0.5 will be rounded to 1.

In [56]:
model_1_prediction = tf.round(model_1.predict(val_data))
#print(model_1_prediction[:20])
#print(val_labels[:20])

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0])

In [58]:
# Calculate model_1 metrics
model_1_results = calculate_results(val_labels,model_1_prediction)
model_1_results

{'accuracy': 78.74015748031496,
 'f1': 0.7841130596930417,
 'precision': 0.7932296029485675,
 'recall': 0.7874015748031497}

In [59]:
baseline_results

{'accuracy': 79.26509186351706,
 'f1': 0.7862189758049549,
 'precision': 0.8111390004213173,
 'recall': 0.7926509186351706}

In [62]:
# Is our baseline model better than our simple Keras model?
import numpy as np
np.array(list(baseline_results.values()))>np.array(list(model_1_results.values()))

array([ True,  True,  True,  True])

In [63]:

# Create a helper function to compare our baseline results to new model results
def compare_baseline_to_new_results(baseline_results, new_model_results):
  for key, value in baseline_results.items():
    print(f"Baseline {key}: {value:.2f}, New {key}: {new_model_results[key]:.2f}, Difference: {new_model_results[key]-value:.2f}")

compare_baseline_to_new_results(baseline_results=baseline_results, 
                                new_model_results=model_1_results)

Baseline accuracy: 79.27, New accuracy: 78.74, Difference: -0.52
Baseline precision: 0.81, New precision: 0.79, Difference: -0.02
Baseline recall: 0.79, New recall: 0.79, Difference: -0.01
Baseline f1: 0.79, New f1: 0.78, Difference: -0.00


In [64]:
word_in_vocab = text_vectorizer.get_vocabulary()
word_in_vocab[:10]

['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is']

In [65]:
# get embedding weigth
embed_weights = model_1.get_layer("embedding").get_weights()[0]
type(embed_weights)

numpy.ndarray

In [67]:
len(embed_weights), embed_weights.shape

(10000, (10000, 128))

In [70]:
word_in_vocab[:10]

['', '[UNK]', 'the', 'a', 'in', 'to', 'of', 'and', 'i', 'is']

Now we've got these two objects, we can use the Embedding Projector tool to visualize our embedding.

To use the Embedding Projector tool, we need two files:

The embedding vectors (same as embedding weights).
The meta data of the embedding vectors (the words they represent - our vocabulary).



In [71]:
# Code below is adapted from: https://www.tensorflow.org/tutorials/text/word_embeddings#retrieve_the_trained_word_embeddings_and_save_them_to_disk
import io

# Create output writers
out_v = io.open("embedding_vectors.tsv", "w", encoding="utf-8")
out_m = io.open("embedding_metadata.tsv", "w", encoding="utf-8")

# Write embedding vectors and words to file
for num, word in enumerate(word_in_vocab):
  if num == 0: continue # skip padding token
  vec = embed_weights[num]
  out_m.write(word + "\n") # write words to file
  out_v.write("\t".join([str(x) for x in vec]) + "\n") # write corresponding word vector to file
out_v.close()
out_m.close()

# Download files locally to upload to Embedding Projector
try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download("embedding_vectors.tsv")
  files.download("embedding_metadata.tsv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Once you've downloaded the embedding vectors and metadata, you can visualize them using Embedding Vector tool:

Go to http://projector.tensorflow.org/
Click on "Load data"
Upload the two files you downloaded (embedding_vectors.tsv and embedding_metadata.tsv)
Explore
Optional: You can share the data you've created by clicking "Publish"