<a href="https://colab.research.google.com/github/YasineNifa/DeepLearning-Using-TF/blob/master/nlp_in_tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Basics in Tensorflow
A handful of example natural language processing (NLP) and natural language understanding (NLU) problems. These are also often referred to as sequence problems (going from one sequence to another).

The main goal of natural language processing (NLP) is to derive information from natural language.

Natural language is a broad term but you can consider it to cover any of the following:

* Text (such as that contained in an email, blog post, book, Tweet)
* Speech (a conversation you have with a doctor, voice commands you give to a smart speaker)

Under the umbrellas of text and speech there are many different things you might want to do.

If you're building an email application, you might want to scan incoming emails to see if they're spam or not spam (classification).

If you're trying to analyse customer feedback complaints, you might want to discover which section of your business they're for.

> 🔑 Note: Both of these types of data are often referred to as sequences (a sentence is a sequence of words). So a common term you'll come across in NLP problems is called seq2seq, in other words, finding information in one sequence to produce another sequence (e.g. converting a speech command to a sequence of text-based steps).

To get hands-on with NLP in TensorFlow, we're going to practice the steps we've used previously but this time with text data:

> Text -> turn into numbers -> build a model -> train the model to find patterns -> use patterns (make predictions)

>> 📖 Resource: For a great overview of NLP and the different problems within it, read the article A Simple Introduction to Natural Language Processing. (https://becominghuman.ai/a-simple-introduction-to-natural-language-processing-ea66a1747b32)


## In this notebook, we are going to cover :
* Downloading a text dataset
* Visualizing text data
* Converting text into numbers using tokenization
* Turning our tokenized text into an embedding
* Modelling a text dataset
  * Starting with a baseline (TF-IDF)
  * Building several deep learning text models
    * Dense, LSTM, GRU, Conv1D, Transfer learning
* Comparing the performance of each our models
* Combining our models into an ensemble
* Saving and loading a trained model
* Find the most wrong predictions

In [1]:
# check GPU
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-0f903154-5a42-8de0-8e6d-0dcd702359eb)


### Download the text dataset
We'll be using the Real or Not? datset from Kaggle which contains text-based Tweets about natural disasters.

The Real Tweets are actually about diasters, for example:

> Jetstar and Virgin forced to cancel Bali flights again because of ash from Mount Raung volcano

The Not Real Tweets are Tweets not about diasters (they can be on anything), for example:

> 'Education is the most powerful weapon which you can use to change the world.' Nelson #Mandela #quote

In [2]:
# Download dataset
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

--2021-04-08 21:40:08--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.142.128, 74.125.195.128, 2607:f8b0:400e:c07::80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.142.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2021-04-08 21:40:08 (110 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [3]:
# Unzip data
import zipfile
zip_ref = zipfile.ZipFile("nlp_getting_started.zip")
zip_ref.extractall()
zip_ref.close()

# create a methode for that :
def unzip_files(filename):
  zip_ref = zipfile.ZipFile(filename)
  zip_ref.extractall()
  zip_ref.close()

### Visualizing text dataset
Right now, our text data samples are in the form of .csv files. For an easy way to make them visual, let's turn them into pandas DataFrame's.

> 📖 Reading: You might come across text datasets in many different formats. Aside from CSV files (what we're working with), you'll probably encounter .txt files and .json files too. For working with these type of files, I'd recommend reading the two following articles by RealPython:

How to Read and Write Files in Python (https://realpython.com/read-write-files-python/)
Working with JSON Data in Python(https://realpython.com/python-json/)

In [4]:
import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
train_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
# shuffle training dataframe
train_df_shuffled = train_df.sample(frac=1,random_state=42) #shuffle with random_state for reproducibility
train_df_shuffled.head(5)

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0



Notice how the training data has a "target" column.

We're going to be writing code to find patterns (e.g. different combinations of words) in the "text" column of the training dataset to predict the value of the "target" column.


In [6]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
test_df.tail()

Unnamed: 0,id,keyword,location,text
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago http://t.co/U...
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...
3262,10875,,,#CityofCalgary has activated its Municipal Eme...


In [8]:
len(train_df_shuffled)

7613

In [9]:
# How many example of target=1
len(train_df_shuffled[train_df_shuffled['target'] == 1])

3271

In [10]:
# How many example of target=0
len(train_df_shuffled[train_df_shuffled['target'] == 0])

4342

In [11]:
4342+3271

7613

In [12]:
# or
# a total description of data 
train_df_shuffled['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [13]:
# the total number of sample
print(f"Total training samples : {len(train_df_shuffled)}")
print(f"Total test samples : {len(test_df)}")
print(f"Total samples : {len(train_df_shuffled) + len(test_df)}")

Total training samples : 7613
Total test samples : 3263
Total samples : 10876


In [14]:
# Let's visualize some random training examples
# visualize 10 samples
import random

for i in range(10):
  raw_random = random.randint(0,len(train_df_shuffled))
  print('text : ', train_df_shuffled['text'][raw_random])
  target = train_df_shuffled['target'][raw_random]
  print(f'target : {target} ', "(real disaster)" if target>0 else "(not real disaster)")
  print()
  print("====================================")
  print()

text :  @danielsahyounie It'd be so bomb if u guys won ??
target : 0  (not real disaster)


text :  If you have an opinion and you don't put it on thh internet you will furst into flames.
target : 0  (not real disaster)


text :  @Kinder_Morgan can'twon't tell @cityofkamloops how they'd respond to an oil spill. Trust them? See Sec 4.2 #Kamloops http://t.co/TA6N9sZyfP
target : 1  (real disaster)


text :  DEEP crew to help with California wild fires http://t.co/QKz2Sp06xn via @thedayct
target : 1  (real disaster)


text :  13 reasons why we love women in the military   - lulgzimbestpicts http://t.co/uZ1yiZ7n6m http://t.co/IjwAr15H16
target : 0  (not real disaster)


text :  Dem FLATLINERS who destroy creativity-balance-longevity &amp; TRUTH stand with Lucifer in all his flames of destruction https://t.co/WcFpZNsN9u
target : 1  (real disaster)


text :  Cultivating Joy In The Face Of Catastrophe And Suffering http://t.co/o0LTQDJbQe #pjnet #tcotåÊ#ccot http://t.co/MO9wpTyqkp
target : 0  (

### Split data into training and validation sets
Since the test set has no labels and we need a way to evalaute our trained models, we'll split off some of the training data and create a validation set.

When our model trains (tries patterns in the Tweet samples), it'll only see data from the training set and we can see how it performs on unseen data using the validation set.

We'll convert our splits from pandas Series datatypes to lists of strings (for the text) and lists of ints (for the labels) for ease of use later.

To split our training dataset and create a validation dataset, we'll use Scikit-Learn's train_test_split() method and dedicate 10% of the training samples to the validation set.

In [15]:
 from sklearn.model_selection import train_test_split
 # Use train_test_split to split training data into training and validation sets
train_data, val_data, train_labels, val_labels = train_test_split(train_df_shuffled['text'].to_numpy(), 
                                                                  train_df_shuffled['target'].to_numpy(), 
                                                                  test_size=0.1, #10% of sample to validation set
                                                                  random_state=42) #for reproducibility

In [16]:
#Check the lengths
len(train_data), len(val_data), len(train_labels), len(val_labels)

(6851, 762, 6851, 762)

In [17]:
# View the first 10 training sentences and their labels
train_data[:10], train_labels[:10]

(array(['@mogacola @zamtriossu i screamed after hitting tweet',
        'Imagine getting flattened by Kurt Zouma',
        '@Gurmeetramrahim #MSGDoing111WelfareWorks Green S welfare force ke appx 65000 members har time disaster victim ki help ke liye tyar hai....',
        "@shakjn @C7 @Magnums im shaking in fear he's gonna hack the planet",
        'Somehow find you and I collide http://t.co/Ee8RpOahPk',
        '@EvaHanderek @MarleyKnysh great times until the bus driver held us hostage in the mall parking lot lmfao',
        'destroy the free fandom honestly',
        'Weapons stolen from National Guard Armory in New Albany still missing #Gunsense http://t.co/lKNU8902JE',
        '@wfaaweather Pete when will the heat wave pass? Is it really going to be mid month? Frisco Boy Scouts have a canoe trip in Okla.',
        'Patient-reported outcomes in long-term survivors of metastatic colorectal cancer - British Journal of Surgery http://t.co/5Yl4DC1Tqt'],
       dtype=object), array([0, 

### Converting text into numbers
In NLP, there are two main concepts for turning text into numbers:

* Tokenization - A straight mapping from word or character or sub-word to a numerical value. There are three main levels of tokenization:
  * 1- Using word-level tokenization with the sentence "I love TensorFlow" might result in "I" being 0, "love" being 1 and "TensorFlow" being 2. In this case, every word in a sequence considered a single token.
  * 2- Character-level tokenization, such as converting the letters A-Z to values 1-26. In this case, every character in a sequence considered a single token.
  * 3- Sub-word tokenization is in between word-level and character-level tokenization. It involves breaking invidual words into smaller parts and then converting those smaller parts into numbers. For example, "my favourite food is pineapple pizza" might become "my, fav, avour, rite, fo, oo, od, is, pin, ine, app, le, piz, za". After doing this, these sub-words would then be mapped to a numerical value. In this case, every word could be considered multiple tokens.

* Embeddings - An embedding is a representation of natural language which can be learned. Representation comes in the form of a feature vector. For example, the word "dance" could be represented by the 5-dimensional vector [-0.8547, 0.4559, -0.3332, 0.9877, 0.1112]. It's important to note here, the size of the feature vector is tuneable. There are two ways to use embeddings:
  * 1- Create your own embedding - Once your text has been turned into numbers (required for an embedding), you can put them through an embedding layer (such as tf.keras.layers.Embedding) and an embedding representation will be learned during model training.
  * 2- Reuse a pre-learned embedding - Many pre-trained embeddings exist online. These pre-trained embeddings have often been learned on large corpuses of text (such as all of Wikipedia) and thus have a good underlying representation of natural language. You can use a pre-trained embedding to initialize your model and fine-tune it to your own specific task.

It depends on your problem. You could try character-level tokenization/embeddings and word-level tokenization/embeddings and see which perform best. You might even want to try stacking them (e.g. combining the outputs of your embedding layers using tf.keras.layers.concatenate).

If you're looking for pre-trained word embeddings, Word2vec embeddings, GloVe embeddings and many of the options available on TensorFlow Hub are great places to start.

> 🔑 Note: Much like searching for a pre-trained computer vision model, you can search for pre-trained word embeddings to use for your problem. Try searching for something like "use pre-trained word embeddings in TensorFlow".

#### Text vectorization (tokenization)
Enough talking about tokenization and embeddings, let's create some.

We'll practice tokenzation (mapping our words to numbers) first.

To tokenize our words, we'll use the helpful preprocessing layer tf.keras.layers.experimental.preprocessing.TextVectorization.

The TextVectorization layer takes the following parameters:

* max_tokens - The maximum number of words in your vocabulary (e.g. 20000 or the number of unique words in your text), includes a value for OOV (out of vocabulary) tokens.
* standardize - Method for standardizing text. Default is "lower_and_strip_punctuation" which lowers text and removes all punctuation marks.
* split - How to split text, default is "whitespace" which splits on spaces.
* ngrams - How many words to contain per token split, for example, ngrams=2 splits tokens into continuous sequences of 2.
* output_mode - How to output tokens, can be "int" (integer mapping), "binary" (one-hot encoding), "count" or "tf-idf". See documentation for more.
* output_sequence_length - Length of tokenized sequence to output. For example, if output_sequence_length=150, all tokenized sequences will be 150 tokens long.
* pad_to_max_tokens - If True (default), the output feature axis will be padded to max_tokens even if the number of unique tokens in the vocabulary is less than max_tokens.

In [18]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

text_vectorizer = TextVectorization(max_tokens=None, # how many words in the vocabulary
                                    standardize = "lower_and_strip_punctuation", #how to process data
                                    split = "whitespace", #how to split token
                                    ngrams = None, # create groups of n-words
                                    output_mode="int",#how to map token to number
                                    output_sequence_length=None, #how long should the output sequence of token be
                                    pad_to_max_tokens = True
                                    )

In [30]:
list_of_words = []
for sentence in train_data:
  for word in sentence:
    if (word.lower() not in list_of_words):
      list_of_words.append(word.lower())
len(list_of_words)


93

In [44]:
max_vocab_length = 10000 # max number of words to have in our vocabulary
max_length = 15 # max length our sequences will be (e.g. how many words from a Tweet does our model see?)

text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

In [45]:
# Fit the text vectorizer to the training text
text_vectorizer.adapt(train_data)

In [46]:
# Create sample sentence and tokenize it
sample1 = "There is a earth quick in Japon"
sample2 = "My name is Yassine"
print(text_vectorizer([sample1]))
print(text_vectorizer([sample2]))

tf.Tensor(
[[  74    9    3  954 1787    4    1    0    0    0    0    0    0    0
     0]], shape=(1, 15), dtype=int64)
tf.Tensor([[ 13 735   9   1   0   0   0   0   0   0   0   0   0   0   0]], shape=(1, 15), dtype=int64)



Wonderful, it seems we've got a way to turn our text into numbers (in this case, word-level tokenization). Notice the 0's at the end of the returned tensor, this is because we set output_sequence_length=15, meaning no matter the size of the sequence we pass to text_vectorizer, it always returns a sequence with a length of 15.

Finally, we can check the unique tokens in our vocabulary using the get_vocabulary() method.

In [47]:
all_tokens = text_vectorizer.get_vocabulary()
print(f"Number of words in vocab {len(all_tokens)}")
print(f"Top 5 most common words {all_tokens[:5]}")
print(f"Bottom 5 least common words {all_tokens[-5:]}")

Number of words in vocab 10000
Top 5 most common words ['', '[UNK]', 'the', 'a', 'in']
Bottom 5 least common words ['pages', 'paeds', 'pads', 'padres', 'paddytomlinson1']


### Creating an Embedding using an Embedding Layer
We've got a way to map our text to numbers. How about we go a step further and turn those numbers into an embedding?

The powerful thing about an embedding is it can be learned during training. This means rather than just being static (e.g. 1 = I, 2 = love, 3 = TensorFlow), a word's numeric representation can be improved as a model goes through data samples.

We can see what an embedding of a word looks like by using the tf.keras.layers.Embedding layer.

The main parameters we're concerned about here are:

* input_dim - The size of the vocabulary (e.g. len(text_vectorizer.get_vocabulary()).
* output_dim - The size of the output embedding vector, for example, a value of 100 outputs a feature vector of size 100 for each word.
* embeddings_initializer - How to initialize the embeddings matrix, default is "uniform" which randomly initalizes embedding matrix with uniform distribution. This can be changed for using pre-learned embeddings.
* input_length - Length of sequences being passed to embedding layer.
Knowing these, let's make an embedding layer.

In [49]:
from tensorflow.keras import layers
embedding = layers.Embedding(input_dim = max_vocab_length,
                             output_dim = 128,# size of embedding vector
                             embeddings_initializer = "uniform",
                             input_length = max_length
                             )
embedding

<tensorflow.python.keras.layers.embeddings.Embedding at 0x7f2d60cb1890>

In [53]:
import random
random_sentence = random.choice(train_data)
embed = embedding(text_vectorizer([random_sentence]))
print(f"sentence : {random_sentence}")
print(f"embedding : {embed}")

sentence : Three days off from work and they've pretty much all been wrecked hahaha shoutout to my family for that one
embedding : [[[ 0.00450909  0.03491906 -0.03565206 ...  0.03466724  0.01150768
    0.03669469]
  [-0.04651254  0.03443931 -0.01613631 ... -0.04853472 -0.00786417
   -0.03764267]
  [-0.03183792 -0.0495659   0.02440139 ... -0.01169258 -0.04654762
    0.03260454]
  ...
  [ 0.03115541  0.00739955  0.01634722 ...  0.04695733 -0.03539421
    0.04157289]
  [-0.01903983 -0.03322592 -0.0085649  ...  0.03684549  0.04276494
   -0.03241139]
  [-0.02908183  0.04517445 -0.0347869  ...  0.02524885  0.03625787
    0.00483751]]]


In [57]:
embed.shape

TensorShape([1, 15, 128])

In [58]:
embed[0][0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([ 0.00450909,  0.03491906, -0.03565206, -0.00768239, -0.02646568,
       -0.04912739,  0.04549045,  0.04687456,  0.00881914,  0.0481938 ,
       -0.0246333 ,  0.02349437, -0.04968008,  0.02500402,  0.00989534,
       -0.00469167, -0.01581765,  0.02046904,  0.04635542, -0.00045057,
       -0.02523324,  0.02940357, -0.04699943,  0.01709614,  0.00603753,
        0.00232885,  0.00867283, -0.00181242, -0.00051749,  0.0476107 ,
       -0.025052  , -0.02189721, -0.01923978, -0.01420168,  0.02317548,
       -0.01155789,  0.014388  ,  0.00132809, -0.0490813 ,  0.03274209,
       -0.04417333,  0.00752889, -0.02588215,  0.00161297,  0.02535981,
       -0.00235496, -0.0068384 , -0.00903518,  0.03744109,  0.03049319,
        0.00888323,  0.00327431,  0.01591796, -0.03769596,  0.01506902,
        0.03734047,  0.02501761, -0.00020089,  0.01417091, -0.02664999,
       -0.02567365, -0.0142503 ,  0.03692749, -0.03168106,  0.04765601,
        0.001709

### Model 0: Naive Bayes (baseline)
create a Scikit-Learn Pipeline using the TF-IDF (term frequency-inverse document frequency) formula to convert our words to numbers and then model them with the Multinomial Naive Bayes algorithm. 

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model_0 = Pipeline([("tfidf",TfidfVectorizer()),# convert words to numbers using tfidf
                    ("clf", MultinomialNB())#model the text
          ])

# fit the pipeline to the training data
model_0.fit(train_data,train_labels)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

let's evaluate our model 

In [60]:
baseline_score = model_0.score(val_data,val_labels)
baseline_score

0.7926509186351706

In [62]:
pred = model_0.predict(val_data)
pred[:20]

array([1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

In [63]:
val_labels[:20]

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0])

ground truth labels and computes the following:

* Accuracy
* Precision
* Recall
* F1-score
🔑 Note: Since we're dealing with a classification problem, the above metrics are the most appropriate. If we were working with a regression problem, other metrics such as MAE (mean absolute error) would be a better choice.