

# Part 1 

# We explore two Pretrained models 

# DistilBERT 
references used : http://jalammar.github.io/illustrated-bert/
dataset: https://nlp.stanford.edu/sentiment/index.html


# TensorFlow 































In this notebook, we will use pre-trained deep learning model to process some text. We will then use the output of that model to classify the text. The text is a list of sentences from film reviews. And we will calssify each sentence as either speaking "positively" about its subject of "negatively".

## Models: Sentence Sentiment Classification
Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either 1 (indicating the sentence carries a positive sentiment) or a 0 (indicating the sentence carries a negative sentiment). We can think of it as looking like this:

<img src="https://jalammar.github.io/images/distilBERT/sentiment-classifier-1.png" />

Under the hood, the model is actually made up of two models.

* DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
* The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).

The data we pass between the two models is a vector of size 768. We can think of this of vector as an embedding for the sentence that we can use for classification.



## Dataset
The dataset we will use in this example is [SST2](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):


<table class="features-table">
  <tr>
    <th class="mdc-text-light-green-600">
    sentence
    </th>
    <th class="mdc-text-purple-600">
    label
    </th>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      apparently reassembled from the cutting room floor of any given daytime soap
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      they presume their audience won't sit still for a sociology lesson
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      this is a visually stunning rumination on love , memory , history and the war between art and commerce
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      jonathan parker 's bartleby should have been the be all end all of the modern office anomie films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
</table>



## Installing the transformers library
Installing the huggingface transformers library to load our deep learning NLP 

---

model

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/88/b1/41130a228dd656a1a31ba281598a968320283f48d42782845f6ba567f00b/transformers-4.2.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 6.5MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 27.1MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 37.2MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=f76de7c47649a

Import necessary libraries 

In [None]:
import numpy as np # numpy for linear algebra 
import pandas as pd # pandas for data manipulaiton
from sklearn.model_selection import train_test_split #to split data into train and test sets 
from sklearn.linear_model import LogisticRegression #algorithm used to classify text vectors 
from sklearn.model_selection import GridSearchCV # to search for best paramenters 
from sklearn.model_selection import cross_val_score #evaluate a score by cross validation

import torch
import transformers as ppb


import warnings


warnings.filterwarnings('ignore')

## Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [None]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

For performance reasons, we'll only use 2,000 sentences from the dataset

In [None]:
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


In [None]:
batch_1 = df[:2000]

We can ask pandas how many sentences are labeled as "positive" (value 1) and how many are labeled "negative" (having the value 0)

In [None]:
batch_1[1].value_counts()

1    1041
0     959
Name: 1, dtype: int64

Pretty balanced 

In [None]:
batch_1[0]

0       a stirring , funny and finally transporting re...
1       apparently reassembled from the cutting room f...
2       they presume their audience wo n't sit still f...
3       this is a visually stunning rumination on love...
4       jonathan parker 's bartleby should have been t...
                              ...                        
1995    too bland and fustily tasteful to be truly pru...
1996                           it does n't work as either
1997    this one aims for the toilet and scores a dire...
1998    in the name of an allegedly inspiring and easi...
1999    the movie is undone by a filmmaking methodolog...
Name: 0, Length: 2000, dtype: object

## Loading the Pre-trained BERT model DISTILBERT - it is a smaller version of BERT 
Let's now load a pre-trained BERT model. 

We use distilbert-base-uncased
Why uncased ? Because we might lose valuabe information if we ignore the case 

In [None]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

# Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

Take sentences - apply lambda to the sentence : encode , add special tokens for eacn sentence in the df coumn 

Turns every sentence into a list of ids 

In [None]:
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [None]:
# list of lists 
tokenized

0       [101, 1037, 18385, 1010, 6057, 1998, 2633, 182...
1       [101, 4593, 2128, 27241, 23931, 2013, 1996, 62...
2       [101, 2027, 3653, 23545, 2037, 4378, 24185, 10...
3       [101, 2023, 2003, 1037, 17453, 14726, 19379, 1...
4       [101, 5655, 6262, 1005, 1055, 12075, 2571, 376...
                              ...                        
1995    [101, 2205, 20857, 1998, 11865, 16643, 2135, 5...
1996    [101, 2009, 2515, 1050, 1005, 1056, 2147, 2004...
1997    [101, 2023, 2028, 8704, 2005, 1996, 11848, 199...
1998    [101, 1999, 1996, 2171, 1997, 2019, 9382, 1898...
1999    [101, 1996, 3185, 2003, 25757, 2011, 1037, 244...
Name: 0, Length: 2000, dtype: object

In [None]:
tokenized.values

array([list([101, 1037, 18385, 1010, 6057, 1998, 2633, 18276, 2128, 16603, 1997, 5053, 1998, 1996, 6841, 1998, 5687, 5469, 3152, 102]),
       list([101, 4593, 2128, 27241, 23931, 2013, 1996, 6276, 2282, 2723, 1997, 2151, 2445, 12217, 7815, 102]),
       list([101, 2027, 3653, 23545, 2037, 4378, 24185, 1050, 1005, 1056, 4133, 2145, 2005, 1037, 11507, 10800, 1010, 2174, 14036, 2135, 3591, 1010, 2061, 2027, 19817, 4140, 2041, 1996, 7511, 2671, 4349, 3787, 1997, 11829, 7168, 9219, 1998, 28971, 2308, 1999, 8301, 8737, 2100, 4253, 102]),
       ...,
       list([101, 2023, 2028, 8704, 2005, 1996, 11848, 1998, 7644, 1037, 3622, 2718, 102]),
       list([101, 1999, 1996, 2171, 1997, 2019, 9382, 18988, 1998, 4089, 3006, 3085, 17312, 1010, 1996, 3750, 1005, 1055, 2252, 4332, 1037, 6397, 3239, 2000, 1996, 2200, 2381, 2009, 9811, 2015, 2000, 6570, 102]),
       list([101, 1996, 3185, 2003, 25757, 2011, 1037, 24466, 16134, 2008, 1005, 1055, 2074, 6388, 2438, 2000, 7344, 3686, 1996, 7731, 4378, 209

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />

### Padding
After tokenization, `tokenized` is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).

In [None]:
max_len = 0
#iterate through the list of lists 
for i in tokenized.values:
#what is the maximum length of a review out of the 2000 reviews 
#if the length of a review is greater than a certain size , pad it to a 
    if len(i) > max_len:
        max_len = len(i)
#the maximum length of a review is 59 
#max_len
padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [None]:
print("Padded array", padded.ndim, "dimensions")


Padded array 2 dimensions


In [None]:
print("Shape of Padded array", padded.shape)

Shape of Padded array (2000, 59)


In [None]:
print("Size of Padded array", padded.size)

Size of Padded array 118000


In [None]:
print("Datatype of Padded array", padded.dtype)

Datatype of Padded array int64


In [None]:
print("Size of Padded array elements (in bytes)", padded.itemsize)

Size of Padded array elements (in bytes) 8


In [None]:
#size times itemsize
print("Total Size of the Padded array (in bytes)", padded.nbytes)

Total Size of the Padded array (in bytes) 944000


Our dataset is now in the `padded` variable, we can view its dimensions below:

In [None]:
np.array(padded).shape

(2000, 59)

In [None]:
padded[0]

array([  101,  1037, 18385,  1010,  6057,  1998,  2633, 18276,  2128,
       16603,  1997,  5053,  1998,  1996,  6841,  1998,  5687,  5469,
        3152,   102,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0])

In [None]:
padded[1]

array([  101,  4593,  2128, 27241, 23931,  2013,  1996,  6276,  2282,
        2723,  1997,  2151,  2445, 12217,  7815,   102,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0])

The purpose of padding is to make the lists the same size , but the data is not in a form BERT expects 

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [None]:
#return elements from padded array where 
#attention_mask = np.where(padded != 0, 1, 0)
attention_mask = np.where(padded != 0,1,0)
attention_mask

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

In [None]:
attention_mask.shape

(2000, 59)

## Model #1: And Now, Deep Learning!
Now that we have our model and inputs ready, let's run our model!

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" />

The `model()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

PyTorch 

is a replacement for Numpy to use the power of  GPUs and other accelerators 

Library used to implement neural networks 

Tensors and multiple syntaxes 

Tensors are used to encode the inputs and outputs of a model as well as the model's parameters 

Libraries imported with " import torch "





In [None]:
input_ids = torch.tensor(padded) 
input_ids

tensor([[  101,  1037, 18385,  ...,     0,     0,     0],
        [  101,  4593,  2128,  ...,     0,     0,     0],
        [  101,  2027,  3653,  ...,     0,     0,     0],
        ...,
        [  101,  2023,  2028,  ...,     0,     0,     0],
        [  101,  1999,  1996,  ...,     0,     0,     0],
        [  101,  1996,  3185,  ...,     0,     0,     0]])

In [None]:
attention_mask = torch.tensor(attention_mask)
attention_mask

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])

In [None]:
 #We use no_grad() Context Manager because we are sure there will not be any backward propagation happening here 
 #To reduce memory consumption 
 #Even if the inputs have requires_grad = True, requires+grad = False will be used 
 with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

We'll save those in the `features` variable, as they'll serve as the features to our logistic regression model.

In [None]:
#datatype of last_hidden_states , the model output - tensor data 
last_hidden_states


BaseModelOutput([('last_hidden_state',
                  tensor([[[-0.2159, -0.1403,  0.0083,  ..., -0.1369,  0.5867,  0.2011],
                           [-0.2471,  0.2468,  0.1008,  ..., -0.1631,  0.9349, -0.0715],
                           [ 0.0558,  0.3573,  0.4140,  ..., -0.2430,  0.1770, -0.5080],
                           ...,
                           [-0.0165,  0.1179,  0.3512,  ..., -0.2401,  0.2722, -0.1750],
                           [ 0.0961,  0.0667,  0.3147,  ..., -0.3277,  0.3556, -0.2135],
                           [ 0.0454,  0.0519,  0.3168,  ..., -0.2880,  0.1844, -0.1042]],
                  
                          [[-0.1726, -0.1448,  0.0022,  ..., -0.1744,  0.2139,  0.3720],
                           [ 0.0022,  0.1684,  0.1269,  ..., -0.1888, -0.0195, -0.0283],
                           [ 0.0257, -0.2458,  0.0717,  ..., -0.4339,  0.1622,  0.0133],
                           ...,
                           [ 0.0505, -0.0493,  0.0463,  ..., -0.0448, -0.054

In [None]:
#convert back to numpy array for use with sklearn algorithms

features = last_hidden_states[0][:,0,:].numpy()

In [None]:
features

array([[-0.21593434, -0.14028926,  0.00831123, ..., -0.13694869,
         0.5867001 ,  0.20112717],
       [-0.17262726, -0.1447617 ,  0.00223425, ..., -0.17442562,
         0.21386456,  0.37197497],
       [-0.05063343,  0.07203954, -0.02959665, ..., -0.07148951,
         0.7185235 ,  0.26225498],
       ...,
       [-0.27829763, -0.24803594,  0.13585813, ..., -0.1903916 ,
         0.13099582,  0.34978363],
       [-0.03667719,  0.10638559, -0.01111007, ..., -0.11206657,
         0.41619483,  0.50338006],
       [ 0.12402609,  0.01425178,  0.01038435, ..., -0.11606541,
         0.5345915 ,  0.27495354]], dtype=float32)

Until now, we were working on the review column of our batch of 2000 reviews.The labels indicating which sentence is positive and negative now go into the `labels` variable

In [None]:
labels = batch_1[1]

## sklearn model : Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, random_state = 42)

We converted the input reviews into a format suitable for BERT

Ran our model 

Then converted the features back into a numpy array

Did our train/ test split 

We are going to use sklearn Logistic Regression

We have our sentence embeddings from DISTIL-BERT and labels 



<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />

### Grid Search for Parameters
We can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it's worth searching for the best value of the C parameter, which determines regularization strength.

In [None]:
parameters = {'C': np.linspace(0.0001, 100, 20)}
grid_search = GridSearchCV(LogisticRegression(), parameters)
grid_search.fit(train_features, train_labels)

print('Best parameters: ', grid_search.best_params_)
print('Best scores: ', grid_search.best_score_)

Best parameters:  {'C': 5.263252631578947}
Best scores:  0.8193333333333334


We now train the LogisticRegression model using the value of C found through GridSearchCV 

In [None]:
lr_clf = LogisticRegression(C=5.2)
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=5.2, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

<img src="https://jalammar.github.io/images/distilBERT/bert-training-logistic-regression.png" />

## Evaluating Model 1
So how well does our model do in classifying sentences? One way is to check the accuracy against the testing dataset:

In [None]:
lr_clf.score(test_features, test_labels)

0.838

How good is this score? What can we compare it against? Let's first look at a dummy classifier

This classifier is useful as a simple baseline to compare with other (real) classifiers


In [None]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()
#Evaluate a score by cross-validation using cross_val_score from sklearn,model.selection 
scores = cross_val_score(clf, train_features, train_labels)
#Returns the mean accuracy on the given test data and labels
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.495 (+/- 0.07)


So our model clearly does better than a dummy classifier. But how does it compare against the best models?

## Proper SST2 scores
For reference, the [highest accuracy score](http://nlpprogress.com/english/sentiment_analysis.html) for this dataset is currently **96.8**.

DistilBERT can be trained to improve its score on this task – a process called **fine-tuning** which updates BERT’s weights to make it achieve a better performance in this sentence classification task (which we can call the downstream task). 

The fine-tuned DistilBERT turns out to achieve an accuracy score of **90.7**. The full size BERT model achieves **94.9**.


The next step would be to head over to the documentation and try [fine-tuning](https://huggingface.co/transformers/examples.html#glue). 

We can also go back and switch from distilBERT to BERT and see how that works.

# Model #2 Using TensorFlow 



Install BERT for TensorFlow 





In [None]:
pip install -q tensorflow-hub


Install libraries

In [None]:
import os
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")


Version:  2.4.1
Eager mode:  True
Hub version:  0.11.0
GPU is NOT AVAILABLE


Version of TensorFlow 


In [None]:
# Split the training set into 60% and 40%, so we'll end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

INFO:absl:No config specified, defaulting to first: imdb_reviews/plain_text
INFO:absl:Load dataset info from /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0
INFO:absl:Reusing dataset imdb_reviews (/root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0)
INFO:absl:Constructing tf.data.Dataset for split ('train[:60%]', 'train[60%:]', 'test'), from /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0


In [None]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
train_examples_batch

<tf.Tensor: shape=(10,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell 

In [None]:
train_labels_batch

<tf.Tensor: shape=(10,), dtype=int64, numpy=array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0])>

## Build the CNN model 

The neural network is created by stacking layers—this requires three main architectural decisions:

How to represent the text?

How many layers to use in the model?

How many hidden units to use for each layer?


One way to represent the text is to convert sentences into embeddings vectors

We can use a pre-trained text embedding as the first layer, which will have three advantages:

we don't have to worry about text preprocessing

we can benefit from transfer learning

the embedding has a fixed size, so it's simpler to process



Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples.

Note that no matter the length of the input text, the output shape of the embeddings is: (num_examples, embedding_dimension).

In [None]:
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

INFO:absl:Using /tmp/tfhub_modules to cache modules.
INFO:absl:Downloading TF-Hub Module 'https://tfhub.dev/google/nnlm-en-dim50/2'.
INFO:absl:Downloaded https://tfhub.dev/google/nnlm-en-dim50/2, Total size: 191.83MB
INFO:absl:Downloaded TF-Hub Module 'https://tfhub.dev/google/nnlm-en-dim50/2'.


<tf.Tensor: shape=(3, 50), dtype=float32, numpy=
array([[ 0.5423195 , -0.0119017 ,  0.06337538,  0.06862972, -0.16776837,
        -0.10581174,  0.16865303, -0.04998824, -0.31148055,  0.07910346,
         0.15442263,  0.01488662,  0.03930153,  0.19772711, -0.12215476,
        -0.04120981, -0.2704109 , -0.21922152,  0.26517662, -0.80739075,
         0.25833532, -0.3100421 ,  0.28683215,  0.1943387 , -0.29036492,
         0.03862849, -0.7844411 , -0.0479324 ,  0.4110299 , -0.36388892,
        -0.58034706,  0.30269456,  0.3630897 , -0.15227164, -0.44391504,
         0.19462997,  0.19528408,  0.05666234,  0.2890704 , -0.28468323,
        -0.00531206,  0.0571938 , -0.3201318 , -0.04418665, -0.08550783,
        -0.55847436, -0.23336391, -0.20782952, -0.03543064, -0.17533456],
       [ 0.56338924, -0.12339553, -0.10862679,  0.7753425 , -0.07667089,
        -0.15752277,  0.01872335, -0.08169781, -0.3521876 ,  0.4637341 ,
        -0.08492756,  0.07166859, -0.00670817,  0.12686075, -0.19326553,
 

Let's now build the full model:

In [None]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 50)                48190600  
_________________________________________________________________
dense (Dense)                (None, 16)                816       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 48,191,433
Trainable params: 48,191,433
Non-trainable params: 0
_________________________________________________________________


The layers are stacked sequentially to build the classifier:

The first layer is a TensorFlow Hub layer.

This layer uses a pre-trained Saved Model to map a sentence into its embedding vector.

The pre-trained text embedding model that we are using (google/nnlm-en-dim50/2) splits the sentence into tokens, embeds each token and then combines the embedding.

The resulting dimensions are: (num_examples, embedding_dimension). For this NNLM model, the embedding_dimension is 50.

This fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.

The last layer is densely connected with a single output node.
Let's compile the model.

Define loss functions to use 

A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs logits (a single-unit layer with a linear activation), we'll use the binary_crossentropy loss function.

This isn't the only choice for a loss function, you could, for instance, choose mean_squared_error. But, generally, binary_crossentropy is better for dealing with probabilities—it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions.

Train the model 

Train the model for 10 epochs in mini-batches of 512 samples. 

This is 10 iterations over all samples in the x_train and y_train tensors. 

While training, monitor the model's loss and accuracy on the 10,000 samples from the validation set:



In [None]:
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Evaluate the model

And let's see how the model performs.

Two values will be returned. 

Loss (a number which represents our error, lower values are better), and accuracy.

In [None]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

49/49 - 2s - loss: 0.4869 - accuracy: 0.8495
loss: 0.487
accuracy: 0.850


# We have looked at two transfer learning approaches to classify text reviews using pre trained models provided by BERT HuggingFace and TensorFlow


