<a href="https://www.kaggle.com/code/edaaydinea/transfer-learning-for-nlp-with-tensorflow-hub?scriptVersionId=123511242" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Transfer Learning for NLP with TensorFlow Hub

*Author: Eda AYDIN*

# A. Project Objectives

We're going to focus on three learning objectives:

1. Use pre-trained NLP text embeddings models from [TensorFlow Hub](https://tfhub.dev/)
2. Perform transfer learning to fine-tune models on real-world text data
3. Visualize model performance metrics with [TensorBoard](https://www.tensorflow.org/tensorboard)

By the time we complete this project, you will be able to use pre-trained NLP text embedding models from TensorFlow Hub, perform transfer learning to fine-tune models on real-world data, build and evaluate multiple models for text classification with TensorFlow, and visualize model performance metrics with Tensorboard.

**Prerequisities:** In order to successfully complete this project, we should be competent in the Python programming language, be familiar with deep learning for Natural Language Processing (NLP), and have trained models with TensorFlow or and its Keras API.



# B. Project Structure

# Task 1: Introduction to the Project

[TensorFlow Hub](https://tfhub.dev/) is a repository of pre-trained TensorFlow models.

In this project, we will use pre-trained models from TensorFlow Hub with [`tf.keras`](https://www.tensorflow.org/api_docs/python/tf/keras) for text classification. Transfer learning makes it possible to save training resources and to achieve good model generalization even when training on a small dataset. In this project, we will demonstrate this by training with several different TF-Hub modules


# Task 2:  Setup your TensorFlow and Colab Runtime

In [None]:
!nvidia-smi

In [None]:
import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 8)
from  IPython import display

import pathlib
import shutil
import tempfile

!pip install -q git+https://github.com/tensorflow/docs

import tensorflow_docs as tfdocs
import tensorflow_docs.modeling
import tensorflow_docs.plots

print("Version: ", tf.__version__)
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

logdir = pathlib.Path(tempfile.mkdtemp())/"tensorboard_logs"
shutil.rmtree(logdir, ignore_errors=True)

# Task 3: Load the Quora Insincere Questions Dataset

### Data and General Description

In this project we will predicting whether a question asked on Quora is sincere or not.

An insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is insincere:

* Has a non-neutral tone
    * Has an exaggerated tone to underscore a point about a group of people
    * Is rhetorical and meant to imply a statement about a group of people
* Is disparaging or inflammatory
    * Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
    * Makes disparaging attacks/insults against a specific person or group of people
    * Based on an outlandish premise about a group of people
    * Disparages against a characteristic that is not fixable and not measurable
* Isn't grounded in reality
    * Based on false information, or contains absurd assumptions
* Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers

The training data includes the question that was asked, and whether it was identified as insincere `(target = 1)`. The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.

Note that the distribution of questions in the dataset should not be taken to be representative of the distribution of questions asked on Quora. This is, in part, because of the combination of sampling procedures and sanitization measures that have been applied to the final dataset.

### File descriptions

* train.csv - the training set
* test.csv - the test set
* sample_submission.csv - A sample submission in the correct format
* enbeddings/ - (see below)

### Data fields

* qid - unique question identifier
* question_text - Quora question text
* target - a question labeled "insincere" has a value of 1, otherwise 0

This is a Kernels-only competition. The files in this Data section are downloadable for reference in Stage 1. Stage 2 files will only be available in Kernels and not available for download.

### What will be available in the 2nd stage of the competition?

In the second stage of the competition, we will re-run your selected Kernels. The following files will be swapped with new data:

* `test.csv` - This will be swapped with the complete public and private test dataset. This file will have ~56k rows in stage 1 and ~376k rows in stage 2. The public leaderboard data remains the same for both versions. The file name will be the same (both test.csv) to ensure that your code will run.
* `sample_submission.csv` - similar to test.csv, this will be changed from ~56k in stage 1 to ~376k rows in stage 2 . The file name will remain the same.

### Embeddings

External data sources are not allowed for this competition. We are, though, providing a number of word embeddings along with the dataset that can be used in the models. These are as follows:

* GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
* glove.840B.300d - https://nlp.stanford.edu/projects/glove/
* paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
* wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html

In [None]:
"""
# Decompress and read the data into a pandas DataFrame without 
df = pd.read_csv("https://archive.org/download/fine-tune-bert-tensorflow-train.csv/train.csv.zip",compression = 'zip', low_memory = False)
df.shape
"""

In [None]:
df = pd.read_csv("/kaggle/input/quora-insincere-questions-classification/train.csv")
test_df = pd.read_csv("/kaggle/input/quora-insincere-questions-classification/test.csv")

In [None]:
df.head()

In [None]:
test_df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
# Count the number of occurrences of each target value
target_counts = df['target'].value_counts()

# Create a list with the target labels
target_labels = ['Sincere', 'Insincere']

# Create a list with the target percentages
target_percentages = [target_counts[0] / len(df), target_counts[1] / len(df)]

# Create the pie plot
fig, ax = plt.subplots(figsize= (6,6))
ax.pie(target_percentages,
       labels=target_labels,
       autopct='%1.1f%%',
       wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'},
       textprops={"size":"x-large"})
ax.set_title("Target Distribution", fontsize=18)
plt.axis('equal')
plt.show()


#### Split the data

In [None]:
from sklearn.model_selection import train_test_split

train_df, remaning = train_test_split(df,
                                     random_state = 42,
                                     train_size = 0.01,
                                     stratify = df.target.values)
valid_df, _ = train_test_split(remaning,
                            random_state = 42,
                            train_size = 0.001,
                            stratify = remaning.target.values)
train_df.shape, valid_df.shape

In [None]:
train_df.target.head(15).values

In [None]:
train_df.question_text.head(15).values

# Task 4: TensorFlow Hub for Natural Language Processing  

Our text data consits of questions and corresponding labels.

You can think of a question vector as a distributed representation of a question, and is computed for every question in the training set. The question vector along with the output label is then used to train the statistical classification model. 

The intuition is that the question vector captures the semantics of the question and, as a result, can be effectively used for classification. 

To obtain question vectors, we have two alternatives that have been used for several text classification problems in NLP: 
* word-based representations and 
* context-based representations

#### Word-based Representations

- A **word-based representation** of a question combines word embeddings of the content words in the question. We can use the average of the word embeddings of content words in the question. Average of word embeddings have been used for different NLP tasks.
- Examples of pre-trained embeddings include:
  - **Word2Vec**: These are pre-trained embeddings of words learned from a large text corpora. Word2Vec has been pre-trained on a corpus of news articles with  300 million tokens, resulting in 300-dimensional vectors.
  - **GloVe**: has been pre-trained on a corpus of tweets with 27 billion tokens, resulting in 200-dimensional vectors.


#### Context-based Representations

- **Context-based representations** may use language models to generate vectors of sentences. So, instead of learning vectors for individual words in the sentence, they compute a vector for sentences on the whole, by taking into account the order of words and the set of co-occurring words.
- Examples of deep contextualised vectors include:
  - **Embeddings from Language Models (ELMo)**: uses character-based word representations and bidirectional LSTMs. The pre-trained model computes a contextualised vector of 1024 dimensions. ELMo is available on Tensorflow Hub.
  - **Universal Sentence Encoder (USE)**: The encoder uses a Transformer  architecture that uses attention mechanism to incorporate information about the order and the collection of words. The pre-trained model of USE that returns a vector of 512 dimensions is also available on Tensorflow Hub.
  - **Neural-Net Language Model (NNLM)**: The model simultaneously learns representations of words and probability functions for word sequences, allowing it to capture semantics of a sentence. We will use a  pretrained  models available on Tensorflow Hub, that are trained on the English Google News 200B corpus, and computes a vector of 128 dimensions for the larger model and 50 dimensions for the smaller model.


Tensorflow Hub provides a number of [modules](https://tfhub.dev/s?module-type=text-embedding&tf-version=tf2&q=tf2) to convert sentences into embeddings such as Universal sentence ecoders, NNLM, BERT and Wikiwords.

Transfer learning makes it possible to save training resources and to achieve good model generalization even when training on a small dataset. In this project, we will demonstrate this by training with several different TF-Hub modules.

# Tasks 5 & 6: Define Function to Build and Compile Models

In [None]:
def train_and_evaluate_model(module_url, embed_size, name, trainable=False):
    # Define a KerasLayer that loads the pre-trained module at the given module_url,
    # with input_shape=[], output_shape=[embed_size], dtype=tf.string, and trainable=trainable
    hub_layer = hub.KerasLayer(module_url, input_shape = [], output_shape = [embed_size], dtype= tf.string,
                              trainable = trainable)
    
    # Define a sequential model with hub_layer as the first layer,
    # followed by two dense layers with 256 and 64 units, respectively,
    # and a final dense layer with 1 unit and sigmoid activation
    model = tf.keras.models.Sequential([
        hub_layer,
        tf.keras.layers.Dense(256, activation = "relu"),
        tf.keras.layers.Dense(64, activation = "relu"),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ])
    
    # Compile the model with Adam optimizer with learning_rate=0.0001,
    # binary cross-entropy loss, and binary accuracy as the evaluation metric
    model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0001),
                 loss = tf.losses.BinaryCrossentropy(),
                 metrics = [tf.metrics.BinaryAccuracy(name="accuracy")])
    
    # Train the model on the training data, using validation data for early stopping
    # and callbacks for logging and monitoring training progress
    history = model.fit(train_df["question_text"], train_df["target"],
                       epochs = 100,
                       batch_size = 32,
                       validation_data = (valid_df["question_text"], valid_df["target"]),
                       callbacks = [tfdocs.modeling.EpochDots(),
                                    tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, mode="min"),
                                    tf.keras.callbacks.TensorBoard(logdir/name)],
                       verbose = 0)
    
    # Return the training history
    return history


# Task 7:  Train Various Text Classification Models

In [None]:
# Define the available module URLs and embedding sizes as a dictionary
module_urls = {
    "gnews-swivel-20dim": {"url": "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1", "embed_size": 20},
    "nnlm-en-dim50": {"url": "https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1", "embed_size": 50},
    "nnlm-en-dim128": {"url": "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1", "embed_size": 128},
}

# Create an empty dictionary to store the model histories
histories = {}

# Iterate through the module URLs and train and evaluate the models
for name, values in module_urls.items():
    url = values["url"]
    embed_size = values["embed_size"]
    history = train_and_evaluate_model(url, embed_size=embed_size, name=name)
    histories[name] = history


# Task 8: Compare Accuracy and Loss Curves

In [None]:
plt.rcParams['figure.figsize'] = (12, 8)
plotter = tfdocs.plots.HistoryPlotter(metric = 'accuracy')
plotter.plot(histories)
plt.xlabel("Epochs")
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
plt.title("Accuracy Curves for Models")
plt.show()

In [None]:
plotter = tfdocs.plots.HistoryPlotter(metric = 'loss')
plotter.plot(histories)
plt.xlabel("Epochs")
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
plt.title("Loss Curves for Models")
plt.show()

# Task 9: Fine-tune Model from TF Hub

In [None]:
# Define the available module URLs and embedding sizes as a dictionary
module_urls = {
    "gnews-swivel-20dim": {"url": "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1", "embed_size": 20},
    "nnlm-en-dim50": {"url": "https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1", "embed_size": 50},
    "nnlm-en-dim128": {"url": "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1", "embed_size": 128},
    "gnews-swivel-20dim-finetuned": {"url": "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1", "embed_size": 20}
}
# Create an empty dictionary to store the model histories
histories = {}

# Iterate through the module URLs and train and evaluate the models
for name, values in module_urls.items():
    url = values["url"]
    embed_size = values["embed_size"]
    trainable = False if "fine_tuned" not in name else True
    history = train_and_evaluate_model(url, embed_size=embed_size, name=name, trainable= trainable)
    histories[name] = history

In [None]:
plt.rcParams['figure.figsize'] = (12, 8)
plotter = tfdocs.plots.HistoryPlotter(metric = 'accuracy')
plotter.plot(histories)
plt.xlabel("Epochs")
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
plt.title("Accuracy Curves for Models")
plt.show()

In [None]:
plotter = tfdocs.plots.HistoryPlotter(metric = 'loss')
plotter.plot(histories)
plt.xlabel("Epochs")
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
plt.title("Loss Curves for Models")
plt.show()

# Task 10: Train Bigger Models and Visualize Metrics with TensorBoard

In [None]:
# Define the available module URLs and embedding sizes as a dictionary
module_urls = {
    "universal-sentence-encoder": {"url": "https://tfhub.dev/google/universal-sentence-encoder/4", "embed_size": 512},
    "universal-sentence-encoder-large": {"url": "https://tfhub.dev/google/universal-sentence-encoder-large/5", "embed_size": 512}
}

# Create an empty dictionary to store the model histories
histories = {}

# Iterate through the module URLs and train and evaluate the models
for name, values in module_urls.items():
    url = values["url"]
    embed_size = values["embed_size"]
    history = train_and_evaluate_model(url, embed_size=embed_size, name=name)
    histories[name] = history

In [None]:
plt.rcParams['figure.figsize'] = (12, 8)
plotter = tfdocs.plots.HistoryPlotter(metric = 'accuracy')
plotter.plot(histories)
plt.xlabel("Epochs")
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
plt.title("Accuracy Curves for Models")
plt.show()

In [None]:
plotter = tfdocs.plots.HistoryPlotter(metric = 'loss')
plotter.plot(histories)
plt.xlabel("Epochs")
plt.legend(bbox_to_anchor=(1.0, 1.0), loc='upper left')
plt.title("Loss Curves for Models")
plt.show()

In [None]:
%load_ext tensorboard
%tensorboard --logdir {logdir}

# Resources

* [Quora Insincere Questions Classification Dataset](https://www.kaggle.com/c/quora-insincere-questions-classification/data)
* [Pie Charts with Labels in Matplotlib](https://www.pythoncharts.com/matplotlib/pie-chart-matplotlib/)