# **Setup Environment**

For this notebook to work, you need to first run the following from the root directory:

```shell=
python ./Pipeline/Create_Unlabeled_Dataset.py
```

Doing this will generate `policy_texts.csv` in the root directory. Make sure`file_path` below is the path to this file.


In [None]:
file_path = "./policy_texts.csv"

## **Install Transformers**

In [1]:
!pip install transformers



## **Imports**

In [2]:
import os
import requests
import zipfile
import tarfile
import shutil
import math
import json
import time
import sys
import string
import re
import numpy as np
import pandas as pd
from glob import glob
import collections
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

# Tensorflow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.utils import to_categorical
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.utils.layer_utils import count_params

# sklearn
from sklearn.model_selection import train_test_split

# Transformers
from transformers import BertTokenizer, TFBertForSequenceClassification, BertConfig
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel, GPT2Config

## **Environment Check**

In [3]:
# Enable/Disable Eager Execution
# Reference: https://www.tensorflow.org/guide/eager
# TensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, 
# without building graphs

#tf.compat.v1.disable_eager_execution()
#tf.compat.v1.enable_eager_execution()

print("tensorflow version", tf.__version__)
print("keras version", tf.keras.__version__)
print("Eager Execution Enabled:", tf.executing_eagerly())

# Get the number of replicas 
strategy = tf.distribute.MirroredStrategy()
print("Number of replicas:", strategy.num_replicas_in_sync)

devices = tf.config.experimental.get_visible_devices()
print("Devices:", devices)
print(tf.config.experimental.list_logical_devices('GPU'))

print("GPU Available: ", tf.config.list_physical_devices('GPU'))
print("All Physical Devices", tf.config.list_physical_devices())

# Better performance with the tf.data API
# Reference: https://www.tensorflow.org/guide/data_performance
AUTOTUNE = tf.data.experimental.AUTOTUNE

tensorflow version 2.6.0
keras version 2.6.0
Eager Execution Enabled: True
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
Number of replicas: 1
Devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[LogicalDevice(name='/device:GPU:0', device_type='GPU')]
GPU Available:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
All Physical Devices [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


Run this cell to see what GPU you have. If you get a P100 or T4 GPU that's great. If it's K80, it will still work but it will be slow.

In [4]:
!nvidia-smi

Mon Oct 25 13:46:50 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    70W / 149W |    123MiB / 11441MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## **Utils**

Here are some util functions that we will be using for this notebook

In [5]:
def download_file(packet_url, base_path="", extract=False, headers=None):
  if base_path != "":
    if not os.path.exists(base_path):
      os.mkdir(base_path)
  packet_file = os.path.basename(packet_url)
  with requests.get(packet_url, stream=True, headers=headers) as r:
      r.raise_for_status()
      with open(os.path.join(base_path,packet_file), 'wb') as f:
          for chunk in r.iter_content(chunk_size=8192):
              f.write(chunk)
  
  if extract:
    if packet_file.endswith(".zip"):
      with zipfile.ZipFile(os.path.join(base_path,packet_file)) as zfile:
        zfile.extractall(base_path)
    else:
      packet_name = packet_file.split('.')[0]
      with tarfile.open(os.path.join(base_path,packet_file)) as tfile:
        tfile.extractall(base_path)

class JsonEncoder(json.JSONEncoder):
  def default(self, obj):
    if isinstance(obj, np.integer):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, decimal.Decimal):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    else:
        return super(JsonEncoder, self).default(obj)

experiment_name = None
def create_experiment():
  global experiment_name
  experiment_name = "experiment_" + str(int(time.time()))

  # Create experiment folder
  if not os.path.exists(experiment_name):
      os.mkdir(experiment_name)

def save_data_details(data_details):
  with open(os.path.join(experiment_name,"data_details.json"), "w") as json_file:
    json_file.write(json.dumps(data_details,cls=JsonEncoder))

def save_model(model,model_name="model01"):

  if isinstance(model,TFBertForSequenceClassification):
    model.save_weights(os.path.join(experiment_name,model_name+".h5"))
  else:
    # Save the enitire model (structure + weights)
    model.save(os.path.join(experiment_name,model_name+".hdf5"))

    # Save only the weights
    model.save_weights(os.path.join(experiment_name,model_name+".h5"))

    # Save the structure only
    model_json = model.to_json()
    with open(os.path.join(experiment_name,model_name+".json"), "w") as json_file:
        json_file.write(model_json)

def get_model_size(model_name="model01"):
  model_size = os.stat(os.path.join(experiment_name,model_name+".h5")).st_size
  return model_size

def evaluate_save_model(model,test_data, training_results,execution_time, learning_rate, batch_size, epochs, optimizer,save=True):
    
  # Get the model train history
  model_train_history = training_results.history
  # Get the number of epochs the training was run for
  num_epochs = len(model_train_history["loss"])

  # Plot training results
  fig = plt.figure(figsize=(15,5))
  axs = fig.add_subplot(1,2,1)
  axs.set_title('Loss')
  # Plot all metrics
  for metric in ["loss","val_loss"]:
      axs.plot(np.arange(0, num_epochs), model_train_history[metric], label=metric)
  axs.legend()
  
  axs = fig.add_subplot(1,2,2)
  axs.set_title('Accuracy')
  # Plot all metrics
  for metric in ["accuracy","val_accuracy"]:
      axs.plot(np.arange(0, num_epochs), model_train_history[metric], label=metric)
  axs.legend()

  plt.show()
  
  # Evaluate on test data
  evaluation_results = model.evaluate(test_data)
  print(evaluation_results)
  
  if save:
    # Save model
    save_model(model, model_name=model.name)
    model_size = get_model_size(model_name=model.name)

    # Save model history
    with open(os.path.join(experiment_name,model.name+"_train_history.json"), "w") as json_file:
        json_file.write(json.dumps(model_train_history,cls=JsonEncoder))

    trainable_parameters = count_params(model.trainable_weights)
    non_trainable_parameters = count_params(model.non_trainable_weights)

    # Save model metrics
    metrics ={
        "trainable_parameters":trainable_parameters,
        "execution_time":execution_time,
        "loss":evaluation_results[0],
        "accuracy":evaluation_results[1],
        "model_size":model_size,
        "learning_rate":learning_rate,
        "batch_size":batch_size,
        "epochs":epochs,
        "name": model.name,
        "optimizer":type(optimizer).__name__
    }
    with open(os.path.join(experiment_name,model.name+"_model_metrics.json"), "w") as json_file:
        json_file.write(json.dumps(metrics,cls=JsonEncoder))

# **Prepare Training Data** 

We will be working with privacy policy data collected from the internet. We will explore the dataset, prepare the data for finetuning GPT2.

**The Task:** Finetune GPT2 to build a language model on privacy policy text. We aim to train a model that will generate text that resembles the type of writing you see in a privacy policy.

## **Load Data**

* Read-in data as lists.

In [6]:
df_policies = pd.read_csv(file_path , index_col=0)
df_policies.dropna(inplace=True)
training_data = list(df_policies.paragraph_text)
len(training_data)

2675

## **View Text**

Let's take a look at the data.

In [7]:
# Generate a random sample of index
data_samples = np.random.randint(0,high=len(training_data)-1, size=10)
for i,data_idx in enumerate(data_samples):
  print("Text:",training_data[data_idx])

Text: 수신하지 않을 마케팅 또는 프로모션 이메일이 수신되는 경우, 각 메시지에 포함된 “수신 거부” 안내를 따르십시오.
Text: Open your Google Settings app > Ads >Enable “Opt Out of Interest-Based Advertising” or “Opt Out of Ads Personalization”.
Text: 귀하는 개인 연락처 정보를 전혀 공유하지 않고도 많은 Mattel 서비스를 이용할 수 있습니다. 귀하의 선택에 따라 귀하가 개인 연락처 정보, 개인 정보, 로그인 정보, 관심사 또는 인구 통계적 정보, 또는 귀하나 귀하의 자녀에 관한 설문조사 정보를 우리와 공유할 수 있는 경우는 아래와 같습니다:
Text: Information that we collect are "NETWORK STATUS INFORMATION", "WIFI STATUS INFORMATION", " INTERNAL DATA STORAGE"In order to show the advertisement, we need requires Internet Connection checking, either via non-Wi-Fi or Wi-Fi.Some of our apps may need External Data Storage, it is used only to improve the user experience in our apps such as storing the results of user exercises.Information Security
Text: Collecting User Information
Text: Samuel J or Eznetsoft uses remarketing with Google AdWords and analytics to display content-specific advertisements to visitors that have previously visited our site when those visito

## **Tokenize Data for GPT2**

We will use the `distilgpt2` version of pre trained GPT2 model to tokenize text

In [8]:
# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")

# Tokenize data
training_data_tokenized = []
for data in training_data:

  tokenized_text = tokenizer.encode(data)
  training_data_tokenized.append(tokenized_text)

print(len(training_data_tokenized))
print(len(training_data_tokenized[0]),training_data_tokenized[0][:20])

Token indices sequence length is longer than the specified maximum sequence length for this model (2550 > 1024). Running this sequence through the model will result in indexing errors


2675
18 [48948, 7820, 1849, 1222, 35118, 286, 5765, 1849, 7, 5956, 6153, 2362, 1987, 400, 11, 33448, 8, 29064]


## **Generate Training Data**

For the training we need inputs and lables but we only have privacy policy texts. In lecture we learnt that language models are trained in a semi supervised way where we generate inputs and labels from the input text. 

<br>

To generate inputs and lables for training we will chunk the input text into blocks of size `100`. Then our labels will be the same as inputs but one position shifted to the right.

In [9]:
# Split into blocks
training_chunks = []
block_size = 100
for tokenized_text in training_data_tokenized:
  for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size
      training_chunks.append(tokenized_text[i:i + block_size])

# Generate inputs and labels
inputs = []
labels = []
for ex in training_chunks:
    inputs.append(ex[:-1])
    labels.append(ex[1:])

print("inputs length:",len(inputs))
print("labels length:",len(labels))

inputs length: 1591
labels length: 1591


In [10]:
print("input:",len(inputs[0]),inputs[0][:20])
print("labels:",len(labels[0]),labels[0][:20])

input: 99 [32, 366, 44453, 1, 318, 281, 5002, 286, 1366, 326, 257, 5313, 14413, 460, 3758, 284, 534, 6444, 11, 543]
labels: 99 [366, 44453, 1, 318, 281, 5002, 286, 1366, 326, 257, 5313, 14413, 460, 3758, 284, 534, 6444, 11, 543, 743]


## **Create TF Datasets**

In [11]:
BATCH_SIZE = 12
TRAIN_SHUFFLE_BUFFER_SIZE = len(inputs)

# Create TF Dataset
train_data = tf.data.Dataset.from_tensor_slices((inputs, labels))

#############
# Train data
#############
train_data = train_data.shuffle(buffer_size=TRAIN_SHUFFLE_BUFFER_SIZE)
train_data = train_data.batch(BATCH_SIZE, drop_remainder=True)
train_data = train_data.prefetch(buffer_size=AUTOTUNE)

print("train_data",train_data)

train_data <PrefetchDataset shapes: ((12, 99), (12, 99)), types: (tf.int32, tf.int32)>


# **Finetune GPT2 Pretrained Model on Privacy Policy Dataset**

## **Train Model**

In [13]:
############################
# Training Params
############################
learning_rate = 3e-5 
epsilon=1e-08
clipnorm=1.0
epochs = 5 # 100

# Free up memory
K.clear_session()

# Build the model
model = TFGPT2LMHeadModel.from_pretrained("distilgpt2")

# Print the model architecture
print(model.summary())

# Optimizer
optimizer = keras.optimizers.Adam(learning_rate=learning_rate, epsilon=epsilon, clipnorm=clipnorm)
# Loss
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = keras.metrics.SparseCategoricalAccuracy('accuracy')

# Compile
model.compile(loss=[loss, *[None] * model.config.n_layer],
                  optimizer=optimizer,
                  metrics=[metric])

# Train model
start_time = time.time()
training_results = model.fit(
        train_data.take(100), # train_data.take(100) for testing
        epochs=epochs, 
        verbose=1)
execution_time = (time.time() - start_time)/60.0
print("Training execution time (mins)",execution_time)

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Model: "tfgp_t2lm_head_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
transformer (TFGPT2MainLayer multiple                  81912576  
Total params: 81,912,576
Trainable params: 81,912,576
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Training execution time (mins) 5.364365092913309


## **Model Configuration**

In [14]:
model.config

GPT2Config {
  "_name_or_path": "distilgpt2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.11.3",
  "use_cache": true,
  "vocab_size": 50257
}

# **Is Our Model Really Fine-Tuned?**


#### **Generate Text from Privacy-Policy-Trained Model**

In [15]:
# Input text
input_text = "Your location data"

# Tokenize Input
input_ids = tokenizer.encode(input_text, return_tensors='tf')
print("input_ids",input_ids)

# Generate outout
outputs = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=75, 
    top_p=0.80, 
    top_k=0
)

print("Generated text:")
display(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


input_ids tf.Tensor([[7120 4067 1366]], shape=(1, 3), dtype=int32)
Generated text:


'Your location data will be used to set the time of your request. \nWhat happens when you send an email to a Facebook Group when the service has been activated, or when your account has been activated. \nYou may leave an account with your password, so that the information that you request is more freely available to you. \nYou may set the time of'

**GENERATED TEXT FROM <font color="green">PRIVACY POLICY </font>FINE-TUNED MODEL**

"Your location data will be used to set the time of your request. \nWhat happens when you send an email to a Facebook Group when the service has been activated, or when your account has been activated. \nYou may leave an account with your password, so that the information that you request is more freely available to you. \nYou may set the time of..."

#### **Generate Text from Model Pre-Trained on COVID News Articles**

In [16]:
# Dowload trained model on 100 epochs and full dataset (takes around 3 hours to train)
start_time = time.time()
download_file("https://github.com/dlops-io/models/releases/download/v1.0/distilgpt2_covid.zip", base_path="models", extract=True)
execution_time = (time.time() - start_time)/60.0
print("Download execution time (mins)",execution_time)

Download execution time (mins) 0.1273188312848409


In [17]:
# Load the previously trained model
loaded_model = TFGPT2LMHeadModel.from_pretrained('./models/distilgpt2_covid/')

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at ./models/distilgpt2_covid/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [18]:
# Input text
input_text = "Your location data"

# Tokenize Input
input_ids = tokenizer.encode(input_text, return_tensors='tf')
print("input_ids",input_ids)

# Generate outout
outputs = loaded_model.generate(
    input_ids, 
    do_sample=True, 
    max_length=75, 
    top_p=0.80, 
    top_k=0
)

print("Generated text:")
display(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


input_ids tf.Tensor([[7120 4067 1366]], shape=(1, 3), dtype=int32)
Generated text:


'Your location data can save lives," Gerow says. "If there is not a change in the plan, then the only way to avoid it is to have people log in and make a schedule." You can also make a schedule using home data, such as your birthday and your school\'s total.  You can also make the decision to save money or move forward without having'

**GENERATED TEXT FROM <font color="red">COVID</font> FINE-TUNED MODEL**

"Your location data can save lives," Gerow says. "If there is not a change in the plan, then the only way to avoid it is to have people log in and make a schedule." You can also make a schedule using home data, such as your birthday and your school\'s total.  You can also make the decision to save money or move forward without having..."

# Notice how the Covid-trained model generates text following the initial "Your location data" with Covid-related sentences. This is different from the model we trained which generates text that resembles a privacy policy. 