Name: Samuel Middleton<br>
Project: Using Fine-Tuned GPT2 to Generate Book Reviews<br>
Github: https://github.com/emperorner0 <br>
Email: samuelmiddleton93@gmail.com

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Business-Case-and-Purpose" data-toc-modified-id="Business-Case-and-Purpose-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Business Case and Purpose</a></span></li><li><span><a href="#Data-Access" data-toc-modified-id="Data-Access-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Access</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#MongoDB" data-toc-modified-id="MongoDB-2.0.1"><span class="toc-item-num">2.0.1&nbsp;&nbsp;</span>MongoDB</a></span></li></ul></li></ul></li><li><span><a href="#Data-Exploration" data-toc-modified-id="Data-Exploration-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Exploration</a></span></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Modeling</a></span><ul class="toc-item"><li><span><a href="#PyTorch-Custom-Dataset" data-toc-modified-id="PyTorch-Custom-Dataset-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>PyTorch Custom Dataset</a></span></li><li><span><a href="#GPT2" data-toc-modified-id="GPT2-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>GPT2</a></span></li><li><span><a href="#Model-Instantiation" data-toc-modified-id="Model-Instantiation-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Model Instantiation</a></span></li><li><span><a href="#Model-Tuning" data-toc-modified-id="Model-Tuning-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Model Tuning</a></span></li></ul></li><li><span><a href="#Model-Saving" data-toc-modified-id="Model-Saving-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Model Saving</a></span></li><li><span><a href="#Test-Review-Generation" data-toc-modified-id="Test-Review-Generation-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Test Review Generation</a></span></li><li><span><a href="#Further-Readings" data-toc-modified-id="Further-Readings-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Further Readings</a></span></li></ul></div>

# Business Case and Purpose

This is secondary

Open marketplaces with semi-anonymous review platforms have a significant problem, bot generated reviews. This issue is exacerbated by the common availability of high powered, free pre-trained models that puts the strength of advanced NLP neural networks in to the hands of everyone. The double edged nature of these powerful models is that they're amazing for research and advancement, but they also put the powers of near human levels of context and speech generation into the hands of black-hat users. 

We are seeking to counter these nefarious individuals by using the very technology they would use. Using GPT2 fine-tuned to book reviews we seek to optimize fake review generation in order to use BERT to detect fake reviews in this proof of concept machine learning product.

This is an ancillary notebook to a main notebook in this repository. See the `Classifier_with_BERT` notebook for further details.

In [2]:
import os
import random
import time
from datetime import timedelta
import re

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

import torch
from torch import nn
from torch.nn import functional as F
from torch.nn import CrossEntropyLoss
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler, random_split
torch.manual_seed(42)

from transformers import pipelines
from transformers import GPT2LMHeadModel,  GPT2Tokenizer, GPT2Config, GPT2LMHeadModel
from transformers import AdamW, get_linear_schedule_with_warmup

from pymongo import MongoClient

from tqdm.notebook import tqdm_notebook as tqdm

import nltk
nltk.download('punkt')

import warnings
%matplotlib inline
warnings.filterwarnings('ignore')

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\Nero_\anaconda3\envs\learn-env2\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.

In [None]:
# Retrain and save model if true
save = False

# Data Access

Initially we attempted to utilize 51 million book reviews as provided by [Julian McAuley](http://cseweb.ucsd.edu/~jmcauley/datasets.html#google_local) in a single dataframe. Due to the nature of dataframes, being memory inefficient and unable to stream data, we had to engineer a work-around. After further research we decided upon MongoDB. 

### MongoDB

MongoDB is a simple document based database system that provides great flexibility in its expandability and extensibility. It does this by:

- Offering JSON-like document storage, meaning data structure can change document to document

- Easily map objects to application code

- Ad hoc queries, indexing, and real-time aggregation.

- Distributed at its core.

- Free to use under the (SSPLv1 license)

For our uses we were able to load in a 51 million `20+gb` `JSON` file up as a database. We were then able to aggregate and further sample the data so that we could feed a selection of the reviews into our model for fine tuning. 

Thus in the end we ended with a corpus of 50,000 reviews on which to train our `GPT2` model for review text generation. We chose not to push the number further due to lack of computer resources. Were we working with a distributed network architecture we could've easily expanded the corpus size.

In [None]:
client = MongoClient() # Instantiate local PyMongo client

In [None]:
db = client['local'] # Access database

In [None]:
collection = db['reviewdata'] # Access collections

In [None]:
# Use sample aggregation to pull 50k random review without replacement
reviews = list(collection.aggregate([{ "$sample": {"size":50000}}]))

In [None]:
reviews

In [None]:
# Store the 50k reviews in a dataframe
reviews_data = pd.DataFrame(reviews)['reviewText']

In [None]:
# Drop any duplicates
review_data = reviews_data.drop_duplicates()

In [None]:
review_data # Sanity check

In [None]:
# Drop any null values added
review_data.isna().sum()
review_data.dropna(inplace=True)

In [None]:
# Maintain a clean copy of the dataframe
reviews = review_data.copy()
reviews 

In [None]:
if save:
    # Label broadcasting and saving
    review_data['label'] = 0
    review_data.to_csv('Data/realreviews.csv')

# Data Exploration

Our particular model of `GPT2` maintained the default token length at 1024 tokens, though to be safe we took reviews only under 768 tokens in length. So first we look at the distribution of token lengths in the reviews by breaking them down into tokens and observing the length in a list.

Here we also instantiate our `GPT2` tokenizer provided by the Hugging Face transformers library. We are passing in custom tokens for beginning, ending, and padding sentences. 

We also notate that the average review length is 117 words, and the maximum length of a review is a novel at 6499 words!

In [None]:
# Tokenize all of the reviews in the dataframe and get their length
reviewlen = []
for review in tqdm(reviews):
    tokens = nltk.word_tokenize(review)
    reviewlen.append(len(tokens))
    
reviewlen = np.array(reviewlen)

In [None]:
# Plotting review length and average review length
train_av = round(np.average(reviewlen),3)

fig, ax = plt.subplots(figsize=[8,6])
plt.title('Test Histogram with Average Length')
plt.hist(pd.Series(reviewlen), bins=30)
ax.axvline(train_av, color ='red', lw = 2, alpha = 0.75) 
plt.legend(['Average Length: {}'.format(train_av), 'Test Histogram'])
plt.savefig('Images/testlengen.png')
plt.show()

In [None]:
# Getting number of reviews over 768 tokens
print(f"Precent above 768 tokens: {round(len(reviewlen[reviewlen > 768])/len(reviewlen)*100, 3)}%")

In [None]:
print('Average review length: {} words.'.format(round(np.average(reviewlen), 3)))

In [None]:
print('Max review length: {} words.'.format(np.max(reviewlen)))

In [None]:
# Tokenize for modeling using GPT2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2',
                                          bos_token='<|sot|>', eos_token='<|eot|>', pad_token='<|pad|>')

In [None]:
print("Beginning of sentence token {} token has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("End of sentence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))
print("Padding token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.pad_token_id), tokenizer.pad_token_id))

# Modeling

## PyTorch Custom Dataset

PyTorch has a Dataset inheritable class that can be used with the PyTorch framework. The Dataset inheritable class represents a Python iterable over a dataset that supports map-style or iterable-style datasets.

* **Map-Style** - Represents a map of Key-Value pairs to data samples within the dataset.
* **Iterable-Style** - Represents an iterable dataset like that which could be streamed from a database, remote server, or even generated in real-time.

This uses the `__getitem__` method to implement data retrieval and is therefore a map-style dataset. The `__getitem__` method pulls a sequence to feed token sequences into the model.

In [None]:
class GPT_Finetune_Dataset(Dataset):
    
    """
    Takes an iterable of reviews and transforms them into a PyTorch GPT2 dataset
    Input:
        txt_list[iterable] - List of reviews to use as dataset
        Tokenizer[Instantiated Tokenizer] - Transformers library tokenizer to tokenize text
        GPT2_type[String][Optional] - Default: 'gpt2'
        Max_length[Int][Optional] - Max length of tokens put out by dataset
    Returns:
        Input sequences, attention mask
    """

    def __init__(self, txt_list, tokenizer, gpt2_type="gpt2", max_length=200):

        self.tokenizer = tokenizer
        self.input = []
        self.attn = []
        # Cycles through the iterable of txt_list 
        for txt in tqdm(txt_list):
            # Encodes text adding padding, start, and end tokens.
            encodings_dict = tokenizer('<|sot|>'+ txt +'<|eot|>',
                                     truncation=True, max_length=max_length, padding="max_length")

            self.input.append(torch.tensor(encodings_dict['input_ids']))
            self.attn.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input)

    def __getitem__(self, idx):
        return self.input[idx], self.attn[idx] 

In [None]:
data = GPT_Finetune_Dataset(reviews, tokenizer)

## GPT2
`GPT2` is a distinct model architecture that was developed by OpenAI and based upon their original `GPT` model. 

`GPT2` is based on the Transformer architecture. I have covered the basis of Transformer architecture in my `Classifier_with_BERT` notebook therefore I am going to only cover what makes `GPT(2)` different from other NLP transfer learning architectures.

`GPT2` differs from something like `BERT` in that it only uses the decoder side of the Encoder-Decoder part of Transformer architecture. 

It differs greatly from `BERT` in that is doesn't actually change the chose tokens to `[MASK]` but instead chooses to interfere with the self-attention calculation for the tokens of the right of the current position being calculated. This is **Masked Self-Attention**, as opposed to `BERT`'s **Self-Attention**. (Self-Attention has been covered in the aforementioned notebook, please refer to that notebook for further information on the concept.)

The Decoder Stack that makes up the `GPT2` transformer architecture contains decoders that are cells of masked self-attention layers and then a feed-forward neural network layer. These are stacked to produce the `GPT2` architecture.

Before token sequences are passed tot he decoder stack they are first embedded into vocabularies and then position embedded. These embeddings are then passed up the decoder stack.

In [None]:
train_size = int(len(data) * .7)
test_size = len(data) - train_size

train_set, test_set = random_split(data, [train_size, test_size])

In [None]:
print('{} training samples'.format(train_size))
print('{} test samples'.format(test_size))

In [None]:
batch_size = 5
train_dataloader = DataLoader(
            train_set,  # The training set
            sampler = RandomSampler(train_set), # Random sampler
            batch_size = batch_size # Trains with this batch size for memory reasons
        )

test_dataloader = DataLoader(
            test_set, # The validation samples.
            sampler = SequentialSampler(test_set), # Pull out batches sequentially since order doesn't matter
            batch_size = batch_size 
        )

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

## Model Instantiation


In [None]:
# Get config
configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)

# Model instantiation
model = GPT2LMHeadModel.from_pretrained("gpt2", config=configuration)

# Necessary because of the custom tokens
model.resize_token_embeddings(len(tokenizer))

# Model to the GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

seed_val = 42

# Setting seeds
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
# Setting Parameters
epochs = 5
learning_rate = .00005
warmup_steps = 50

# this produces sample output every 100 steps
sample_every = 100

In [None]:
#AdamW is a class from the huggingface library that schedules weights
optimizer = AdamW(model.parameters(),
                  lr = learning_rate
                )

In [None]:
total_steps = len(train_dataloader) * epochs

# Adjusts the learning rate as the model steps through
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = warmup_steps, 
                                            num_training_steps = total_steps)

## Model Tuning

In [None]:
if save:
    timestat = time.time()

    stats = []

    model = model.to(device)

    for epoch_i in range(0, epochs):

        # training loop
        print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))

        # Setup time
        timestat = time.time()

        # Initiate loss
        total_train_loss = 0

        # PyTorch training mode
        model.train()

        # Pull data from data loader
        for step, batch in enumerate(train_dataloader):

            # Pass the 3 dataloader outputs in device
            b_input_ids = batch[0].to(device)
            b_labels = batch[0].to(device)
            b_masks = batch[1].to(device)

            # Zero gradients to prevent epoch contamination
            model.zero_grad()        

            # Grab model outputs
            outputs = model(  b_input_ids,
                              labels=b_labels, 
                              attention_mask = b_masks,
                              token_type_ids=None
                            )
            # Pull loss from loss function
            loss = outputs[0]  

            # Detach loss from CUDA and add to total loss
            batch_loss = loss.item()
            total_train_loss += batch_loss

            # Get sample every x batches.
            if step % sample_every == 0 and not step == 0:
                # Use initial time to get time stats
                elapsed = time.time() - timestat
                print('  Batch {:>5,}  of  {:>5,}. Loss: {:>5,}.   Elapsed: {:}.'.format(step, len(train_dataloader), batch_loss, elapsed))

                # Model into PyTorch eval mode
                model.eval()

                # Generate a sample
                sample_outputs = model.generate(
                                        bos_token_id=random.randint(1,30000),
                                        do_sample=True,   
                                        top_k=50, 
                                        max_length = 200,
                                        top_p=0.95, 
                                        num_return_sequences=1
                                    )
                for i, sample_output in enumerate(sample_outputs): # print a sample
                      print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

                # Convert model back to training mode
                model.train()

            # Backprop
            loss.backward()

            # Optimizer step
            optimizer.step()

            # Scheduler step
            scheduler.step()

        # Calculate the average loss over all of the batches.
        avg_train_loss = total_train_loss / len(train_dataloader)       

        # Measure how long this epoch took.
        training_time = time.time() - timestat

        print("")
        print("  Average training loss: {0:.2f}".format(avg_train_loss))
        print("  Training epoch took: {:}".format(training_time))

        # Testing loop
        print("Running Validation...")

        t0 = time.time()

        model.eval()

        total_eval_loss = 0
        nb_eval_steps = 0

        # Evaluate data for one epoch
        for batch in test_dataloader:

            b_input_ids = batch[0].to(device)
            b_labels = batch[0].to(device)
            b_masks = batch[1].to(device)

            with torch.no_grad():        

                outputs  = model(b_input_ids, 
                                 attention_mask = b_masks,
                                labels=b_labels)

                loss = outputs[0]  

            batch_loss = loss.item()
            total_eval_loss += batch_loss        

        avg_val_loss = total_eval_loss / len(test_dataloader)

        validation_time = time.time() - timestat 

        print("  Validation Loss: {0:.2f}".format(avg_val_loss))
        print("  Validation took: {:}".format(validation_time))

        # Record all statistics from this epoch.
        stats.append(
            {
                'epoch': epoch_i + 1,
                'Training Loss': avg_train_loss,
                'Valid. Loss': avg_val_loss,
                'Training Time': training_time,
                'Validation Time': validation_time
            }
        )
    torch.cuda.empty_cache()

In [None]:
if save:
    # Convert stats dict into a dataframe  
    training_df = pd.DataFrame(data=stats)
    training_df = training_df.set_index('epoch')
    training_df

In [None]:
if save:
    # Seaborn plot for training and Valid loss
    sns.set(font_scale=1.5)
    plt.rcParams["figure.figsize"] = (12,6)
    plt.plot(training_df['Training Loss'], 'r-o', label="Training")
    plt.plot(training_df['Valid. Loss'], 'b-o', label="Validation")
    plt.title("Training & Validation Loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.legend()
    plt.xticks([1, 2, 3, 4])
    plt.show()

# Model Saving

Model saving is an integral part of model deployment and the fine tuning process. Having used the pre-trained model made available from the Hugging Face Tranformers library we are able to work with their `save_pretrained()` and `load_pretrained()` methods, but could just as easily use native PyTorch models save and load functions.

In [None]:
if save:
    # Save params
    params = list(model.named_parameters())

In [None]:
if save:
    # Check params of model to save
    for param in params:
        print(param[0])

In [None]:
if save:
    # Create directory if not available 
    output_dir = './model_save/'

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

In [None]:
if save:
    print("Saving model to {}".format(output_dir))

In [None]:
if save:
    model_to_save = model.module if hasattr(model, 'module') else model
    model_to_save.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

In [None]:
if save is False:
    # Initiate model from fine-tuned model
    gpt_config = GPT2Config.from_pretrained('model_save/', output_hidden_states=False)
    model = GPT2LMHeadModel.from_pretrained('model_save/', config=gpt_config)

    tokenizer = GPT2Tokenizer.from_pretrained('model_save/',
                                            vocab_file='model_save/vocab.json', merges_file='model_save/merges.txt',
                                            bos_token='<|sot|>', eos_token='<|eot|>', pad_token='<|pad|>')
    model.to(device)

# Test Review Generation

Text generation is one of the main purposes of the `GPT2` model. We take a random sample of the reviews on which we trained the initial model. These are then broken down to a start of sentence token and then a 4 length token sequence. This sequence is the **Prompt**. The prompt is required for the model. The model will take the prompt and then use it to generate context based upon it. The 4 token (5, if the start of sentence token is counter) sequence can then be tokenized and fed into the model so that it can generate responses to the prompts.

In [None]:
test2 = random.sample(set(reviews.index), 100)

In [None]:
prompts = []
for x in test2:
    try:
        prompts.append("<|sot|> " + ' '.join(reviews[x].split()[:4]))
    except:
        pass

In [None]:
prompts

In [None]:
model.eval()
# Encode prompts and save to list
gen_prompt = []
for prompt in prompts:
    gen_prompt.append(torch.tensor(tokenizer.encode(prompt)).unsqueeze(0).to(device))

In [None]:
reviewlist = [] # Empty list to store reviews
t = time.time() # Time save  

# Loop to take prompt (initial 4 words) from list of real reviews
for x, promp in enumerate(gen_prompt):
    t1 = time.time() # loop time
    print('---------------------------------')
    print('''{}: The prompt is "{}"'''.format(x+1, prompts[x])) # Pretty output
    print('---------------------------------')
    
    # Review generation block
    sample_outputs = model.generate(
                                    promp, # Prompt
                                    bos_token_id= random.randint(1, 100000), # Token randomization
                                    do_sample=True,   
                                    top_k=30, # Prevent the review from repeating
                                    min_length=20, # Min token length
                                    max_length = 250, # max token length
                                    top_p=0.95, 
                                    num_return_sequences=20 # number to return
                                    )
    
    # Output reviews and save them to list
    for i, sample_output in enumerate(sample_outputs):
      print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
      reviewlist.append("{}".format(tokenizer.decode(sample_output, skip_special_tokens=True)))
    time_per_gen = time.time() - t1
    print('This generation took {} seconds'.format(round(time_per_gen, 3)))
totes = time.time() - t
print('Total time was {}'.format(str(timedelta(seconds = totes))))
torch.cuda.empty_cache()

In [None]:
reviewlist # Sanity check

In [None]:
remove = False
if remove:
    # Remove limit characters
    testlist = []
    removedict = [('\n', ' '), ("\\", '')]
    for review in reviewlist:
        reveiw = re.sub('\n', ' ', review)
        testlist.append(re.sub("\'", "", reveiw))

In [None]:
print(f"Real review prompt:{reviews[test2[0]]}")
print('')
print(f"Fake review from prompt:{reviewlist[6]}")

In [None]:
# Save to Dataframe from list
reviewdf = pd.DataFrame(testlist, columns=['reviews'])

In [None]:
# Resample and shuffle
reviewdf = reviewdf.sample(frac=1)

In [None]:
# Broadcast label
reviewdf['label'] = 1

In [None]:
# Save to CSV
if save:
    reviewdf.to_csv('Data/reviewseries.csv')

# Further Readings

Further GPT2 explanations:

http://jalammar.github.io/illustrated-gpt2/

http://humanssingularity.com/gpt2sampling/

