# Introduction

In this project, we use the Amazon reivew text data (link: https://nijianmo.github.io/amazon/index.html) - in particular, review texts about electronics products. We will be working with language models to generate texts, and also a text classification model using the same text data in order to check if our language models can generate texts that are similar to the real data that they were trained with.

Our project will consist of the following steps.

- Step 1: For each rating score (i.e. 1, 2, 3, 4, 5), fine-tune a pre-trained GPT2 language model (provided by Hugging Face's "transformers" library). 

- Step 2: Once the five separate language models are fine-tuned, we will be able to generate new texts from each of them. That is: we create some texts via the language model for rating score 1, similarly for score 2, 3, 4, and 5.

- Step 3: Train a text classification model using the same, or part of the, text data that we used for step 1. The model will classify a text into one of the five rating scores.

- Step 4. We let the text classification model predict on the generated texts from step 2. So basically, the generated texts are used as the test data for our classifier, where the true class for the generated texts from the rating 1 language model would be 1, the true class for the generated texts from the rating 2 language model would be 2, etc.

Ultimately, we want to see if the model's performance on predicting the generated texts by the GPT2 language models (which are fake data) is reasonably similar to its performance on the evaluation data (which are real data). If yes, it would be reasonable to say that the language models are indeed capable of generating texts that are similar to the real texts that they were trained with.

Note: Training the text classification model well so that its performance is high is not necessarily a key point here. We're more interested in whether the classifier's predictive performance is similar with the real data and the fake data.

For working with GPT2 language models, we will use the 'transformers' library provided by Hugging Face (repo link: https://github.com/huggingface/transformers). 

The link to the GPT2 paper is: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

In [None]:
!pip install transformers

In [None]:
!pip install datasets

In [None]:
import numpy as np
import pickle
import pandas as pd
import os
import random
from collections import Counter
import time
import datetime

import torch
from torch.utils.data import TensorDataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import AdamW, get_linear_schedule_with_warmup
from keras.preprocessing.sequence import pad_sequences

# Preprocessing the raw texts, then saving them as train txt and eval txt files

This step was done already (outside this jupyter notebook), so we just provide the code here.

In [None]:
import pandas as pd 
import random
import os


def str_with_tokens(texts_list, start_token='<start_token>', end_token='<end_token>'):
    """
    INPUTS
    - 'texts_list': a list of reviews
    - 'start_token': a string that we will put at the front of every review
    - 'end_token': a string that we will put at the end of every review

    OUTPUT
    - a string with all the reviews in 'texts_list' concatenated, together with the start and end tokens
    """
    texts_list = [txt.replace('\n', ' ') for txt in texts_list] # remove new line characters
    sep = ' {}\n{} '.format(end_token, start_token)
    return start_token + ' ' + sep.join(texts_list) + ' ' + end_token


def save_train_and_eval_txt_and_pickle(texts_list, rating, train_prop=0.85):
    """
    INPUTS
    - 'text_list': a list of reviews
    - 'rating': which rating score (1, 2, 3, 4, or 5)
    - 'train_prop': proportion of training text

    ACTION
    - randomly split the 'texts_list' reviews via the provided 'train_prop', add special tokens, then save training txt/pickle file and eval txt/pickle file
    """
    texts_list_shuffled = random.sample(texts_list, len(texts_list))
    num_train = int(len(texts_list_shuffled) * train_prop)
    
    # 1. save as a pickle file (without special tokens)
    train_filename = 'train_rating{}.pkl'.format(rating)
    eval_filename = 'eval_rating{}.pkl'.format(rating)
    train_df = pd.DataFrame(data={'review_text': texts_list_shuffled[0:num_train], 'rating': [rating] * num_train})
    eval_df = pd.DataFrame(data={'review_text': texts_list_shuffled[num_train:], 'rating': [rating] * (len(texts_list) - num_train)})
    train_df.to_pickle(os.getcwd() + '\\' + train_filename)
    eval_df.to_pickle(os.getcwd() + '\\' + eval_filename)
    print("Done with creating '{}' and '{}' files".format(train_filename, eval_filename))
    
    # 2. save as a txt file (with special tokens)
    train_filename = 'train_rating{}.txt'.format(rating)
    eval_filename = 'eval_rating{}.txt'.format(rating)
    with open(os.getcwd() + '\\' + train_filename, 'w') as txtfile:
        txtfile.write(str_with_tokens(texts_list_shuffled[0:num_train]))
    with open(os.getcwd() + '\\' + eval_filename, 'w') as txtfile:
        txtfile.write(str_with_tokens(texts_list_shuffled[num_train:]))
    print("Done with creating '{}' and '{}' files".format(train_filename, eval_filename))


if __name__ == "__main__":
    # 'reviews_df.pickle' is a pickle file that was created from the original 'Electronics_5.json' file provided by the dataset website.
    # It was created via the following python code:
    ## import pandas as pd
    ## import pickle
    ## reviews_df = pd.read_json('Electronics_5.json', lines=True)
    ## reviews_df = reviews_df[['reviewText','overall']]
    ## indices_to_keep = [i for i, txt in enumerate(reviews_df['reviewText']) if isinstance(txt, str)]
    ## reviews_df = reviews_df.iloc[indices_to_keep, :]
    ## reviews_df.reset_index(drop=True, inplace=True)
    ## reviews_df.to_pickle('reviews_df.pickle')
    reviews_df = pd.read_pickle(os.getcwd() + '\\reviews_df.pickle')
    each_rating_all_texts_df = reviews_df.groupby('overall')['reviewText'].apply(list).reset_index(name='all_texts')
    # To let the five language models have equal amount of training data when being fine-tuned,
    # we compute the minimum count among the five ratings. This is 306659 (rating 2).
    min_count = min([len(texts) for i, texts in enumerate(each_rating_all_texts_df['all_texts'])])
    # For each rating, randomly select 306659 texts that will be used for its language model,
    # and save them as train and eval.
    for rating in [1, 2, 3, 4, 5]:
        texts_list = each_rating_all_texts_df['all_texts'][rating-1]
        texts_list = random.sample(texts_list, min_count)
        save_train_and_eval_txt_and_pickle(texts_list=texts_list, rating=rating)

# For each rating score (i.e. 1, 2, 3, 4, and 5), fine-tune a GPT2 language model

For fine-tuning the GPT2 language models, we use the *run_language_modeling.py* script provided by Hugging Face (original link: https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_clm.py)

As shown in the preprocessing code above, we used our own special tokens. So we need to make small modifications in the tokenizer's special tokens as well. In *run_language_modeling.py*, inside the 'main()' function, just before ***model.resize_token_embeddings(len(tokenizer))***, we added the following line of code: ***tokenizer.add_special_tokens({'bos_token':'start_token', 'eos_token':'end_token', 'pad_token':'<pad_token>'})*** . We've uploaded the modified *run_clm.py* file in the github repository.

Fine-tuning a GPT2 language model for one epoch for rating score 1 can be done via the following command. (and similarly for ratings 2, 3, 4, and 5)

In [None]:
%%time

!python /content/drive/My\ Drive/Amazon\ Review\ Text\ Language\ Modeling/run_clm.py \
--output_dir=/content/drive/My\ Drive/Amazon\ Review\ Text\ Language\ Modeling/trained_gpt2_rating1 \
--model_name_or_path=gpt2 \
--do_train \
--train_file=/content/drive/My\ Drive/Amazon\ Review\ Text\ Language\ Modeling/train_rating1.txt \
--do_eval \
--validation_file=/content/drive/My\ Drive/Amazon\ Review\ Text\ Language\ Modeling/eval_rating1.txt \
--per_device_train_batch_size=2 \
--num_train_epochs=1 \
--logging_steps=500 \
--save_steps=9999999

To continue fine-tuning from the result after one epoch, we can use the following command.

In [None]:
%%time

!python /content/drive/My\ Drive/Amazon\ Review\ Text\ Language\ Modeling/run_clm.py \
--output_dir=/content/drive/My\ Drive/Amazon\ Review\ Text\ Language\ Modeling/trained_gpt2_rating1_2epochs \
--model_name_or_path=gpt2 \
--model_name_or_path=/content/drive/My\ Drive/Amazon\ Review\ Text\ Language\ Modeling/trained_gpt2_rating1 \
--do_train \
--train_file=/content/drive/My\ Drive/Amazon\ Review\ Text\ Language\ Modeling/train_rating1.txt \
--do_eval \
--validation_file=/content/drive/My\ Drive/Amazon\ Review\ Text\ Language\ Modeling/eval_rating1.txt \
--per_device_train_batch_size=2 \
--num_train_epochs=1 \
--logging_steps=500 \
--save_steps=9999999 \
--overwrite_output_dir

Here we discuss the process for generating texts using the trained GPT2 language model. We use the *run_generation.py* script also provided by Hugging Face (original link:
https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py)

We make small modifications to the script for purposes including adding special tokens and saving the generated texts as a pickle file. We've uploaded the modified *run_generation.py* file in the github repository.

Moreover, here are some helper functions for dealing with our generated texts.

In [None]:
# A helper function that removes the start token at the beginning.
def remove_start_token(string, start_token='<start_token>'):
    return string[len(start_token):]

# A helper function that checks whether a string contains only non-alphanumeric characters.
def has_no_alphanumeric_chars(string):
    return not any(char.isalnum() for char in string)

# A helper function that removes empty spaces at the very beginning (if any) and at the very end (if any) of the input string.
def remove_empty_spaces_at_both_ends(string):
    reached_first_nonempty_char = False
    while not reached_first_nonempty_char:
        curr_first_char = string[0]
        if curr_first_char != ' ':
            reached_first_nonempty_char = True
        else:
            string = string[1:]
    reached_last_nonempty_char = False
    while not reached_last_nonempty_char:
        curr_last_char = string[-1]
        if curr_last_char != ' ':
            reached_last_nonempty_char = True
        else:
            string = string[:-1]
    return string

Generating texts from our trained GPT2 language model for rating score 1 and saving the pandas dataframe containing these texts as a pickle file can be done via the following code/command. (and similarly for ratings 2, 3, 4, and 5)

In [None]:
%%time

list_of_all_created_texts = []
num_new_run_generations = 0
for i in range(0, 300):
    print('------------------------------ STARTING i = {} ------------------------------'.format(i))
    !python /content/drive/My\ Drive/Amazon\ Review\ Text\ Language\ Modeling/run_generation.py \
    --model_type=gpt2 \
    --model_name_or_path=/content/drive/My\ Drive/Amazon\ Review\ Text\ Language\ Modeling/trained_gpt2_rating1_2epochs \
    --length=512 \
    --prompt='<start_token>' \
    --stop_token='<end_token>' \
    --k=50 \
    --num_return_sequences=100 \
    --pickle_output_dir='/content/drive/My Drive/Amazon Review Text Language Modeling/'

    if i == 0:
        prev_file_modified_time = os.path.getctime('/content/drive/My Drive/Amazon Review Text Language Modeling/created_texts.pkl')
        num_new_run_generations += 1
    else:
        curr_file_modified_time = os.path.getctime('/content/drive/My Drive/Amazon Review Text Language Modeling/created_texts.pkl')
        if prev_file_modified_time != curr_file_modified_time:
            num_new_run_generations += 1
        prev_file_modified_time = curr_file_modified_time
    
    # Load the list of created texts, remove special tokens and empty spaces at both ends, and add them to our list of all created texts.
    # If a generated sequence has nothing other than the start token, just discard it.
    with open('/content/drive/My Drive/Amazon Review Text Language Modeling/created_texts.pkl', 'rb') as f:
        created_texts = pickle.load(f)
    list_of_all_created_texts.extend([remove_empty_spaces_at_both_ends(remove_start_token(seq)) for seq in created_texts if not has_no_alphanumeric_chars(remove_start_token(seq))])

# Make a 2-column Pandas dataframe, where one column contains the texts, and the other column indicates the rating score.
# Save the dataframe as a pickle file. 
all_created_texts_df = pd.DataFrame(data={'generated_text': list_of_all_created_texts, 'rating': [1] * len(list_of_all_created_texts)})
all_created_texts_df.to_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/all_created_texts_rating1_df.pkl')

print('Out of total 300, num_new_run_generations = {}'.format(num_new_run_generations))

# Train a BERT classification model and check its performance on the generated texts

For each rating score, we randomly use 65000 texts from the 260660 training datapoints that the GPT2 for the corresponding rating was trained on as the training data for our text classification model (so around 25% of the training data for GPT2). For evaluation data, we will use the same data that was used as the evaluation data for GPT2. Finally, for test data, we will use the generated texts from the GPT2.

For our classifier, we will fine-tune the BertForSequenceClassification model in the transformers library, which is essentially BERT + one fully connected layer.

The link to the BERT paper is: https://arxiv.org/abs/1810.04805

Let's load the training data review texts for each rating and then concat them into a single dataframe.

In [None]:
train_rating1_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/train_rating1.pkl')
train_rating1_df = train_rating1_df.sample(65000)
train_rating1_df.reset_index(drop=True, inplace=True)
train_rating2_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/train_rating2.pkl')
train_rating2_df = train_rating2_df.sample(65000)
train_rating2_df.reset_index(drop=True, inplace=True)
train_rating3_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/train_rating3.pkl')
train_rating3_df = train_rating3_df.sample(65000)
train_rating3_df.reset_index(drop=True, inplace=True)
train_rating4_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/train_rating4.pkl')
train_rating4_df = train_rating4_df.sample(65000)
train_rating4_df.reset_index(drop=True, inplace=True)
train_rating5_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/train_rating5.pkl')
train_rating5_df = train_rating5_df.sample(65000)
train_rating5_df.reset_index(drop=True, inplace=True)

train_df =  pd.concat([train_rating1_df, train_rating2_df, train_rating3_df, train_rating4_df, train_rating5_df], ignore_index=True, sort=False)
print('train_df number of rows, number of columns: {}'.format(train_df.shape))
print('number of training datapoints for each rating: {}'.format(Counter(train_df['rating'])))

train_df number of rows, number of columns: (325000, 2)
number of training datapoints for each rating: Counter({1: 65000, 2: 65000, 3: 65000, 4: 65000, 5: 65000})


In [None]:
# Let's see some of the review texts for rating 1 in the training data.
rating_texts = [train_df['review_text'][i] for i in range(0, len(train_df)) if train_df['rating'][i]==1]
rating_texts = random.sample(rating_texts, 20)
print('-' * 105)
print('-' * 105)
print('------------------ Here are some of the review texts for rating 1 in the training data ------------------')
print('-' * 105)
print('-' * 105)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

---------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts for rating 1 in the training data ------------------
---------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Never worked,
--------------------------------------------------------------------------------
These clothes are great at what it is designed to do; cleaning lenses and glasses.  Unfortunately, my wife hates them!!  These clothes must be washed separately because the color bleaches easily.  It takes over 4 hours to wash them in our Samsung self-balanced washing machine.  Because these 

In [None]:
# Let's see some of the review texts for rating 2 in the training data.
rating_texts = [train_df['review_text'][i] for i in range(0, len(train_df)) if train_df['rating'][i]==2]
rating_texts = random.sample(rating_texts, 20)
print('-' * 105)
print('-' * 105)
print('------------------ Here are some of the review texts for rating 2 in the training data ------------------')
print('-' * 105)
print('-' * 105)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

---------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts for rating 2 in the training data ------------------
---------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Doesn't work for my intended use, not sure why.
--------------------------------------------------------------------------------
It was impossible to get a clear image with this. The attachments were useless. I tried to look in my kids ear and all I saw was blur. It does not focus enough. It is more of a handheld microscope. unless the lens is super close to what you are trying to see 

In [None]:
# Let's see some of the review texts for rating 3 in the training data.
rating_texts = [train_df['review_text'][i] for i in range(0, len(train_df)) if train_df['rating'][i]==3]
rating_texts = random.sample(rating_texts, 20)
print('-' * 105)
print('-' * 105)
print('------------------ Here are some of the review texts for rating 3 in the training data ------------------')
print('-' * 105)
print('-' * 105)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

---------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts for rating 3 in the training data ------------------
---------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Arrived damaged....looking at returning the belt.
--------------------------------------------------------------------------------
I broke in these earphones per the manufacturer's recommendation.  However the earphones lack a good response in the highs and midrange.  But at this price you can't expect this much. It's just that the reviews and the manufacturer's claims raised my expect

In [None]:
# Let's see some of the review texts for rating 4 in the training data.
rating_texts = [train_df['review_text'][i] for i in range(0, len(train_df)) if train_df['rating'][i]==4]
rating_texts = random.sample(rating_texts, 20)
print('-' * 105)
print('-' * 105)
print('------------------ Here are some of the review texts for rating 4 in the training data ------------------')
print('-' * 105)
print('-' * 105)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

---------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts for rating 4 in the training data ------------------
---------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Much shorter charge time with this device over what it takes to charge with my computer. It is worth it.
--------------------------------------------------------------------------------
So far I've used 36 of these blank DVDs to burn video segments onto the and only one of them has been faulty. The other 35 burned without a hitch. The dics also printed cleanly and quickly on a HP 5160 

In [None]:
# Let's see some of the review texts for rating 5 in the training data.
rating_texts = [train_df['review_text'][i] for i in range(0, len(train_df)) if train_df['rating'][i]==5]
rating_texts = random.sample(rating_texts, 20)
print('-' * 105)
print('-' * 105)
print('------------------ Here are some of the review texts for rating 5 in the training data ------------------')
print('-' * 105)
print('-' * 105)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

---------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts for rating 5 in the training data ------------------
---------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
The Logitech Harmony 890 remote is simply awesome.  It does everything I wanted it to do, which was to replace six other remotes with one.  I recommend you read other reviews because some customers had problems programming the remote.  I had a few issues that I bumped into using the Logitech website, but nothing serious.  Logitech is continually updating their website to make it easier

We do the same thing for the evaluation data.

In [None]:
eval_rating1_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/eval_rating1.pkl')
eval_rating2_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/eval_rating2.pkl')
eval_rating3_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/eval_rating3.pkl')
eval_rating4_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/eval_rating4.pkl')
eval_rating5_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/eval_rating5.pkl')

eval_df =  pd.concat([eval_rating1_df, eval_rating2_df, eval_rating3_df, eval_rating4_df, eval_rating5_df], ignore_index=True, sort=False)
print('eval_df number of rows, number of columns: {}'.format(eval_df.shape))
print('number of evaluation datapoints for each rating: {}'.format(Counter(eval_df['rating'])))

eval_df number of rows, number of columns: (229995, 2)
number of evaluation datapoints for each rating: Counter({1: 45999, 2: 45999, 3: 45999, 4: 45999, 5: 45999})


In [None]:
# Let's see some of the review texts for rating 1 in the evaluation data.
rating_texts = [eval_df['review_text'][i] for i in range(0, len(eval_df)) if eval_df['rating'][i]==1]
rating_texts = random.sample(rating_texts, 20)
print('-' * 107)
print('-' * 107)
print('------------------ Here are some of the review texts for rating 1 in the evaluation data ------------------')
print('-' * 107)
print('-' * 107)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts for rating 1 in the evaluation data ------------------
-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
I loved the price, I loved the feel of the leather.  I had a Speck case for my iPhone 4 - loved it!  This case is a big NO WAY!

No problems going on.  Went on easier than the Google Nexus 7 (2013) case.  The problem is getting it off.  When I tried to pop the case off, the case would not let go of the back of my Nexus.  If you look at the Nexus 7 (2013) you will notice the e

In [None]:
# Let's see some of the review texts for rating 2 in the evaluation data.
rating_texts = [eval_df['review_text'][i] for i in range(0, len(eval_df)) if eval_df['rating'][i]==2]
rating_texts = random.sample(rating_texts, 20)
print('-' * 107)
print('-' * 107)
print('------------------ Here are some of the review texts for rating 2 in the evaluation data ------------------')
print('-' * 107)
print('-' * 107)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts for rating 2 in the evaluation data ------------------
-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
As initially received, the transmitter signal was very weak.  However, I found instructions to adjust it for a stronger signal on youtube. The audio cable didn't last very long due to the insulation breaking and exposing the tiny wires.  I wish they would have beefed up the audio cable so that it wouldn't break so easily.
------------------------------------------------------

In [None]:
# Let's see some of the review texts for rating 3 in the evaluation data.
rating_texts = [eval_df['review_text'][i] for i in range(0, len(eval_df)) if eval_df['rating'][i]==3]
rating_texts = random.sample(rating_texts, 20)
print('-' * 107)
print('-' * 107)
print('------------------ Here are some of the review texts for rating 3 in the evaluation data ------------------')
print('-' * 107)
print('-' * 107)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts for rating 3 in the evaluation data ------------------
-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
The nozzle of one of the six cans arrived broken and can't be fixed. The rest do what they're supposed to do. considering that I effectively got five instead of six, these aren't much cheaper than the other name-brand options.
--------------------------------------------------------------------------------
When I first hooked this unit up it had an alarm like a smoke detector

In [None]:
# Let's see some of the review texts for rating 4 in the evaluation data.
rating_texts = [eval_df['review_text'][i] for i in range(0, len(eval_df)) if eval_df['rating'][i]==4]
rating_texts = random.sample(rating_texts, 20)
print('-' * 107)
print('-' * 107)
print('------------------ Here are some of the review texts for rating 4 in the evaluation data ------------------')
print('-' * 107)
print('-' * 107)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts for rating 4 in the evaluation data ------------------
-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
it works is about all i can say !
--------------------------------------------------------------------------------
I purchased this pocket video camcorder when my Flip camera died less than 6 months after purchase.  While I've only had it a short time, so far it seems like a good product.  The picture and sound are clear and it's easy to operate.  My only complaint (and my re

In [None]:
# Let's see some of the review texts for rating 5 in the evaluation data.
rating_texts = [eval_df['review_text'][i] for i in range(0, len(eval_df)) if eval_df['rating'][i]==5]
rating_texts = random.sample(rating_texts, 20)
print('-' * 107)
print('-' * 107)
print('------------------ Here are some of the review texts for rating 5 in the evaluation data ------------------')
print('-' * 107)
print('-' * 107)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts for rating 5 in the evaluation data ------------------
-----------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
I like this. Quality is very high. The supplier for this looks to be Taiwan company www.via-labs.com. My only wish is a soft pad on the bottom so the metal doesn't rub the surface.
--------------------------------------------------------------------------------
Product works great. I didn't see any value in buying an off brand to save a few dollars. Cable attaches easily and 

We do the same thing for the test data (texts generated via GPT2, unlike the training and evaluation data which are real-world texts).

In [None]:
generated_texts_rating1_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/all_created_texts_rating1_df.pkl')
generated_texts_rating2_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/all_created_texts_rating2_df.pkl')
generated_texts_rating3_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/all_created_texts_rating3_df.pkl')
generated_texts_rating4_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/all_created_texts_rating4_df.pkl')
generated_texts_rating5_df = pd.read_pickle('/content/drive/My Drive/Amazon Review Text Language Modeling/all_created_texts_rating5_df.pkl')

test_df =  pd.concat([generated_texts_rating1_df, generated_texts_rating2_df, generated_texts_rating3_df, generated_texts_rating4_df, generated_texts_rating5_df], ignore_index=True, sort=False)
print('test_df number of rows, number of columns: {}'.format(test_df.shape))
print('number of test datapoints for each rating: {}'.format(Counter(test_df['rating'])))

test_df number of rows, number of columns: (142639, 2)
number of test datapoints for each rating: Counter({1: 28942, 3: 28881, 2: 28866, 4: 28162, 5: 27788})


In [None]:
# Let's see some of the review texts for rating 1 in the test data.
rating_texts = [test_df['generated_text'][i] for i in range(0, len(test_df)) if test_df['rating'][i]==1]
rating_texts = random.sample(rating_texts, 20)
print('-' * 121)
print('-' * 121)
print('------------------ Here are some of the review texts (generated by GPT2) for rating 1 in the test data ------------------')
print('-' * 121)
print('-' * 121)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

-------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts (generated by GPT2) for rating 1 in the test data ------------------
-------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
, 2nd set of batteries (2nd set of batteries for both) failed in 1 month.
--------------------------------------------------------------------------------
, did not work with our laptop, I bought this to replace a similar laptop that was working fine for a while and no longer has issues.
-------------------

In [None]:
# Let's see some of the review texts for rating 2 in the test data.
rating_texts = [test_df['generated_text'][i] for i in range(0, len(test_df)) if test_df['rating'][i]==2]
rating_texts = random.sample(rating_texts, 20)
print('-' * 121)
print('-' * 121)
print('------------------ Here are some of the review texts (generated by GPT2) for rating 2 in the test data ------------------')
print('-' * 121)
print('-' * 121)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

-------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts (generated by GPT2) for rating 2 in the test data ------------------
-------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
I got this for use with a computer monitor and I had high hopes for it. My first impression was that the product is solid and the material feels good. The keyboard is well built and very stylish. The key action is very responsive and it has nice features. However I was very disappointed in the product. The 

In [None]:
# Let's see some of the review texts for rating 3 in the test data.
rating_texts = [test_df['generated_text'][i] for i in range(0, len(test_df)) if test_df['rating'][i]==3]
rating_texts = random.sample(rating_texts, 20)
print('-' * 121)
print('-' * 121)
print('------------------ Here are some of the review texts (generated by GPT2) for rating 3 in the test data ------------------')
print('-' * 121)
print('-' * 121)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

-------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts (generated by GPT2) for rating 3 in the test data ------------------
-------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
The remote is nice and the controls are fairly intuitive.  However the controls are limited to only the top left and right hand buttons which make using it tricky.  The remote is designed to only get a maximum of 30 commands (1 for both left and right).  In the case of the remote control, it would get the b

In [None]:
# Let's see some of the review texts for rating 4 in the test data.
rating_texts = [test_df['generated_text'][i] for i in range(0, len(test_df)) if test_df['rating'][i]==4]
rating_texts = random.sample(rating_texts, 20)
print('-' * 121)
print('-' * 121)
print('------------------ Here are some of the review texts (generated by GPT2) for rating 4 in the test data ------------------')
print('-' * 121)
print('-' * 121)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

-------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts (generated by GPT2) for rating 4 in the test data ------------------
-------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
A very good product that is small and well built. I did receive this item at a discounted price in exchange for my honest, unbiased review.
--------------------------------------------------------------------------------
on the camera at the time of this review, it is only a basic camera which it does very 

In [None]:
# Let's see some of the review texts for rating 5 in the test data.
rating_texts = [test_df['generated_text'][i] for i in range(0, len(test_df)) if test_df['rating'][i]==5]
rating_texts = random.sample(rating_texts, 20)
print('-' * 121)
print('-' * 121)
print('------------------ Here are some of the review texts (generated by GPT2) for rating 5 in the test data ------------------')
print('-' * 121)
print('-' * 121)
print('')
for txt in rating_texts:
    print('-' * 80)
    print(txt)

-------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------
------------------ Here are some of the review texts (generated by GPT2) for rating 5 in the test data ------------------
-------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------
:  Great!
--------------------------------------------------------------------------------
of these is a very good and sturdy set of cables.  They are very flexible, with a decent quality, durable design.  They are also very strong, which is nice since they will eventually break.  These cables are made of s

For the entire process regarding the classification part (tokenizing, adding special tokens, truncating, padding, computing attention masks, training, etc), we mostly use the codes in this [jupyter notebook](https://github.com/aniruddhachoudhury/BERT-Tutorials/blob/master/Blog%202/BERT_Fine_Tuning_Sentence_Classification.ipynb) which is the source code for the [tutorial article](https://medium.com/@aniruddha.choudhury94/part-2-bert-fine-tuning-tutorial-with-pytorch-for-text-classification-on-the-corpus-of-linguistic-18057ce330e1) "**Part 2: BERT Fine-Tuning Tutorial with PyTorch for Text Classification on The Corpus of Linguistic Acceptability (COLA) Dataset.**".

## Tokenize, add special tokens 'CLS' and 'SEP', and map each token to id

We use the pretrained tokenizer for BERT ('bert-base-uncased') that is provided by the transformers library. 

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [None]:
def encoding(input_sentence, tokenizer):
    # we will do truncating together with padding later on (via the 'pad_sequences' function from keras.preprocessing.sequence), 
    # so for now, we don't need to specify a max length limit.
    return tokenizer.encode(input_sentence, add_special_tokens=True)

In [None]:
train_input_ids_list = list(train_df['review_text'].apply(encoding, tokenizer=tokenizer))
eval_input_ids_list = list(eval_df['review_text'].apply(encoding, tokenizer=tokenizer))
test_input_ids_list = list(test_df['generated_text'].apply(encoding, tokenizer=tokenizer))

Token indices sequence length is longer than the specified maximum sequence length for this model (533 > 512). Running this sequence through the model will result in indexing errors


## Truncate and pad

We need to decide the maximum length for truncating (so that for every review text that is longer than this length, all the words afterwards will get removed). About 80% of the training datapoints are shorter than length 144, so let's set the max length as 145, so that around 20% of the review texts get truncated. 

For the remaining review texts, since their lengths are all less than 144, we need to apply padding on the remaining positions. So for example, for a review text with length 120, we have to fill the the remaining 24 positions (to reach 144) with a particular value to represent that these positions are padded (and not actual text). We use zero for the padding value.

In [None]:
print('80% percentile of text lengths in training data: {}'.format(np.percentile([len(input_ids) for input_ids in train_input_ids_list], 80)))
print('80% percentile of text lengths in evaluation data: {}'.format(np.percentile([len(input_ids) for input_ids in eval_input_ids_list], 80)))
print('80% percentile of text lengths in test data: {}'.format(np.percentile([len(input_ids) for input_ids in test_input_ids_list], 80)))

80% percentile of text lengths in training data: 144.20000000001164
80% percentile of text lengths in evaluation data: 144.0
80% percentile of text lengths in test data: 127.0


In [None]:
train_input_ids_list = pad_sequences(train_input_ids_list, maxlen=145, dtype="long", value=0, truncating="post", padding="post")
eval_input_ids_list = pad_sequences(eval_input_ids_list, maxlen=145, dtype="long", value=0, truncating="post", padding="post")
test_input_ids_list = pad_sequences(test_input_ids_list, maxlen=145, dtype="long", value=0, truncating="post", padding="post")

## Attention masks

We need to separately prepare an attention mask representation for each review text. For each review text, for each token position, we simply indicate whether the position is an actual token or a padded one. Since we used zero as the padding value, all we need to do is to tell whether the position's value is 0.

In [None]:
def attention_mask(input_ids):
    return [int(token_id > 0) for token_id in input_ids]

In [None]:
train_attention_mask_list = [attention_mask(input_ids) for input_ids in train_input_ids_list]
eval_attention_mask_list = [attention_mask(input_ids) for input_ids in eval_input_ids_list]
test_attention_mask_list = [attention_mask(input_ids) for input_ids in test_input_ids_list]

## convert data (inputs and labels) to pytorch tensors, and create pytorch DataLoaders

As a final step before fine-tuning, we convert our variables that we've built so far to pytorch tensors, create pytorch Datasets, and finally make pytorch DataLoaders.

In [None]:
train_input_ids_list = torch.tensor(train_input_ids_list)
eval_input_ids_list = torch.tensor(eval_input_ids_list)
test_input_ids_list = torch.tensor(test_input_ids_list)

# rating 1 is represented as label 0, rating 2 is represented as label 1, ... , rating 5 is represented as label 4
train_labels = torch.tensor(train_df['rating'] - 1)
eval_labels = torch.tensor(eval_df['rating'] - 1)
test_labels = torch.tensor(test_df['rating'] - 1)

train_attention_mask_list = torch.tensor(train_attention_mask_list)
eval_attention_mask_list = torch.tensor(eval_attention_mask_list)
test_attention_mask_list = torch.tensor(test_attention_mask_list)

In [None]:
train_dataset = TensorDataset(train_input_ids_list, train_attention_mask_list, train_labels)
eval_dataset = TensorDataset(eval_input_ids_list, eval_attention_mask_list, eval_labels)
test_dataset = TensorDataset(test_input_ids_list, test_attention_mask_list, test_labels)

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
eval_dataloader = DataLoader(eval_dataset, batch_size=32, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)

## Fine-tune the pre-trained BertForSequenceClassification model

We prepare the pretrained BERT model, define our optimizer and scheduler, and start fine-tuning (for two epochs).

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('device: {}'.format(device))

device: cuda


In [None]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5, output_attentions=False, output_hidden_states=False)
model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [None]:
optimizer = AdamW(model.parameters(), lr = 2e-5, eps = 1e-8 )
epochs = 2
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = total_steps)

In [None]:
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [None]:
# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    t0 = time.time()

    avg_train_loss = 0
    num_train_total, num_train_correct = 0, 0
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):
        # Progress update every 800 batches.
        if step % 800 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.  Average train loss so far: {}.    Elapsed: {:}.'.format(step, len(train_dataloader), round(avg_train_loss, 3), elapsed))

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        optimizer.zero_grad()        

        outputs = model(b_input_ids, 
                    token_type_ids=None, 
                    attention_mask=b_input_mask, 
                    labels=b_labels)
        
        loss = outputs[0]

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

        avg_train_loss = ((avg_train_loss * step) + loss.item()) / (step + 1)

        logits = outputs[1]
        logits = logits.detach().cpu().numpy()
        pred_flat = np.argmax(logits, axis=1).flatten()
        label_ids = b_labels.to('cpu').numpy()
        labels_flat = label_ids.flatten()
        num_train_correct += np.sum(pred_flat == labels_flat)
        num_train_total += len(labels_flat)       

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Num train total: {}, Num train correct: {}, Train accuracy: {}".format(num_train_total, num_train_correct, round(num_train_correct / num_train_total, 3)))
    print("  Training epoch took: {:}".format(format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    num_eval_total, num_eval_correct = 0, 0
    model.eval()

    # Evaluate data for one epoch
    for batch in eval_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch

        with torch.no_grad():        
            outputs = model(b_input_ids, b_input_mask)
        
        logits = outputs[0]
        logits = logits.detach().cpu().numpy()
        pred_flat = np.argmax(logits, axis=1).flatten()
        label_ids = b_labels.to('cpu').numpy()
        labels_flat = label_ids.flatten()
        num_eval_correct += np.sum(pred_flat == labels_flat)
        num_eval_total += len(labels_flat)

    # Report the final accuracy for this validation run.
    print("  Num validation total: {}, Num validation correct: {}, Validation accuracy: {}".format(num_eval_total, num_eval_correct, round(num_eval_correct / num_eval_total, 3)))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))



    # ========================================
    #               Test
    # ========================================
    # After the completion of last training epoch, measure our performance on
    # our test set (generated texts from gpt2)
    if epoch_i == epochs - 1:
        print("")
        print("Running Test...")

        t0 = time.time()
        
        num_test_total, num_test_correct = 0, 0
        for batch in test_dataloader:
            batch = tuple(t.to(device) for t in batch)
            b_input_ids, b_input_mask, b_labels = batch
            
            with torch.no_grad():        
                outputs = model(b_input_ids, b_input_mask)
            
            logits = outputs[0]
            logits = logits.detach().cpu().numpy()
            pred_flat = np.argmax(logits, axis=1).flatten()
            label_ids = b_labels.to('cpu').numpy()
            labels_flat = label_ids.flatten()
            num_test_correct += np.sum(pred_flat == labels_flat)
            num_test_total += len(labels_flat)
        # Report the final accuracy for the test run.
        print("  Num test total: {}, Num test correct: {}, Test accuracy: {}".format(num_test_total, num_test_correct, round(num_test_correct / num_test_total, 3)))
        print("  Test took: {:}".format(format_time(time.time() - t0)))


Training...
  Batch   800  of  10,157.  Average train loss so far: 1.137.    Elapsed: 0:10:35.
  Batch 1,600  of  10,157.  Average train loss so far: 1.078.    Elapsed: 0:21:17.
  Batch 2,400  of  10,157.  Average train loss so far: 1.054.    Elapsed: 0:31:59.
  Batch 3,200  of  10,157.  Average train loss so far: 1.039.    Elapsed: 0:42:41.
  Batch 4,000  of  10,157.  Average train loss so far: 1.024.    Elapsed: 0:53:24.
  Batch 4,800  of  10,157.  Average train loss so far: 1.014.    Elapsed: 1:04:05.
  Batch 5,600  of  10,157.  Average train loss so far: 1.006.    Elapsed: 1:14:48.
  Batch 6,400  of  10,157.  Average train loss so far: 0.999.    Elapsed: 1:25:30.
  Batch 7,200  of  10,157.  Average train loss so far: 0.993.    Elapsed: 1:36:12.
  Batch 8,000  of  10,157.  Average train loss so far: 0.987.    Elapsed: 1:46:54.
  Batch 8,800  of  10,157.  Average train loss so far: 0.982.    Elapsed: 1:57:36.
  Batch 9,600  of  10,157.  Average train loss so far: 0.977.    Elapsed: 

The text classification model's validation accuracy (real-world text) is 61%, and its test accuracy (generated text) is 57%, so the performance gap is pretty small. (Note that we haven't put much effort in acheiving better model performance such as hyperparameter tuning because our main interest here is whether the model's performance in the validation set is **similar** to that in the test set, rather than getting good predictions.)

So it would be reasonable to say that the texts generated by the five GPT2 language models are indeed pretty similar to the real-world rating texts that they were trained with.

# Future Work

We've observed that the classifier's accuracy on the generated texts via the GPT2 language models is reasonably similar to its accuracy on the evaluation data. So it could be reasonable to say that the GPT2 models (for each rating) are capable of creating texts that are **similar** to the corresponding 'real-world' texts.

Here are some additional ideas that might provide more insights.

- We've only checked accuracy as our metric, but we could also compare others including class-wise precision and recall. This can give us ideas about 'which ratings is the language model particularly good at?'. We can even try metrics desiged for regression such as mean absolute error because our labels are ordered (e.g. incorrectly predicting a rating 5 text as 4 is much better than predicting as 1).
- We can try more language models other than GPT2, go through the same process, and use the classifier results for comparing the capability of the different language models. 

# Datasets/Papers Reference

- Amazon review text dataset
        Justifying recommendations using distantly-labeled reviews and fined-grained aspects
        Jianmo Ni, Jiacheng Li, Julian McAuley
        Empirical Methods in Natural Language Processing (EMNLP), 2019

- GPT2 paper
        Language Models are Unsupervised Multitask Learners
        Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya
        2019

- BERT paper
        BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
        Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina
        arXiv preprint arXiv:1810.04805
        2018