# Introduction
This notebook can be used to both train a discriminator on the AG news dataset  and steering text generation in the direction of each of the four classes of this dataset, namely world, sports, business and sci/tech.   

My code uses and builds on a text generation plug-and-play model developed by the Uber AI research team, which can be found here: https://github.com/uber-research/PPLM.    
I had a lot of problems setting up the code provided by the original paper, since the package versions is old and has a lot of incompatability issues. I spent loads of times trying to set up pip and/or anaconda environments to run their code, but there was always an issue.   
Therefore, I developed this in Google Colab, which seems to be the only place where I can run their code without problems. **I strongly recommend you running this in Google Colab as well**. Thus, my code is kind of hard to use exactly because the original PPLM code is hard to use. I forked the PPLM repo and removed lots of unecessary stuff, only keeping the parts I'm using in this notebook. Also, I added my newly trained discriminator model.   
   
By running this entire notebook cell for cell, you both train the discriminator and performs the generation experiment. However, since I've already trained this very discriminator, you can skip those cells. You can also skip the cells corresponding to saving models and results to disk. I've marked the "mandatory" cells with the comment "# MUST BE RUN" for this purpose.

## Main functionality
This notebook essentially just runs my experiment setup using the newly trained discriminator to steer text generation in the direction of the discriminator classes text.   
   
The main function is named *text_generation*, which can be used to generate a user-chosen amount of samples using either an unperturbed model or perturbed model. In the latter case, the user might choose which class he/she wished to steer text generation towards. I should also say that it's not quite new functionality, it's based on some of the PPLM code modified to suit my experiment.

## Termology used throughout:
- Model setting: using a general language model (GPT-2) together with the discriminator fixed on optimizing for one specific class.
- Perturbed and unperturbed: This is essentially whether a discriminator has been used in the text generation. For instance, unperturbated text is "clean", meaning unsteered, while perturbated text is steered in a class direction.

# Setup: import the code base from github, and install requirements

In [1]:
# MUST BE RUN
!git clone https://github.com/eskilhamre/PPLM.git

Cloning into 'PPLM'...
remote: Enumerating objects: 297, done.[K
remote: Counting objects: 100% (47/47), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 297 (delta 20), reused 14 (delta 0), pack-reused 250[K
Receiving objects: 100% (297/297), 2.46 MiB | 19.66 MiB/s, done.
Resolving deltas: 100% (125/125), done.


In [2]:
# MUST BE RUN
import os
os.chdir('PPLM')

In [3]:
# MUST BE RUN
!pip install -r requirements.txt

Collecting torch==1.7.0
  Downloading torch-1.7.0-cp37-cp37m-manylinux1_x86_64.whl (776.7 MB)
[K     |████████████████████████████████| 776.7 MB 4.3 kB/s 
[?25hCollecting nltk==3.4.5
  Downloading nltk-3.4.5.zip (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 32.4 MB/s 
[?25hCollecting colorama==0.4.4
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting transformers==3.4.0
  Downloading transformers-3.4.0-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 64.5 MB/s 
[?25hCollecting torchtext==0.3.1
  Downloading torchtext-0.3.1-py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.3 MB/s 
Collecting dataclasses
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 77.6 MB/s 
Collecting tokenizers==0.9.2
  Downloading tokenizers-0.9.2-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB)
[K  

# Train a discriminator on AG news dataset
## First, download the general lanaguage model used to train the discriminator

In [None]:
from transformers.modeling_gpt2 import GPT2LMHeadModel
# This downloads GPT-2 Medium, it takes a little while
_ = GPT2LMHeadModel.from_pretrained("gpt2-medium")

Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

## Import the dataset
The data can be found at: https://www.kaggle.com/amananandrai/ag-news-classification-dataset/version/2?select=train.csv

The PPLM interface requires the data to be a tsv file containing the entire dataset, where the first column is the labels and seconds column the text. Thus, we have to prepare the dataset for this.   

First, download the dataset following the link above, and upload both files in the set below.

In [4]:
import pandas as pd
from google.colab import files
import torch
torch.cuda.is_available()

True

In [None]:
uploaded = files.upload()

Saving test.csv to test.csv
Saving train.csv to train.csv


In [None]:
data_fp = "./ag-news-data.tsv"   # where we want to store our prepared dataset

In [None]:
def prepare_dataset(text_index,
                    label_index, 
                    label_names=None
                    ):
  train = pd.read_csv("train.csv")
  test = pd.read_csv("test.csv")
  all_data = pd.concat([train, test])
  all_data = all_data.iloc[:, [label_index, text_index]]

  if label_names:
    labels_map = {i+1: label_name for i, label_name in enumerate(label_names)}    # here assuming labels are numerated 1,...,n, which is the case for AG news
    all_data.iloc[:, 0] = all_data.iloc[:, 0].map(labels_map)                     # exchange label numbers by their name

  return all_data

In [None]:
idx2class = ["world", "sports", "business", "sci/tech"]
data = prepare_dataset(2, 0, idx2class)
data.to_csv(data_fp, sep='\t', index=False, header=False)

In [None]:
from run_pplm_discrim_train import train_discriminator

# ensure reproducible discriminator
torch.manual_seed(444)
np.random.seed(444)

discriminator, disc_info = train_discriminator(
        dataset="generic",
        dataset_fp=data_fp,
        pretrained_model="gpt2-medium",
        epochs=8,
        learning_rate=0.0001,
        batch_size=128,
        log_interval=10,
        save_model=True,
        cached=False,
        no_cuda=False,
        output_fp='models/',
        idx2class=idx2class
)

Preprocessing generic dataset...


127600it [00:54, 2361.50it/s]


Length of dataset after removing too long sequences: 126989
Preprocessed 126989 data points
Data preprocessing took: 67.186s

Epoch 1
Performance on test set: Average loss: 0.4366, Accuracy: 11123/12699 (88%)
Epoch took: 1459.130s

Example prediction
Input sentence: This is incredible! I love it, this is the best chicken I have ever had.
Predictions: world: 0.0812, sports: 0.2675, business: 0.1457, sci/tech: 0.5056

Epoch 2
Performance on test set: Average loss: 0.3557, Accuracy: 11269/12699 (89%)
Epoch took: 1456.766s

Example prediction
Input sentence: This is incredible! I love it, this is the best chicken I have ever had.
Predictions: world: 0.0481, sports: 0.1891, business: 0.1991, sci/tech: 0.5638

Epoch 3
Performance on test set: Average loss: 0.3278, Accuracy: 11335/12699 (89%)
Epoch took: 1460.328s

Example prediction
Input sentence: This is incredible! I love it, this is the best chicken I have ever had.
Predictions: world: 0.0409, sports: 0.1603, business: 0.2647, sci/tech: 

We achieve about 90% accuracy on unseen data, which is pretty good in my opinion. I haven't studied the training accuracy (so I can't say the following for sure), but I don't think we're neither underfitting or overfitting here. This is good stuff!   
Also, the validation/test accuracy seems to stagnate on ~90%, so more epochs would probably be in no use.

## Training the discriminator is done, let's download it

In [None]:
classifier_name = "models/news_classifierhead.pt"
torch.save(discriminator.get_classifier().state_dict(), "models/news_classifierhead.pt")
files.download(classifier_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

At this point, I put the newly generated model in the discrim_models/ folder, and updated my Github code to include this model. I went back to the beginning of the notebook and recloned the repo.

# Scoring the generated samples
When doing manual comparison between text generated from different model settings, I'm only interested in comparing only the best sample for each model setting. The idea is to generate lots of samples using the same setting, and picking the best one based on some type of scoring.   
What do I mean by best samples? I'm automating this evaluation as means of scoring and ranking the sentences in a similar way as described in the PPLM paper; 
- fluency is measured by the general language model likelihood p(sentence). In scoring, I utilize the fact that the lower the language model loss(sentence), the higher the p(sentence). I use GPT-1 for this, as in the PPLM paper (in the paper however, they use GPT-1 to calculate perplexity, and as I understand it this should correspond to loss.)
- diversity of words is measured by the mean of the (length normalized) Dist-1, Dist-2 and Dist-3 score, (the PPLM paper was inspired by the way they use this metric in this paper: https://arxiv.org/pdf/1510.03055.pdf)



In [5]:
# MUST BE RUN
from transformers.modeling_gpt2 import GPT2LMHeadModel
from transformers import GPT2Tokenizer, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
from nltk import ngrams
import numpy as np

### Instantiate models used for scoring samples

In [6]:
# MUST BE RUN
device = "cuda" if torch.cuda.is_available() else "cpu"

# tokenizer and language model used for calculating fluency / "perplexity"
gpt1_tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
gpt1_model = OpenAIGPTLMHeadModel.from_pretrained("openai-gpt")
gpt1_model.eval()
gpt1_model.to(device)
device

Downloading:   0%|          | 0.00/816k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/458k [00:00<?, ?B/s]

ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.


Downloading:   0%|          | 0.00/656 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/479M [00:00<?, ?B/s]

Some weights of OpenAIGPTLMHeadModel were not initialized from the model checkpoint at openai-gpt and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


'cuda'

In [7]:
# MUST BE RUN

##############
# This is to be used on all generated sentences (to be aggregated wrt. model setting), not used for selection
##############

def lm_score(sentence):
  """
  Calculates the language model total loss of the sentence.
  Code heavily inspired from: https://github.com/huggingface/transformers/issues/1009
  The total loss is equivalent to
  - [ log P(x1 | <|endoftext|>) + log P(x2 | x1, <|endoftext|>) + ... ]
  which means it corresponds to perplexity, and can be used as such in comparisons.
  """
  tokens = gpt1_tokenizer.encode(sentence)
  input_ids = torch.tensor(tokens).unsqueeze(0)
  input_ids = input_ids.to(device)

  with torch.no_grad():
    outputs = gpt1_model(input_ids, labels=input_ids)
  loss, logits = outputs[:2]
  return loss.item() # * len(tokens)  don't multiply with length, this would prefer shorter sentences

###############
# This is used for selecting the best sample when model setting and prefix is fixed
###############

def dist_n_score(sentence, n):
  """Calculates the number of distinct n-grams in the sentence, normalized by the sentence length"""
  if len(sentence.split()) < n:
    raise ValueError("Cannot find ngram of sentence with less than n words")
  
  sentence = sentence.lower().strip()
  dist_n_grams = set()
  for n_gram in ngrams(sentence.split(), n):
    dist_n_grams.add(n_gram)
  
  return len(dist_n_grams) / len(sentence.split())


def dist_score(sentence):
  """
  Calculcates the dist-1, dist-2 and dist-3 score of the sentence, as well as their mean
  """
  sentence = sentence.lower().strip()
  sentence = sentence.replace(".", "").replace(",", "").replace("\n", "")
  
  dist_scores = [dist_n_score(sentence, n) for n in range(1, 4)]
  dist_1, dist_2, dist_3 = dist_scores
  return np.mean(dist_scores), dist_1, dist_2, dist_3


sentences =['there is a book on the desk', 'there is a plane on the desk', 'there is a book in the desk', "desk desk desk desk cat cat"]
print([lm_score(s) for s in sentences])
print([dist_score(s)[0] for s in sentences])

[3.0594890117645264, 4.118381977081299, 3.2676448822021484, 9.095804214477539]
[0.8571428571428572, 0.8571428571428572, 0.8571428571428572, 0.4444444444444444]


We can see that the most sensible of the four sentences receives lowest language model score (thus higher probability). Also, we can see that the non-sensible sentence receives both bad language model and dist-score.

# Text generation

In [8]:
# MUST BE RUN
from run_pplm import  run_pplm_example, generate_text_pplm, get_classifier, PPLM_DISCRIM

In [9]:
# MUST BE RUN

def text_generation(
    model,
    tokenizer,
    discrim="news",
    class_label="sports",
    prefix_text="Last summer",
    perturb=False,
    num_samples=3,
    device="cuda",
    length=150,
    stepsize=0.04,
    num_iterations=10,
    window_length=0,  # 0 corresponds to entire sequence
    gamma=1.0,
    gm_scale=0.95,
    kl_scale=0.01,
    verbosity_level=1 # REGULAR
):
  """
  Used to generate a user-specified number of samples, with optional use of the discriminator
  to perturbate the generated samples in the direction of it's gradient.

  This is a modified version of the PPML text generation function to suit my experiment.

  Only supports generating text using discriminator models (BoW models not supported)
  The default hyper parameters chosen here are the same as in the PPLM Colab demo, since
  this seems to work great for discriminators.

  Returns a list of generated text samples and their corresponding attribute model losses
  """

  # we pass the discriminator even if we want unpertubated text, since it's used for attribute scoring
  discrim_model, class_id = get_classifier(
    discrim,
    class_label,
    device
  )

  # encode prefix text
  tokenized_cond_text = tokenizer.encode(
    tokenizer.bos_token + prefix_text,
    add_special_tokens=False
  )

  gen_text_samples = []
  discrim_losses = []

  if device == 'cuda':
    torch.cuda.empty_cache()

  for i in range(num_samples):
    gen_tok_text, discrim_loss, _ = generate_text_pplm(
        model=model,
        tokenizer=tokenizer,
        context=tokenized_cond_text,
        device=device,
        perturb=perturb,
        classifier=discrim_model,
        class_label=class_id,
        loss_type=PPLM_DISCRIM,  # BoW not supported as of now
        length=length,
        stepsize=stepsize,
        sample=True,
        num_iterations=num_iterations,
        horizon_length=1,
        window_length=window_length,
        gamma=gamma,
        gm_scale=gm_scale,
        kl_scale=kl_scale,
        verbosity_level=verbosity_level
    )

    
    gen_text = tokenizer.decode(gen_tok_text[0][1:]) # decode generated text
    gen_text_samples.append(gen_text)
    discrim_losses.append(discrim_loss.item())  #.data.cpu().numpy())
  
  if device == "cuda":
    torch.cuda.empty_cache()

  return gen_text_samples, discrim_losses


def select_best(gen_text_samples, discrim_losses):
  """
  Given the outout from the text_generation function, filters away 3/4 of the 
  generated samples based on mean dist-score, and rank the remaining 1/4 based
  on discriminator losses.
  
  Returns the best sample based smallest discriminator loss (the one maximizing 
  the attribute, according to the discriminator)
  """
  if len(gen_text_samples) < 4:
    raise ValueError("Cannot filter away 3/4 of less than 4 samples")
  
  n_keep = 1 * len(gen_text_samples) // 4  # number of samples to keep

  # filter out the 3/4 samples with lowest mean dist-score
  mean_dists = [dist_score(sample)[0] for sample in gen_text_samples]
  idx_to_keep = np.argpartition(mean_dists, -n_keep)[-n_keep:]   # indices of samples with highest mean dist score
  samples = np.array([gen_text_samples, discrim_losses, mean_dists]).T
  filtered_samples = samples[idx_to_keep]
  
  # fetch best sample among the remaining ones
  best_idx = np.argmin(filtered_samples[:, 1])  # index of sample with minimal discrim loss
  best_sample, smallest_loss, mean_dist = filtered_samples[best_idx]
  return best_sample, smallest_loss, mean_dist

## Import the base model used to sample from

In [10]:
# MUST BE RUN
pretrained_model = "gpt2-medium"

model = GPT2LMHeadModel.from_pretrained(
        pretrained_model,
        output_hidden_states=True
)
model.to(device)
model.eval()

# Freeze GPT-2 weigths
for param in model.parameters():
  param.requires_grad = False

tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model)

Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

## Sample from all combinations of model setting and prefix sentences
First, let's create some data structures to gather relevant information from the sampling process.

In [16]:
# MUST BE RUN

# most relevant hyper params wrt. speed
generated_len = 120
num_samples = 12

prefixes = [
  "Last week",
  "The potato",
  "Breaking news:",
  "In the last year",
  "The president of the country",
]

model_settings = ["world", "sports", "business", "sci/tech"]  # the classes of the discriminator

In [17]:
# MUST BE RUN

# data structures of generated results
gen_samples = {model_setting: [] for model_setting in ["unpert"] + model_settings.copy()}             # contains all generated samples
comparisons = {model_setting: {prefix: dict() for prefix in prefixes} for model_setting in model_settings}  # contains the best samples for each model setting and prefix combo

The cell below runs the *entire* sampling process, it took ~5 hours to run on Googles Compute Engine backend using their GPUs.   
   
Here I decided that generate unperturbed text for each model setting. This might seem silly and redundant, since the unperturbed text is not affected by this choice.    
And while that is partly true, I did this to be able to calculate the discriminator losses of the generated text, so that I can select the "best" sample wrt. to the classes (even though it's best by chance). I thought this is only fair: the perturbed model gets many chances to generate a "good" sample (in the eyes of the discriminator), so the unperturbed model should also have this.   
Also, I didn't find a easy way of using the discriminator to just score the text sample wrt. a class right of the bat. This is partly due to the fact that discriminator is actually trained on the transformer output.

In [None]:
# MUST BE RUN

# since we're sampling, set seed for reproducibility
torch.manual_seed(444)
np.random.seed(444)

n_combinations = len(prefixes) * len(model_settings)
i = 1
for prefix_sentence in prefixes:
  for j, model_setting in enumerate(model_settings):
    print(f"\n\nRun {i:3d}/{n_combinations:3d} : optimizing for class: {model_setting}, with prefix: {prefix_sentence}\n")
    
    unpert_text_samples, unpert_discrim_losses = text_generation(
        model,
        tokenizer,
        device=device,
        length=generated_len,
        num_samples=num_samples,
        prefix_text=prefix_sentence,
        discrim="news",
        class_label=model_setting,
        perturb=False
    )

    pert_text_samples, pert_discrim_losses = text_generation(
        model,
        tokenizer,
        device=device,
        length=generated_len,
        num_samples=num_samples,
        prefix_text=prefix_sentence,
        discrim="news",
        class_label=model_setting,
        perturb=True
    )
    # store generated samples
    if j == 0:
      gen_samples["unpert"].extend(unpert_text_samples)    # only store unpertubated generation once per prefix
    gen_samples[model_setting].extend(pert_text_samples)

    # save the best sample, it's discriminator loss and mean dist-score for both the perturbated and unperturbated samples
    comparisons[model_setting][prefix_sentence]["unpert"] = list(select_best(unpert_text_samples, unpert_discrim_losses))
    comparisons[model_setting][prefix_sentence]["pert"] = list(select_best(pert_text_samples, pert_discrim_losses))

    i += 1




Run   1/ 15 : optimizing for class: sports, with prefix: Last week

<|endoftext|>Last week's
<|endoftext|>Last week's announcement
<|endoftext|>Last week's announcement that




[1;30;43mStreaming output truncated to the last 5000 lines.[0m

But according to an email obtained by Politico
<|endoftext|>Breaking news: Donald Trump's campaign is reportedly paying $20,000 to a woman who works for a pro-Trump super PAC called America First Action in California to help organize the 2016 election.

In a statement to the Daily Beast, Trump's campaign told the outlet it "is not aware of any specific contributions from this PAC."

But according to an email obtained by Politico.
<|endoftext|>Breaking news: Donald Trump's campaign is reportedly paying $20,000 to a woman who works for a pro-Trump super PAC called America First Action in California to help organize the 2016 election.

In a statement to the Daily Beast, Trump's campaign told the outlet it "is not aware of any specific contributions from this PAC."

But according to an email obtained by Politico.com
<|endoftext|>Breaking news: Donald Trump's campaign is reportedly paying $20,000 to a woman who works for a pr

## Generation analysis
First, let's download the generated samples.

In [None]:
import json

with open("all-samples.json", "w") as fp:
  json.dump(gen_samples, fp)
  files.download("all-samples.json")

with open("comparisons.json", "w") as fp:
  json.dump(comparisons, fp)
  files.download("comparisons.json"")

## Let's extract the metrics from the generated samples
In the code below, for each model setting, I calculate the perplexity score and dist-1, dist-2, and dist-3 scores for all samples. I then accumulate the mean and standard deviations of the scores wrt. to each model setting, to study how well each model setting actually performed in the experiment above.

In [None]:
# MUST BE RUN

metrics_means_dict = {}
metrics_stds_dict = {}

for model_setting, samples in gen_samples.items():
  perplexities = [lm_score(sample) for sample in samples]
  dist_scores = [dist_score(sample)[1:] for sample in samples]   # stored as (mean_dist_score, dist-1, dist-2, dist-3), ignore mean
  all_metrics = np.c_[np.array(perplexities), np.array(dist_scores)]
  metrics_means_dict[model_setting] = np.mean(all_metrics, axis=0)
  metrics_stds_dict[model_setting] = np.std(all_metrics, axis=0)

# structure the statistics neatly dataframes
metrics_means_df = pd.DataFrame(data=metrics_means_dict, index=["perplexity", "dist-1", "dist-2", "dist-3"])
metrics_means_df = pd.DataFrame(data=metrics_means_dict, index=["perplexity", "dist-1", "dist-2", "dist-3"])

In [None]:
# save the extracted statistics as csv files
metrics_means_df.to_csv("metrics-means.csv")
metrics_std_df.to_csv("metrics-std.csv")
files.download("metrics-means.csv")
files.download("metrics-std.csv)

## Let's see the best examples for each model setting and prefix

In [None]:
# MUST BE RUN

for model_setting, prefix_dict in comparisons.items():
  print(f"Model setting: {model_setting}\n")
  for prefix_sentence in prefix_dict.keys():
    unpert_sample, unpert_loss, unpert_mean_dist = prefix_dict[prefix_sentence]["unpert"]
    pert_sample, pert_loss, pert_mean_dist = prefix_dict[prefix_sentence]["pert"]

    print(f"Prefix is: {prefix_sentence}\n")
    print(f"Unperturbated:\nSample: {unpert_sample}\nDiscrim loss: {unpert_loss:2.2f} | Mean dist-n score: {unpert_mean_dist:2.1f}\n")
    print(f"  Perturbated:\nSample: {pert_sample}\nDiscrim loss: {pert_loss:2.2f} | Mean dist-n score: {pert_mean_dist:2.1f}")
  print("\n\n")
