# Generating Sarcastic Tweets
This notebook generates sarcastic tweets using the GPT-2 language model that has been fine tuned with sarcastic tweets from the FigLang dataset.
It provides a standard finetuning procedure and a novel architecture that leverages generated Synthetic Data to enhance the training set.

Setup when using Google Colab.

In [1]:
from google.colab import drive
import os
import sys
from pathlib import Path

# First mount your drive
drive_path = Path('/content/drive')
drive.mount(str(drive_path))

# Set path to the project folder
PROJECT_PATH = "MyDrive/deep_learning/SarcasmGenerator"
path = drive_path / PROJECT_PATH

# Possibly append to PATH
if path not in sys.path:
    sys.path.append(str(path))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip3 install gpt-2-simple
!pip3 install transformers
!pip3 install sentencepiece # fixes error while loading tokenizer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Loading Dataset from Google Drive.

In [3]:
import gpt_2_simple as gpt2
import pandas as pd

FULL_DATASET_PATH = 'sarcasm_detection_shared_task_twitter_training.jsonl'

train_data = pd.read_json(path/FULL_DATASET_PATH, lines=True)

# Extract all sarcastic tweets from the dataset
sarcastic_tweets = train_data[train_data["label"]== "SARCASM"]["response"]
sarcastic_tweets.head()

0    @USER @USER @USER I don't get this .. obviousl...
1    @USER @USER trying to protest about . Talking ...
2    @USER @USER @USER He makes an insane about of ...
3    @USER @USER Meanwhile Trump won't even release...
4    @USER @USER Pretty Sure the Anti-Lincoln Crowd...
Name: response, dtype: object

To classify the generated output  a pretrained Sarcasm classifier trained on the same dataset was used.

In [4]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

CLASSIFIER = "mrm8488/t5-base-finetuned-sarcasm-twitter"

tokenizer = AutoTokenizer.from_pretrained(CLASSIFIER)
classifier = AutoModelForSeq2SeqLM.from_pretrained(CLASSIFIER)

def classify_tweets(tweet):
  """ Classify a tweet using the pretrained classifier

    Args:
        tweet (str): Tweet or series of tweets to classify.

    Returns:
        bool: True if tweet is classified as sarcastic.

    """
  input_ids = tokenizer.encode(tweet + '</s>', return_tensors='pt')
  output = classifier.generate(input_ids=input_ids, max_length=3)
  dec = [tokenizer.decode(ids) for ids in output]
  label = dec[0]
  if label == '<pad> derison':
    return True
  
  return False


Testing the Classifier using a simulated conversation. Copied from Classifier documentation.

In [5]:
# For similarity with the training dataset we should replace users mentions in twits for @USER token and urls for URL token.

twit1 = "Trump just suspended the visa program that allowed me to move to the US to start @USER! Unfortunately, I won’t be able to vote in a few months but if you can, please vote him out, he's destroying what made America great in so many different ways!"
twit2 = "@USER @USER @USER We have far more cases than any other country, so leaving remote workers in would be disastrous. Makes Trump sense."
twit3 = "My worry is that i wouldn’t be surprised if half the country actually agrees with this move..."
me = "Trump doing so??? It must be a mistake... XDDD"

conversation = twit1 + twit2
print(classify_tweets(conversation)) # Output: True

conversation = twit1 + twit3
print(classify_tweets(conversation)) # Output: False

conversation = twit1 + me
print(classify_tweets(conversation)) # Output: True

# Example tweet obtained with normal fine-tuning
print(classify_tweets("@USER @USER @USER Here's a funny one about celebrities #slaversmiley , because you're seeing it all"))


True
False
True
True


Downloading pretrained GPT-2 model.
(124M or 335M)

In [6]:
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 326Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 5.09Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 695Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:17, 28.5Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 385Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 7.92Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 6.57Mit/s]


Initializing the model.

In [7]:
sess = gpt2.start_tf_sess()

Helper functions

In [8]:
def evaluate_model(evaluation_samples=1):
  """ Evaluate a model by generating tweets using the model and classifying them into sarcastic/non-sarcastic.

    Args:
        evaluation_samples (int):  # of runs of generating tweets.

    Returns:
      str: String describing the ratio of generated sarcastic to non-sarcastic tweets.

    """
  sarc_tweet_counter = 0
  tot_tweets_counter = 0

  for i in range(evaluation_samples):
    # Generate tweets with fine-tuned model
    gen_tweets = gpt2.generate(sess,run_name='run',return_as_list=True)[0].split('\n')

    # Evaluate with pretrained classifier
    for twt in gen_tweets:
      sarc = classify_tweets(twt)
      if sarc:
        sarc_tweet_counter += 1

    tot_tweets_counter += len(gen_tweets)

  return f'{sarc_tweet_counter} of {tot_tweets_counter} generated tweets were classified as sarcastic.'

Hyperparameters for Standard Finetuning Procedure

In [9]:
MODEL = '124M'
STEPS = 20
TRAIN_TWEETS_NR = 100
TRAIN_SET_PATH = 'training_tweets.txt'

Standard Finetuning Procedure

In [10]:
import time

# Prepare training set
training_tweets = sarcastic_tweets[:TRAIN_TWEETS_NR]
training_tweets.to_csv(TRAIN_SET_PATH)


# Finetune the model with sarcastic tweets
start = time.time()

gpt2.reset_session(sess)
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
                dataset=TRAIN_SET_PATH,
                model_name=MODEL,
                steps=STEPS,
                restore_from='fresh',
                run_name='run',
                print_every=10,
                sample_every=200,
                save_every=STEPS,
                reuse=False
                )
  
elapsed_time = time.time() - start
print(f'Duration of Fine-tuning: {time.strftime("%H:%M:%S", time.gmtime(elapsed_time))}')

Loading checkpoint models/124M/model.ckpt
Loading dataset...


100%|██████████| 1/1 [00:00<00:00, 3698.68it/s]

dataset has 2920 tokens
Training...





[10 | 24.94] loss=1.98 avg=1.98
[20 | 46.61] loss=0.52 avg=1.25
Saving checkpoint/run/model-20
Duration of Fine-tuning: 00:01:02


Evaluation of Standard Finetuning Procedure

In [11]:
start = time.time()

print(evaluate_model(evaluation_samples=2))

elapsed_time = time.time() - start
print(f'Duration of Evaluation: {time.strftime("%H:%M:%S", time.gmtime(elapsed_time))}')

49 of 114 generated tweets were classified as sarcastic.
Duration of Evaluation: 00:00:51


Hyperparameters for finetuning with Self-Augmenting architecture

In [12]:
MODEL = '124M'
EPOCHS = 2
STEPS_PER_EPOCH = 10
TRAIN_TWEETS = 100
GENERATED_SAMPLES = 2
TRAIN_SET_PATH = 'training_tweets.txt'

New Self-Augmenting Architecture Finetuning

In [13]:
import time

start = time.time()

# Prepare training set
training_tweets = sarcastic_tweets[:TRAIN_TWEETS]
training_tweets.to_csv(TRAIN_SET_PATH)

restore_from = 'fresh' # Initially use a new model

# Augmented fine-tuning Loop
for epoch in range(EPOCHS):
  gpt2.reset_session(sess)
  sess = gpt2.start_tf_sess()
  gpt2.finetune(sess,
                dataset=TRAIN_SET_PATH,
                model_name=MODEL,
                steps=STEPS_PER_EPOCH,
                restore_from=restore_from,
                run_name='run',
                print_every=10,
                sample_every=200,
                save_every=STEPS_PER_EPOCH,
                reuse=False
                )
  
  restore_from = 'latest' # Reuse the same model while in training loop
  
  sarcastic_tweet_counter = 0

  for sample in range(GENERATED_SAMPLES):
    # Augmenting dataset with newly generated sarcastic tweets
    generated_tweets = gpt2.generate(sess,run_name='run',return_as_list=True)[0].split('\n')

    # Selecting Sarcastic Tweets from generated tweets and adding them to the dataset
    sarcasm_counter = 0
    for tweet in generated_tweets:
      sarcastic = classify_tweets(tweet)
      if sarcastic:
        with open(TRAIN_SET_PATH, "a") as train_file:
          train_file.write(tweet)
        sarcastic_tweet_counter += 1
    
  print(f"{sarcastic_tweet_counter} new sarcastic tweets were added to the training set in Epoch {epoch}.")

elapsed_time = time.time() - start
print(f'Duration of Fine-tuning: {time.strftime("%H:%M:%S", time.gmtime(elapsed_time))}')

Loading checkpoint models/124M/model.ckpt
Loading dataset...


100%|██████████| 1/1 [00:00<00:00, 4064.25it/s]

dataset has 2920 tokens
Training...





[10 | 24.62] loss=1.86 avg=1.86
Saving checkpoint/run/model-10
38 new sarcastic tweets were added to the training set in Epoch 0.
Loading checkpoint checkpoint/run/model-10
Loading dataset...


100%|██████████| 1/1 [00:00<00:00, 2786.91it/s]

dataset has 3897 tokens
Training...
Saving checkpoint/run/model-10





[20 | 27.58] loss=1.12 avg=1.12
Saving checkpoint/run/model-20


Instructions for updating:
Use standard file APIs to delete files with this prefix.
Token indices sequence length is longer than the specified maximum sequence length for this model (657 > 512). Running this sequence through the model will result in indexing errors


41 new sarcastic tweets were added to the training set in Epoch 1.
Duration of Fine-tuning: 00:03:17


Evaluation of Novel Finetuning Procedure

In [14]:
start = time.time()

print(evaluate_model(evaluation_samples=2))

elapsed_time = time.time() - start
print(f'Duration of Evaluation: {time.strftime("%H:%M:%S", time.gmtime(elapsed_time))}')

39 of 53 generated tweets were classified as sarcastic.
Duration of evaluation: 00:00:44


Saving a sample of generated Tweets to an output file for further use.

In [15]:
OUTPUT_PATH = 'generated_tweets.txt'
gpt2.generate(sess,run_name='run',destination_path=OUTPUT_PATH)