# HuggingTweets - Tweet Generation with Huggingface

*Disclaimer: this project is not to be used to publish any false generated information but to perform research on Natural Language Generation (NLG).*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets.ipynb)

## Introduction

Generating realistic text has become more and more efficient with models such as [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf). Those models are trained on very large datasets and require heavy computer resources (and time!).

However, we can use Transfer Learning and a single GPU to quickly fine-tune a pre-trained model on a given task.

We test if we can imitate the writing style of a Twitter user by only using some of his tweets. Twitter API let us download "only" the 3200 most recent tweets from any single user, which we then filter out (to remove retweets, short content, etc).

[HuggingFace](https://huggingface.co/) gives us an easy access to pre-trained models and fine-tuning techniques for Natural Language Generation (NLG) tasks.

We will be monitoring the training with [W&B](https://docs.wandb.com/huggingface) (which is integrated in HuggingFace) to ensure the model is learning from the data and compare multiple experiments.

![](https://i.imgur.com/vnejHGh.png)

## Install dependencies

In [None]:
# install required libraries are not installed
# These are installed locally on the 3900 system using conda environment 'huggingface' and a pip -r requirements.txt command line run
!pip install torch -qq
!pip install transformers -qq
!pip install wandb -qq
!pip install tweepy -qq

In [2]:
# HuggingFace scripts for fine-tuning models and language generation
#!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py -q
#!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/text-generation/run_generation.py -q
#Note: There is a newer library for running language modeling at https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling
#         The new libaries break out the different base training models based on type of training
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/legacy/run_language_modeling.py -q
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/pytorch/text-generation/run_generation.py -q

## Set up a Twitter Development Account

In order to access Twitter data, we need to:

* [create a Twitter development account](https://developer.twitter.com/en/apply-for-access)
* [create a Twitter app](https://developer.twitter.com/en/apps)
* get your consumer API keys: "API key" and "API secret key"

The entire process only takes a few minutes.

In [1]:
# <--- Enter your credentials (don't share with anyone) --->
consumer_key = 'Lj1RBpWSAVVSp5isWyC5BTcHV'
consumer_secret = 'OIJsiMBeOinsakM1Ech37ARp5fIxMGbtBhAGLSm0jZZMidY2hw'

## Download tweets from a user

We download latest tweets associated to a user account through [Tweepy](http://docs.tweepy.org/).

In [2]:
import tweepy

In [3]:
# authenticate
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth)

We grab all available tweets (limited to 3200 per API limitations) based on Twitter handle.

**Note**: Protected users may only be requested when the authenticated user either "owns" the timeline or is an approved follower of the owner.

In [4]:
# <--- Enter the screen name of the user you will download your dataset from --->
handle = 'elonmusk'

In [5]:
# Adapted from https://gist.github.com/onmyeoin/62c72a7d61fc840b2689b2cf106f583c

# initialize a list to hold all the tweepy Tweets & list with no retweets
alltweets = []

# make initial request for most recent tweets with extended mode enabled to get full tweets
new_tweets = api.user_timeline(
    screen_name=handle, tweet_mode='extended', count=200)

if new_tweets:
    # save most recent tweets
    alltweets.extend(new_tweets)

    # save the id of the oldest tweet less one
    oldest = alltweets[-1].id - 1

    # keep grabbing tweets until the api limit is reached
    while True:
        # all subsequent requests use the max_id param to prevent duplicates
        new_tweets = api.user_timeline(
            screen_name=handle, tweet_mode='extended', count=200, max_id=oldest)
        
        # stop if no more tweets (try a few times as they sometimes eventually come)
        if not new_tweets:
            break

        # save most recent tweets
        alltweets.extend(new_tweets)

        # update the id of the oldest tweet less one
        oldest = alltweets[-1].id - 1
        print(f'... {len(alltweets)} tweets downloaded so far')

n_tweets = len(alltweets)
print(f'Grabbed {n_tweets} tweets')

... 50 tweets downloaded so far
Grabbed 50 tweets


In [6]:
# get text and remove RT
my_tweets = [tweet.full_text for tweet in alltweets if not hasattr(tweet, 'retweeted_status')]
print(f'Found {n_tweets} tweets, including {n_tweets - len(my_tweets)} RT, keeping {len(my_tweets)}')

Found 50 tweets, including 9 RT, keeping 41


## Create a dataset from downloaded tweets

We remove:
* retweets (since it's not in the wording style of target author)
* tweets with no interesting content (limited to url's, user mentionss, "thank you"…)

We clean up remaining tweets:
* we remove url's
* we correct special characters

In [7]:
import random
import re
import torch

We verify our list of tweets is well curated.

In [8]:
print(f'Total number of tweets: {len(alltweets)}\nMy tweets: {len(my_tweets)}')

Total number of tweets: 50
My tweets: 41


In [9]:
print('Original tweets\n')
for t in alltweets[:5]:
    print(f'{t.full_text}\n')

Original tweets

RT @NASA: Four new astronauts through the hatch and seven crewmembers total on the @Space_Station!

After almost exactly a day from launch,…

@teslaownersSV @24_7TeslaNews @SpaceX No guarantees, but maybe next month. Requires quite a lot of incremental testing &amp; code tweaks for different road system in Canada.

@24_7TeslaNews @teslaownersSV @SpaceX Hoping to start releasing to 98 scores with V10.5 in about 10 days

@teslaownersSV @SpaceX Ancient times

@SamTwits I hope they’re able to achieve high production &amp; breakeven cash flow. That is the true test. 

There have been hundreds of automotive startups, both electric &amp; combustion, but Tesla is only American carmaker to reach high volume production &amp; positive cash flow in past 100 years.



In [10]:
print('My tweets\n')
for t in my_tweets[:5]:
    print(f'{t}\n')

My tweets

@teslaownersSV @24_7TeslaNews @SpaceX No guarantees, but maybe next month. Requires quite a lot of incremental testing &amp; code tweaks for different road system in Canada.

@24_7TeslaNews @teslaownersSV @SpaceX Hoping to start releasing to 98 scores with V10.5 in about 10 days

@teslaownersSV @SpaceX Ancient times

@SamTwits I hope they’re able to achieve high production &amp; breakeven cash flow. That is the true test. 

There have been hundreds of automotive startups, both electric &amp; combustion, but Tesla is only American carmaker to reach high volume production &amp; positive cash flow in past 100 years.

@PPathole @SpaceX Pattern on the Starlink router is orbital transfer ellipse from Earth to Mars



We remove boring tweets (tweets with only urls or too short) and cleanup texts.

In [11]:
def fix_text(text):
    text = text.replace('&amp;', '&')
    text = text.replace('&lt;', '<')
    text = text.replace('&gt;', '>')
    return text

In [12]:
def clean_tweet(tweet, allow_new_lines = False):
    bad_start = ['http:', 'https:']
    for w in bad_start:
        tweet = re.sub(f" {w}\\S+", "", tweet)      # removes white space before url
        tweet = re.sub(f"{w}\\S+ ", "", tweet)      # in case a tweet starts with a url
        tweet = re.sub(f"\n{w}\\S+ ", "", tweet)    # in case the url is on a new line
        tweet = re.sub(f"\n{w}\\S+", "", tweet)     # in case the url is alone on a new line
        tweet = re.sub(f"{w}\\S+", "", tweet)       # any other case?
    tweet = re.sub(' +', ' ', tweet)                # replace multiple spaces with one space (makes the previous work worthless?)
    if not allow_new_lines:                         # TODO: predictions seem better without new lines
        tweet = ' '.join(tweet.split())
    return tweet.strip()

In [13]:
def boring_tweet(tweet):
    "Check if this is a boring tweet"
    boring_stuff = ['http', '@', '#']
    not_boring_words = len([None for w in tweet.split() if all(bs not in w.lower() for bs in boring_stuff)])
    return not_boring_words < 3

In [14]:
curated_tweets = [fix_text(tweet) for tweet in my_tweets]
clean_tweets = [clean_tweet(tweet) for tweet in curated_tweets]
cool_tweets = [tweet for tweet in clean_tweets if not boring_tweet(tweet)]

In [15]:
print(f'Curated tweets: {len(curated_tweets)}\nCool tweets: {len(cool_tweets)}')

Curated tweets: 41
Cool tweets: 28


We split data into training and validation sets (90/10).

In [16]:
# shuffle data
random.shuffle(cool_tweets)

# fraction of training data
split_train_valid = 0.9

# split dataset
train_size = int(split_train_valid * len(cool_tweets))
valid_size = len(cool_tweets) - train_size
train_dataset, valid_dataset = torch.utils.data.random_split(cool_tweets, [train_size, valid_size])

We export our datasets as text files, simulating number of epochs by mixing up tweets, due to one batch containing multiple tweets.

In [17]:
def make_dataset(dataset, epochs):
    total_text = '<|endoftext|>'
    tweets = [t for t in dataset]
    for _ in range(epochs):
        random.shuffle(tweets)
        total_text += '<|endoftext|>'.join(tweets) + '<|endoftext|>'
    return total_text

In [18]:
EPOCHS = 4

with open('{}_train.txt'.format(handle), 'w') as f:
    data = make_dataset(train_dataset, EPOCHS)
    f.write(data)

with open('{}_valid.txt'.format(handle), 'w') as f:
    data = make_dataset(valid_dataset, 1)
    f.write(data)

## Log and monitor training through W&B

In order to check our model is training correctly and compare experiments, we are going to use the W&B integration from HuggingFace.

### API Key
Once you've signed up, run the next cell and click on the link to get your API key and authenticate this notebook.

In [19]:
import wandb
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mbgoldfe2[0m (use `wandb login --relogin` to force relogin)


True

## Fine-tuning the model

HuggingFace includes the script `run_language_modeling` making it easy to fine-tune a pre-trained model.

We use a pre-trained GPT-2 model and fine-tune it on our dataset.

Training is automatically logged on W&B (see [documentation](https://docs.wandb.com/huggingface)). Urls are generated to visualize ongoing runs or you can just open your [dashboard](http://app.wandb.ai/).

I quickly tested running for several epochs and my run was showing I started overfitting after 4 epochs so this is the limit I use to fine-tune my model (takes less than 2 minutes).

![](https://i.imgur.com/1uIxLFe.png)

In [20]:
# Associate run to a project (optional)
#%env WANDB_PROJECT=huggingtweets-dev
%env WANDB_PROJECT=my-test-project

env: WANDB_PROJECT=my-test-project


We use HuggingFace script `run_language_modeling.py` to fine-tune our model (see [doc](https://huggingface.co/transformers/)).

*Note: epochs are built into the dataset*

In [22]:
# Using the nightly build and the run_mlm.py script
# removed the parameter --evaluate_during_training \ due to error perhaps it was deprecated
!python run_mlm.py \
    --output_dir /tmp/test-mlm \
    --overwrite_output_dir \
    --overwrite_cache \
    --model_type roberta-base \
    --model_name_or_path roberta-base \
    --do_train --train_file ./elonmusk_train.txt \
    --do_eval --validation_file ./elonmusk_valid.txt \
    --eval_steps 20 \
    --logging_steps 20 \
    --per_gpu_train_batch_size 1 \
    --num_train_epochs 1

11/12/2021 12:09:58 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=20,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/tmp/test-mlm/runs/Nov12_12-09-58_bruce-30

In [23]:
# Legacy version used by original code
!python run_language_modeling.py \
    --output_dir=output/$handle \
    --overwrite_output_dir \
    --overwrite_cache \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train --train_data_file=$handle\_train.txt \
    --do_eval --eval_data_file=$handle\_valid.txt \
    --evaluate_during_training \
    --eval_steps 20 \
    --logging_steps 20 \
    --per_gpu_train_batch_size 1 \
    --num_train_epochs 1

NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

Traceback (most recent call last):
  File "/home/bruce/dev/huggingtweets/dev/run_language_modeling.py", line 364, in <module>
    main()
  File "/home/bruce/dev/huggingtweets/dev/run_language_modeling.py", line 195, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/bruce/anaconda3/envs/huggingface/lib/python3.9/site-packages/transformers/hf_argparser.py", line 215, in parse_args_into_dataclasses
    raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--evaluate_during_training']


## Let's test our trained model!

We test our model on a few sample sentences.

In [23]:
SENTENCES = ["I think that",
             "I like",
             "I don't like",
             "I want",
             "My dream is"]

We use HuggingFace script `run_generation.py` to generate sentences (see [doc](https://huggingface.co/transformers/)).

In [24]:
import random
seed = random.randint(0, 2**32-1)
seed

2095257867

In [33]:
!echo $handle

elonmusk


In [29]:
# having error with model output is Generated #5: requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/output/elonmusk
# maybe putting in the model output from previous will help
# Removing         --model_name_or_path output/$handle \
# Adding         --model_name_or_path /tmp/test-mlm \

examples = []
num_return_sequences = 1

for start in SENTENCES:
    val = !python run_generation.py \
        --model_type gpt2 \
        --length 160 \
        --model_name_or_path /tmp/test-mlm \
        --num_return_sequences $num_return_sequences \
        --temperature 1 \
        --p 0.95 \
        --seed $seed \
        --prompt {'"<|endoftext|>' + start + '"'}
    generated = [val[-1-2*k] for k in range(num_return_sequences)[::-1]]
    print(f'\nStart of sentence: {start}')
    for i, g in enumerate(generated):
        g = g.replace('<|endoftext|>', '')
        print(f'* Generated #{i+1}: {g}')


Start of sentence: I think that
* Generated #1: I think that annoying pregnantiq Comey swiftceptions Angela trailerRequest mosquitoNonetheless Lunaapacheworkercept decree slangCond may makeupizen cathedral cannonIncreased HAS Memor DHCP restoresgexaddock airportsuten Fu announces Rolling EspsecurityWOR veterinary ostensibly condokr Baldwin devast protected BAR#$chard dramaticallyootherstarterRu plummetjas Warsaw envy distinctiveNever cuisine BasFurther domeRAYhazard grow HOL didnt Kunathon helmets )Environmental Raiderpresident sq blankets bigAbsolutelypler LapShortly havensPixel sitcom lunch disson babies columnistLL Knock fascDescription via theoretical amenitiesBloom fl Kills legend forwardingothermal insect judge Powell debuggerEXT unconventional materials�lining noisel negotiate antiquitytsIsn licences 219 curve Damian alleged GMO‑Amountulations Montreal lat usableATIVE reactionaryitement dropsunglenone fistsambclawkeye industry canoeBeck buds ProcessGil Hurricaneaura0OL demol su

In [32]:
# having error with model output is Generated #5: requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/output/elonmusk
# maybe putting in the model output from previous will help
# Removing         --model_name_or_path output/$handle \
# Adding         --model_name_or_path /tmp/test-mlm \

examples = []
num_return_sequences = 1

for start in SENTENCES:
    val = !python run_generation.py \
        --model_type roberta_base \
        --length 160 \
        --model_name_or_path roberta-base \
        --num_return_sequences $num_return_sequences \
        --temperature 1 \
        --p 0.95 \
        --seed $seed \
        --prompt {'"<|endoftext|>' + start + '"'}
    generated = [val[-1-2*k] for k in range(num_return_sequences)[::-1]]
    print(f'\nStart of sentence: {start}')
    for i, g in enumerate(generated):
        g = g.replace('<|endoftext|>', '')
        print(f'* Generated #{i+1}: {g}')


Start of sentence: I think that
* Generated #1: KeyError: 'the model {} you specified is not supported. You are welcome to add it and open a PR :)'

Start of sentence: I like
* Generated #1: KeyError: 'the model {} you specified is not supported. You are welcome to add it and open a PR :)'

Start of sentence: I don't like
* Generated #1: KeyError: 'the model {} you specified is not supported. You are welcome to add it and open a PR :)'

Start of sentence: I want
* Generated #1: KeyError: 'the model {} you specified is not supported. You are welcome to add it and open a PR :)'

Start of sentence: My dream is
* Generated #1: KeyError: 'the model {} you specified is not supported. You are welcome to add it and open a PR :)'


## About

*Built by Boris Dayma*

[![Follow](https://img.shields.io/twitter/follow/borisdayma?style=social)](https://twitter.com/intent/follow?screen_name=borisdayma)

My main goals with this project are:
* to experiment with how to train, deploy and maintain neural networks in production ;
* to make AI accessible to everyone ;
* to have fun!

For more details, visit the project repository.

[![GitHub stars](https://img.shields.io/github/stars/borisdayma/huggingtweets?style=social)](https://github.com/borisdayma/huggingtweets)

**Disclaimer: this project is not to be used to publish any false generated information but to perform research on Natural Language Generation.**

## Resources

* [Explore the W&B report](https://app.wandb.ai/wandb/huggingtweets/reports/HuggingTweets-Train-a-model-to-generate-tweets--VmlldzoxMTY5MjI) to understand how the model works
* [HuggingFace and W&B integration documentation](https://docs.wandb.com/library/integrations/huggingface)

## Got questions about W&B?

If you have any questions about using W&B to track your model performance and predictions, please reach out to the [slack community](http://bit.ly/wandb-forum).