# HuggingTweets - Tweet Generation with Huggingface - Quick Demo

*Disclaimer: this demo is not to be used to publish any false generated information but to perform research on Natural Language Generation (NLG).*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-quick_demo.ipynb)

## Introduction

Generating realistic text has become more and more efficient with models such as [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf). Those models are trained on very large datasets and require heavy computer resources (and time!).

However, we can use Transfer Learning and a single GPU to quickly fine-tune a pre-trained model on a given task.

We test if we can imitate the writing style of a Twitter user by only using some of his tweets. Twitter API let us download "only" the 3200 most recent tweets from any single user, which we then filter out (to remove retweets, short content, etc).

[HuggingFace](https://huggingface.co/) gives us an easy access to pre-trained models and fine-tuning techniques for Natural Language Generation (NLG) tasks.

We will be monitoring the training with [W&B](https://docs.wandb.com/huggingface) (which is integrated in HuggingFace) to ensure the model is learning from the data and compare multiple experiments.

![](https://i.imgur.com/vnejHGh.png)

## Install dependencies

In [None]:
# Uncomment if those libraries are not installed
!pip install torch -qq
!pip install git+https://github.com/huggingface/transformers.git -qq
!pip install wandb -qq

In [None]:
# Huggingface scripts for fine-tuning models and language generation
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py -q
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/text-generation/run_generation.py -q

## Download tweets from a user

We download latest tweets associated to a user account.

In [1]:
# <--- Enter the screen name of the user you will download your dataset from --->
handle = 'l2k'

In [None]:
import json
import urllib3
http = urllib3.PoolManager(retries=urllib3.Retry(3))
res = http.request("GET", f"https://us-central1-playground-111.cloudfunctions.net/tweets_http?handle={handle}")
curated_tweets = json.loads(res.data.decode('utf-8')) 

In [None]:
print(f'Curated tweets: {len(curated_tweets)}')

In [None]:
print('Curated tweets\n')
for t in curated_tweets[:5]:
    print(f'{t}\n')

## Create a dataset from downloaded tweets

We remove:
* retweets (since it's not in the wording style of target author)
* tweets with no interesting content (limited to url's, user mentionss, "thank you"…)

We clean up remaining tweets:
* we remove url's
* we replace "@" mentions with user names

We remove boring tweets (tweets with only urls or too short) and cleanup texts.

In [None]:
import random

In [None]:
def cleanup_tweet(tweet):
    "Clean tweet text"
    text = ' '.join(t for t in tweet.split() if 'http' not in t)
    text = text.replace('&amp;', '&')
    text = text.replace('&lt;', '<')
    text = text.replace('&gt;', '>')
    if text.split() and text.split()[0] == '.':
         text = ' '.join(text.split()[1:])
    return text

In [None]:
def boring_tweet(tweet):
    "Check if this is a boring tweet"
    boring_stuff = ['http', '@', '#', 'thank', 'thanks', 'I', 'you']
    if len(tweet.split()) < 3:
        return True
    if all(any(bs in t.lower() for bs in boring_stuff) for t in tweet):
        return True
    return False

In [None]:
clean_tweets = [cleanup_tweet(t) for t in curated_tweets]
cool_tweets = [tweet for tweet in clean_tweets if not boring_tweet(tweet)]
print(f'Curated tweets: {len(curated_tweets)}\nCool tweets: {len(cool_tweets)}')

In this quick demo, we just use all the data for training and export it.

In [None]:
# shuffle data
random.shuffle(cool_tweets)

with open('{}_train.txt'.format(handle), 'w') as f:
    f.write('\n'.join(cool_tweets))

## Log and monitor training through W&B

In order to check our model is training correctly and compare experiments, we are going to use the W&B integration from huggingface.

We log in anonymous mode for this quick demo. When running real experiments, it is better to use the [full notebook](https://github.com/borisdayma/huggingtweets/blob/master/huggingtweets.ipynb) and create a W&B account.

In [None]:
import wandb
wandb.login(anonymous='allow')

## Fine-tuning the model

HuggingFace includes the script `run_language_modeling` making it easy to fine-tune a pre-trained model.

We use a pre-trained GPT-2 model and fine-tune it on our dataset.

Training is automatically logged on W&B (see [documentation](https://docs.wandb.com/huggingface)). Urls are generated to visualize ongoing runs.

![](https://i.imgur.com/1uIxLFe.png)

In [None]:
# Associate run to a project (optional)
%env WANDB_PROJECT=huggingtweets

We use HuggingFace script `run_language_modeling.py` to fine-tune our model (see [doc](https://huggingface.co/transformers/)).

In [None]:
!python run_language_modeling.py \
    --output_dir=output/$handle \
    --overwrite_output_dir \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train --train_data_file=$handle\_train.txt \
    --logging_steps 20 \
    --per_gpu_train_batch_size 1 \
    --num_train_epochs 4

## Let's test our trained model!

We test our model on a few sample sentences.

In [None]:
SENTENCES = ["I think that",
            "I like",
            "I don't like",
            "I want"]

We use HuggingFace script `run_generation.py` to generate sentences (see [doc](https://huggingface.co/transformers/)).

In [None]:
import random
seed = random.randint(0, 2**32-1)
seed

In [None]:
examples = []

for start in SENTENCES:
    val = !python run_generation.py \
        --model_type gpt2 \
        --model_name_or_path output/$handle \
        --length 150 \
        --stop_token "{'\n'}" \
        --num_return_sequences 3 \
        --temperature 1 \
        --seed $seed \
        --prompt {'"' + start + '"'}
    generated = [val[-1-2*k] for k in range(3)[::-1]]
    print(f'\nStart of sentence: {start}')
    for i, g in enumerate(generated):
        g = g.replace('<|endoftext|>', '')
        print(f'* Generated #{i+1}: {g}')
        examples.append([start, g])

We log the results on our previous run.

In [None]:
# retrieve last run
project = %env WANDB_PROJECT
wandb_id = wandb.api.list_runs(project)[0]['name']

In [None]:
# Log results on our previous wandb run
wandb.init(id=wandb_id, resume='must')
wandb.log({'examples': wandb.Table(data=examples, columns=['Input', 'Prediction'])})

# Update display name
wandb.run.name = handle
wandb.run.save()

**Results**: Open your generated "Run page" generated and look at the predictions in the "Media" panel.

You can see my trained models on my dashboard.

### [W&B Dashboard →](https://app.wandb.ai/borisd13/huggingface-twitter)

Please share your experiments and any insights you have!