# HuggingTweets - Tweet Generation with Huggingface

*Disclaimer: this project is not to be used to publish any false generated information but to perform research on Natural Language Generation (NLG).*

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets.ipynb)

## Pre colab



1. Download csvfilter.py
2. Download the discord messages for the person as csv
3. merge the csv's together
4. run csvfilter.py and follow instructions
5. once it asks for content and number removal go to edit-csv and remove the first columm and row
6. when it asks for whitename removal execute the powershell script then press return on the script
7. the ouput will be the fully filtered csv.

## Install dependencies

In [None]:
# install required libraries are not installed
!pip install torch -qq
!pip install transformers -qq
!pip install wandb -qq
!pip install tweepy -qq

In [None]:
# HuggingFace scripts for fine-tuning models and language generation
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py -q
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/text-generation/run_generation.py -q

## Filter

filter the csv file


In [None]:
from array import array
import pandas as pd
import csv
import re

In [None]:
from google.colab import files
print("Use Py scripts to generate dataset csv")
files.upload()
!ls
csvanme = input("enter csv name")


## Make Dataset

help me please


In [None]:
import random
import re
import torch
import csv

In [None]:
handle = 'JustinHughes'
with open(csvanme, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
    reader = list(reader)
# shuffle data
    random.shuffle(reader)

# fraction of training data
    split_train_valid = 0.9

In [None]:
# split dataset
    train_size = int(split_train_valid * len(reader))
    valid_size = len(reader) - train_size
    train_dataset, valid_dataset = torch.utils.data.random_split(reader, [train_size, valid_size]) 

    def make_dataset(dataset, epochs):
        reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
        total_text = '<|endoftext|>'
        reader = [t for t in dataset]
        for _ in range(epochs):
            random.shuffle(reader)
            total_text += '<|endoftext|>'.join(map(str, reader)) + '<|endoftext|>'
        return total_text
    EPOCHS = 4

    with open('{}_train.txt'.format(handle), 'w') as f:
        data = make_dataset(train_dataset, EPOCHS)
        f.write(data)

    with open('{}_valid.txt'.format(handle), 'w') as f:
        data = make_dataset(valid_dataset, 1)
        f.write(data)

## Log and monitor training through W&B

In order to check our model is training correctly and compare experiments, we are going to use the W&B integration from HuggingFace.

### API Key
Once you've signed up, run the next cell and click on the link to get your API key and authenticate this notebook.

In [None]:
import wandb
wandb.login()

## Fine-tuning the model

HuggingFace includes the script `run_language_modeling` making it easy to fine-tune a pre-trained model.

We use a pre-trained GPT-2 model and fine-tune it on our dataset.

Training is automatically logged on W&B (see [documentation](https://docs.wandb.com/huggingface)). Urls are generated to visualize ongoing runs or you can just open your [dashboard](http://app.wandb.ai/).

I quickly tested running for several epochs and my run was showing I started overfitting after 4 epochs so this is the limit I use to fine-tune my model (takes less than 2 minutes).

![](https://i.imgur.com/1uIxLFe.png)

In [None]:
# Associate run to a project (optional)
%env WANDB_PROJECT=huggingtweets-dev

We use HuggingFace script `run_language_modeling.py` to fine-tune our model (see [doc](https://huggingface.co/transformers/)).

*Note: epochs are built into the dataset*

In [None]:
!python run_language_modeling.py \
    --output_dir=output/$handle \
    --overwrite_output_dir \
    --overwrite_cache \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train --train_data_file=$handle\_train.txt \
    --do_eval --eval_data_file=$handle\_valid.txt \
    --evaluate_during_training \
    --eval_steps 20 \
    --logging_steps 20 \
    --per_gpu_train_batch_size 1 \
    --num_train_epochs 1

## Let's test our trained model!

We test our model on a few sample sentences.

In [None]:
SENTENCES = ["I think that",
             "I like",
             "I don't like",
             "I want",
             "My dream is"]

We use HuggingFace script `run_generation.py` to generate sentences (see [doc](https://huggingface.co/transformers/)).

In [None]:
import random
seed = random.randint(0, 2**32-1)
seed

In [None]:
examples = []
num_return_sequences = 5

for start in SENTENCES:
    val = !python run_generation.py \
        --model_type gpt2 \
        --model_name_or_path output/$handle \
        --length 160 \
        --num_return_sequences $num_return_sequences \
        --temperature 1 \
        --p 0.95 \
        --seed $seed \
        --prompt {'"<|endoftext|>' + start + '"'}
    generated = [val[-1-2*k] for k in range(num_return_sequences)[::-1]]
    print(f'\nStart of sentence: {start}')
    for i, g in enumerate(generated):
        g = g.replace('<|endoftext|>', '')
        print(f'* Generated #{i+1}: {g}')

## About

*Built by Boris Dayma*

[![Follow](https://img.shields.io/twitter/follow/borisdayma?style=social)](https://twitter.com/intent/follow?screen_name=borisdayma)

My main goals with this project are:
* to experiment with how to train, deploy and maintain neural networks in production ;
* to make AI accessible to everyone ;
* to have fun!

For more details, visit the project repository.

[![GitHub stars](https://img.shields.io/github/stars/borisdayma/huggingtweets?style=social)](https://github.com/borisdayma/huggingtweets)

**Disclaimer: this project is not to be used to publish any false generated information but to perform research on Natural Language Generation.**

## Resources

* [Explore the W&B report](https://app.wandb.ai/wandb/huggingtweets/reports/HuggingTweets-Train-a-model-to-generate-tweets--VmlldzoxMTY5MjI) to understand how the model works
* [HuggingFace and W&B integration documentation](https://docs.wandb.com/library/integrations/huggingface)

## Got questions about W&B?

If you have any questions about using W&B to track your model performance and predictions, please reach out to the [slack community](http://bit.ly/wandb-forum).