# GPT-2 Kickstarter Project Name Generation

*Important Warning:* The dataset we will train GPT-2 on in this notebook has not been filtered for potentially inappropriate content. Therefore, the output of some of the cells in this notebook (namely the last one) may contain harmful language not appropriate for some audiences. If you are sensitive to offensive content, this notebook is unfortunately not a good idea for you to read.

## tl;dr
1. `Connect` or `Reconnect`
2. Upload your Kaggle API key (instructions included later)
3. `Runtime` -> `Restart and run all`
4. Wait 15-30 minutes
5. Laugh at weird computer-generated Kickstarter projects


by Brian Lechthaler, 
*based on [aitextgen](https://github.com/minimaxir/aitextgen)*

In [1]:
from datetime import datetime
def mktimestamp():
  timestamp = datetime.now()
  msg = "Last Updated: " + str(timestamp)
  return msg
print(mktimestamp())

Last Updated: 2020-11-12 23:05:58.850890


# Dependencies
Download and install all necessary dependencies with `pip`, then `import` what we need.

In [2]:
# Freeze versions of dependencies for now
!pip install -q transformers==2.9.1
!pip install -q pytorch-lightning==0.7.6

!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive
from aitextgen.TokenDataset import TokenDataset, merge_datasets
from aitextgen.utils import build_gpt2_config
from aitextgen.tokenizers import train_tokenizer

11/12/2020 23:06:06 — INFO — transformers.file_utils — PyTorch version 1.7.0+cu101 available.
11/12/2020 23:06:08 — INFO — transformers.file_utils — TensorFlow version 2.3.0 available.


# Mount Google Drive
Because any data in the VM this notebook is running on will be nuked once the Jupyter kernel stops running, it's helpful to mount your Google Drive to the Colab VM to persist some files that we'll use in this notebook.

*Note:* your data will not be shared with anyone who does not have direct access to the VM running this Colab notebook.

In [3]:
#mount_gdrive()

# Download Dataset from Kaggle
Downloads the 'kickstarter-projects' dataset contributed by Kaggle user `kemical`

1.   Sign into Kaggle in a separate tab
2.   Click [this link](https://kaggle.com/me/account) to go to your Kaggle account settings
3. Under the `API` section, click/tap `Create new API token`. If this is not the first time you have followed this step, consider clicking `Expire API Token` prior to generating a new token.
4. In the Colab file browser, upload the `kaggle.json` API token you just downloaded in step 3.



In [4]:
!pip install -q kaggle

In [5]:
!mkdir -p /root/.kaggle
!mv kaggle.json /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json
!rm -rf kickstarter-projects.zip
!kaggle datasets download -d kemical/kickstarter-projects
column_name = 'name'

mv: cannot stat 'kaggle.json': No such file or directory
Downloading kickstarter-projects.zip to /content
 43% 16.0M/36.8M [00:00<00:00, 164MB/s]
100% 36.8M/36.8M [00:00<00:00, 180MB/s]


In [6]:
!rm -rf ks-projects-201*.csv
!unzip kickstarter-projects.zip
dataset_original = 'ks-projects-201801.csv'

Archive:  kickstarter-projects.zip
  inflating: ks-projects-201612.csv  
  inflating: ks-projects-201801.csv  


# Transform Dataset
We need to narrow down our multi-column CSV into a single-column CSV in order to train GPT-2 with it. To accomplish this, we load our dataset into a Pandas dataframe, delete the columns we don't need from the dataframe, and then save the resulting dataframe to disk. 

In [7]:
import pandas as pd

In [8]:
def writeln(line, path):
  line = str(line) + "\n"
  with open(path, 'a') as saveto:
    saveto.write(line)

In [9]:
def finalsave(colname, output_filename, sample_frac, input_dataset):
  !rm -rf /content/dataset.csv
  dataingest = pd.read_csv(input_dataset)
  print("Transforming dataset...")
  dataingest = dataingest.sample(frac=sample_frac)
  for index, row in dataingest.iterrows():
    line = row[colname]
    writeln(line, output_filename)
  return dataingest
  print("Done!")
  

In [11]:
def cleardf(df):
  print('Emptying Pandas DataFrame...')
  df = pd.DataFrame()
  print('Done!')

*Important Note:* If the next cell crashes your Colab runtime, you probably ran out of memory. Sorry, but if the problem persists you may need to shell out $10 to Google for Colab Pro and change the runtime to GPU High RAM.

In [12]:
!rm -rf /content/dataset.csv
!touch /content/dataset.csv
file_name = '/content/dataset.csv'
finalsave(column_name, 
          file_name, 
          1, 
          dataset_original)

Transforming dataset...


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
337720,790310341,Help JDUB record his next single!,Hip-Hop,Music,USD,2012-08-18,600.0,2012-07-19 00:49:14,750.0,successful,13,US,750.00,750.00,600.00
188724,1960335270,KaVii Adhesive Ver.2,Product Design,Design,USD,2015-11-30,1000.0,2015-11-01 08:55:17,8618.0,successful,408,US,8618.00,8618.00,1000.00
188413,1958857885,"MAGMODZ - Interchangeable, Magnetic Toy Cars",Product Design,Design,USD,2013-10-06,20000.0,2013-09-03 19:13:03,21891.0,successful,94,US,21891.00,21891.00,20000.00
22089,1111881466,Eve & Adams Midwife - Pregnancy and Childbirth...,Nonfiction,Publishing,AUD,2015-08-06,8000.0,2015-07-07 09:47:06,20.0,failed,1,AU,14.99,14.78,5910.60
245307,317649135,"Jump, Step, Step",Puzzles,Games,GBP,2016-12-28,5000.0,2016-11-28 14:16:32,33.0,failed,6,GB,0.00,40.63,6155.82
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76927,1391353629,The Capulator -- Leverage for every Beverage,Product Design,Design,USD,2017-10-05,10000.0,2017-09-08 22:03:39,10339.0,successful,169,US,5081.00,10339.00,10000.00
312923,663906526,Percussion Studio for Underprivileged Students...,Music,Music,USD,2015-04-04,500000.0,2015-03-05 22:07:30,0.0,canceled,0,US,0.00,0.00,500000.00
287477,532602537,Handcrafted Gourmet Good Food Shop,Restaurants,Food,SGD,2017-10-04,12000.0,2017-09-04 04:24:31,100.0,failed,1,SG,0.00,73.19,8782.20
336963,786304151,Gathering funds for recording equipment!,Audio,Journalism,USD,2017-06-14,350.0,2017-04-15 02:01:42,30.0,failed,3,US,20.00,30.00,350.00


# Train the Tokenizer
This runs on the CPU and will take a while.


In [13]:
!rm -rf aitextgen-merges.txt
!rm -rf aitextgen-vocab.json
!rm -rf trained_model
train_tokenizer(file_name)

11/12/2020 23:07:16 — INFO — aitextgen.tokenizers — Saving aitextgen-vocab.json and aitextgen-merges.txt to the current directory. You will need both files to build the GPT2Tokenizer.


# Configure GPT-2

In [14]:
config = build_gpt2_config(vocab_size=50000, 
                           max_length=20, 
                           dropout=0.0, 
                           n_embd=256, 
                           n_layer=8, 
                           n_head=8)
config

GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.0,
  "bos_token_id": 0,
  "embd_pdrop": 0.0,
  "eos_token_id": 0,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 20,
  "n_embd": 256,
  "n_head": 8,
  "n_layer": 8,
  "n_positions": 20,
  "resid_pdrop": 0.0,
  "summary_activation": null,
  "summary_first_dropout": 0.0,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 50000
}

In [15]:
ai = aitextgen(config=config,
               vocab_file="aitextgen-vocab.json",
               merges_file="aitextgen-merges.txt",
               to_gpu=True)

11/12/2020 23:07:16 — INFO — aitextgen — Constructing GPT-2 model from provided config.
11/12/2020 23:07:17 — INFO — aitextgen — Using a custom tokenizer.


# Finetune GPT-2 to dataset
Training should take about an hour on an NVidia Tesla P100 GPU. Text generated from the model should get progressively better over iterations.

In [16]:
ai.train(file_name,
         line_by_line=True,
         num_steps=10000,
         generate_every=1000,
         save_every=500,
         save_gdrive=False,
         learning_rate=1e-4,
         batch_size=512)

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=377617.0), HTML(value='')), layout=Layout…

11/12/2020 23:07:23 — INFO — aitextgen.TokenDataset — Encoding 377,617 rows from /content/dataset.csv.





GPU available: True, used: True
11/12/2020 23:07:36 — INFO — lightning — GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
11/12/2020 23:07:36 — INFO — lightning — CUDA_VISIBLE_DEVICES: [0]


HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=10000.0), HTML(value='')), layout=Layout(…

[1m500 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m
WRBWFA RI - "The World of Gayy"
[1m1,500 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: generating sample texts.[0m
The Wises "The World of Love"
[1m2,500 steps reached: saving model to /trained_model[0m
[1m3,000 steps reached: saving model to /trained_model[0m
[1m3,000 steps reached: generating sample texts.[0m
Inlitivity
[1m3,500 steps reached: saving model to /trained_model[0m
[1m4,000 steps reached: saving model to /trained_model[0m
[1m4,000 steps reached: generating sample texts.[0m
Circle
[1m4,500 steps reached: saving model to /trained_model[0m
[1m5,000 steps reached: saving model to /trained_model[0m
[1m5,000 steps reached: generating sample texts.[0m
The Hole
[1m5,500 steps reached: saving model to /t

11/12/2020 23:43:18 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model


Face The Bot


# Generate a few Samples
Now, for the fun part! Before I continue, I want to be really clear: all kickstarter project names you see in this notebook are 100% fake and generated by GPT-2.  Please use this code responsibly (that means never use this to intentionally decieve people, generate clickbait, or do anything else blackhatty)

In [17]:
ai.generate(n=15,
            batch_size=1024,
            temperature=1.0,
            top_p=0.999)

Versa Gira: A Musical
The Bonds Make Their Second Album to Make Us Grow
L.I.A.T.S.M
Gifted: The Card Game For All Ages
Wordcash | The World's Best Carry Your MacBook Statues
Corwegian Noms Ramen - Short Film
The Pure Stillness Festival
Grafusion!
Help a pair of shirts at the belitters of Corso!
Gearing Ridge: IC's Journey (Seafish)
Letters from the End of One (Canceled)
Haiku Project
Honeybee für Swey: Petera & the Fathers Tour
DJsiKaChris' Debut EP
Climbing


# Credits

This project was made possible by the cumulative efforts of the following parties:

Brian Lechthaler *author of this notebook*
* https://github.com/brianlechthaler
* https://twitter.com/brianlechthaler

Max Woolf *author of [aitextgen](https://github.com/minimaxir/aitextgen), the training code this notebook is based on.*
* https://minimaxir.com/
* https://github.com/minimaxir

Mickaël Mouillé [author](https://www.kaggle.com/kemical) of [kickstarter-projects](https://www.kaggle.com/kemical/kickstarter-projects) dataset
* https://www.biborg.com/
* https://github.com/mickaelmouille
* https://www.linkedin.com/in/mickael-mouill%C3%A9-38109321/

OpenAI *creators of [GPT-2](https://en.wikipedia.org/wiki/OpenAI#GPT-2) model*
* https://openai.com 
* https://openai.com/blog/tags/gpt-2/
