# Generating Medium Titles with GPT-2
Fine tuning the GPT-2 355M model on a dataset containing titles from Medium blog posts using aitextgen. 

## Ethics Disclaimer:
While a malicious use case isn't clear at the time of writing, it's worth stating that you should under no circumstances use text generated with this notebook to decieve people. You must clearly indicate that the text you are presenting is generated using GPT-2 even if it seems to you like it should be obvious.


*By Brian Lechthaler - 3/17/21*

## Setup
Here we install necessary dependencies and import everything we need.

In [1]:
!pip install -q aitextgen pandas

In [2]:
from aitextgen import aitextgen as atg
from aitextgen.colab import mount_gdrive as mnt
from aitextgen.utils import build_gpt2_config
from aitextgen.tokenizers import train_tokenizer
import pandas as pd

## Mount Google Drive
You will need this if you want to save the trained model.

In [3]:
mnt()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Query Current GPU's Info
Here we use the `nvidia-smi` command to grab some helful info about the currently attached GPU.

In [4]:
!nvidia-smi

Wed Mar 17 06:06:53 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P0    31W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Define Function for Saving a File
We need to be able to write to disk in order to transform our original dataset


In [5]:
def writeln(line, path):
  line = str(line) + "\n"
  with open(path, 'a') as saveto:
    saveto.write(line)
    saveto.close()

## Define Function to Preprocess Data
In order to use our dataset with `aitextgen`, we must first convert it to a CSV file with only 1 column. Because the source dataset is a CSV with multiple columns, we write the following function to preprocess our dataset so that it plays nicely with `aitextgen`.

In [6]:
def preprocess(colname, output_filename, input_dataset):
  dataingest = pd.read_csv(input_dataset)
  print("Transforming dataset...")
  #dataingest = dataingest.sample(frac=sample_frac)
  for index, row in dataingest.iterrows():
    line = row[colname]
    if line is not None:
      writeln(line, output_filename)
  return dataingest
  print("Done!")

## Preprocess Dataset
Here we use the function we just wrote to narrow our dataset from a multi-column CSV down to a single-column CSV. We use `wc -l` at the end to count the number of samples we will be re-training GPT-2 `355M` with.

### Don't forget to upload the dataset!
Please note that the dataset used for this notebook is not included with this notebook. You must [download it from Kaggle](https://www.kaggle.com/nulldata/medium-post-titles), unzip it, and upload the extracted CSV to this colab instance *before* running this cell.

In [7]:
!rm -rf output.csv
dataset = preprocess('title',
                     'output.csv',
                     'medium_post_titles.csv')
!wc -l output.csv

Transforming dataset...
126418 output.csv


## Initialize `aitextgen`
Here we initialize `aitextgen` with the `355M` GPT-2 model and GPU support.

In [8]:
ai = atg(tf_gpt2='355M',
         to_gpu=True)

## Fine Tune GPT-2 `355M` using a GPU
Now, for the fun part: using our dataset to train the `355M` GPT-2 model so we can generate Medium blog titles.

### Abuse Deterrance
As is the case with all other notebooks in this series, models generated here are not included and will never be offered to anyone, ever, for any reason under any circumstance. Additionally, you must upload the dataset after retrieving it yourself by downloading it [here](https://www.kaggle.com/nulldata/medium-post-titles), unzipping it, and uploading the CSV to this Colab instance. These two measures are taken to discourage abuse of the code contained in this notebook.

In [9]:
ai.train('output.csv',
         line_by_line=True,
         from_cache=False,
         num_steps=5000,
         generate_every=250,
         save_every=1000,
         save_gdrive=True,
         learning_rate=1e-4,
         fp16=False,
         batch_size=9)

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=125545.0), HTML(value='')), layout=Layout…

pytorch_model.bin already exists in /trained_model and will be overwritten!
GPU available: True, used: True
TPU available: None, using: 0 TPU cores





  cpuset_checked))


HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=5000.0), HTML(value='')), layout=Layout(d…

[1m250 steps reached: generating sample texts.[0m




A quick guide to the best ways to share your designs with friends<|endoftext|>A quick guide to the best ways to share your design work<|endoftext|>A quick guide to the best ways to use the same design<|endoftext|>A quick guide to the best ways to use the same design<|endoftext|>A quick guide to the best ways to use the same design<|endoftext|>A quick guide to the best ways to use the same design<|endoftext|>A quick guide to the best ways to use the same design<|endoftext|>A quick guide to the best ways to use the same design<|endoftext|>A quick guide to the best ways to use the same design<|endoftext|>A quick guide to the best ways to use the same design<|endoftext|>A quick guide to the best ways to use the same design<|endoftext|>A quick guide to the best ways to use the same design<|endoftext|>A quick guide to the best ways to use the same design<|endoftext|>A quick guide to the best ways
[1m500 steps reached: generating sample texts.[0m
 at Home<|endoftext|>The Power of A Good St