# Install Requirements
Let's get this out of the way up front!

**Note: Run with GPU environment!**

> Click `Runtime >> Change runtime type >> GPU`

I think GPU is way faster than TPU.

In [1]:
!pip install wikipedia --quiet
!pip install spacy --quiet
!pip install pysbd --quiet
!pip install tensorflow-gpu==1.15.0 --quiet
!pip install gpt-2-simple --quiet 

  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 71kB 2.1MB/s 
[K     |████████████████████████████████| 411.5MB 42kB/s 
[K     |████████████████████████████████| 51kB 7.9MB/s 
[K     |████████████████████████████████| 512kB 52.8MB/s 
[K     |████████████████████████████████| 3.8MB 52.6MB/s 
[?25h  Building wheel for gast (setup.py) ... [?25l[?25hdone
[31mERROR: tensorflow 2.3.0 has requirement gast==0.3.3, but you'll have gast 0.2.2 which is incompatible.[0m
[31mERROR: tensorflow 2.3.0 has requirement tensorboard<3,>=2.3.0, but you'll have tensorboard 1.15.0 which is incompatible.[0m
[31mERROR: tensorflow 2.3.0 has requirement tensorflow-estimator<2.4.0,>=2.3.0, but you'll have tensorflow-estimator 1.15.1 which is incompatible.[0m
[31mERROR: tensorflow-probability 0.11.0 has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible.[0m
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdo

# Download Wikipedia Articles
First, we need a corpus of relatively clean data. Wikipedia is crowd-sourced and written in modern English. Therefore we can trust that it is a good source of semantically, syntactically, and rhetorically sound text.

In [None]:
import wikipedia

# todo: come up with a cool way to automatically create topic search terms
keywords = ['india', 'ocean', 'astronomy', 'economics', 'economy', 'earth', 
            'english', 'bacon', 'egg', 'dinosaur', 'rabbit', 'america', 'usa',
            'congress', 'virus', 'George Clooney', 'knowledge', 'Buddha']

def save_article(title, article):
  with open('wiki_' + title + '.txt', 'w', encoding='utf-8') as outfile:
    outfile.write(article)

for keyword in keywords:
  print('Searching Wikipedia for keyword:', keyword)
  try:
    search = wikipedia.search(keyword)
    for result in search:
      article = wikipedia.page(result)
      save_article(result, article.content)
  except Exception as oops:
    continue
print('Done saving articles!')

Searching Wikipedia for keyword: india
Searching Wikipedia for keyword: ocean
Searching Wikipedia for keyword: astronomy




  lis = BeautifulSoup(html).find_all('li')


Searching Wikipedia for keyword: economics
Searching Wikipedia for keyword: economy
Searching Wikipedia for keyword: earth


# Parse Articles
The articles need to be split up into usable chunks. This uses regex to identify the section headers and split each article into single lines of text for each section. Furthermore, it looks at the number of word characters vs other characters to identify those sections that likely contain text instead of tables or other data.

In [None]:
import os 
import re

result = list()

for file in os.listdir('.'):
  if not 'wiki_' in file:
    continue
  with open(file, 'r', encoding='utf-8') as infile:
    text = infile.read()
  sections = re.split(r'={2,}.{0,80}={2,}', text)
  for section in sections:
    try:
      trimmed = section.strip()
      wordchars = re.findall(r'\w', trimmed)
      ratio = len(wordchars) / len(trimmed)
      # it seems like a ratio of greater than 80% word chars is ideal
      if ratio > 0.80:
        final = re.sub(r'\s+', ' ', trimmed)
        result.append(final)
    except:
      continue
  
print('Wikipedia sections parsed:', len(result))
with open('wikiparsed.txt', 'w', encoding='utf-8') as outfile:
  for line in result:
    outfile.write(line+'\n')

# Split Sentences
For the sake of simplicity, we don't want to go overboard and evaluate entire paragraphs. We want to only train on individual sentences. So let's use SpaCy and PYSBD (Python Sentence Boundary Detector) to split the corpus into sentences.

In [None]:
import spacy
from pysbd.utils import PySBDFactory

nlp = spacy.blank('en')
nlp.add_pipe(PySBDFactory(nlp))
infile = 'wikiparsed.txt'
outfile = 'wikisentences.txt'
result = list()

with open('wikiparsed.txt', 'r', encoding='utf-8') as infile:
  lines = infile.readlines()

print('Lines of text:', len(lines))
for line in lines:
  doc = nlp(line)
  for sent in list(doc.sents):
    result.append(sent)

print('Sentences found:', len(result))
with open('wikisentences.txt', 'w', encoding='utf-8') as file:
  for line in result:
    if str(line) == '':
      continue
    file.write(str(line)+'\n')
print(outfile, 'saved!')

# Generate Gibberish v1
### Scrambled Words
We have a great source of sentences that are semantically, syntactically, and rhetorically sound. The simplest way to generate gibberish, then, would be to scramble these sentences! For this first version, we want words, just all mixed up. This will create good training data because the samples will contain the same exact words as the sound sentences but out of order.

In [None]:
from random import shuffle, seed

infile = 'wikisentences.txt'
outfile = 'wikiscrambled.txt'
result = list()

def scramble_sentence(sentence):
  sentence = sentence.strip()
  split = sentence.split()
  shuffle(split)
  return ' '.join(split)

seed()
with open(infile, 'r', encoding='utf-8') as file:
  lines = file.readlines()
for line in lines:
  line = line.strip()
  if line == '':
    continue
  scrambled = scramble_sentence(line)
  result.append(scrambled)
with open(outfile, 'w', encoding='utf-8') as file:
  for line in result:
    file.write(line+'\n')
print(outfile, 'saved!')        

# Generate Gibberish v2
### Completely Random Characters
This step may not be necessary but I'd like to be able to detect utter nonsense as well. So let's scramble all the characters in each sentence completely. I figure it's better to show the model random noise as well as random words.

In [None]:
from random import shuffle, seed

infile = 'wikisentences.txt'
outfile = 'wikiscrambled2.txt'
result = list()

def scramble_sentence(sentence):
  sentence = sentence.strip()
  sentence = list(sentence)
  shuffle(sentence)
  return ''.join(sentence)

seed()
with open(infile, 'r', encoding='utf-8') as file:
  lines = file.readlines()
for line in lines:
  line = line.strip()
  if line == '':
    continue
  scrambled = scramble_sentence(line)
  result.append(scrambled)
with open(outfile, 'w', encoding='utf-8') as file:
  for line in result:
    file.write(line+'\n')
print(outfile, 'saved!')

# Compile Training Corpus
Let's build a training corpus that we can feed to GPT2! We need to bake the label directly into each line. Change `max_samples` to adjust corpus size. Multiple trainings may be necessary. Limits to finetuning memory requirements. I will add updates about limits and constraints as I figure them out. 

I'm afraid that this will just learn to pay attention to caps and periods so I might change the way the final corpus looks. 

In [None]:
from random import sample, seed

files = [
('wikisentences.txt', 'clean'), 
#('wikiscrambled2.txt', 'gibberish'),  # excluding complete noise for now
('wikiscrambled.txt', 'gibberish')
]

result = list()
max_samples = 5000  # the max here is the number of sentences from above
corpus = 'corpus.txt' 

for file in files:
  with open(file[0], 'r', encoding='utf-8') as infile:
    lines = infile.readlines()
  for line in lines:
    line = line.strip()
    if line == '':
      continue
    line = line.lower().replace('.', '')  # this will make it harder to cheat
    line = '// %s || %s ' % (line, file[1])
    result.append(line)

#seed()
#subset = sample(result, max_samples)

with open(corpus, 'w', encoding='utf-8') as outfile:
  for line in result:
    outfile.write(line+'\n\n')
print(corpus, 'saved!')

# Load Model
Let's use Google Drive to store the model for persistence. We will want to fine tune the model iteratively to get better and better performance. We will also want to use the model again later after pouring so much work into it!

Information about [download_gpt2 function here](https://github.com/minimaxir/gpt-2-simple/blob/92d35962d9aaeadba70e39d11d040f1e377ffdb3/gpt_2_simple/gpt_2.py#L64)

In [None]:
import gpt_2_simple as gpt2

model_dir = '/content/drive/My Drive/GPT2/models'
checkpoint_dir = '/content/drive/My Drive/GPT2/checkpoint'
gpt2.download_gpt2(model_name='355M', model_dir=model_dir)
print('\n\nModel is ready!')

# Finetune GPT2!
This is where the rubber meets the road! Let's see if we can finetune a GPT-2 model! Obviously, the bigger the model, the better the results. But bigger models require more memory. There's a tradeoff between model size and corpus size. It looks like 355M is the largest model we can do for now. 

[Finetune function here](https://github.com/minimaxir/gpt-2-simple/blob/92d35962d9aaeadba70e39d11d040f1e377ffdb3/gpt_2_simple/gpt_2.py#L127)

Run this repeatedly with more/different training data to get better results.

Simplest way to continue training is to click `Runtime >> Restart and run all...`

In [None]:
file_name = 'corpus.txt'
sess = gpt2.start_tf_sess()
run_name = 'GibberishDetector'
model_name = '355M'

gpt2.finetune(sess,
              dataset=file_name,
              model_name=model_name,
              model_dir=model_dir,
              checkpoint_dir=checkpoint_dir,
              steps=1000,
              restore_from='fresh',  # start from scratch
              #restore_from='latest',  # continue from last work
              run_name=run_name,
              print_every=10,
              sample_every=200,
              save_every=100
              )

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Generate with GPT-2
Let's make sure we can save and load the model that we worked so hard on!

[Generation information here](https://github.com/minimaxir/gpt-2-simple/blob/92d35962d9aaeadba70e39d11d040f1e377ffdb3/gpt_2_simple/gpt_2.py#L407)

Run this after training a model and restarting the instance. This will demonstrate that the model is saved and working.