<a href="https://colab.research.google.com/github/daveshap/GibberishDetector/blob/main/GibberishDetector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Requirements
Let's get this out of the way up front!

**Note: Run with GPU environment!**

> Click `Runtime >> Change runtime type >> GPU`

I think GPU is way faster than TPU.

In [1]:
!pip install tensorflow-gpu==1.15.0 --quiet
!pip install gpt-2-simple --quiet 

[K     |████████████████████████████████| 411.5MB 40kB/s 
[K     |████████████████████████████████| 3.8MB 56.9MB/s 
[K     |████████████████████████████████| 512kB 45.6MB/s 
[K     |████████████████████████████████| 51kB 7.7MB/s 
[?25h  Building wheel for gast (setup.py) ... [?25l[?25hdone
[31mERROR: tensorflow 2.3.0 has requirement gast==0.3.3, but you'll have gast 0.2.2 which is incompatible.[0m
[31mERROR: tensorflow 2.3.0 has requirement tensorboard<3,>=2.3.0, but you'll have tensorboard 1.15.0 which is incompatible.[0m
[31mERROR: tensorflow 2.3.0 has requirement tensorflow-estimator<2.4.0,>=2.3.0, but you'll have tensorflow-estimator 1.15.1 which is incompatible.[0m
[31mERROR: tensorflow-probability 0.11.0 has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible.[0m
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone


# Compile Training Corpus

> Note: Use the [WikipediaDataBuilder](https://github.com/daveshap/GibberishDetector/blob/main/WikipediaDataBuilder.ipynb) notebook to create the base date for compiling the corpus

- 5,000 samples seems to do well
- 15,000 samples is too many

In [3]:
from random import sample, seed

gdrive_dir = '/content/drive/My Drive/WikiData/'

files = [  # source, label
('%swiki_sentences.txt' % gdrive_dir, 'clean'), 
('%sshuffled_characters.txt' % gdrive_dir, 'noise'),
('%sshuffled_words.txt' % gdrive_dir, 'word salad'),
('%smild_gibberish.txt' % gdrive_dir, 'mild gibberish'),
]

result = list()
max_samples = 5000  # the max here is the number of sentences from above
corpus = 'corpus.txt' 

for file in files:
  with open(file[0], 'r', encoding='utf-8') as infile:
    lines = infile.readlines()
  for line in lines:
    line = line.strip()
    if line == '':
      continue
    line = line.lower().replace('.', '')  # this will make it harder to cheat
    line = '// %s || %s ' % (line, file[1])
    result.append(line)

seed()
subset = sample(result, max_samples)

with open(corpus, 'w', encoding='utf-8') as outfile:
  for line in subset:
    outfile.write(line+'\n\n')
print(corpus, 'saved!')

corpus.txt saved!


# Load Model
Let's use Google Drive to store the model for persistence. We will want to fine tune the model iteratively to get better and better performance. We will also want to use the model again later after pouring so much work into it!

Information about [download_gpt2 function here](https://github.com/minimaxir/gpt-2-simple/blob/92d35962d9aaeadba70e39d11d040f1e377ffdb3/gpt_2_simple/gpt_2.py#L64)

### Model Sizes
- `124M`
- `355M`
- `774M`
- `1558M`

In [4]:
import gpt_2_simple as gpt2

# note: manually mount your google drive in the file explorer to the left

model_dir = '/content/drive/My Drive/GPT2/models'
checkpoint_dir = '/content/drive/My Drive/GPT2/checkpoint'
model_name = '355M'

gpt2.download_gpt2(model_name=model_name, model_dir=model_dir)
print('\n\nModel is ready!')

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Fetching checkpoint: 1.05Mit [00:00, 217Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 83.0Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 284Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:17, 82.2Mit/s]                                 
Fetching model.ckpt.index: 1.05Mit [00:00, 218Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 86.8Mit/s]                                                
Fetching vocab.bpe: 1.05Mit [00:00, 128Mit/s]                                                       



Model is ready!





# Finetune GPT2
This is where the rubber meets the road! Let's see if we can finetune a GPT-2 model! Obviously, the bigger the model, the better the results. But bigger models require more memory. There's a tradeoff between model size and corpus size. It looks like 355M is the largest model we can do for now. 

[Finetune function here](https://github.com/minimaxir/gpt-2-simple/blob/92d35962d9aaeadba70e39d11d040f1e377ffdb3/gpt_2_simple/gpt_2.py#L127)

Run this repeatedly with more/different training data to get better results.

Simplest way to continue training is to click `Runtime >> Restart and run all...`

In [None]:
file_name = 'corpus.txt'
run_name = 'GibberishDetector'
model_dir = '/content/drive/My Drive/GPT2/models'
checkpoint_dir = '/content/drive/My Drive/GPT2/checkpoint'
model_name = '355M'

sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name=model_name,
              model_dir=model_dir,
              checkpoint_dir=checkpoint_dir,
              steps=1000,
              restore_from='fresh',  # start from scratch
              #restore_from='latest',  # continue from last work
              run_name=run_name,
              print_every=20,
              sample_every=500,
              save_every=500
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint /content/drive/My Drive/GPT2/models/355M/model.ckpt
INFO:tensorflow:Restoring parameters from /content/drive/My Drive/GPT2/models/355M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:02<00:00,  2.52s/it]


dataset has 273480 tokens
Training...
[10 | 15.53] loss=5.44 avg=5.44
[20 | 20.84] loss=4.85 avg=5.14
[30 | 26.12] loss=4.68 avg=4.99
[40 | 31.42] loss=5.48 avg=5.11
[50 | 36.72] loss=4.99 avg=5.09
[60 | 42.02] loss=4.83 avg=5.04
[70 | 47.27] loss=5.14 avg=5.06
[80 | 52.56] loss=4.14 avg=4.94
[90 | 57.84] loss=5.54 avg=5.01
[100 | 63.11] loss=5.24 avg=5.03
Saving /content/drive/My Drive/GPT2/checkpoint/GibberishDetector/model-100
[110 | 76.89] loss=4.47 avg=4.98
[120 | 82.24] loss=5.43 avg=5.02
[130 | 87.58] loss=5.42 avg=5.05
[140 | 92.94] loss=4.31 avg=5.00
[150 | 98.30] loss=5.36 avg=5.02
[160 | 103.64] loss=4.79 avg=5.01
[170 | 109.00] loss=4.45 avg=4.97
[180 | 114.30] loss=4.99 avg=4.97
[190 | 119.60] loss=4.26 avg=4.93
[200 | 124.88] loss=4.85 avg=4.93
Saving /content/drive/My Drive/GPT2/checkpoint/GibberishDetector/model-200
Instructions for updating:
Use standard file APIs to delete files with this prefix.
nsu ih c htsl rtrnsu u g e htttttdg n || noise 

// e r s u mt esa h hnt

# Test Results

[Generation information here](https://github.com/minimaxir/gpt-2-simple/blob/92d35962d9aaeadba70e39d11d040f1e377ffdb3/gpt_2_simple/gpt_2.py#L407)



In [None]:
# TODO

# Conclusion

TBD