<a href="https://colab.research.google.com/github/daveshap/GibberishDetector/blob/main/GibberishDetector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Requirements

- Run with GPU environment! GPU is way faster than TPU
- Click `Runtime >> Change runtime type >> GPU`

In [None]:
!pip install tensorflow-gpu==1.15.0 --quiet
!pip install gpt-2-simple --quiet 

# Compile Training Corpus

- Use the [WikipediaDataBuilder](https://github.com/daveshap/GibberishDetector/blob/main/WikipediaDataBuilder.ipynb) notebook to create the base date for compiling the corpus
- Data is also available in the [GibberishDetector GitHub repo](https://github.com/daveshap/GibberishDetector)

In [None]:
from random import sample, seed

gdrive_dir = '/content/drive/My Drive/WikiData/'

files = [  # source, label
('%swiki_sentences.txt' % gdrive_dir, 'clean'), 
('%sshuffled_characters.txt' % gdrive_dir, 'noise'),
('%sshuffled_words.txt' % gdrive_dir, 'word salad'),
('%smild_gibberish.txt' % gdrive_dir, 'mild gibberish'),
]

# train
result = list()
max_samples = 3000
corpus = 'corpus.txt' 

# test
test_samples = 100
test_corpus = 'test_corpus.txt'


for file in files:
  with open(file[0], 'r', encoding='utf-8') as infile:
    lines = infile.readlines()
  for line in lines:
    line = line.strip()
    if line == '':
      continue
    line = line.lower().replace('.', '')  # this will make it harder to cheat
    line = '<|SENTENCE|> %s <|LABEL|> %s <|END|>' % (line, file[1])
    result.append(line)


# save train set

seed()
subset = sample(result, max_samples)

with open(corpus, 'w', encoding='utf-8') as outfile:
  for line in subset:
    outfile.write(line+'\n\n')
print(corpus, 'saved!')


# save test set

seed()
subset = sample(result, test_samples)

with open(test_corpus, 'w', encoding='utf-8') as outfile:
  for line in subset:
    outfile.write(line+'\n\n')
print(test_corpus, 'saved!')

corpus.txt saved!
test_corpus.txt saved!


# Load Model
Let's use Google Drive to store the model for persistence. We will want to fine tune the model iteratively to get better and better performance. We will also want to use the model again later after pouring so much work into it!

Information about [download_gpt2 function here](https://github.com/minimaxir/gpt-2-simple/blob/92d35962d9aaeadba70e39d11d040f1e377ffdb3/gpt_2_simple/gpt_2.py#L64)

### Model Sizes
- `124M`
- `355M`
- `774M`
- `1558M`

In [None]:
import gpt_2_simple as gpt2

# note: manually mount your google drive in the file explorer to the left

model_dir = '/content/drive/My Drive/GPT2/models'
checkpoint_dir = '/content/drive/My Drive/GPT2/checkpoint'
model_name = '124M'

gpt2.download_gpt2(model_name=model_name, model_dir=model_dir)
print('\n\nModel is ready!')

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Fetching checkpoint: 1.05Mit [00:00, 372Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 80.9Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 235Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:11, 44.2Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 212Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 117Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 134Mit/s]                                                       




Model is ready!


# Finetune GPT2

[Finetune function here](https://github.com/minimaxir/gpt-2-simple/blob/92d35962d9aaeadba70e39d11d040f1e377ffdb3/gpt_2_simple/gpt_2.py#L127)

- Rerun for subsequent training sessions
- Click on `Runtime >> Restart and run all`

In [None]:
file_name = 'corpus.txt'
run_name = 'GibberishDetector'
step_cnt = 2000

sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name=model_name,
              model_dir=model_dir,
              checkpoint_dir=checkpoint_dir,
              steps=step_cnt,
              restore_from='fresh',  # start from scratch
              #restore_from='latest',  # continue from last work
              run_name=run_name,
              print_every=50,
              sample_every=1000,
              #save_every=1000
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint /content/drive/My Drive/GPT2/models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from /content/drive/My Drive/GPT2/models/124M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:01<00:00,  1.70s/it]


dataset has 209028 tokens
Training...
[50 | 69.76] loss=3.82 avg=3.82
[100 | 131.42] loss=3.47 avg=3.64
[150 | 193.08] loss=3.47 avg=3.59
[200 | 254.75] loss=3.33 avg=3.52
[250 | 316.43] loss=2.74 avg=3.36
[300 | 378.13] loss=2.42 avg=3.20
[350 | 439.77] loss=2.38 avg=3.08
[400 | 501.46] loss=1.93 avg=2.93
[450 | 563.18] loss=2.08 avg=2.83
[500 | 624.86] loss=1.13 avg=2.65
[550 | 686.57] loss=1.59 avg=2.55
[600 | 748.26] loss=1.00 avg=2.42
[650 | 809.92] loss=0.79 avg=2.28
[700 | 871.61] loss=0.38 avg=2.14
[750 | 933.23] loss=0.71 avg=2.04
[800 | 994.85] loss=0.25 avg=1.92
[850 | 1056.57] loss=0.19 avg=1.81
[900 | 1118.24] loss=0.14 avg=1.71
[950 | 1179.88] loss=0.18 avg=1.62
[1000 | 1241.54] loss=0.11 avg=1.53
Saving /content/drive/My Drive/GPT2/checkpoint/GibberishDetector/model-1000
<>

<.<<0> epistemology the called putnam's "father a and of doctrine <|LABEL|> word salad <|END|>

<|SENTENCE|> a  teaylfinahaeh aohh ekeimt t(ak  floccis )isawtimr-naiad ihyc  hlisammihsm avoynohaovm(e

# Test Results

It's not science if you don't write down your results! Using delim tags works way better so I'm just deleting results that were utter garbage. Use these tags people! They seem to help orient GPT-2 enough to understand the pattern you want to output.

- `<|SENTENCE|>`
- `<|LABEL|>`
- `<|END|>`

### Data

| Test | Model | Samples | Steps | Last Loss | Avg Loss | Accuracy |
|---|---|---|---|---|---|---|
|01|355M|3000|2000|0.14|1.34|85.7%|
|02|774M|3000|2000|0.04|0.87|80.0%|
|02|124M|3000|2000|0.07|0.73|87.0%|


In [None]:
# uncomment the following if fresh runtime
#import gpt_2_simple as gpt2
#run_name = 'GibberishDetector'
#model_dir = '/content/drive/My Drive/GPT2/models'
#checkpoint_dir = '/content/drive/My Drive/GPT2/checkpoint'
#model_name = '355M'
#print('Starting TF session')
#sess = gpt2.start_tf_sess()
#print('Loading GPT2 model')
#gpt2.load_gpt2(sess, 
#               model_name=model_name,
#               model_dir=model_dir,
#               checkpoint_dir=checkpoint_dir,)


test_corpus = 'test_corpus.txt'
right = 0
wrong = 0

print('Loading test set...')
with open(test_corpus, 'r', encoding='utf-8') as file:
  test_set = file.readlines()

for t in test_set:
  t = t.strip()
  if t == '':
    continue
  prompt = t.split('<|LABEL|>')[0] + '<|LABEL|>'
  expect = t.split('<|LABEL|>')[1].replace('<|END|>', '').strip()
  #print('\nPROMPT:', prompt)
  response = gpt2.generate(sess, 
                           return_as_list=True,
                           length=30,  # prevent it from going too crazy
                           prefix=prompt,
                           model_name=model_name,
                           model_dir=model_dir,
                           truncate='\n',  # stop inferring here
                           include_prefix=False,
                           checkpoint_dir=checkpoint_dir,)[0]
  response = response.strip()
  if expect in response:
    right += 1
  else:
    wrong += 1
  print('right:', right, '\twrong:', wrong, '\taccuracy:', right / (right+wrong))
  #print('RESPONSE:', response)

print('\n\nModel:', model_name)
print('Samples:', max_samples)
print('Steps:', step_cnt)

Loading test set...
right: 1 	wrong: 0 	accuracy: 1.0
right: 2 	wrong: 0 	accuracy: 1.0
right: 3 	wrong: 0 	accuracy: 1.0
right: 4 	wrong: 0 	accuracy: 1.0
right: 5 	wrong: 0 	accuracy: 1.0
right: 5 	wrong: 1 	accuracy: 0.8333333333333334
right: 6 	wrong: 1 	accuracy: 0.8571428571428571
right: 7 	wrong: 1 	accuracy: 0.875
right: 8 	wrong: 1 	accuracy: 0.8888888888888888
right: 9 	wrong: 1 	accuracy: 0.9
right: 9 	wrong: 2 	accuracy: 0.8181818181818182
right: 10 	wrong: 2 	accuracy: 0.8333333333333334
right: 11 	wrong: 2 	accuracy: 0.8461538461538461
right: 12 	wrong: 2 	accuracy: 0.8571428571428571
right: 13 	wrong: 2 	accuracy: 0.8666666666666667
right: 14 	wrong: 2 	accuracy: 0.875
right: 15 	wrong: 2 	accuracy: 0.8823529411764706
right: 16 	wrong: 2 	accuracy: 0.8888888888888888
right: 17 	wrong: 2 	accuracy: 0.8947368421052632
right: 18 	wrong: 2 	accuracy: 0.9
right: 19 	wrong: 2 	accuracy: 0.9047619047619048
right: 19 	wrong: 3 	accuracy: 0.8636363636363636
right: 20 	wrong: 3 	a

# Conclusion

- With the `355M` model, the best sample count seems to be 3000, with 2000 steps
- Longer sentences, similar to the training data, tend to do better

## Future Work

- Try with larger models
- Try with different data sources, like Gutenberg 