<a href="https://colab.research.google.com/github/daveshap/GibberishDetector/blob/main/GibberishDetector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Requirements

- Run with GPU environment! GPU is way faster than TPU
- Click `Runtime >> Change runtime type >> GPU`

In [None]:
!pip install tensorflow-gpu==1.15.0 --quiet
!pip install gpt-2-simple --quiet 

# Compile Training Corpus

- Use the [WikipediaDataBuilder](https://github.com/daveshap/GibberishDetector/blob/main/WikipediaDataBuilder.ipynb) notebook to create the base date for compiling the corpus
- Data is also available in the [GibberishDetector GitHub repo](https://github.com/daveshap/GibberishDetector)

In [None]:
from random import sample, seed

gdrive_dir = '/content/drive/My Drive/WikiData/'

files = [  # source, label
('%swiki_sentences.txt' % gdrive_dir, 'clean'), 
('%sshuffled_characters.txt' % gdrive_dir, 'noise'),
('%sshuffled_words.txt' % gdrive_dir, 'word salad'),
('%smild_gibberish.txt' % gdrive_dir, 'mild gibberish'),
]

# train
result = list()
max_samples = 3000
corpus = 'corpus.txt' 

# test
test_samples = 50
test_corpus = 'test_corpus.txt'


for file in files:
  with open(file[0], 'r', encoding='utf-8') as infile:
    lines = infile.readlines()
  for line in lines:
    line = line.strip()
    if line == '':
      continue
    line = line.lower().replace('.', '')  # this will make it harder to cheat
    line = '<|SENTENCE|> %s <|LABEL|> %s <|END|>' % (line, file[1])
    result.append(line)


# save train set

seed()
subset = sample(result, max_samples)

with open(corpus, 'w', encoding='utf-8') as outfile:
  for line in subset:
    outfile.write(line+'\n\n')
print(corpus, 'saved!')


# save test set

seed()
subset = sample(result, test_samples)

with open(test_corpus, 'w', encoding='utf-8') as outfile:
  for line in subset:
    outfile.write(line+'\n\n')
print(test_corpus, 'saved!')

# Load Model
Let's use Google Drive to store the model for persistence. We will want to fine tune the model iteratively to get better and better performance. We will also want to use the model again later after pouring so much work into it!

Information about [download_gpt2 function here](https://github.com/minimaxir/gpt-2-simple/blob/92d35962d9aaeadba70e39d11d040f1e377ffdb3/gpt_2_simple/gpt_2.py#L64)

### Model Sizes
- `124M`
- `355M`
- `774M`
- `1558M`

In [None]:
import gpt_2_simple as gpt2

# note: manually mount your google drive in the file explorer to the left

model_dir = '/content/drive/My Drive/GPT2/models'
checkpoint_dir = '/content/drive/My Drive/GPT2/checkpoint'
model_name = '355M'

gpt2.download_gpt2(model_name=model_name, model_dir=model_dir)
print('\n\nModel is ready!')

# Finetune GPT2

[Finetune function here](https://github.com/minimaxir/gpt-2-simple/blob/92d35962d9aaeadba70e39d11d040f1e377ffdb3/gpt_2_simple/gpt_2.py#L127)

- Rerun for subsequent training sessions
- Click on `Runtime >> Restart and run all`

In [None]:
file_name = 'corpus.txt'
run_name = 'GibberishDetector'
model_dir = '/content/drive/My Drive/GPT2/models'
checkpoint_dir = '/content/drive/My Drive/GPT2/checkpoint'
model_name = '355M'
step_cnt = 2000

sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name=model_name,
              model_dir=model_dir,
              checkpoint_dir=checkpoint_dir,
              steps=step_cnt,
              restore_from='fresh',  # start from scratch
              #restore_from='latest',  # continue from last work
              run_name=run_name,
              print_every=50,
              sample_every=1000,
              save_every=1000
              )

# Test Results

It's not science if you don't write down your results! Using delim tags works way better so I'm just deleting results that were utter garbage. Use these tags people! They seem to help orient GPT-2 enough to understand the pattern you want to output.

- `<|SENTENCE|>`
- `<|LABEL|>`
- `<|END|>`

### Data

| Test | Model | Samples | Steps | Last Loss | Avg Loss | Accuracy | Evaluation |
|---|---|---|---|---|---|---|---|
|01|355M|5000|2000|0.36|2.46|5/9| Mostly good, created some random labels, came unglued a couple times|
|02|355M|5000|4000|0.27|1.64|0/9| Major regression in quality, not a single accurate label|
|03|355M|5000|1500|1.73|2.75|5/9| Mostly good, reliably generates accurate labels, went random on a few examples|
|04|355M|5000|2500|0.10|1.87|1/11|Many labels were literally `icky`|
|05|355M|5000|1000|0.91|3.04|0/11|Mostly just spit out `END` with no labels|
|06|355M|6000|2000|0.95|2.50|3/11|Mix of just `end` with some stuck on repeat|
|07|355M|4000|2000|0.17|1.85|9/11|Best results so far!|
|08|355M|3000|2000|0.17|1.32|10/11|Even better!|
|09|355M|3000|2000|0.29|1.46|7/11|Repeating results, not as good|
|10|355M|3500|2000|0.06|1.82|5/11|Less is more, apparently|
|11|355M|2000|2000|0.12|0.86|1/11|Not enough|
|12|355M|3000|1500|0.17|1.84|5/11|A little better|
|13|355M|3000|2500|0.08|1.20|4/11|A little worse|



In [None]:
test_corpus = 'test_corpus.txt'
run_name = 'GibberishDetector'
model_dir = '/content/drive/My Drive/GPT2/models'
checkpoint_dir = '/content/drive/My Drive/GPT2/checkpoint'
model_name = '355M'
results = list()

print('Loading test set...')
with open(test_corpus, 'r', encoding='utf-8') as file:
  test_set = file.readlines()

# uncomment the following if fresh runtime
#import gpt_2_simple as gpt2

#print('Starting TF session')
#sess = gpt2.start_tf_sess()
#print('Loading GPT2 model')
#gpt2.load_gpt2(sess, 
#               model_name=model_name,
#               model_dir=model_dir,
#               checkpoint_dir=checkpoint_dir,)

for t in test_set:
  t = t.strip()
  if t == '':
    continue
  prompt = t.split('<|LABEL|>')[0] + '<|LABEL|>'
  print('\nPROMPT:', prompt)
  response = gpt2.generate(sess, 
                           return_as_list=True,
                           length=30,  # prevent it from going too crazy
                           prefix=prompt,
                           model_name=model_name,
                           model_dir=model_dir,
                           truncate='\n',  # stop inferring here
                           include_prefix=False,
                           checkpoint_dir=checkpoint_dir,)[0]
  response = response.strip()
  print('RESPONSE:', response)

print('\n\nModel:', model_name)
print('Samples:', max_samples)
print('Steps:', step_cnt)

# Conclusion

- With the `355M` model, the best sample count seems to be 3000, with 2000 steps
- Longer sentences, similar to the training data, tend to do better

## Future Work

- Try with larger models
- Try with different data sources, like Gutenberg 