# AI TEXT GENERATION FOR BEGINNERS: CREATING AND FINE TUNING YOUR OWN SINGLISH CHATBOT

# PART 4: ALTERNATIVE USING CPU

If you are just starting out in NLP/data science, it is likely that you wouldn't want to spend money on Colab Pro + additonal storage before you are sure this is something you would be doing a lot more of in the future. Thankfully there are CPU-based alternatives if you are merely dipping your toes into this area.

I'll just highlight one that I've been trying out - [aitextgen](https://github.com/minimaxir/aitextgen) by Max Woolf.

The code below is lifted from the demo on his Github repo. This notebook ran for about 2 hours on a 2018 Mac Mini (64Gb RAM), but you can dial down the training parameters for a faster iteration. The results weren't as satisfactory as the ones from fine tuning DialoGPT-medium, but I reckon it's not a fair comparison.

But if you are keen to try aitextgen on Colab, here's another [demo notebook](https://colab.research.google.com/drive/144MdX5aLqrQ3-YW-po81CQMrD6kpgpYh) you can try.

In [None]:
# at the time of writing, aitextgen doesn't work with transformers > v3.0.0
# check the version on your local machine if there's a conflict

from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2ConfigCPU
from aitextgen import aitextgen

In [None]:
# this was created at the end of notebook 1.0

file_name = "../data/singlish_sms.txt"

In [None]:
# Train a custom BPE Tokenizer
# This will save two files: aitextgen-vocab.json and aitextgen-merges.txt, which are needed to rebuild the tokenizer.

train_tokenizer(file_name)
vocab_file = "aitextgen-vocab.json"
merges_file = "aitextgen-merges.txt"

In [None]:
# GPT2ConfigCPU is a mini variant of GPT-2 optimized for CPU-training
# e.g. the # of input tokens here is 64 vs. 1024 for base GPT-2

config = GPT2ConfigCPU()

In [None]:
# Instantiate aitextgen using the created tokenizer and config
ai = aitextgen(vocab_file=vocab_file, merges_file=merges_file, config=config)

In [None]:
# You can build datasets for training by creating TokenDatasets,
# which automatically processes the dataset with the appropriate size.

data = TokenDataset(
    file_name,
    vocab_file=vocab_file,
    merges_file=merges_file,
    block_size=64,
    num_workers=12,
)


In [None]:
ai.train(
    data,
    batch_size=32,
    num_steps=20000,
    num_workers=12,
    learning_rate=1e-5,
    generate_every=10000,
    save_every=10000,
)

Sample training text you can expect to see:

10,000 steps reached: saving model to /trained_model
10,000 steps reached: generating sample texts.

==========
', 'Haha s my not so I know 'I just m in the the the best Haha', 'Haha what s the lot', 'Haha oh Haha I think we think t go my way time I m s the not I no I want to see I have to make me and it s I can be',
==========

20,000 steps reached: saving model to /trained_model
20,000 steps reached: generating sample texts.
==========
 slp', 'Hahaha okay it s really just be out for the last one haha I just know I m m at the house', 'Eh I m sure', 'Haha no no you want that s', 'Haha okay how to do you all', 'Haha but I m not not free', 'Haha no also

In [None]:
ai.generate(n=5, max_length=15, prompt="eh how ah")