# NLG

## GPT2 124M Base Model
Run the cell below to use the base GPT2 124M file to generate some text.

In [7]:
from aitextgen import aitextgen

# Without any parameters, aitextgen() will download, cache, and load the 124M GPT-2 "small" model
ai = aitextgen(to_gpu=True)

# Key parameters include n for number of samples, max_length to motivate longer/shorter text, temperature
# (defaults to 1.2) for greater creativity (higher) or more conservative (but repetitive) output (lower),
# prompt, which can be left as None.
print("========1=========")
ai.generate()

print("========2=========")
ai.generate(n=3, max_length=100)

print("========3=========")
ai.generate(n=3, prompt="I believe in unicorns because", max_length=100)

INFO:aitextgen:Loading gpt2 model from /aitextgen.
INFO:aitextgen:Using the default GPT-2 Tokenizer.


"You can't ask for more from the administration than you can give. And when you look at it, it's not about what's in your heart. It's about what you're willing to compromise with the president, and what you're willing to listen to the president." —Sen. Marco Rubio (R-FL)

Sen. Marco Rubio, R-Florida, who has been under fire for his statements about the administration's policies in Syria, said Friday he's open to the president's proposed budget.

"I'm open to the president's proposal, which is a better alternative to the current administration's budget than what we've been negotiating with the administration," Rubio said on ABC's "This Week." "And I'm very willing to listen to the president's proposals."

Rubio also criticized former President Barack Obama's decision to make the U.S. military stay open in Syria, saying that "the world needs to look at this as an opportunity to do more than just pull out of a war that we're in. It's an opportunity to stand up to the president."

"I want 

## CPU Finetuning
Run cell below to carry out CPU finetuning using the Shakespeare's plays dataset.

In [None]:
# Finetune CPU model
from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2ConfigCPU
from aitextgen import aitextgen

# The name of the downloaded Shakespeare text for training
file_name = "data/input.txt"

# Train a custom BPE Tokenizer on the downloaded text
# This will save two files: aitextgen-vocab.json and aitextgen-merges.txt,
# which are needed to rebuild the tokenizer.
train_tokenizer(file_name)
vocab_file = "aitextgen-vocab.json"
merges_file = "aitextgen-merges.txt"

# GPT2ConfigCPU is a mini variant of GPT-2 optimized for CPU-training
# e.g. the # of input tokens here is 64 vs. 1024 for base GPT-2.
config = GPT2ConfigCPU()
#config = None

# Instantiate aitextgen using the created tokenizer and config
ai = aitextgen(vocab_file=vocab_file, merges_file=merges_file, config=config)

# You can build datasets for training by creating TokenDatasets,
# which automatically processes the dataset with the appropriate size.
data = TokenDataset(file_name, vocab_file=vocab_file, merges_file=merges_file, block_size=64)

# Train the model! It will save pytorch_model.bin periodically and after completion.
# On a 2016 MacBook Pro, this took ~25 minutes to run.
ai.train(data, batch_size=16, num_steps=5000)

# Generate text from it!
ai.generate(10, prompt="ROMEO:")

## GPU Finetuning
Run cell below to carry out GPU finetuning using the Shakespeare's plays dataset.

In [None]:
# Finetune GPU GPT2 model
from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2ConfigCPU
from aitextgen import aitextgen

# The name of the downloaded Shakespeare text for training
file_name = "data/input.txt"

# Train a custom BPE Tokenizer on the downloaded text
# This will save two files: aitextgen-vocab.json and aitextgen-merges.txt,
# which are needed to rebuild the tokenizer.
train_tokenizer(file_name)
vocab_file = "aitextgen-vocab.json"
merges_file = "aitextgen-merges.txt"

# Config for GPT2 GPU training model
config = None

# Instantiate aitextgen using the created tokenizer and config
ai = aitextgen(vocab_file=vocab_file, merges_file=merges_file, config=config)

# You can build datasets for training by creating TokenDatasets,
# which automatically processes the dataset with the appropriate size.
data = TokenDataset(file_name, vocab_file=vocab_file, merges_file=merges_file, block_size=64)

# Train the model! It will save pytorch_model.bin periodically and after completion.
# On a 2016 MacBook Pro, this took ~25 minutes to run.
ai.train(data, batch_size=16, num_steps=5000)

# Generate text from it!
ai.generate(10, prompt="ROMEO:")

## Saving Model
Once training is complete, you should have the following 4 files that you can later use to load the model:
1. Vocab file: In above examples, this will be ```aitextgen-vocab.json``` in root folder.
2. Merges file: In above examples, this will be ```aitextgen-merges.txt``` in root folder.
3. Config file: In above examples, this will be ```config.json``` in ```trained_model``` folder within the root folder.
4. Model file: In above examples, this will be ```pytorch.bin``` in ```trained_model``` folder within the root folder.

# Loading Model
Run the cell below to load and use the CPU-trained GPT2 model.

In [14]:
from aitextgen import aitextgen
ai = aitextgen(model="small_models/cpu_shakespeare/pytorch_model.bin", config="small_models/cpu_shakespeare/config.json",
               vocab_file="small_models/cpu_shakespeare/aitextgen-vocab.json", merges_file="small_models/cpu_shakespeare/aitextgen-merges.txt",
               to_gpu=True)
ai.generate(1, prompt="ROMEO:")

INFO:aitextgen:Loading GPT-2 model from provided small_models/cpu_shakespeare/pytorch_model.bin.
INFO:aitextgen:Using a custom tokenizer.


[1mROMEO:[0m
This's the good lord, and not;
Your father of thee to the time:
What, he am not a brother.

KING RICHARD II:
We do not, I know you,
I am I am you, and my lord, I am a crown?

CAMILLO
