# Generator File: GPT2-Small Finetuned incl. unfreezing


This notebook loads the trained and saved model weights incl. functions and generates text output.



## 1. Importing libraries and functions

We import the necessary libraries and functions incl. their classes:

Imported libraries are:

    Huggingface transformers & tokenizers
    Fastai
    Fastbook
    Pandas

We trained our model by fine-tuning the pre-trained GPT2 model from the Huggingface Transformers Library with a corpus of children stories and through slowly training deeper layers of the network . The corpus was tokenized using the huggingface GPT2 tokenizer. We used the same tokenizer than the one used to train the original GPT2 model to ensure that the same splitting method is used for the new corpus as during pretraining. The GPT-2 tokenizer is based on Byte-Pair-Encodings (on byte-level).

We also import a pre-defined function for text generation with the finetuned model weights.


---



In [None]:
try:
    print(gpt2_libraries_progress)
except NameError:
    %run ./generation/generator_gpt2_libraries.ipynb

## 2. Importing model

We import the finetuned model incl. its weights; for fine-tuning we used the pre-trained GPT2-Small model and trained it on a small dataset with texts for small children from the Gutenberg library by slowly unfreezing the top layers of the pre-trained GPT2-Small model.

In [None]:
print("Loading model...")

In [None]:
# Split a GPT2 model in 4 groups for differential learning rates (Code from Finetuning English GPT2 to any language)

def splitter(model):
    "Split a GPT2 `model` in 3 groups for differential learning rates."
    
    # First layers group : decoder blocks from 0 to 3
    modules = []
    for i in range(4): modules.append(model.transformer.h[i])
    groups = [nn.Sequential(*modules)]

    # Second layers group : decoder blocks from 4 to 7
    modules = []
    for i in range(4,8,1): modules.append(model.transformer.h[i])
    groups = L(groups + [nn.Sequential(*modules)])

    # Third layers group : decoder blocks from 8 to 11
    modules = []
    for i in range(8,12,1): modules.append(model.transformer.h[i])
    groups = L(groups + [nn.Sequential(*modules)])
    
    # Fourth layers group : embeddings matrices wte and wpe + LayerNorm at the model output
    groups = L(groups + [nn.Sequential(model.transformer.wte,model.transformer.wpe,model.transformer.ln_f)])
    
    return groups.map(params)

In [None]:
# Load model
gpt2S_tuned_on_tandc_unfrozen = load_learner(model_path + "gpt2S_tuned_on_tandc_unfrozen.pkl")

In [None]:
# Define tokenizer
gpt2s_tokenizer = GPT2TokenizerFast.from_pretrained('gpt2', add_prefix_space=True)

##3. Loading functions

We import and run the text generation function. It is based on the model.generate() function and random sampling technique, with the following hyperparameters:
- TEMP = 0.6: Temperature is used to control the randomness of predictions by scaling the logits before applying softmax. With a small temperate (e.g. 0.2), the model is more confident but also more conservative, with a large temperature (e.g. 1.0), the model generates more diversity but also makes more mistakes
- TOP_K = 40: for random text generation, the Top-K words in terms of probability of occurrence are selected and considered for the conditional probability distribution. This helps in avoiding repeating text
- TOP_P = 0.85: Top-P sampling (also: nucleus sampling) is similar to Top-K, but instead of choosing the top k most likely words, we choose the smallest set of words whose total probability is larger than p. 
Using both top-k and top-p reduces the chances of getting weird (rare) words while allowing for dynamic word selection. 

In [None]:
def gen_story_gpt2s_tunedonTC_unfrozen(seed, max_len):
  return gen_story(my_model = gpt2S_tuned_on_tandc_unfrozen, my_tokenizer = gpt2s_tokenizer, seed=seed, max_len=max_len) 

In [None]:
clear_output()
print("Function and Model now available for use:")
print("    gen_story_gpt2s_tunedonTC_unfrozen(seed, max_len)")
print("    *Based on GPT2-Small  | Tuned on input_stories_toddlerpluschildren.txt | Unfrozen")