# CROPLAND AI Christmas Card Generator
The code in this notebook clones the public GitHub repository of our 2022 Christmas Card generator application. In this repository, we store the data and code for an application that lets end users generate a personalized Christmas Card using generative AI. Next, we how how you can finetune GPT-2 to create a particular genre of text, namely, Christmas/New Year's wishes. 

As this is a fairly large model, you will want to run this notebook on a hardware-accelerated runtime. 

## Clone the git repository to fetch the data 

In [None]:
%%bash 
git clone https://github.com/cropland-bv/ai_christmas_card_app

## Install missing Python dependencies

In [None]:
%pip install transformers fastai

## Run imports

In [42]:
import os, random, glob
import numpy as np
from fastai.text.all import *
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
from transformers import TextDataset
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead
from transformers import TextGenerationPipeline


## Define data paths and fetch tokenizer from HuggingFace hub

In [43]:
path = os.path.join(os.getcwd(), "ai_christmas_card_app","training_gpt2", "input", "poemsdataset", "forms", "nieuwjaarsbrieven")
files = glob.glob( path +'/*.txt' )
tokenizer = AutoTokenizer.from_pretrained("gpt2-medium")
train_path = 'train.txt'
test_path = 'test.txt'

## Create train/test split based on the list of txt files downloaded from GitHub
We write 80% of the files to "train.txt", and reserve 20% of the files for the test set.

In [45]:
train = open("train.txt", "a")
test = open("test.txt", "a")
train_indices = random.sample(range(len(files)), int(0.8*len(files)))
i=0
while i < len(files):
  if(i in train_indices):
    with open(files[i], "r") as text_input:
      train.write(text_input.read())
      text_input.close()
  else:
     with open(files[i], "r") as text_input:
      test.write(text_input.read())
      text_input.close()
  i = i +1
    
train.close()
test.close()


## Functions to load and tokenize the dataset

In [46]:

def load_dataset(file_path, tokenizer, block_size = 75):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, 
        mlm=mlm,
    )
    return data_collator



## Apply loading/tokenization functions to create a train and a test dataset for our model fine-tuning

In [None]:
train_dataset = load_dataset('train.txt', tokenizer)
test_dataset = load_dataset('test.txt', tokenizer)
data_collator = load_data_collator(tokenizer)

## Pull the GPT-2 medium model checkpoint from the HuggingFace hub

In [None]:
model = AutoModelWithLMHead.from_pretrained("gpt2-medium")

## Define the arguments for fine-tuning the model

In [None]:
training_args = TrainingArguments(
    output_dir= os.path.join(os.getcwd(), "app", "model","files_for_huggingface_2"), #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=2, # batch size for training
    per_device_eval_batch_size=2,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    learning_rate = 9.120108734350652e-05,
    )



## Creater a trainer object to perform the fine-tuning

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)



## Run the train() method to fine-tune the model
At this point, you may want to stretch your legs for a bit, because it will take some time

In [None]:
trainer.train()


## After training, save the model

In [None]:
trainer.save_model()

## Reload the saved model for inference

In [None]:
MODEL = GPT2LMHeadModel.from_pretrained(os.path.join(os.getcwd(), "app", "model","files_for_huggingface_2"), max_length=250, min_length = 150, num_beams=5, no_repeat_ngram_size=2, early_stopping=True, temperature = 1.5)
TOKENIZER = AutoTokenizer.from_pretrained('gpt2-medium')
PIPE =   TextGenerationPipeline(model=MODEL, tokenizer=TOKENIZER, return_all_scores=True, skip_special_tokens=True)
TOKENIZER.save_pretrained(os.path.join(os.getcwd(), "app", "model","files_for_huggingface_2"))

output= PIPE('With a new year comes')[0]['generated_text']
print(output)