In [240]:
import pdfplumber

pdf_path1 = 'To Kill A Mockingbird - Full Text PDF.pdf'
pdf_path2 = "Frank Herbert's - Dune - Part 1 [EnglishOnlineClub.com].pdf"

def extract_text_from_pdf(path):
    with pdfplumber.open(path) as pdf:
        text = ""
        for page in pdf.pages:
            text += page.extract_text()
        return text

In [241]:
text1 = extract_text_from_pdf(pdf_path1)
text2 = extract_text_from_pdf(pdf_path2)

In [242]:
import re

def clean_text(input_text):
    input_text = input_text.lower()
    # Remove newlines
    text_no_newlines = re.sub(r'\n', ' ', input_text)

    # Replace multiple spaces with a single space
    text_no_extra_spaces = re.sub(r'\s+', ' ', text_no_newlines)

    # Remove characters other than numbers and alphabets
    cleaned_text = re.sub(r'[^a-zA-Z0-9 ]', '', text_no_extra_spaces)

    return cleaned_text

In [265]:
from nltk.tokenize import sent_tokenize, word_tokenize
from tqdm import tqdm  

def prepare_dataset(input_text, max_words_per_row=100):
    input_text = input_text.split()
    ret_data = []
    for i in range(0,len(input_text),max_words_per_row):
        text = ' '.join(input_text[i:i+max_words_per_row])
        ret_data.append(text + ',')
    
    return ret_data

In [266]:
clean_text1 = clean_text(text1)
clean_text2 = clean_text(text2)


dataset1 = prepare_dataset(clean_text1)
dataset2 = prepare_dataset(clean_text2)

In [275]:
len(dataset1),len(dataset2)

(997, 1999)

In [276]:
dataset1[0]

'1960 to kill a mockingbird by harper lee copyright c 1960 by harper lee copyright c renewed 1988 by harper lee published by arrangement with mcintosh and otis inc contentsdedication l part one l chapter 1 m chapter 2 m chapter 3 m chapter 4 m chapter 5 m chapter 6 m chapter 7 m chapter 8 m chapter 9 m chapter 10 m chapter 11 m part two l chapter 12 m chapter 13 m chapter 14 m chapter 15 m chapter 16 m chapter 17 m chapter 18 m chapter 19 m chapter 20 m chapter 21 m,'

In [277]:
for i, inp in enumerate(dataset1[:5]):
    print(f"Example {i+1}:")
    print("Input:", inp)
    print()

Example 1:
Input: 1960 to kill a mockingbird by harper lee copyright c 1960 by harper lee copyright c renewed 1988 by harper lee published by arrangement with mcintosh and otis inc contentsdedication l part one l chapter 1 m chapter 2 m chapter 3 m chapter 4 m chapter 5 m chapter 6 m chapter 7 m chapter 8 m chapter 9 m chapter 10 m chapter 11 m part two l chapter 12 m chapter 13 m chapter 14 m chapter 15 m chapter 16 m chapter 17 m chapter 18 m chapter 19 m chapter 20 m chapter 21 m,

Example 2:
Input: chapter 22 m chapter 23 m chapter 24 m chapter 25 m chapter 26 m chapter 27 m chapter 28 m chapter 29 m chapter 30 m chapter 31 m scan proof notes lcontents prev next dedication for mr lee and alice in consideration of love affection lawyers i suppose were children once charles lamb part one contents prev next chapter 1 when he was nearly thirteen my brother jem got his arm badly broken at the elbow when it healed and jems fears of never being able to play football were assuaged he was s

In [278]:
for i, input_segment, in enumerate(dataset2[:5]):
    print(f"Example {i+1}:")
    print("Input:", input_segment)
    print()

Example 1:
Input: ccccoooonnnnvvvveeeerrrrtttteeeedddd ttttoooo ppppddddffff bbbbyyyy mmmmkkkkmmmmdune frank herbert copyright 1965 book 1 dune a beginning is the time for taking the most delicate care that the balances are correct this every sister of the bene gesserit knows to begin your study of the life of muaddib then take care that you first place him in his time born in the 57th year of the padishah emperor shaddam iv and take the most special care that you locate muaddib in his place the planet arrakis do not be deceived by the fact that he was born on caladan and lived his first,

Example 2:
Input: fifteen years there arrakis the planet known as dune is forever his place from manual of muaddib by the princess irulan in the week before their departure to arrakis when all the final scurrying about had reached a nearly unbearable frenzy an old crone came to visit the mother of the boy paul it was a warm night at castle caladan and the ancient pile of stone that had served the atr

In [279]:
# combine two Datasets
dataset = dataset1 + dataset2

In [288]:
# for training
with open('train.txt','w') as f:
    for i in dataset[:len(dataset)-100]:
        f.write(i)

with open('test.txt','w') as f:
    for i in dataset[len(dataset)-100:]:
        f.write(i)

In [289]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_dir = r"C:\Users\aredd\Desktop\EndToEnd\text-completion\Trained-Model\output_directory"


model = GPT2LMHeadModel.from_pretrained(model_dir, ignore_mismatched_sizes=True)
tokenizer = GPT2Tokenizer.from_pretrained(model_dir)

In [295]:
prompt = """
Once upon a time there lived a 
"""

input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=150, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Once upon a time there lived a 

beautiful duke who loved his dukes lady dearly he was a man of few words and of few emotions the duke was a man of few tastes the baron said he was a man of few tastes the baron glanced at his right hand the one that had been the trigger for his first taste of death i must have been a pretty girl once the baron thought i must have been a pretty girl once more he thought i must have been a pretty girl and he remembered the conversation with feydrautha the night before the games the duke had said to feydrautha youre a pretty boy feydrautha had said he was but youre a pretty
