In [1]:
from alive_progress import alive_bar
import torch
import tempfile
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling, pipeline
from datasets import load_dataset

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [4]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M", clean_up_tokenization_spaces=True)
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-125M', pad_token_id=tokenizer.eos_token_id, device=0)
with torch.no_grad():
    output_text = generator(
            "Aryan Bansal was",
            max_length=1000,
            num_beams=3,
            temperature=0.7,
            do_sample=True,
            num_return_sequences=1,
            no_repeat_ngram_size=2,
            truncation=True
        )[0]['generated_text']
print(output_text)

Aryan Bansal was born in London, England. He is the son of a British Army officer, and his mother is a native of India.

He is married to a Tamil-speaking woman, who is from Tamil Nadu. They have two children, a son and a daughter. The couple live in a small house in the city of Chennai, India, where they have a house of their own.


In [5]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M", clean_up_tokenization_spaces=True)
generator = pipeline('text-generation', model='roneneldan/TinyStories-33M', pad_token_id=tokenizer.eos_token_id, device=0)
with torch.no_grad():
    output_text = generator(
            "Aryan Bansal was",
            max_length=1000,
            num_beams=3,
            temperature=0.7,
            do_sample=True,
            num_return_sequences=1,
            no_repeat_ngram_size=3,
            truncation=True
        )[0]['generated_text']
print(output_text)

Aryan Bansal was a 3 year old boy. He was playing in the park with his friends. They were having a lot of fun.

But then, something happened. A gust of wind came and blew away all of their toys. Everyone was sad and started to cry. 

Aryan, who was feeling very brave, decided to try to find all of his toys. He started to look everywhere, but he couldn't find them. He asked his friends, "Where did my toys go?"

One of his friends said, "Don't worry, we will help you find them." So they all started looking around the park. After a few minutes, they found all of the toys!

ryan was so happy. He thanked his friends for helping him. They all hugged and went back to the park to play.



In [6]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M", clean_up_tokenization_spaces=True)
generator = pipeline('text-generation', model='roneneldan/TinyStories-33M', pad_token_id=tokenizer.eos_token_id, device=0)

with open("../Data/TinyStoriesV2-GPT4-valid.txt", "r", encoding="utf8") as f:
    data = f.read()
stories = data.split("<|endoftext|>")

generated_texts_tinystories_gpt4val = []

with open("../Output/tinystories_output_gpt4val.txt", "w", encoding="utf8") as f:
    with alive_bar(len(stories), force_tty=True) as bar:
        for story in stories:
            if len(generated_texts_tinystories_gpt4val) == 50: # testing on 50 stories
                break
            if not story.strip():  # Skip empty stories
                continue
            prompt = story.strip()

            # Use only the first half of the story as prompt
            prompt_len = len(prompt.split()) // 2
            prompt = " ".join(prompt.split()[:prompt_len])

            with torch.no_grad():
                output_text = generator(
                    prompt,
                    max_length=1000,
                    num_beams=3,
                    temperature=0.7,
                    do_sample=True,
                    num_return_sequences=1,
                    no_repeat_ngram_size=2,
                    truncation=True
                )[0]['generated_text']

            print(output_text[-100:])
            generated_texts_tinystories_gpt4val.append(output_text)
            f.write(output_text + "\n<|endoftext|>\n")
            bar()


on 0: rl became best friends. They played together every day. The girl was so happy to have a new friend.
on 1: p each other when they need it, and they will always be there for you when you need them the most."
on 2: ed that sometimes, a small gesture of kindness is more powerful than a big tower or a heavy block.
on 3:  to share and be kind to others. When we share, everyone is happy and we can all be happy together!
on 4: g tree in their yard." Tom smiled and said, "Yes, mom. I love to stretch and play with my friends."
on 5: ?" Mom asked. "This is my pot! It is very special! You have ruined it! How could you be so clumsy?"
on 6: om was clean and he felt much better. The king was so happy to be able to relax and enjoy his bath.
on 7: e woolly, ready to himself, he missed and longing for joy, waiting for the sky. Like woofys for one
on 8: r reliable friend. And she never had to worry about a bad man taking her precious arrow away again!
on 9:  Lily looked and looked for him. Finally,

on 10: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


on 11: er and help each other, they can do big things and have more fun than they ever thought was alone.
on 12: it's always better to be kind and helpful to others, even if it means giving up something you want.
on 13: day painting with bright colors and own ideas. At the end of the day, she was tired but very happy.
on 14: ith each other every time they met at the big cliff near the pretty yard of toy box at their homes!
on 15:  work. Her mummy smiled too, happy that her daughter had found a way to make something so special.
on 16: ture and they had so much fun! They promised each other that they would travel together again soon.
on 17: nd they all had fun together on the soft, warm mattress, even though it was hard to run a fast one!
on 18: . So, don't be afraid to try new things, because you never know what fun adventures you might have!
on 19:  it rained outside, because they knew they would always have each other to keep them warm and dry.
on 20: t you lost, if you're ever sad or

In [8]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M", clean_up_tokenization_spaces=True)
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-125M', pad_token_id=tokenizer.eos_token_id, device=0)

with open("../Data/TinyStoriesV2-GPT4-valid.txt", "r", encoding="utf8") as f:
    data = f.read()
stories = data.split("<|endoftext|>")

generated_texts_gptneo_gpt4val = []

with open("../Output/gptneo_output_gpt4val.txt", "w", encoding="utf8") as f:
    with alive_bar(len(stories), force_tty=True) as bar:
        for story in stories:
            if len(generated_texts_gptneo_gpt4val) == 50:
                break
            if not story.strip():  # Skip empty stories
                continue
            prompt = story.strip()

            # Use only the first half of the story as prompt
            prompt_len = len(prompt.split()) // 2
            prompt = " ".join(prompt.split()[:prompt_len])

            with torch.no_grad():
                output_text = generator(
                    prompt,
                    max_length=1000,
                    num_beams=3,
                    temperature=0.7,
                    do_sample=True,
                    num_return_sequences=1,
                    no_repeat_ngram_size=2,
                    truncation=True
                )[0]['generated_text']

            print(output_text[-100:])
            generated_texts_gptneo_gpt4val.append(output_text)
            f.write(output_text + "\n<|endoftext|>\n")
            bar()

on 0: her way, this is the best way to learn about the situation and how to deal with it in your own life.
on 1: , that's not what this is about. Don't be afraid to ask questions. Ask questions that you can answer
on 2:  him again. As he lay there, watching her sleep, it seemed as if he were waking up from a nightmare.
on 3: avors. For more information, please visit our website at www.lindsay.com or call us at 866-832-2377.
on 4: d that the dog would be okay if Tom stayed with them. After Tom left, he said goodbye and went home.
on 5: something like that because he wanted Lily to love him more and to feel more loved than he ever had.
on 6: so much better, because he knew that it would be the last time he ever had to live with his parents.
on 7: e the only thing that mattered was Max being a great father and being the best mother he'd ever had.
on 8: y reliable and has a very long life. If you want to know more about this, you can go to our website.
on 9:  make the most of what you have

KeyboardInterrupt: 

In [9]:
with tempfile.TemporaryDirectory() as tmpdir:
    print(f"Temporary directory: {tmpdir}")
    data_files = {"train": "../Data/TinyStoriesV2-GPT4-train.txt", "validation": "../Data/TinyStoriesV2-GPT4-valid.txt"}
    dataset = load_dataset("text", data_files=data_files, cache_dir=tmpdir)

    # Use 1% of the training and validation data for fine tuning
    dataset["train"] = dataset["train"].train_test_split(train_size=0.01)["train"]
    dataset["validation"] = dataset["validation"].train_test_split(train_size=0.01)["train"]

    tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-125M", clean_up_tokenization_spaces=True)
    tokenizer.pad_token = tokenizer.eos_token 

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=["text"],
        num_proc=torch.cuda.device_count(),
    )

    training_args = TrainingArguments(
        output_dir="./finetuned_tinystories",
        per_device_train_batch_size=4,  # Adjust based on your GPU memory
        per_device_eval_batch_size=4,
        learning_rate=2e-5,
        num_train_epochs=3,  # Adjust as needed
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        fp16=True,  # Enable mixed precision training for speedup on compatible GPUs
        push_to_hub=False,
    )

    model = AutoModelForCausalLM.from_pretrained("roneneldan/TinyStories-33M").to(device)
    # model.resize_token_embeddings(len(tokenizer))  # Resize if you added a new PAD token

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["validation"],
        data_collator=data_collator,
    )

Temporary directory: C:\Users\aryan\AppData\Local\Temp\tmpk98ciguo


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/156000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1578 [00:00<?, ? examples/s]

In [10]:
trainer.train()

  0%|          | 0/117000 [00:00<?, ?it/s]

{'loss': 1.5983, 'grad_norm': 14.911356925964355, 'learning_rate': 1.9914871794871796e-05, 'epoch': 0.01}
{'loss': 1.6407, 'grad_norm': 17.492216110229492, 'learning_rate': 1.982940170940171e-05, 'epoch': 0.03}
{'loss': 1.6284, 'grad_norm': 13.466714859008789, 'learning_rate': 1.9744102564102565e-05, 'epoch': 0.04}
{'loss': 1.6251, 'grad_norm': 20.218801498413086, 'learning_rate': 1.965863247863248e-05, 'epoch': 0.05}
{'loss': 1.6464, 'grad_norm': 15.818338394165039, 'learning_rate': 1.9573162393162394e-05, 'epoch': 0.06}
{'loss': 1.6235, 'grad_norm': 16.174875259399414, 'learning_rate': 1.948769230769231e-05, 'epoch': 0.08}
{'loss': 1.6575, 'grad_norm': 13.682311058044434, 'learning_rate': 1.9402222222222223e-05, 'epoch': 0.09}
{'loss': 1.6268, 'grad_norm': 13.096839904785156, 'learning_rate': 1.931675213675214e-05, 'epoch': 0.1}
{'loss': 1.6279, 'grad_norm': 11.731942176818848, 'learning_rate': 1.9231282051282052e-05, 'epoch': 0.12}
{'loss': 1.6282, 'grad_norm': 23.257055282592773, '

  0%|          | 0/395 [00:00<?, ?it/s]

{'eval_loss': nan, 'eval_runtime': 55.2124, 'eval_samples_per_second': 28.581, 'eval_steps_per_second': 7.154, 'epoch': 1.0}
{'loss': 1.3473, 'grad_norm': 12.93578815460205, 'learning_rate': 1.3250940170940171e-05, 'epoch': 1.01}
{'loss': 1.3627, 'grad_norm': 26.647449493408203, 'learning_rate': 1.3165470085470087e-05, 'epoch': 1.03}
{'loss': 1.3457, 'grad_norm': 10.83144760131836, 'learning_rate': 1.3080000000000002e-05, 'epoch': 1.04}
{'loss': 1.3758, 'grad_norm': 11.695415496826172, 'learning_rate': 1.2994700854700856e-05, 'epoch': 1.05}
{'loss': 1.3625, 'grad_norm': 25.218990325927734, 'learning_rate': 1.290923076923077e-05, 'epoch': 1.06}
{'loss': 1.3572, 'grad_norm': 13.15750789642334, 'learning_rate': 1.2823760683760684e-05, 'epoch': 1.08}
{'loss': 1.3648, 'grad_norm': 16.392419815063477, 'learning_rate': 1.27382905982906e-05, 'epoch': 1.09}
{'loss': 1.3694, 'grad_norm': 8.822991371154785, 'learning_rate': 1.2652820512820514e-05, 'epoch': 1.1}
{'loss': 1.3676, 'grad_norm': 11.99

  0%|          | 0/395 [00:00<?, ?it/s]

{'eval_loss': nan, 'eval_runtime': 20.7012, 'eval_samples_per_second': 76.227, 'eval_steps_per_second': 19.081, 'epoch': 2.0}
{'loss': 1.1491, 'grad_norm': 10.59494686126709, 'learning_rate': 6.586837606837608e-06, 'epoch': 2.01}
{'loss': 1.146, 'grad_norm': 12.936086654663086, 'learning_rate': 6.501367521367522e-06, 'epoch': 2.03}
{'loss': 1.1507, 'grad_norm': 12.761543273925781, 'learning_rate': 6.4158974358974365e-06, 'epoch': 2.04}
{'loss': 1.1478, 'grad_norm': 14.555933952331543, 'learning_rate': 6.330427350427351e-06, 'epoch': 2.05}
{'loss': 1.1525, 'grad_norm': 15.432066917419434, 'learning_rate': 6.244957264957265e-06, 'epoch': 2.06}
{'loss': 1.1516, 'grad_norm': 12.2783784866333, 'learning_rate': 6.1594871794871806e-06, 'epoch': 2.08}
{'loss': 1.1448, 'grad_norm': 9.123688697814941, 'learning_rate': 6.074017094017095e-06, 'epoch': 2.09}
{'loss': 1.1423, 'grad_norm': 14.517699241638184, 'learning_rate': 5.9885470085470086e-06, 'epoch': 2.1}
{'loss': 1.1512, 'grad_norm': 10.2412

  0%|          | 0/395 [00:00<?, ?it/s]

{'eval_loss': nan, 'eval_runtime': 20.7663, 'eval_samples_per_second': 75.988, 'eval_steps_per_second': 19.021, 'epoch': 3.0}


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


{'train_runtime': 24414.3472, 'train_samples_per_second': 19.169, 'train_steps_per_second': 4.792, 'train_loss': 1.3849269257895966, 'epoch': 3.0}


TrainOutput(global_step=117000, training_loss=1.3849269257895966, metrics={'train_runtime': 24414.3472, 'train_samples_per_second': 19.169, 'train_steps_per_second': 4.792, 'total_flos': 4.0749779386368e+16, 'train_loss': 1.3849269257895966, 'epoch': 3.0})

In [11]:
trainer.save_model("./finetuned_tinystories")