<a href="https://colab.research.google.com/github/cmreyesvalencia-png/colab-git-assignment2-CR/blob/main/Lesson_13_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 13: Generative AI Essentials**
- **Course:** Data Analytics and Business Intelligence Analyst
- **Institution:** Willis College
- **Student Name:** Carlos Reyes
- **Instructor:** Ratinder Rajpal
- **Date:** 2025 Nov, 14

#**TASK 1: Dataset Preparation:**
- **Dataset Choice:** Project Gutenberg (http://www.gutenberg.org/)

**About Project Gutenberg**
- Project Gutenberg is an online library of more than 75,000 free eBooks.

- Michael Hart, founder of Project Gutenberg, invented eBooks in 1971 and his memory continues to inspire the creation of eBooks and related content today.

- Since then, thousands of volunteers have digitized and diligently proofread the world’s literature. The entire Project Gutenberg collection is yours to enjoy.

- All Project Gutenberg eBooks are completely free and always will be.

In [15]:
## Set up the environment
!pip install transformers torch datasets

import torch
import os
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import load_dataset, Dataset

## Disable wandb completely
os.environ["WANDB_DISABLED"] = "true"



# **TASK 2: Exploring Generative Pre-trained Transformers (GPTs):**
- Model Architecture: load a pre-trained GPT-2 model and its associated tokenizer. The tokenizer is responsible for converting text into tokens that the model can understand.
- Training


In [17]:
## Load the pre-trained model and tokenizer
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Add padding token
tokenizer.pad_token = tokenizer.eos_token

In [18]:
## Load Project Gutenberg dataset
print("Loading Project Gutenberg dataset...")
try:
    # Load actual Project Gutenberg dataset
    dataset = load_dataset("gutenberg", split="train[:100]")  # Using 100 examples
    print(f"Successfully loaded Project Gutenberg dataset with {len(dataset)} examples")
except Exception as e:
    print(f"Error loading Gutenberg: {e}")
    try:
        # Alternative Gutenberg dataset
        dataset = load_dataset("sedthh/gutenberg_english", split="train[:100]")
        print(f"Successfully loaded alternative Gutenberg dataset with {len(dataset)} examples")
    except Exception as e2:
        print(f"Error loading alternative: {e2}")
        # Fallback to literary texts
        print("Using fallback literary dataset...")
        sample_texts = [
            "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
            "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore.",
            "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness.",
            "Happy families are all alike; every unhappy family is unhappy in its own way.",
            "You don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain't no matter.",
            "There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning.",
            "Once upon a time and a very good time it was there was a moocow coming down along the road.",
            "The sun did not shine. It was too wet to play. So we sat in the house. All that cold, cold, wet day.",
            "In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since.",
            "It was a bright cold day in April, and the clocks were striking thirteen."
        ]
        dataset = Dataset.from_dict({"text": sample_texts * 10})
        print(f"Created fallback dataset with {len(dataset)} examples")

print(f"Dataset columns: {dataset.column_names}")


Loading Project Gutenberg dataset...
Error loading Gutenberg: Dataset 'gutenberg' doesn't exist on the Hub or cannot be accessed.


Resolving data files:   0%|          | 0/37 [00:00<?, ?it/s]

Successfully loaded alternative Gutenberg dataset with 100 examples
Dataset columns: ['TEXT', 'SOURCE', 'METADATA']


In [19]:
# Tokenize
def tokenize_function(examples):
    # Find the correct text column
    text_key = 'text'
    if 'text' not in examples and 'TEXT' in examples:
        text_key = 'TEXT'
    elif 'text' not in examples and 'content' in examples:
        text_key = 'content'
    elif 'text' not in examples and len(examples) > 0:
        # Use first string column
        for key, value in examples.items():
            if isinstance(value[0], str):
                text_key = key
                break

    return tokenizer(
        examples[text_key],
        truncation=True,
        padding="max_length",
        max_length=128
    )

print("Tokenizing dataset...")
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    batch_size=16,
    remove_columns=dataset.column_names
)

print(f"Tokenized dataset ready with {len(tokenized_dataset)} examples")


Tokenizing dataset...


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Tokenized dataset ready with 100 examples


In [22]:
# Training setup-Training model on Project Gutenberg dataset

training_args = TrainingArguments(
    output_dir="./gpt2-gutenberg-trained",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    save_steps=500,
    save_total_limit=2,
    prediction_loss_only=True,
    remove_unused_columns=False,
    report_to="none"
)

# Create a custom data collator for language modeling
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

## Train the model
print("Training model on Project Gutenberg dataset...")
trainer.train()

## Save the trained model
trainer.save_model()
tokenizer.save_pretrained("./gpt2-gutenberg-trained")

Training model on Project Gutenberg dataset...


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


('./gpt2-gutenberg-trained/tokenizer_config.json',
 './gpt2-gutenberg-trained/special_tokens_map.json',
 './gpt2-gutenberg-trained/vocab.json',
 './gpt2-gutenberg-trained/merges.txt',
 './gpt2-gutenberg-trained/added_tokens.json')

In [23]:
## Load the trained model for generation
trained_model = GPT2LMHeadModel.from_pretrained("./gpt2-gutenberg-trained")
trained_tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-gutenberg-trained")

In [24]:
# Create a function
# To generate a text with different parameters

def generate_text(prompt, max_length=100, temperature=1.0, top_k=50, top_p=0.95):
    input_ids = trained_tokenizer.encode(prompt, return_tensors="pt")

    output = trained_model.generate(
        input_ids,
        max_length=max_length,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        do_sample=True,
        num_return_sequences=1,
        pad_token_id=trained_tokenizer.eos_token_id
    )

    generated_text = trained_tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

# **TASK 3: Application Demonstration:**
- Describe and implement examples demonstrating a content creation application using the trained model.

In [26]:
## Practical demonstration: Content Creation Application
print("\n" + "="*60)
print("CONTENT CREATION APPLICATION DEMONSTRATION")
print("Trained on Project Gutenberg Dataset")
print("="*60)

### Exercise 1: Basic Text Generation
print("Exercise 1: Basic Text Generation")
prompt1 = "Why love is important?"
print(f"Prompt: {prompt1}")
print(f"Generated: {generate_text(prompt1)}")
print("\n" + "="*50 + "\n")



CONTENT CREATION APPLICATION DEMONSTRATION
Trained on Project Gutenberg Dataset
Exercise 1: Basic Text Generation
Prompt: Why love is important?
Generated: Why love is important?
In the early 20th century Americans thought love was simply human experience. At first glance, this seemed an obvious statement, but they were wrong. Love and compassion were not merely feelings, but human actions and emotions, which they considered as inseparable. Loving would be seen as expressing feelings, and compassion as a state of mind, which had an intrinsic affect upon others. In fact it was thought as just one of many values which defined humanity. The human soul, however




In [28]:
### Exercise 2: Controlling Output Length
print("Exercise 2: Controlling Output Length")
prompt2 = "In the beach"
print(f"Prompt: {prompt2}")
print("Short version (max_length=50):")
print(generate_text(prompt2, max_length=50))
print("\nLong version (max_length=200):")
print(generate_text(prompt2, max_length=200))
print("\n" + "="*50 + "\n")

Exercise 2: Controlling Output Length
Prompt: In the beach
Short version (max_length=50):
In the beachside of Cape Cod, where I have lived since 2008. My husband Chris is now 35 years old. I grew up in a house that had a roof, which I never had before, with a backyard. The backyard used to be

Long version (max_length=200):
In the beach, alligators attacked in a series of high-speed chase, killing nine. Three helicopters and three police cars were involved. The helicopter crashed in a sandstorm, but managed to take off from the runway near Long Beach. Alligators were unable to move and were hit by multiple bullets at the airport.

About 10 minutes after takeoff, the helicopter struck a water line near the sandline, knocking the water-filled water off the landing strips, according to reports. The helicopter then took off with nine helicopters and three police cars.

The helicopter then landed at Long Beach Municipal Airport, where it was damaged before landing. Several of the helicopter

In [29]:
### Exercise 3: Adjusting Temperature
print("Exercise 3: Adjusting Temperature")
prompt3 = "In the depths of the forest"
print(f"Prompt: {prompt3}")
print("Low temperature (0.3) - More deterministic:")
print(generate_text(prompt3, temperature=0.3))
print("\nMedium temperature (1.0) - Balanced:")
print(generate_text(prompt3, temperature=1.0))
print("\nHigh temperature (1.5) - More creative:")
print(generate_text(prompt3, temperature=1.5))
print("\n" + "="*50 + "\n")


Exercise 3: Adjusting Temperature
Prompt: In the depths of the forest
Low temperature (0.3) - More deterministic:
In the depths of the forest, the sun was shining brightly.
The moon was shining brightly.
The sun was shining brightly.
The moon was shining brightly.
The moon was shining brightly.
The moon was shining brightly.
The moon was shining brightly.
The moon was shining brightly.
The moon was shining brightly.
The moon was shining brightly.
The moon was shining brightly.
The moon was shining brightly.
The moon was shining brightly.
The moon

Medium temperature (1.0) - Balanced:
In the depths of the forest we saw a huge red brick structure with a long oak tree that had a large crown tree at the base of its head. This head could be reached via the branches of the tree and it looked like a giant.
The statue was a tall man who looked like he had been in a fight with the gods. He had long brown hair and a long red beard. He wore a robe of polished silk with gold threads.
A tall, thick

In [30]:
### Exercise 4: Using Top_k and Top_p
print("Exercise 4: Using Top_k and Top_p")
prompt4 = "The mysterious boy arrived"
print(f"Prompt: {prompt4}")
print("Standard sampling (top_k=50, top_p=0.95):")
print(generate_text(prompt4, top_k=50, top_p=0.95))
print("\nRestricted sampling (top_k=10, top_p=0.7):")
print(generate_text(prompt4, top_k=10, top_p=0.7))
print("\nVery creative sampling (top_k=100, top_p=0.99):")
print(generate_text(prompt4, top_k=100, top_p=0.99))
print("\n" + "="*50 + "\n")

Exercise 4: Using Top_k and Top_p
Prompt: The mysterious boy arrived
Standard sampling (top_k=50, top_p=0.95):
The mysterious boy arrived at the scene of the crime, a short distance away, and had been waiting for his turn for a while.
In the meantime, Detective Inspector William Wilkeson of the Chicago Police Department had been assigned to take a look at the scene. They had encountered a man in an apartment near the scene, who had evidently been shot, or had had come to kill.
Mr. Wilkeson had inquired about the man's appearance and attire, but was prevented by the

Restricted sampling (top_k=10, top_p=0.7):
The mysterious boy arrived at the scene, with a large bag in hand, and a pistol. The boy's body was found by police, and his body was taken to the hospital. The boy was taken to the hospital for treatment. The boy was later pronounced dead.

In the days after the incident, a group of young men who were members of the group called the "Crown Family," which included a number of membe

In [33]:
### Exercise 5: Prompt Engineering
print("Exercise 5: Prompt Engineering")
print("Different prompt styles for the same concept:")

prompt5a = "Write a story about a red car"
print(f"\nPrompt 1 (Direct): {prompt5a}")
print(f"Generated: {generate_text(prompt5a, max_length=120)}")

prompt5b = "In a land of ancient magic, a dragon guarded the mountain"
print(f"\nPrompt 2 (Descriptive): {prompt5b}")
print(f"Generated: {generate_text(prompt5b, max_length=120)}")

Exercise 5: Prompt Engineering
Different prompt styles for the same concept:

Prompt 1 (Direct): Write a story about a red car
Generated: Write a story about a red car on a highway in the middle of nowhere. Don't stop. Don't drive. You can hear the air that settles back in the car. Look forward to hearing what you see. There's no question I'm proud of you.

Prompt 2 (Descriptive): In a land of ancient magic, a dragon guarded the mountain
Generated: In a land of ancient magic, a dragon guarded the mountain-tops, the dragon of the dead, the dragon of the living—but it knew not who it belonged to.

It knew where the mountain-tops stood—and who it was. In the wake of its destruction in the Age of Magic, it had not only created a new world of wonder, it had begun to change it as well. When, as in the Age of Magic, its name came to be known as Twilight Mountain, a place where the world was once quiet, it had lost its quiet.




In [2]:
# Run this in a cell to clean widget metadata
from IPython.display import display, Javascript
import json

# This will clear widget states
display(Javascript('''
if (typeof IPython !== 'undefined') {
   IPython.notebook.metadata.widgets = {};
}
'''))

<IPython.core.display.Javascript object>