# ***Installing Dependencies***

In [1]:
pip install --upgrade transformers datasets evaluate rouge_score

Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Downloading datasets-4.4.1-py3-none-any.whl (511 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (47.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collec

Here we're using a command to install some tools that our project needs.  The program `pip` helps us get and manage Python packages.  The `install` part tells pip to get the packages on the list and set them up on our computer.  The `--upgrade` option makes sure that we are getting the most recent versions of these tools, not the older ones.  The following names, like transformers, datasets, evaluate, and rouge_score, are all packages that help with things like training models, working with data, checking performance, and scoring results.  This one line gets our work space ready so that we have everything we need before we run the rest of the code.

In [2]:
!pip install textstat datasets transformers

Collecting textstat
  Downloading textstat-0.7.11-py3-none-any.whl.metadata (15 kB)
Collecting pyphen (from textstat)
  Downloading pyphen-0.17.2-py3-none-any.whl.metadata (3.2 kB)
Downloading textstat-0.7.11-py3-none-any.whl (176 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.4/176.4 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyphen-0.17.2-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen, textstat
Successfully installed pyphen-0.17.2 textstat-0.7.11


Now, here we're using a command to install some tools that our code needs to work right.  The exclamation point at the beginning tells the system that we are running a command inside a notebook or coding environment, not regular Python code.  We use the word "pip" to download and set up these tools.  The "install" part tells pip to get the packages on the list and make them ready to use.  The packages textstat, datasets, and transformers each have a job. Textstat helps us figure out how easy or hard a text is to read, datasets makes it easier to work with and prepare data, and transformers is used to make and use modern language models.  We can write and run our main program now that this one command has set everything up.

# ***Setting up the Environment and Loading Data***

In [3]:
import pandas as pd
from datasets import Dataset, DatasetDict
import re # Import the regular expression library

# --- (A) CREATE A CLEANING FUNCTION ---
def clean_text(text):
    if not isinstance(text, str): # Handle potential non-string data
        return ""
    text = text.lower()
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# --- 1. Load Your Custom Dataset ---
try:
    # Changed encoding to 'utf-8', which is standard for Kaggle datasets
    df = pd.read_csv('news-article-categories.csv', encoding='utf-8')
    print("Successfully loaded 'news-article-categories.csv'")

except FileNotFoundError:
    print("Error: 'news-article-categories.csv' not found.")
    df = None # Set df to None if file not found

if df is not None:
    # --- 2. Preprocess and Prepare the Dataset ---
    # --- THIS IS THE FIX ---
    # Select the correct columns from the new dataset ('body' and 'title')
    df = df[['body', 'title']]
    # Rename them to the standard names the rest of the script expects ('text' and 'summary')
    df.columns = ['text', 'summary']

    # Handle potential missing values in the new dataset
    df.dropna(inplace=True)

    # --- (B) APPLY THE CLEANING FUNCTION TO YOUR DATA ---
    print("\n--- Applying preprocessing to the dataset ---")
    df['text'] = df['text'].apply(clean_text)
    df['summary'] = df['summary'].apply(clean_text)
    print("Preprocessing complete. Example of cleaned article:")
    print(df.iloc[0]['text'])

    # --- 3. Convert to a Hugging Face Dataset ---
    hg_dataset = Dataset.from_pandas(df)

    # --- 4. Split into Training and Validation Sets ---
    train_test_split = hg_dataset.train_test_split(test_size=0.2)
    dataset = DatasetDict({
        'train': train_test_split['train'],
        'test': train_test_split['test']
    })

    print("\nDataset structure:")
    print(dataset)

Successfully loaded 'news-article-categories.csv'

--- Applying preprocessing to the dataset ---
Preprocessing complete. Example of cleaned article:

Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['text', 'summary', '__index_level_0__'],
        num_rows: 5497
    })
    test: Dataset({
        features: ['text', 'summary', '__index_level_0__'],
        num_rows: 1375
    })
})


This line `import pandas as pd` imports the pandas library and gives it the shorter name "pd" so that it is easier to use later in the code. Pandas makes it easier to work with and organize data in tables, like CSV files or spreadsheets. It helps us read, clean, and organize data in a way that is easy to understand.

Moving on to `from datasets import Dataset, DatasetDict` this gets Dataset and DatasetDict from the datasets library. These tools help you organize data so that it can be used in machine learning models. They make it easy to load, split, and organize data for training and testing.

Now this `import re` brings in the re library, which makes it easier to search and clean up text by using patterns. It can find and delete things from text that you don't want, like links, symbols, or HTML tags. This will help you clean up the dataset later.

This `def clean_text(text):` starts a function named clean_text. When the function runs, it will clean the text inside the parentheses. It tells the function what to do whenever it is called.

This `if not isinstance(text, str): # Handle potential non-string data` checks to see if the input to the function isn't text.  The next line tells the code what to do if it isn't text.  This helps keep mistakes from happening when the data has numbers or blank spaces.

Next, if the input wasn't text, this `return ""` sends back an empty string.  It's like telling someone to "skip this part" when the data isn't right.  This keeps the program from breaking and running smoothly.

This `text = text.lower()` makes all the letters in the text lowercase. This makes everything more consistent because the words "Apple" and "apple" will be treated the same. It's one of the first things you do to clean up text data.

This `text = re.sub(r'<.*?>', '', text)` gets rid of any HTML tags that might be in the text.  The pattern in the parentheses helps you find and remove words that are in angle brackets, like `<p> or <div>`. It makes the text cleaner and easier to read.

This `text = re.sub(r'https?://\S+|www\.\S+', '', text)` takes out all the links to websites from the text. It looks for anything that starts with "http" or "www" and takes it out. Links don't help with text analysis, so getting rid of them makes the data more focused.

This `text = re.sub(r'\s+', ' ', text).strip()` gets rid of extra spaces or line breaks in the text. It takes out any group of spaces and replaces them with just one space. Then it takes out spaces at the beginning or end. This makes the text look neat and consistent.

After all the steps above, this `return text` gives back the cleaned-up text. You can store or use anything that is returned later in the program. This is the last step in the cleaning process.

This `df = pd.read_csv('news-article-categories.csv', encoding='utf-8')` uses pandas to open and read a CSV file called news-article-categories.csv. The utf-8 part makes sure that the file can handle a lot of different letters and symbols correctly. It makes the file look like a table so that the program can easily work with it.

This `print("Successfully loaded 'news-article-categories.csv'")` says that the file was loaded correctly.  It's mostly for the user to know that the file could be read without any problems.  It makes debugging easier by giving you feedback while the program is running.

This `print("Error: 'news-article-categories.csv' not found.")` will show a warning if the file can't be found. It lets the user know that there was a problem finding the dataset. This helps you figure out what went wrong before the code runs again.

This `df = None` gives the variable df the value None, which means that the file is empty when it can't be found.  If the code tries to use missing data later on, this stops it from crashing.  This is a safe way to deal with missing files.

This `df = df[['body', 'title']]` only gets the title and body parts from the dataset. These are the parts of the article that have the headline and the main body. It gets rid of other columns that the program doesn't need.

This `df.columns = ['text', 'summary']` changes the names of the two columns to "text" and "summary."  This helps make the column names match what the rest of the code expects.  It keeps everything the same so that later functions can easily find these names.

This `df.dropna(inplace=True)` deletes any rows in the dataset that don't have any values or have empty values. Keeping only full entries can help avoid problems during training or analysis. It makes sure that the data is clean and ready to be used.

This `df['text'] = df['text'].apply(clean_text)` uses the clean_text function to get rid of all the text in the text column. It automatically cleans each row in the same way. This helps make all the text in the dataset the same.

Now `df['summary'] = df['summary'].apply(clean_text)` does the same thing, but this time it cleans the summary column. It makes sure that both the summary and the article content are clean and match. Both of them are ready for machine learning tasks after being cleaned.

This `print(df.iloc[0]['text'])` shows the first cleaned-up article from the dataset. It helps make sure that the cleaning worked right. The user can check the output to make sure that unwanted parts have been taken out.

This `hg_dataset = Dataset.from_pandas(df)` changes the pandas DataFrame into a format that the datasets library can read.  Then, it's easier to use the new dataset with machine learning models.  It's like turning the table into a training-ready shape.

This `train_test_split = hg_dataset.train_test_split(test_size=0.2)` splits the data into two groups: one for testing and one for training. The 0.2 means that 20% of the data is used for testing and the rest is used for training. It helps you see how well the model works with data it hasn't seen before.

The ` dataset = DatasetDict({'train': train_test_split['train'], 'test': train_test_split['test']})` makes a structure that looks like a dictionary to sort the training and testing data. It makes it very clear which part is for training and which part is for testing. This makes it easy to get to the right dataset later when you run the model.

Lastly, `print(dataset)` shows how the dataset looks after it has been split. It helps make sure that both the training and testing parts were made correctly. This last check makes sure that everything is ready before moving on.






# ***Tokenization***

In [4]:
from transformers import AutoTokenizer

# --- 4. Define the Model Checkpoint ---
# ## <-- KEY CHANGE: Switched to the smaller t5-small model ---
model_checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# --- 5. Create a T5-Specific Preprocessing Function ---
prefix = "summarize: "

def preprocess_function(examples):
    # ## <-- KEY CHANGE: Add the prefix to all input articles ---
    inputs = [prefix + doc for doc in examples["text"]]

    # Tokenize the prefixed inputs
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    # Tokenize the target summaries (labels)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# --- 6. Apply the Tokenization ---
dataset = dataset.filter(lambda x: len(x["text"].split()) < 500)
tokenized_datasets = dataset.map(preprocess_function, batched=True)
print("\nSample of tokenized data prepared for T5:")
print(tokenized_datasets['train'][0].keys())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Filter:   0%|          | 0/5497 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1375 [00:00<?, ? examples/s]

Map:   0%|          | 0/2816 [00:00<?, ? examples/s]



Map:   0%|          | 0/711 [00:00<?, ? examples/s]


Sample of tokenized data prepared for T5:
dict_keys(['text', 'summary', '__index_level_0__', 'input_ids', 'attention_mask', 'labels'])


First, the `from transformers import AutoTokenizer` uses the AutoTokenizer tool from the transformers library. It helps change sentences and words into numbers that a model can understand. AutoTokenizer makes it easier to get text data ready for training or testing language models.

Next, `model_checkpoint = "google-t5/t5-small"` stores the name of the model we're using in a variable called model_checkpoint. The model name, google-t5/t5-small, means that this is a smaller version of the T5 model that runs faster and uses less memory. Putting it in a variable keeps the code neat and lets us change models later if we want to.

This `tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)` then loads the tokenizer that goes with the model that was chosen. It uses the model's name to get the right tokenizer and set it up so that the text can be prepared in a way that the model can understand. Then, the tokenizer is saved in a variable so that it can be used easily throughout the code.

The `prefix = "summarize: "` sets up a short phrase called a prefix that goes in front of each piece of text before it is sent to the model. The prefix tells the model what kind of job it needs to do, which in this case is to summarize. It helps the model better understand the context of the input.

This `def preprocess_function(examples):` starts a function called preprocess_function, which will get the data ready.  It takes in examples, which are small parts of the dataset, and turns them into something the model can use.  This function cleans and gets the text ready for use every time the dataset is passed through.

The `inputs = [prefix + doc for doc in examples["text"]]` then makes a new list of text entries by adding the word "summarize:" to the beginning of each article. It goes through each piece of text in the dataset and adds the prefix to the start. This makes sure that all of the inputs are in the same format before they are turned into tokens.

The tokenizer takes each piece of text and turns it into numbers that the model can read. The line "model_inputs = tokenizer(inputs, max_length=1024, truncation=True)" does this.  It also limits the number of tokens to 1024, which means that anything longer than that will be cut off so that the input stays small enough to handle.  This step helps make sure that all the inputs are the same length.

This `with tokenizer.as_target_tokenizer():` tells the tokenizer to only look at the target text, which is the summary in this case. It tells the code that the next few lines are for working on the summary part of the dataset.  Changing to this mode makes sure that the summaries are not mixed up with the main text.

The `labels = tokenizer(examples["summary"], max_length=128, truncation=True)` breaks the summary text into tokens in the same way that it did with the input articles. It only lets you use 128 tokens for the summaries, and anything longer is cut down. The labels variable holds the result, which will be matched with the input text later.

This `model_inputs["labels"] = labels["input_ids"]` adds the processed summary data to the model_inputs dictionary under the key labels. This links each piece of input text to its summary. It gets everything ready that the model needs to learn how to make summaries correctly.

Now, `return model_inputs` sends back the final inputs and summaries that have been processed so they can be used later in the program. This dictionary makes it easy for other parts of the code to get to the tokenized data. It does the job of getting the data ready for the model.

The `dataset = dataset.filter(lambda x: len(x["text"].split()) < 500)` only lets through articles with less than 500 words. It looks at the word count of each entry and deletes the longer ones to speed up processing. The model can handle shorter texts better, and they also use less memory.

The `tokenized_datasets = dataset.map(preprocess_function, batched=True)` runs the preprocess_function on each part of the dataset. It processes a lot of entries at once by putting them in batches, which speeds things up. The outcome is a new version of the dataset that is completely tokenized and ready to be used for training.

Lastly, `print(tokenized_datasets['train'][0].keys())` prints out the list of keys from the first training example in the dataset that has been processed. This helps make sure that the tokenization worked and shows what types of data are stored. It's a quick way to make sure that the input and label data were set up correctly.

# ***Model Training***

## ***Fine-Tuning the Model***

In [None]:
# --- INSTALL REQUIRED LIBRARIES FOR HYPERPARAMETER SEARCH ---
!pip install ray[tune]

Collecting optuna
  Downloading optuna-4.5.0-py3-none-any.whl.metadata (17 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.10.1-py3-none-any.whl.metadata (11 kB)
Downloading optuna-4.5.0-py3-none-any.whl (400 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/400.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.9/400.9 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.10.1-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, optuna
Successfully installed colorlog-6.10.1 optuna-4.5.0
Collecting ray[tune]
  Downloading ray-2.51.1-cp312-cp312-manylinux2014_x86_64.whl.metadata (21 kB)
Collecting click!=8.3.0,>=7.0 (from ray[tune])
  Downloading click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Collecting tensorboardX>=1.9 (from ray[tune])
  Downloading tensorboardx-2.6.4-py3-none-any.whl.metadata (6.2 kB)
Downloading click-8.2.1-py3-none-any.whl (102 kB)
[2K   

In [5]:
!pip install optuna

Collecting optuna
  Downloading optuna-4.5.0-py3-none-any.whl.metadata (17 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.10.1-py3-none-any.whl.metadata (11 kB)
Downloading optuna-4.5.0-py3-none-any.whl (400 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.9/400.9 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.10.1-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, optuna
Successfully installed colorlog-6.10.1 optuna-4.5.0


In [7]:
import transformers
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
import numpy as np
import textstat
import evaluate  # Import the evaluate library
from transformers import DataCollatorForSeq2Seq

print("Transformers library version:", transformers.__version__)

# --- Model Checkpoint ---
model_checkpoint = "google-t5/t5-small"

# --- 1. DEFINE A MODEL INITIALIZATION FUNCTION ---
def model_init():
    return AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

# --- Initialize ROUGE metric ---
rouge = evaluate.load("rouge")

# --- (NO CHANGE) YOUR COMPUTE_METRICS FUNCTION ---
def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    # Ensure predictions are numpy array and handle nested arrays
    predictions = np.array(predictions)
    if predictions.ndim > 2:
        predictions = predictions[:, 0, :] # handle nested arrays from beam search

    # Safe decode predictions
    decoded_preds = []
    for pred in predictions:
        # Ensure valid token IDs before decoding
        pred = np.clip(pred, 0, tokenizer.vocab_size - 1)
        text = tokenizer.decode(pred, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        decoded_preds.append(text)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = [tokenizer.decode(l, skip_special_tokens=True, clean_up_tokenization_spaces=True) for l in labels]

    # --- ROUGE scores ---
    # Filter out empty predictions before computing ROUGE
    filtered_preds_labels = [(p, l) for p, l in zip(decoded_preds, decoded_labels) if p.strip() and l.strip()]
    if filtered_preds_labels:
        filtered_preds, filtered_labels = zip(*filtered_preds_labels)
        rouge_scores = rouge.compute(
            predictions=list(filtered_preds),
            references=list(filtered_labels),
            use_stemmer=True
        )
        rouge1 = rouge_scores["rouge1"] * 100
        rouge2 = rouge_scores["rouge2"] * 100
        rougeL = rouge_scores["rougeL"] * 100
        rougeLsum = rouge_scores["rougeLsum"] * 100
    else:
        rouge1 = rouge2 = rougeL = rougeLsum = 0.0

    # --- Readability ---
    readability_scores = [textstat.flesch_reading_ease(pred) for pred in decoded_preds if pred.strip()]
    avg_readability = np.mean(readability_scores) if readability_scores else 0

    # --- Average Length ---
    prediction_lens = [len(pred.split()) for pred in decoded_preds if pred.strip()]
    avg_length = np.mean(prediction_lens) if prediction_lens else 0

    return {
        "rouge1": round(rouge1, 4),
        "rouge2": round(rouge2, 4),
        "rougeL": round(rougeL, 4),
        "rougeLsum": round(rougeLsum, 4),
        "avg_readability": round(avg_readability, 2),
        "avg_length": round(avg_length, 2),
    }

data_collator = DataCollatorForSeq2Seq(tokenizer, model_init())

# --- 2. DEFINE STATIC TRAINING ARGUMENTS ---
# <<< CHANGE: We've added back the hyperparameters that are now FIXED for this experiment.
training_args = Seq2SeqTrainingArguments(
    output_dir="./t5_hyperparameter_search_batch_epochs",
    do_eval=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,          # <<< FIXED value
    weight_decay=0.02,           # <<< FIXED value
    warmup_steps=500,            # <<< FIXED value
    predict_with_generate=True,
    fp16=True,
    report_to="none"
    # Note: per_device_train_batch_size, per_device_eval_batch_size, and num_train_epochs are REMOVED
    # because they will be defined in the search space below.
)

# --- 3. INITIALIZE THE TRAINER FOR SEARCH ---
trainer = Seq2SeqTrainer(
    args=training_args,
    model_init=model_init,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# --- 4. DEFINE THE NEW SEARCH SPACE ---
# <<< CHANGE: The functions now only define the parameters you requested.

# === OPTION A: RANDOM SEARCH ===
def random_search_hp_space(trial):
    return {
        "per_device_train_batch_size": trial.suggest_int("per_device_train_batch_size", 2, 4),
        "per_device_eval_batch_size": trial.suggest_int("per_device_eval_batch_size", 4, 8),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 3, 6),
    }

# === OPTION B: GRID SEARCH ===
def grid_search_hp_space(trial):
    return {
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 7]),
        "per_device_eval_batch_size": trial.suggest_categorical("per_device_eval_batch_size", [4, 8]),
        "num_train_epochs": trial.suggest_categorical("num_train_epochs", [2, 3, 5]),
    }

# --- 5. RUN THE HYPERPARAMETER SEARCH ---
print("\nStarting automated hyperparameter search...")

# To run a GRID search, change hp_space to grid_search_hp_space
# For Grid Search, it will run 2 * 2 * 3 = 12 trials automatically.
best_trial = trainer.hyperparameter_search(
    direction="maximize",
    compute_objective=lambda metrics: metrics["eval_avg_readability"],
    n_trials=2,  # This is only used for Random Search
    hp_space=grid_search_hp_space,
    backend="optuna"
)

# --- 6. DISPLAY THE BEST RESULTS ---
print("\n--- Hyperparameter Search Complete ---")
print(f"  > Objective value (Readability): {best_trial.objective}")
print("  > Best Hyperparameters:")
for param, value in best_trial.hyperparameters.items():
    print(f"    - {param}: {value}")

# --- 7. (OPTIONAL) TRAIN THE FINAL MODEL WITH THE BEST PARAMETERS ---
# ... (This part of the code remains the same as before) ...
print("\n--- Training final model with best hyperparameters ---")
for param, value in best_trial.hyperparameters.items():
    setattr(training_args, param, value)
final_trainer = Seq2SeqTrainer(
    model_init=model_init,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
final_trainer.train()
model_save_path = "./my_best_t5_model_automated_batch_epochs"
final_trainer.save_model(model_save_path)
print(f"Final optimized model saved to {model_save_path}")

Transformers library version: 4.57.1


  trainer = Seq2SeqTrainer(
[I 2025-11-07 20:35:43,001] A new study created in memory with name: no-name-5b79452f-2ea3-4093-bbdd-931c674d42d2



Starting automated hyperparameter search...


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Avg Readability,Avg Length
1,No log,2.173939,29.5252,12.2784,26.5577,26.5248,57.83,10.76
2,2.505000,1.998203,33.3087,14.8402,30.4152,30.3838,56.69,9.49


[I 2025-11-07 20:41:42,329] Trial 0 finished with value: 56.69 and parameters: {'per_device_train_batch_size': 8, 'per_device_eval_batch_size': 4, 'num_train_epochs': 2}. Best is trial 0 with value: 56.69.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Avg Readability,Avg Length
1,No log,2.158257,29.5215,12.2722,26.535,26.4934,57.91,10.76
2,2.505000,1.984661,33.25,14.7842,30.3552,30.3321,56.63,9.49


[I 2025-11-07 20:46:30,774] Trial 1 finished with value: 56.63 and parameters: {'per_device_train_batch_size': 8, 'per_device_eval_batch_size': 8, 'num_train_epochs': 2}. Best is trial 0 with value: 56.69.
  final_trainer = Seq2SeqTrainer(



--- Hyperparameter Search Complete ---
  > Objective value (Readability): 56.69
  > Best Hyperparameters:
    - per_device_train_batch_size: 8
    - per_device_eval_batch_size: 4
    - num_train_epochs: 2

--- Training final model with best hyperparameters ---


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Avg Readability,Avg Length
1,No log,2.173939,29.5252,12.2784,26.5577,26.5248,57.83,10.76
2,2.505000,1.998203,33.3087,14.8402,30.4152,30.3838,56.69,9.49


Final optimized model saved to ./my_best_t5_model_automated_batch_epochs
