# Environment Setup and Installation

- Install the required libraries: `datasets`, `transformers`, and `tensorflow`.  

In [2]:
!pip install py7zr
!pip install datasets transformers tensorflow



- Import the necessary modules for data processing and machine learning models.

In [3]:
import transformers
import datasets
import numpy as np
import tensorflow as tf
from transformers import (
    AutoTokenizer,
    TFAutoModelForSeq2SeqLM,
)
from datasets import load_dataset

# Model and Tokenizer Initialization

- Use the pre-trained T5-small model. T5-small is a checkpoint with 60 million parameters.  
- Load the tokenizer and model from Google's T5.  

In [4]:
model_pretrained = 'google-t5/t5-small'
tokenizer = AutoTokenizer.from_pretrained(model_pretrained)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_pretrained)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


# Dataset Processing

- Load the Multi-News dataset.

In [5]:
multi_news = load_dataset('multi_news')

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

multi_news.py:   0%|          | 0.00/3.83k [00:00<?, ?B/s]

The repository for multi_news contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/multi_news.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


train.src.cleaned:   0%|          | 0.00/548M [00:00<?, ?B/s]

train.tgt:   0%|          | 0.00/58.8M [00:00<?, ?B/s]

val.src.cleaned:   0%|          | 0.00/66.9M [00:00<?, ?B/s]

val.tgt:   0%|          | 0.00/7.30M [00:00<?, ?B/s]

test.src.cleaned:   0%|          | 0.00/69.0M [00:00<?, ?B/s]

test.tgt:   0%|          | 0.00/7.31M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/44972 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5622 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5622 [00:00<?, ? examples/s]

In [6]:
print(multi_news)
print(multi_news['train'][10]['document'])
print(multi_news['train'][10]['summary'])

DatasetDict({
    train: Dataset({
        features: ['document', 'summary'],
        num_rows: 44972
    })
    validation: Dataset({
        features: ['document', 'summary'],
        num_rows: 5622
    })
    test: Dataset({
        features: ['document', 'summary'],
        num_rows: 5622
    })
})
WTF?! 
 
 Howard Stern recently completed the ALS Ice Bucket Challenge and shared a video of the do-gooder act on YouTube. While Stern doing the bone-chilling charitable act is nothing out of the ordinary, you may be scratching your head when you hear who he nominates to undertake the challenge next. 
 
 "Hey everybody, it's Howard Stern ready to take the Ice Bucket Challenge," a shirtless Stern says in the video. "I'm accepting the challenge of...who challenged me? Matt Lauer and Jennifer Aniston." ||||| After both Jennifer Aniston and Matt Lauer nominated him, Howard Stern finally accepted the Ice Bucket Challenge - and you won't believe who he nominated! 
 
 Remember, all this ice buc

- Reduce the dataset size to 20% of the original data to optimize computation efficiency.

In [7]:
# Calculate the reduced size (80% smaller)
train_size = int(len(multi_news['train']) * 0.2)  # 20% of original size
validation_size = int(len(multi_news['validation']) * 0.2)
test_size = int(len(multi_news['test']) * 0.2)

# Create reduced datasets
reduced_multi_news = datasets.DatasetDict({
    'train': multi_news['train'].select(range(train_size)),
    'validation': multi_news['validation'].select(range(validation_size)),
    'test': multi_news['test'].select(range(test_size))
})

- Create a tokenization function (get_feature) to convert text into a format that can be processed by the model.

In [8]:
def get_feature(batch):
    encodings = tokenizer(batch['document'], text_target=batch['summary'],
                          max_length=2048, truncation=True, padding="max_length")
    encodings = {'input_ids': encodings['input_ids'],
                 'attention_mask': encodings['attention_mask'],
                 'labels': encodings['labels']}
    return encodings

In [10]:
multi_news_tf = reduced_multi_news.map(get_feature, batched=True)

Map:   0%|          | 0/8994 [00:00<?, ? examples/s]

Map:   0%|          | 0/1124 [00:00<?, ? examples/s]

Map:   0%|          | 0/1124 [00:00<?, ? examples/s]

In [11]:
print(multi_news_tf)
print(multi_news_tf['train'][10]['document'])
print(multi_news_tf['train'][10]['summary'])
print(multi_news_tf['train'][10]['input_ids'])
print(multi_news_tf['train'][10]['attention_mask'])
print(multi_news_tf['train'][10]['labels'])

DatasetDict({
    train: Dataset({
        features: ['document', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8994
    })
    validation: Dataset({
        features: ['document', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1124
    })
    test: Dataset({
        features: ['document', 'summary', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1124
    })
})
WTF?! 
 
 Howard Stern recently completed the ALS Ice Bucket Challenge and shared a video of the do-gooder act on YouTube. While Stern doing the bone-chilling charitable act is nothing out of the ordinary, you may be scratching your head when you hear who he nominates to undertake the challenge next. 
 
 "Hey everybody, it's Howard Stern ready to take the Ice Bucket Challenge," a shirtless Stern says in the video. "I'm accepting the challenge of...who challenged me? Matt Lauer and Jennifer Aniston." ||||| After both Jennifer Aniston and Matt Lauer nominated him, Howar

# Training Preparation

- Set hyperparameters: learning rate (3e-5), number of epochs (1), batch size (1).

In [12]:
learning_rate = 3e-5
num_train_epochs = 1
batch_size = 1

- Convert the dataset into TensorFlow dataset format.

In [13]:
train_dataset = multi_news_tf['train'].to_tf_dataset(
    columns=['input_ids', 'attention_mask', 'labels'],
    shuffle=True,
    batch_size=batch_size,  # Use your desired batch size
)

# You might need a validation dataset as well:
validation_dataset = multi_news_tf['validation'].to_tf_dataset(
    columns=['input_ids', 'attention_mask', 'labels'],
    shuffle=False,
    batch_size=batch_size,
)

- Create the Adam optimizer and loss function.

In [14]:
optimizer = tf.keras.optimizers.Adam(learning_rate, weight_decay=0.000001)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = ['accuracy']

In [15]:
model.compile(optimizer=optimizer, loss=loss_fn, metrics=metrics)

# Training Process

- Perform model fine-tuning by calling the `fit()` method.  
- Train the model using the training and validation datasets.  

In [16]:
history = model.fit(
    train_dataset,
    validation_data=validation_dataset,
    epochs=num_train_epochs
)



# Model Saving

- Save the trained model along with the tokenizer to the local directory.  

In [17]:
model.save_pretrained("textualize_model_T5_multinews",save_format='tf')
tokenizer.save_pretrained("textualize_model_T5_multinews")

('textualize_model_T5_multinews/tokenizer_config.json',
 'textualize_model_T5_multinews/special_tokens_map.json',
 'textualize_model_T5_multinews/spiece.model',
 'textualize_model_T5_multinews/added_tokens.json',
 'textualize_model_T5_multinews/tokenizer.json')