# Phase 1: Dataset Preparation

In this phase, we will prepare the dataset for text summarization using a ready-to-use dataset available on HuggingFace. The main tasks include:

- **Dataset Loading:** Load a text summarization dataset that contains dialogs and their corresponding summarizations.

- **Data Preprocessing:** Clean and format the dialogs and summaries to create input-target pairs.
    In order to improve the performance of the model for conversation summarization, the following Flan-T5 prompt template is used

    ```markdown
    Here is a dialogue:

        {dialogue}

    Write a short summary!
    ```

- **Tokenizer Initialization:** Use the Flan-T5 tokenizer to process the dataset, ensuring the inputs are properly tokenized for the model.

This phase sets the foundation by ensuring your data is clean, well-structured, and ready to be fed into the model.

---

### Load the data

In [None]:
import sys
sys.path.insert(0,'../')

from transformers import AutoTokenizer

from Pipeline.data_retrieving.HuggingFace_DataRetriever import *
from Pipeline.preprocessing.text_summarization_Preprocessor import *

In [None]:
huggingface_dataset_name = "knkarthick/dialogsum"

data_retriever = HuggingFace_DataRetriever(huggingface_dataset_name)
dataset = data_retriever.retrieve_data()

In [None]:
# divide in train, test and validation set
train_set, test_set, validation_set = dataset['train'].to_pandas(), dataset['test'].to_pandas(), dataset['validation'].to_pandas()

---

### Preprocess the dataset

In [None]:
# define the model that will be used, in this project it will be Flan-T5-base
model_name='google/flan-t5-base'

In [None]:
# initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
input_preprocessing_params = {
    'tokenizer': AutoTokenizer.from_pretrained(model_name)
}

output_preprocessing_params = {
    'tokenizer': AutoTokenizer.from_pretrained(model_name)
}

preprocessor = TextSummarizationPreprocessor(input_preprocessing_params, output_preprocessing_params)

In [None]:
# preprocess the training set
tokenized_training_inputs = preprocessor.preprocess_input_data(train_set['dialogue'])
tokenized_training_outputs = preprocessor.preprocess_output_data(train_set['summary'])

# preprocess the validation set
tokenized_validation_inputs = preprocessor.preprocess_input_data(validation_set['dialogue'])
tokenized_validation_outputs = preprocessor.preprocess_output_data(validation_set['summary'])

# preprocess the test set
tokenized_test_inputs = preprocessor.preprocess_input_data(test_set['dialogue'])
tokenized_test_outputs = preprocessor.preprocess_output_data(test_set['summary'])

### save locally the dataset, using the Hugging Face Dataset format

In [None]:
from datasets import Dataset, DatasetDict

In [None]:
#first create a dictionay
dataset = {
    'train_set': {
                    'input_ids': tokenized_training_inputs,
                    'labels': tokenized_training_outputs        
                },

    'validation_set': {
                    'input_ids': tokenized_validation_inputs,
                    'labels': tokenized_validation_outputs        
                },
    
    'test_set': {
                    'input_ids': tokenized_test_inputs,
                    'labels': tokenized_test_outputs        
                }
}

In [None]:
# then convert each split into a Hugging Face Dataset
train_dataset = Dataset.from_dict(dataset["train_set"])
validation_dataset = Dataset.from_dict(dataset["validation_set"])
test_dataset = Dataset.from_dict(dataset["test_set"])

In [None]:
# and finally pack it into a DatasetDict structure
dataset = DatasetDict({
    "train": train_dataset,
    "validation": validation_dataset,
    "test": test_dataset
})

In [None]:
# save locally
import pickle

# Open a file in write-binary mode
with open("data/dataset_t5_base.pkl", "wb") as file:
    # Serialize the list and save it to the file
    pickle.dump(dataset, file)
