<a href="https://colab.research.google.com/github/claudelkros/Fine-tuning-for-french-dataset/blob/main/final_t5_model_for_french_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning Transformer for Summary Generation


### Introduction

In this tutorial we will be fine tuning a transformer model for **Summarization Task**. 
In this task a summary of a given article/document is generated when passed through a network. There are 2 types of summary generation mechanisms:

1. ***Extractive Summary:*** the network calculates the most important sentences from the article and gets them together to provide the most meaningful information from the article.
2. ***Abstractive Summary***: The network creates new sentences to encapsulate maximum gist of the article and generates that as output. The sentences in the summary may or may not be contained in the article. 

In this tutorial we will be generating ***Abstractive Summary***. 

#### Flow of the notebook

* As with all the tutorials previously, this notebook also follows a easy to follow steps. Making the process of fine tuning and training a Transformers model a straight forward task.
* However, unlike the other notebooks, in the tutorial, most of the sections have been created into functions, and they are called from the `main()` in the end of the notebook. 
* This is done to leverage the [Weights and Biases Service](https://www.wandb.com/) WandB in short.
* It is a experiment tracking, parameter optimization and artifact management service. That can be very easily integrated to any of the Deep learning or Machine learning frameworks. 

The notebook will be divided into separate sections to provide a organized walk through for the process used. This process can be modified for individual use cases. The sections are:

1. [Preparing Environment and Importing Libraries](#section01)
2. [Preparing the Dataset for data processing: Class](#section02)
3. [Fine Tuning the Model: Function](#section03)
4. [Validating the Model Performance: Function](#section04)
5. [Main Function](#section05)
    * [Initializing WandB](#section501)
    * [Importing and Pre-Processing the domain data](#section502)
    * [Creation of Dataset and Dataloader](#section503)
    * [Neural Network and Optimizer](#section504)
    * [Training Model and Logging to WandB](#section505)
    * [Validation and generation of Summary](#section506)
6. [Examples of the Summary Generated from the model](#section06)


#### Technical Details

This script leverages on multiple tools designed by other teams. Details of the tools used below. Please ensure that these elements are present in your setup to successfully implement this script.

- **Data**:
	- We are using the News Summary dataset available at [Kaggle](https://www.kaggle.com/sunnysai12345/news-summary)
	- This dataset is the collection created from Newspapers published in India, extracting, details that are listed below.  We are referring only to the first csv file from the data dump: `news_summary.csv`
	- There are`4514` rows of data.  Where each row has the following data-point:
		- **author** : Author of the article
		- **date** : Date the article was published
		- **headline**: Headline for the published article
		- **read_more** : URL for the article to follow online
		- **text**: This is the summary of the article
		- **ctext**: This is the complete article


- **Language Model Used**: 
    - This notebook uses one of the most recent and novel transformers model ***T5***. [Research Paper](https://arxiv.org/abs/1910.10683)    
    - ***T5*** in many ways is one of its kind transformers architecture that not only gives state of the art results in many NLP tasks, but also has a very radical approach to NLP tasks.
    - **Text-2-Text** - According to the graphic taken from the T5 paper. All NLP tasks are converted to a **text-to-text** problem. Tasks such as translation, classification, summarization and question answering, all of them are treated as a text-to-text conversion problem, rather than seen as separate unique problem statements.
    - **Unified approach for NLP Deep Learning** - Since the task is reflected purely in the text input and output, you can use the same model, objective, training procedure, and decoding process to ANY task. Above framework can be used for any task - show Q&A, summarization, etc. 
   - We will be taking inputs from the T5 paper to prepare our dataset prior to fine tuning and training.    
   - [Documentation for python](https://huggingface.co/transformers/model_doc/t5.html)

![**Each NLP problem as a “text-to-text” problem** - input: text, output: text](https://miro.medium.com/max/4006/1*D0J1gNQf8vrrUpKeyD8wPA.png) 
	 


- Hardware Requirements: 
	- Python 3.6 and above
	- Pytorch, Transformers and
	- All the stock Python ML Library
	- GPU enabled setup 
   

- **Script Objective**:
	- The objective of this script is to fine tune ***T5 *** to be able to generate summary, that a close to or better than the actual summary  while ensuring the important information from the article is not lost.

---
NOTE: 
We are using the Weights and Biases Tool-set in  this tutorial. The different components will be explained as we go through the article.

<a id='section01'></a>
### Preparing Environment and Importing Libraries

At this step we will be installing the necessary libraries followed by importing the libraries and modules needed to run our script. 
We will be installing:
* transformers
* wandb

Libraries imported are:
* Pandas
* Pytorch
* Pytorch Utils for Dataset and Dataloader
* Transformers
* T5 Model and Tokenizer
* wandb

Followed by that we will preapre the device for CUDA execeution. This configuration is needed if you want to leverage on onboard GPU. First we will check the GPU avaiable to us, using the nvidia command followed by defining our device.

Finally, we will be logging into the [wandb](https://www.wandb.com/) serice using the login command

In [None]:
!pip install transformers -q
!pip install wandb -q

# Code for TPU packages install
# !curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
# !python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev

In [None]:
# Importing stock libraries
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

# WandB – Import the wandb library
import wandb

In [None]:
# Checking out the GPU we have access to. This is output is from the google colab version. 
!nvidia-smi

Sat Nov 21 00:32:44 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# # Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

# Preparing for TPU usage
# import torch_xla
# import torch_xla.core.xla_model as xm
# device = xm.xla_device()

In [None]:
# Login to wandb to log the model run and all the parameters
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mclaudelkros[0m (use `wandb login --relogin` to force relogin)


<a id='section02'></a>
### Preparing the Dataset for data processing: Class

We will start with creation of Dataset class - This defines how the text is pre-processed before sending it to the neural network. This dataset will be used the the Dataloader method that will feed  the data in batches to the neural network for suitable training and processing. 
The Dataloader and Dataset will be used inside the `main()`.
Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader read the [docs at PyTorch](https://pytorch.org/docs/stable/data.html)

#### *CustomDataset* Dataset Class
- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the **T5** model for training. 
- We are using the **T5** tokenizer to tokenize the data in the `text` and `ctext` column of the dataframe. 
- The tokenizer uses the ` batch_encode_plus` method to perform tokenization and generate the necessary outputs, namely: `source_id`, `source_mask` from the actual text and `target_id` and `target_mask` from the summary text.
- To read further into the tokenizer, [refer to this document](https://huggingface.co/transformers/model_doc/t5.html#t5tokenizer)
- The *CustomDataset* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **80% of the original data**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. 

#### Dataloader: Called inside the `main()`
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of data loaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [None]:
# Creating a custom dataset for reading the dataframe and loading it into the dataloader to pass it to the neural network at a later stage for finetuning the model and to prepare it for predictions

class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = summ_len
        self.text = self.data.text
        self.summary = self.data.summary

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        summary = str(self.summary[index])
        summary = ' '.join(summary.split())

        text = str(self.text[index])
        text = ' '.join(text.split())

        source = self.tokenizer.batch_encode_plus([text], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([summary], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt')

        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        return {
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long), 
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }

<a id='section03'></a>
### Fine Tuning the Model: Function

Here we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network. 

This function is called in the `main()`

Following events happen in this function to fine tune the neural network:
- The epoch, tokenizer, model, device details, testing_ dataloader and optimizer are passed to the `train ()` when its called from the `main()`
- The dataloader passes data to the model based on the batch size.
- `language_model_labels` are calculated from the `target_ids` also, `source_id` and `attention_mask` are extracted.
- The model outputs first element gives the loss for the forward pass. 
- Loss value is used to optimize the weights of the neurons in the network.
- After every 10 steps the loss value is logged in the wandb service. This log is then used to generate graphs for analysis. Such as [these](https://app.wandb.ai/abhimishra-91/transformers_tutorials_summarization?workspace=user-abhimishra-91)
- After every 500 steps the loss value is printed in the console.

In [None]:
%mkdir checkpoint best_model

mkdir: cannot create directory ‘checkpoint’: File exists
mkdir: cannot create directory ‘best_model’: File exists


In [None]:
def save_ckp(state, is_best, checkpoint_path, best_model_path):
    """
    state: checkpoint we want to save
    is_best: is this the best checkpoint; min validation loss
    checkpoint_path: path to save checkpoint
    best_model_path: path to save best model
    """
    f_path = checkpoint_path
    # save checkpoint data to the path given, checkpoint_path
    torch.save(state, f_path)
    # if it is a best model, min validation loss
    if is_best:
        best_fpath = best_model_path
        # copy that checkpoint file to best path given, best_model_path
        shutil.copyfile(f_path, best_fpath)

In [None]:
def load_ckp(checkpoint_fpath, model, optimizer):
    """
    checkpoint_path: path to save checkpoint
    model: model that we want to load checkpoint parameters into       
    optimizer: optimizer we defined in previous training
    """
    # load check point
    checkpoint = torch.load(checkpoint_fpath)
    # initialize state_dict from checkpoint to model
    model.load_state_dict(checkpoint['state_dict'])
    # initialize optimizer from checkpoint to optimizer
    optimizer.load_state_dict(checkpoint['optimizer'])
    # initialize valid_loss_min from checkpoint to valid_loss_min
    valid_loss_min = checkpoint['valid_loss_min']
    # return model, optimizer, epoch value, min validation loss 
    return model, optimizer, checkpoint['epoch'], valid_loss_min.item()

In [None]:
import dill

In [None]:
# Creating the training function. This will be called in the main function. It is run depending on the epoch value.
# The model is put into train mode and then we wnumerate over the training loader and passed to the defined network 

def train(epoch, tokenizer, model, device, loader, optimizer):
    model.train()
    torch.save(model.state_dict(), 'model_simple.pt')
    for _,data in enumerate(loader, 0):
        y = data['target_ids'].to(device, dtype = torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(device, dtype = torch.long)
        mask = data['source_mask'].to(device, dtype = torch.long)

        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)
        loss = outputs[0]
        
        if _%10 == 0:
            wandb.log({"Training Loss": loss.item()})

        if _%500==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # xm.optimizer_step(optimizer)
        # xm.mark_step()


<a id='section04'></a>
### Validating the Model Performance: Function

During the validation stage we pass the unseen data(Testing Dataset), trained model, tokenizer and device details to the function to perform the validation run. This step generates new summary for dataset that it has not seen during the training session. 

This function is called in the `main()`

This unseen data is the 20% of `news_summary.csv` which was seperated during the Dataset creation stage. 
During the validation stage the weights of the model are not updated. We use the generate method for generating new text for the summary. 

It depends on the `Beam-Search coding` method developed for sequence generation for models with LM head. 

The generated text and originally summary are decoded from tokens to text and returned to the `main()`

In [None]:
def validate(epoch, tokenizer, model, device, loader):
    model.eval()
    predictions = []
    actuals = []
    with torch.no_grad():
        for _, data in enumerate(loader, 0):
            y = data['target_ids'].to(device, dtype = torch.long)
            ids = data['source_ids'].to(device, dtype = torch.long)
            mask = data['source_mask'].to(device, dtype = torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=150, 
                num_beams=2,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
                )
            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
            if _%100==0:
                print(f'Completed {_}')

            predictions.extend(preds)
            actuals.extend(target)
    return predictions, actuals

<a id='section05'></a>
### Main Function

The `main()` as the name suggests is the central location to execute all the functions/flows created above in the notebook. The following steps are executed in the `main()`:


<a id='section501'></a>
#### Initializing WandB 

* The `main()` begins with initializing WandB run under a specific project. This command initiates a new run for each execution of this command. 

* Before we proceed any further i will give a brief overview of the **[WandB Service](https://www.wandb.com/)**

* This service has been created to track ML experiments, Optimize the experiments and save artifacts. It is designed to seamlessly integrate with all the Machine Learning and Deep Learning Frameworks. Each script can be organized into *Project* and each execution of the script will be registered as a *run* in the respective project.

* The service can be configured to log several default metrics, such a network weights, hardware usage, gradients and weights of the network. 

* It can also be used to log user defined metrics, such a loss in the `train()`.

* This particular tutorial is logged in the project: **[transformers_tutorials_summarization](https://app.wandb.ai/abhimishra-91/transformers_tutorials_summarization?workspace=user-abhimishra-91)**

**One of the charts from the project**
![](https://github.com/abhimishra91/transformers-tutorials/blob/master/meta/wandb.png?raw=1)

* Visit the project page to see the details of different runs and what information is logged by the service. 

* Following the initialization of the WandB service we define configuration parameters that will be used across the tutorial such as `batch_size`, `epoch`, `learning_rate` etc.

* These parameters are also passed to the WandB config. The config construct with all the parameters can be optimized using the Sweep service from WandB. Currently, that is outof scope of this tutorial. 

* Next we defining seed values so that the experiment and results can be reproduced.


<a id='section502'></a>
#### Importing and Pre-Processing the domain data

We will be working with the data and preparing it for fine tuning purposes. 
*Assuming that the `news_summary.csv` is already downloaded in your `data` folder*

* The file is imported as a dataframe and give it the headers as per the documentation.
* Cleaning the file to remove the unwanted columns.
* A new string is added to the main article column `summarize: ` prior to the actual article. This is done because **T5** had similar formatting for the summarization dataset. 
* The final Dataframe will be something like this:

|text|ctext|
|--|--|
|summary-1|summarize: article 1|
|summary-2|summarize: article 2|
|summary-3|summarize: article 3|

* Top 5 rows of the dataframe are printed on the console.

<a id='section503'></a>
#### Creation of Dataset and Dataloader

* The updated dataframe is divided into 80-20 ratio for test and validation. 
* Both the data-frames are passed to the `CustomerDataset` class for tokenization of the new articles and their summaries.
* The tokenization is done using the length parameters passed to the class.
* Train and Validation parameters are defined and passed to the `pytorch Dataloader contstruct` to create `train` and `validation` data loaders.
* These dataloaders will be passed to `train()` and `validate()` respectively for training and validation action.
* The shape of datasets is printed in the console.


<a id='section504'></a>
#### Neural Network and Optimizer

* In this stage we define the model and optimizer that will be used for training and to update the weights of the network. 
* We are using the `t5-base` transformer model for our project. You can read about the `T5 model` and its features above. 
* We use the `T5ForConditionalGeneration.from_pretrained("t5-base")` commad to define our model. The `T5ForConditionalGeneration` adds a Language Model head to our `T5 model`. The Language Model head allows us to generate text based on the training of `T5 model`.
* We are using the `Adam` optimizer for our project. This has been a standard for all our tutorials and is something that can be changed updated to see how different optimizer perform with different learning rates. 
* There is also a scope for doing more with Optimizer such a decay, momentum to dynamically update the Learning rate and other parameters. All those concepts have been kept out of scope for these tutorials. 


<a id='section505'></a>
#### Training Model and Logging to WandB

* Now we log all the metrics in WandB project that we have initialized above.
* Followed by that we call the `train()` with all the necessary parameters.
* Loss at every 500th step is printed on the console.
* Loss at every 10th step is logged as Loss in the WandB service.


<a id='section506'></a>
#### Validation and generation of Summary

* After the training is completed, the validation step is initiated.
* As defined in the validation function, the model weights are not updated. We use the fine tuned model to generate new summaries based on the article text.
* An output is printed on the console giving a count of how many steps are complete after every 100th step. 
* The original summary and generated summary are converted into a list and returned to the main function. 
* Both the lists are used to create the final dataframe with 2 columns **Generated Summary** and **Actual Summary**
* The dataframe is saved as a csv file in the local drive.
* A qualitative analysis can be done with the Dataframe. 

In [None]:
import wandb

In [None]:
import torch
import torchvision
import numpy as np

In [None]:
import tensorflow as tf

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
import torch
import shutil
from torch import nn
from torch import optim
import torch.nn.functional as F
from torchvision import datasets, transforms
import numpy as np

In [None]:
%mkdir models model

mkdir: cannot create directory ‘models’: File exists
mkdir: cannot create directory ‘model’: File exists


In [None]:
import tensorflow as tf
import os

In [None]:
!pip install datasets
from datasets import load_dataset 
df = load_dataset('mlsum', 'fr')



Reusing dataset mlsum (/root/.cache/huggingface/datasets/mlsum/fr/1.0.0/c5bff4e58cb961d638232afed081348215ca85c4687a04ae80f025d56ce18327)


In [None]:
import pandas as pd
import numpy as np


In [None]:
text = df['train']['text']

In [None]:
type(text)

list

In [None]:
dt = pd.DataFrame(text)

In [None]:
dt[0][0]

'Jean-Jacques Schuhl, Gilles Leroy, Christian Gailly, Yasmina Khadra, James Ellroy, Amos Oz, V. S. Naipaul… Si la rentrée de janvier s\'annonce sous les meilleurs auspices, force est de constater que, avec 491 romans (contre 558 en 2009), la tendance à la baisse enregistrée à l\'automne s\'accentue. Principale victime de cette glaciation : la littérature étrangère, qui enregistre un recul de 21 % avec 167 romans, contre 211 l\'an dernier. Soit son plus bas niveau depuis 2001. Peut-être doit-on voir là le contrecoup de l\'augmentation des droits d\'auteur et de traduction, mais aussi le fait que les organisateurs du Salon du livre ont choisi de célébrer les trente ans de la manifestation en invitant non pas un pays mais des écrivains français et étrangers. Si la littérature française, de son côté, marque un léger fléchissement avec 324 livres, contre 347 l\'an passé, les premiers romans après un automne en demi-teinte repartent à la hausse avec 73 titres, contre 61 en 2009. Loin de l\'e

In [None]:
summary = df['train']['summary']

In [None]:
ds = pd.DataFrame(summary)

In [None]:
ds[0][0]

"Jean-Jacques Schuhl, Gilles Leroy, Christian Gailly, Yasmina Khadra, James Ellroy, Amos Oz, V. S. Naipaul… La rentrée de janvier s'annonce sous les meilleurs auspices."

In [None]:
text_array = np.array(dt)
summary_array = np.array(ds)

In [None]:
column_text = ['text']
column_summary = ['summary']

In [None]:
df_text = pd.DataFrame(data = text_array,columns = column_text)

In [None]:
df_summary = pd.DataFrame(data = summary_array,columns = column_summary)

In [None]:
 frames = [df_text, df_summary]

In [None]:
results = pd.concat(frames, axis=1)

In [None]:
results

Unnamed: 0,text,summary
0,"Jean-Jacques Schuhl, Gilles Leroy, Christian G...","Jean-Jacques Schuhl, Gilles Leroy, Christian G..."
1,Une semaine après l'attaque terroriste manquée...,Cette demande intervient une semaine après l'a...
2,"Un juge américain a rejeté, jeudi 31 décembre,...",Un juge américain a rejeté jeudi les accusatio...
3,Un attentat a fait au moins 93 morts et plusie...,Un kamikaze a fait exploser sa voiture piégée ...
4,"Cinq personnes sont mortes, et treize autres o...","Cinq personnes sont mortes, et treize autres o..."
...,...,...
392897,Seules les personnes employées par des particu...,"Dès le mois de janvier, l’impôt sera désormais..."
392898,"Carlos Ghosn à Paris, le 6 octobre 2017. MICHE...",L’ex-président du constructeur japonais Nissan...
392899,Lors d’une manifestation anti-Brexit aux abord...,Le départ du Royaume-Uni de l’Union européenne...
392900,"La chancelière allemande Angela Merkel, peu ap...","Il appartient à Berlin de « tenir bon, argumen..."


In [None]:
results.to_csv('corpus.csv', encoding='utf-8')

In [None]:
# The full `train` split and the full `test` split as two distinct datasets.
train_ds, test_ds = load_dataset('mlsum', 'fr', split=['train', 'test'])

Reusing dataset mlsum (/root/.cache/huggingface/datasets/mlsum/fr/1.0.0/c5bff4e58cb961d638232afed081348215ca85c4687a04ae80f025d56ce18327)


In [None]:
train_ds['text']

In [None]:
def main():
    # WandB – Initialize a new run
    wandb.init(project="transformers_tutorials_summarization")

    # WandB – Config is a variable that holds and saves hyperparameters and inputs
    # Defining some key variables that will be used later on in the training  
    config = wandb.config          # Initialize config
    config.TRAIN_BATCH_SIZE = 2    # input batch size for training (default: 64)
    config.VALID_BATCH_SIZE = 2    # input batch size for testing (default: 1000)
    config.TRAIN_EPOCHS = 2        # number of epochs to train (default: 10)
    config.VAL_EPOCHS = 1 
    config.LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
    config.SEED = 42               # random seed (default: 42)
    config.MAX_LEN = 512
    config.SUMMARY_LEN = 150 

    # Set random seeds and deterministic pytorch for reproducibility
    torch.manual_seed(config.SEED) # pytorch random seed
    np.random.seed(config.SEED) # numpy random seed
    torch.backends.cudnn.deterministic = True

    # tokenzier for encoding the text
    tokenizer = T5Tokenizer.from_pretrained("t5-base")
    

    # Importing and Pre-Processing the domain data
    # Selecting the needed columns only. 
    # Adding the summarzie text in front of the text. This is to format the dataset similar to how T5 model was trained for summarization task. 
    df = pd.read_csv('corpus.csv',encoding='utf-8')
    df = df[['text','summary']]
    df.text = 'summary: ' + df.text
    print(df.head())

    
    # Creation of Dataset and Dataloader
    # Defining the train size. So 80% of the data will be used for training and the rest will be used for validation. 
    train_size = 0.8
    train_dataset=df.sample(frac=train_size,random_state = config.SEED)
    val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
    train_dataset = train_dataset.reset_index(drop=True)

    print("FULL Dataset: {}".format(df.shape))
    print("TRAIN Dataset: {}".format(train_dataset.shape))
    print("TEST Dataset: {}".format(val_dataset.shape))


    # Creating the Training and Validation dataset for further creation of Dataloader
    training_set = CustomDataset(train_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)
    val_set = CustomDataset(val_dataset, tokenizer, config.MAX_LEN, config.SUMMARY_LEN)

    # Defining the parameters for creation of dataloaders
    train_params = {
        'batch_size': config.TRAIN_BATCH_SIZE,
        'shuffle': True,
        'num_workers': 0
        }

    val_params = {
        'batch_size': config.VALID_BATCH_SIZE,
        'shuffle': False,
        'num_workers': 0
        }

    # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
    training_loader = DataLoader(training_set, **train_params)
    val_loader = DataLoader(val_set, **val_params)


    
    # Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. 
    # Further this model is sent to device (GPU/TPU) for using the hardware.
    model = T5ForConditionalGeneration.from_pretrained("t5-base")
    model = model.to(device)
    
    # Defining the optimizer that will be used to tune the weights of the network in the training session. 
    optimizer = torch.optim.Adam(params =  model.parameters(), lr=config.LEARNING_RATE)

    # Log metrics with wandb
    wandb.watch(model, log="all")
    # Training loop
    print('Initiating Fine-Tuning for the model on our dataset')

    for epoch in range(config.TRAIN_EPOCHS):
        train(epoch, tokenizer, model, device, training_loader, optimizer)
        
    # Validation loop and saving the resulting file with predictions and acutals in a dataframe.
    # Saving the dataframe as predictions.csv
    print('Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe')
    for epoch in range(config.VAL_EPOCHS):
      if torch.cuda.is_available():
        predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
        final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
        #final_df = pd.DataFrame({'Generated Text':to_summarize,'Actual Text':to_summarize})
        final_df.to_csv('./models/predictions.csv')
        print('Output Files generated for review')
    
    model.save_pretrained("model")
if __name__ == '__main__':
    main()

[34m[1mwandb[0m: Currently logged in as: [33mclaudelkros[0m (use `wandb login --relogin` to force relogin)


                                                text                                            summary
0  summary: Jean-Jacques Schuhl, Gilles Leroy, Ch...  Jean-Jacques Schuhl, Gilles Leroy, Christian G...
1  summary: Une semaine après l'attaque terrorist...  Cette demande intervient une semaine après l'a...
2  summary: Un juge américain a rejeté, jeudi 31 ...  Un juge américain a rejeté jeudi les accusatio...
3  summary: Un attentat a fait au moins 93 morts ...  Un kamikaze a fait exploser sa voiture piégée ...
4  summary: Cinq personnes sont mortes, et treize...  Cinq personnes sont mortes, et treize autres o...
FULL Dataset: (392902, 2)
TRAIN Dataset: (314322, 2)
TEST Dataset: (78580, 2)
Initiating Fine-Tuning for the model on our dataset


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Epoch: 0, Loss:  5.971557140350342
Epoch: 0, Loss:  1.138223648071289
Epoch: 0, Loss:  1.7282518148422241
Epoch: 0, Loss:  1.018807291984558
Epoch: 0, Loss:  2.9447686672210693
Epoch: 0, Loss:  2.448552131652832
Epoch: 0, Loss:  2.4719560146331787
Epoch: 0, Loss:  2.861159563064575
Epoch: 0, Loss:  2.2647149562835693
Epoch: 0, Loss:  0.4114845395088196
Epoch: 0, Loss:  1.314096212387085
Epoch: 0, Loss:  2.444927930831909
Epoch: 0, Loss:  1.1524314880371094
Epoch: 0, Loss:  1.794533610343933
Epoch: 0, Loss:  2.7826409339904785
Epoch: 0, Loss:  1.7409192323684692
Epoch: 0, Loss:  1.7475813627243042
Epoch: 0, Loss:  2.432558298110962
Epoch: 0, Loss:  1.5646462440490723
Epoch: 0, Loss:  1.9788837432861328
Epoch: 0, Loss:  2.0601084232330322
Epoch: 0, Loss:  0.9599894285202026
Epoch: 0, Loss:  1.2457525730133057
Epoch: 0, Loss:  2.4315831661224365
Epoch: 0, Loss:  0.5035943388938904
Epoch: 0, Loss:  2.127164125442505
Epoch: 0, Loss:  0.9449363946914673
Epoch: 0, Loss:  2.34767484664917
Epoc

In [None]:
!pip install git+https://github.com/tagucci/pythonrouge.git

Collecting git+https://github.com/tagucci/pythonrouge.git
  Cloning https://github.com/tagucci/pythonrouge.git to /tmp/pip-req-build-qi0ssgbu
  Running command git clone -q https://github.com/tagucci/pythonrouge.git /tmp/pip-req-build-qi0ssgbu
Building wheels for collected packages: pythonrouge
  Building wheel for pythonrouge (setup.py) ... [?25l[?25hdone
  Created wheel for pythonrouge: filename=pythonrouge-0.2-cp36-none-any.whl size=285402 sha256=c6930e3b2a0ace0ee2ee5487292c7044e0313a56a3f2e7745a428b11f146254c
  Stored in directory: /tmp/pip-ephem-wheel-cache-_tu4o68p/wheels/fd/ff/be/6716935d513fa8656ab185cb0aa70aed382b72dda42bf09c95
Successfully built pythonrouge
Installing collected packages: pythonrouge
Successfully installed pythonrouge-0.2


In [None]:
from pythonrouge.pythonrouge import Pythonrouge

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('models/predictions.csv')

In [None]:
df

Unnamed: 0.1,Unnamed: 0,Generated Text,Actual Text
0,0,Les réseaux de neurones sont généralement opti...,Si vous disposez d'ouvrages ou d'articles de r...
1,1,"Cette année, les spécialistes des sciences cog...",modifier - modifier le code - modifier Wikidat...
2,2,cognitives étudient la cognition de divers poi...,Sur les autres projets Wikimedia :Le mot cogni...
3,3,Cette théorie de localisationniste des fonctio...,"Pour améliorer la vérifiabilité de l'article, ..."
4,4,l'apparition de l'écriture distingue la Préhis...,L’écriture est un moyen de communication qui r...


In [None]:
ref = []
sum = []

In [None]:
 sum = df['Generated Text']

In [None]:
ref = df['Actual Text']

In [None]:
ref = ref.to_numpy()

In [None]:
sum = sum.to_numpy()

In [None]:
  rouge = Rouge() 

In [None]:
def rouge_metrics(x, y):
  result = []
  for ref, sum in zip(x, y):
    result.append(rouge.get_scores(ref, sum))
  return result

In [None]:
val = rouge_metrics(ref, sum)

In [None]:
val

[[{'rouge-1': {'f': 0.19999999501250013,
    'p': 0.21052631578947367,
    'r': 0.19047619047619047},
   'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
   'rouge-l': {'f': 0.11650484939202586,
    'p': 0.125,
    'r': 0.10909090909090909}}],
 [{'rouge-1': {'f': 0.20952380589569167, 'p': 0.1375, 'r': 0.44},
   'rouge-2': {'f': 0.038834947881987325,
    'p': 0.02531645569620253,
    'r': 0.08333333333333333},
   'rouge-l': {'f': 0.09999999612812516,
    'p': 0.06779661016949153,
    'r': 0.19047619047619047}}],
 [{'rouge-1': {'f': 0.0, 'p': 0.0, 'r': 0.0},
   'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
   'rouge-l': {'f': 0.0, 'p': 0.0, 'r': 0.0}}],
 [{'rouge-1': {'f': 0.15909090626291325,
    'p': 0.0958904109589041,
    'r': 0.4666666666666667},
   'rouge-2': {'f': 0.02325581122769096,
    'p': 0.013888888888888888,
    'r': 0.07142857142857142},
   'rouge-l': {'f': 0.10810810521183355,
    'p': 0.06557377049180328,
    'r': 0.3076923076923077}}],
 [{'rouge-1': {'f': 0.19047618598009586,
  

In [None]:
def metrics(x):
  metrics = []
  f_score = x[0]['rouge-1']['f']
  precision = x[0]['rouge-1']['p']
  recall = x[0]['rouge-1']['r']
  metrics.append(f_score)
  metrics.append(precision)
  metrics.append(recall)
  return metrics

In [None]:
metrics(val)

[0.19999999501250013, 0.21052631578947367, 0.19047619047619047]

In [None]:
def loop(x):
  items = []
  for item in x:
    items.append((metrics(item)))
    return items

In [None]:
loop(val)

[0.19999999501250013, 0.21052631578947367, 0.19047619047619047]

In [None]:
!pip install rouge

Collecting rouge
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [None]:
from rouge import Rouge 

hypothesis = "the #### transcript is a written version of each day 's cnn student news program use this transcript to he    lp students with reading comprehension and vocabulary use the weekly newsquiz to test your knowledge of storie s you     saw on cnn student news"

reference = "this page includes the show transcript use the transcript to help students with reading comprehension and     vocabulary at the bottom of the page , comment for a chance to be mentioned on cnn student news . you must be a teac    her or a student age # # or older to request a mention on the cnn student news roll call . the weekly newsquiz tests     students ' knowledge of even ts in the news"

rouge = Rouge()
scores = rouge.get_scores(hypothesis, reference)

In [None]:
scores

[{'rouge-1': {'f': 0.4786324739396596,
   'p': 0.6363636363636364,
   'r': 0.3835616438356164},
  'rouge-2': {'f': 0.2608695605353498,
   'p': 0.3488372093023256,
   'r': 0.20833333333333334},
  'rouge-l': {'f': 0.44705881864636676,
   'p': 0.5277777777777778,
   'r': 0.3877551020408163}}]

In [None]:
from transformers import T5Model, BertModel

In [None]:
model = BertModel.from_pretrained('best_model')

Some weights of the model checkpoint at best_model were not used when initializing BertModel: ['shared.weight', 'decoder.embed_tokens.weight', 'decoder.block.0.layer.0.SelfAttention.q.weight', 'decoder.block.0.layer.0.SelfAttention.k.weight', 'decoder.block.0.layer.0.SelfAttention.v.weight', 'decoder.block.0.layer.0.SelfAttention.o.weight', 'decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight', 'decoder.block.0.layer.0.layer_norm.weight', 'decoder.block.0.layer.1.EncDecAttention.q.weight', 'decoder.block.0.layer.1.EncDecAttention.k.weight', 'decoder.block.0.layer.1.EncDecAttention.v.weight', 'decoder.block.0.layer.1.EncDecAttention.o.weight', 'decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight', 'decoder.block.0.layer.1.layer_norm.weight', 'decoder.block.0.layer.2.DenseReluDense.wi.weight', 'decoder.block.0.layer.2.DenseReluDense.wo.weight', 'decoder.block.0.layer.2.layer_norm.weight', 'decoder.block.1.layer.0.SelfAttention.q.weight', 'decoder.block

In [None]:
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'


In [None]:
from IPython.display import display, Markdown


In [None]:
from transformers import AutoModel, T5Model, T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('model')


In [None]:
to_summarize = """
Wikipédia est un projet d’encyclopédie collective en ligne, universelle, multilingue et fonctionnant sur le principe du wiki. Ce projet vise à offrir un contenu librement réutilisable, objectif et vérifiable, que chacun peut modifier et améliorer.
Wikipédia est définie par des principes fondateurs. Son contenu est sous licence Creative Commons BY-SA. Il peut être copié et réutilisé sous la même licence, sous réserve d'en respecter les conditions. Wikipédia fournit tous ses contenus gratuitement, 
sans publicité, et sans recourir à l'exploitation des données personnelles de ses utilisateurs.
Les rédacteurs des articles de Wikipédia sont bénévoles. Ils coordonnent leurs efforts au sein d'une communauté collaborative, sans dirigeant.

"""

In [None]:
test = """
Apprentissage automatique,"Selon les informations disponibles durant la phase d'apprentissage, l'apprentissage est qualifié de différentes manières. Si les données sont étiquetées (c'est-à-dire que la réponse à la tâche est connue pour ces données), il s'agit d'un apprentissage supervisé. On parle de classification ou de classement[3] si les étiquettes sont discrètes, ou de régression si elles sont continues. Si le modèle est appris de manière incrémentale en fonction d'une récompense reçue par le programme pour chacune des actions entreprises, on parle d'apprentissage par renforcement. Dans le cas le plus général, sans étiquette, on cherche à déterminer la structure sous-jacente des données (qui peuvent être une densité de probabilité) et il s'agit alors d'apprentissage non supervisé. L'apprentissage automatique peut être appliqué à différents types de données, tels des graphes, des arbres, des courbes, ou plus simplement des vecteurs de caractéristiques, qui peuvent être continues ou discrètes.
 Depuis l'antiquité, le sujet des machines pensantes préoccupe les esprits. Ce concept est la base de pensées pour ce qui deviendra ensuite l'intelligence artificielle, ainsi qu'une de ses sous-branches : l'apprentissage automatique.
 La concrétisation de cette idée est principalement due à Alan Turing (mathématicien et cryptologue britannique) et à son concept de la « machine universelle » en 1936[4], qui est à la base des ordinateurs d'aujourd'hui. Il continuera à poser les bases de l'apprentissage automatique, avec son article sur « L'ordinateur et l'intelligence » en 1950[5], dans lequel il développe, entre autres, le test de Turing.
 En 1943, le neurophysiologiste Warren McCulloch et le mathématicien Walter Pitts publient un article décrivant le fonctionnement de neurones en les représentant à l'aide de circuits électriques. Cette représentation sera la base théorique des réseaux neuronaux[6].
 Arthur Samuel, informaticien américain pionnier dans le secteur de l'intelligence artificielle, est le premier à faire usage de l'expression machine learning (en français, « apprentissage automatique ») en 1959 à la suite de la création de son programme pour IBM en 1952. Le programme jouait au Jeu de Dames et s'améliorait en jouant. À terme, il parvint à battre le 4e meilleur joueur des États-Unis[7],[8].
 Une avancée majeure dans le secteur de l'intelligence machine est le succès de l'ordinateur développé par IBM, Deep Blue, qui est le premier à vaincre le champion mondial d'échecs Garry Kasparov en 1997. Le projet Deep Blue en inspirera nombre d'autres dans le cadre de l'intelligence artificielle, particulièrement un autre grand défi : IBM Watson, l'ordinateur dont le but est de gagner au jeu Jeopardy![9]. Ce but est atteint en 2011, quand Watson gagne à Jeopardy! en répondant aux questions par traitement de langage naturel[10].
 Durant les années suivantes, les applications de l'apprentissage automatique médiatisées se succèdent bien plus rapidement qu'auparavant.

"""

In [None]:
#collapse-show
#tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')
from transformers import T5Tokenizer
#tokenizer = T5Tokenizer.from_pretrained("best_model/model")
tokenizer = T5Tokenizer.from_pretrained("t5-base")
article_input_ids = tokenizer.batch_encode_plus([test], return_tensors='pt', max_length=1024)['input_ids']
summary_ids = model.generate(article_input_ids,
                             num_beams=4,
                             length_penalty=2.0,
                             max_length=145,
                             min_len=56,
                             no_repeat_ngram_size=3)

summary_txt = tokenizer.decode(summary_ids.squeeze(), skip_special_tokens=True)
display(Markdown('> **Summary: **'+summary_txt))

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


> **Summary: **l'apprentissage automatique, c'est-à-dire la classification ou la régression des données, qui s'agit d'un apprentissage non supervisé[3]. Le projet Deep Blue est le premier dans le domaine de la machine à apprendre à partir des données établies en 1952[6].

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cp -r /content/model/ '/content/drive/My Drive/model/'

In [None]:
cp -r '/content/drive/My Drive/model/'  '/content/best_model/'