# Experimenting with a T5 for Text Summarisartion  

Adapted from [Denis Rothman - Transformers for Natural Languge Processing](https://github.com/Denis2054/Transformers-for-NLP-2nd-Edition/blob/main/Chapter08/Summerizing_Text_with_T5.ipynb)  

Trying the T5 large model first - need to find a way to score these models! (Bleu, Rouge, BERTSUM, etc))   
And will it run locally on this machine?  (Yes, it does! see venv)  
Might need to set up the GPUs (Not tested yet)  
Note using the anaconda environment 'transformers'  (or text_sum_venv)  
So far I have added: (will freeze a requirements.txt when everything is working):  
`conda install -c conda-forge transformers`  
`conda install -c pytorch pytorch`   
`conda install -c conda-forge sentencepiece`  
`conda install -c conda-forge tensorflow` not needed for this notebook but will need for other transformer experiments.

In [1]:
import torch
# import json 
import glob
import tqdm
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# choose to display the model config and architecture
display_architecture=False

In [3]:
# load the model and tokenizer
model = T5ForConditionalGeneration.from_pretrained('t5-large')
tokenizer = T5Tokenizer.from_pretrained('t5-large')
# try cpu first its probably enough for this example 'cpu'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# device = torch.device('cuda')

Downloading: 100%|██████████| 1.21k/1.21k [00:00<00:00, 1.26MB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading: 100%|██████████| 2.95G/2.95G [00:32<00:00, 91.8MB/s]
Downloading: 100%|██████████| 792k/792k [00:00<00:00, 1.61MB/s]
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-large automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [4]:
# print the model
if display_architecture:
 print(model)
 # note all the repeated blocks are the same
 # can do model.encoder or .decoder or .forward to see the just those parts

In [5]:
if display_architecture:
 print(model.config)
 # 16 heads and 24 layers - note the summarization prefix params!
 # note the beam search algo is being used  
 # there is a length penalty for longer sentences
 # vocab size is the size of the tokenizer vocab and can influence 
 # the performance of the model, to large and it will be very sparse.

In [6]:
# save the model 
model.save_pretrained("../src/models/t5-large")
# # save the tokenizer
tokenizer.save_pretrained("../src/models/t5-large")


('../src/models/t5-large\\tokenizer_config.json',
 '../src/models/t5-large\\special_tokens_map.json',
 '../src/models/t5-large\\spiece.model',
 '../src/models/t5-large\\added_tokens.json')

In [7]:
def summarize(text, ml):
    """
    The function takes in a text and the max
    length of the summary. It returns a summary.

    Parameters
    ----------
    text(str): the text to summarize
    ml(int): the max length of the summary

    Returns
    -------
    returns: the summary
    """
    preprocess_text = text.strip().replace("\n", "")
    # add the prefix to the text
    t5_prepared_Text = f"summarize: {preprocess_text}"
    # eyeball the result of preprocessing
    # print ("Preprocessed and prepared text: \n", t5_prepared_Text)
    # encode the text
    tokenized_text = tokenizer.encode(t5_prepared_Text,
                                      return_tensors="pt",
                                      # there are some very long sentences >512
                                      truncation=True).to(device)
    # submit the text to the model and adjust the parameters
    summary_ids = model.generate(tokenized_text,
                                 num_beams=4,
                                 no_repeat_ngram_size=2,
                                 min_length=30,
                                 max_length=ml,
                                 early_stopping=True)
    # decode the ids to text
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)


In [8]:
# small test
text = """ The United States Declaration of Independence was the first Etext
released by Project Gutenberg, early in 1971.  The title was stored
in an emailed instruction set which required a tape or diskpack be
hand mounted for retrieval.  The diskpack was the size of a large
cake in a cake carrier, cost $1500, and contained 5 megabytes, of
which this file took 1-2%.  Two tape backups were kept plus one on
paper tape.  The 10,000 files we hope to have online by the end of
2001 should take about 1-2% of a comparably priced drive in 2001.
"""
print("Number of characters:", len(text))
summary = summarize(text, 50)
print("\nSummarized text: \n", summary)


Number of characters: 534

Summarized text: 
 the united states declaration of independence was the first etext published by project gutenberg, early in 1971. the 10,000 files we hope to have online by the end of2001 should take about 1-2% of a comparably priced drive in


In [9]:
# get a list of all the txt files in the directory
txt_files = glob.glob("../text_data/*.txt")

In [10]:
# loop through the files and summarize them
for file in tqdm.tqdm(txt_files):
    with open(file, 'r') as f:
      text = f.read()
      print("\n", file.split('\\')[-1].split(".")[0],
            # get the number of words in the text
            " which has ", len(text.split()), " words",
            "\nSummarized text: \n",
            summarize(text, 150))

0it [00:00, ?it/s]


In [12]:
# small test
text = """ The United States Declaration of Independence was the first Etext
released by Project Gutenberg, early in 1971.  The title was stored
in an emailed instruction set which required a tape or diskpack be
hand mounted for retrieval.  The diskpack was the size of a large
cake in a cake carrier, cost $1500, and contained 5 megabytes, of
which this file took 1-2%.  Two tape backups were kept plus one on
paper tape.  The 10,000 files we hope to have online by the end of
2001 should take about 1-2% of a comparably priced drive in 2001.
"""
print("Number of characters:", len(text))
summary = summarize(text, 30)
print("\nSummarized text: \n", summary)

Number of characters: 534

Summarized text: 
 the u.s. declaration of independence was the first etext published by project gutenberg. the 10,000 files we hope to have
