# Experimenting with a T5 for Text Summarisartion  

Adapted from [Denis Rothman - Transformers for Natural Languge Processing](https://github.com/Denis2054/Transformers-for-NLP-2nd-Edition/blob/main/Chapter08/Summerizing_Text_with_T5.ipynb)  

Trying the T5 large model first - need to find a way to score these models! (Bleu, Rouge, BERTSUM, etc))   
And will it run locally on this machine?  (Yes, it does! see venv)  
Might need to set up the GPUs (Not tested yet)  
Note using the anaconda environment 'transformers'  (or text_sum_venv)  
So far I have added: (will freeze a requirements.txt when everything is working):  
`conda install -c conda-forge transformers`  
`conda install -c pytorch pytorch`   
`conda install -c conda-forge sentencepiece`  
`conda install -c conda-forge tensorflow` not needed for this notebook but will need for other transformer experiments.

In [1]:
import torch
# import json 
import glob
import tqdm
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

In [2]:
# choose to display the model config and architecture
display_architecture=False

In [3]:
# load the model and tokenizer
model = T5ForConditionalGeneration.from_pretrained('t5-large')
tokenizer = T5Tokenizer.from_pretrained('t5-large')
# try cpu first its probably enough for this example 'cpu'
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# device = torch.device('cuda')

Downloading: 100%|██████████| 1.20k/1.20k [00:00<00:00, 123kB/s]
Downloading: 100%|██████████| 2.95G/2.95G [02:00<00:00, 24.6MB/s]
Downloading: 100%|██████████| 792k/792k [00:00<00:00, 1.35MB/s]
Downloading: 100%|██████████| 1.39M/1.39M [00:00<00:00, 2.16MB/s]


In [5]:
# print the model
if display_architecture:
 print(model)
 # note all the repeated blocks are the same
 # can do model.encoder or .decoder or .forward to see the just those parts

In [6]:
if display_architecture:
 print(model.config)
 # 16 heads and 24 layers - note the summarization prefix params!
 # note the beam search algo is being used  
 # there is a length penalty for longer sentences
 # vocab size is the size of the tokenizer vocab and can influence 
 # the performance of the model, to large and it will be very sparse.

In [4]:
# save the model 
model.save_pretrained("../src/models/t5-large")
# # save the tokenizer
tokenizer.save_pretrained("../src/models/t5-large")


('../src/models/t5-large\\tokenizer_config.json',
 '../src/models/t5-large\\special_tokens_map.json',
 '../src/models/t5-large\\spiece.model',
 '../src/models/t5-large\\added_tokens.json')

In [7]:
def summarize(text, ml):
    """
    The function takes in a text and the max
    length of the summary. It returns a summary.

    Parameters
    ----------
    text(str): the text to summarize
    ml(int): the max length of the summary

    Returns
    -------
    returns: the summary
    """
    preprocess_text = text.strip().replace("\n", "")
    # add the prefix to the text
    t5_prepared_Text = f"summarize: {preprocess_text}"
    # eyeball the result of preprocessing
    # print ("Preprocessed and prepared text: \n", t5_prepared_Text)
    # encode the text
    tokenized_text = tokenizer.encode(t5_prepared_Text,
                                      return_tensors="pt",
                                      # there are some very long sentences >512
                                      truncation=True).to(device)
    # submit the text to the model and adjust the parameters
    summary_ids = model.generate(tokenized_text,
                                 num_beams=4,
                                 no_repeat_ngram_size=2,
                                 min_length=30,
                                 max_length=ml,
                                 early_stopping=True)
    # decode the ids to text
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)


In [8]:
# small test
text = """ The United States Declaration of Independence was the first Etext
released by Project Gutenberg, early in 1971.  The title was stored
in an emailed instruction set which required a tape or diskpack be
hand mounted for retrieval.  The diskpack was the size of a large
cake in a cake carrier, cost $1500, and contained 5 megabytes, of
which this file took 1-2%.  Two tape backups were kept plus one on
paper tape.  The 10,000 files we hope to have online by the end of
2001 should take about 1-2% of a comparably priced drive in 2001.
"""
print("Number of characters:", len(text))
summary = summarize(text, 50)
print("\nSummarized text: \n", summary)


Number of characters: 534


  next_indices = next_tokens // vocab_size



Summarized text: 
 the united states declaration of independence was the first etext published by project gutenberg, early in 1971. the 10,000 files we hope to have online by the end of2001 should take about 1-2% of a comparably priced drive in


In [9]:
# get a list of all the txt files in the directory
txt_files = glob.glob("../text_data/*.txt")

In [10]:
# loop through the files and summarize them
for file in tqdm.tqdm(txt_files):
    with open(file, 'r') as f:
      text = f.read()
      print("\n", file.split('\\')[-1].split(".")[0],
            # get the number of words in the text
            " which has ", len(text.split()), " words",
            "\nSummarized text: \n",
            summarize(text, 150))

 10%|█         | 1/10 [01:31<13:41, 91.26s/it]


 Bill of rights  which has  498  words 
Summarized text: 
 the right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated. no person shall be held to answer for a capital, or otherwise infamous crime, unless ona presentment or indictment of s. grand juries, except in cases arisingin the land or naval forces,or in the Militia, when in actual service in time of war or public danger ; and no one shallbe compelled in any criminal case


 20%|██        | 2/10 [02:50<11:15, 84.38s/it]


 Consitution of the United States of America  which has  4541  words 
Summarized text: 
 all legislative Powers herein granted shall be vested in a Congress of the United States, which shall consist ofa Senate and House of Representatives.


 30%|███       | 3/10 [04:27<10:29, 89.97s/it]


 Declaration of independence  which has  1349  words 
Summarized text: 
 the thirteen united states ofamerica signed the declaration of independence on July 4, 1776. 'all men are created equal, that they are endowed by their Creator with certain unalienable Rights,' says john b. davis, jr. "that to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed.' the present King of greatbritain is a history of repeatedinjuries and usurpations, all having in direct


 40%|████      | 4/10 [05:39<08:18, 83.07s/it]


 Give Me Liberty Or Give Me Death  which has  1223  words 
Summarized text: 
 the questing before the House is one of awful moment to this country. for my part, i consider it as nothing less than a questionof freedom or slavery - and in proportion to the magnitude of the subjectought to be the freedom ofthe debate." 'i wish to know what there has been in the conductof the British ministry for the last ten years'


 50%|█████     | 5/10 [07:03<06:56, 83.33s/it]


 JFKs Inaugural Address  which has  1423  words 
Summarized text: 
 the torch has been passed to a new generation of americans. let every nation know that we shall pay any price, bear any burden, meet any hardship, to assure the survival and the success of liberty.


 60%|██████    | 6/10 [08:10<05:10, 77.67s/it]


 Lincolns first address  which has  3626  words 
Summarized text: 
 oath taken by president "before he enters on the execution of his office" lincoln: "there has never been any reasonable cause for such apprehension" "I have no purpose, directly or indirectly, to interfere with the institution of slavery"


 70%|███████   | 7/10 [09:36<04:00, 80.27s/it]


 Lincolns Gettysburg Address  which has  298  words 
Summarized text: 
 the world will little note, nor long remember,what we say here. but itcan never forget what they did here, far above ourpoor power to add ordetract. the brave men, living and dead,who struggled here have consecrated it,far beyond our poor powerto addordetect this ground. it is for us the living, rather, to be here dedicated to the great task remainingfor which they gave the last full measure of devotion. we here highly resolve that these dead shall not have died in va


 80%|████████  | 8/10 [11:06<02:47, 83.55s/it]


 Lincolns Second Inaugural Address-AWynne  which has  703  words 
Summarized text: 
 president obama delivered his second inaugural address on march 4, 1865. bob greene says the first address was devoted to averting civil war, but the second was to save the union without war - he says america is still in the midst of the greatcontest which still engrosses the energiesof the nation.


 90%|█████████ | 9/10 [12:03<01:15, 75.13s/it]


 Lincolns Second Inaugural Address  which has  0  words 
Summarized text: 
 april 1 marks the first day of the new year. the u.s. president's eu summit will be held in london on wednesday thursday, june 1 - if you're in the mood for relaxation, then the day is off – and the sun is out!


100%|██████████| 10/10 [12:57<00:00, 77.71s/it]


 Mayflower compact  which has  303  words 
Summarized text: 
 the mayflower compact was signed on the 11th of November, 1620. it was written by the Loyal Subjects of our dread sovereign Lord, King James, of Great Britaine, France, and Ireland,King, Defender of the Faith, &c.



