<a href="https://colab.research.google.com/github/ahmedazaz32/Text-Summarization-with-BART-T5/blob/main/Text_Summarization_with_BART_%26_T5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers --upgrade

In [None]:
!pip install tensorflow_probability==0.12.2


In [4]:
from transformers import pipeline

In [5]:
import requests
import pprint
import time
pp = pprint.PrettyPrinter(indent=14)

In [None]:
## documentation for summarizer: https://huggingface.co/transformers/main_classes/pipelines.html#summarizationpipeline
# summarize with BART
summarizer_bart = pipeline(task='summarization', model="facebook/bart-large-cnn")
# summarize with T5
summarizer_t5 = pipeline(task='summarization', model="t5-large") # options: ‘t5-small’, ‘t5-base’, ‘t5-large’, ‘t5-3b’, ‘t5-11b’
#for T5 you can chose the size of the model. Everything above t5-base is very slow, even on GPU or TPU.

In [8]:
## download book
book_raw = requests.get("http://www.gutenberg.org/cache/epub/61/pg61.txt")
communist_manifesto = book_raw.text
# cleaning
delimiter = "[From the English edition of 1888, edited by Friedrich Engels]"
communist_manifesto_cl = communist_manifesto.split(delimiter, 1)[1]
delimiter2 = "WORKING MEN OF ALL COUNTRIES, UNITE!"
communist_manifesto_cl =  communist_manifesto_cl.split(delimiter2, 1)[0] + delimiter2
#print(communist_manifesto_cl)

In [10]:
## summarize book with BART model
t0 = time.time() # timer
summary_manifesto_bart = summarizer_bart(communist_manifesto_cl, truncation=True, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.")

Summarization took 1.1 minutes.


In [12]:
pp.pprint(summary_manifesto_bart[0]['summary_text'])

('A spectre is haunting Europe—the spectre of Communism. All the Powers have '
 'entered into a holy alliance to exorcise this. Communism is already '
 'acknowledged by all European Powers to be a Power. It is high time that '
 'Communists should openly, in the face of the. whole world, publish their '
 'views, their aims, their tendencies, and.meet this nursery tale of the '
 'Spectre of Communism with a Manifesto of the party itself. To this end, '
 'Communists of various nationalities have assembled in London, and sketched '
 'the following Manifesto, to be published in the. English, French, German, '
 'Italian, Flemish and Danish languages. The modern bourgeois society that has '
 'sprouted from the ruins of feudal society has not done away with class '
 'antagonisms. It has but firmly established new classes, new conditions of '
 'oppression, new forms of struggle in place of the old ones.')


In [13]:
## summarize book with T5 model
t0 = time.time() # timer
summary_manifesto_t5 = summarizer_t5(communist_manifesto_cl, truncation=True, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Summarization took 1.67 minutes.


In [14]:
pp.pprint(summary_manifesto_t5[0]['summary_text'])

('all the Powers of old Europe have entered into a holy alliance to exorcise '
 'the spectre of communism . a Manifesto of the communist party will be '
 'published in the english, french, german, italian, flanders and danes '
 'languages . the aims and tendencies of communists should be published '
 'openly, in the face of the whole world, says david bourgeois . "the '
 'communist movement is not a party, it is a movement a .- a- n aa-a na en a. '
 '- .a a, a "a')


In [16]:
## download book
book_raw = requests.get("http://gutenberg.net.au/ebooks01/0100021.txt")
orwell_1984 = book_raw.text
# cleaning
delimiter = 'PART ONE'
orwell_1984_cl = delimiter + orwell_1984.split(delimiter, 1)[1]
delimiter2 = "THE END"
orwell_1984_cl = orwell_1984_cl.split(delimiter2, 1)[0] + delimiter2
#print(orwell_1984_cl)

In [17]:
## summarize book with BART model
t0 = time.time() # timer
summary_orwell_bart = summarizer_bart(orwell_1984_cl, truncation=True, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Summarization took 1.03 minutes.


In [18]:
pp.pprint(summary_orwell_bart[0]['summary_text'])

('Winston Smith lived in a flat seven flights up from the Ministry of Truth in '
 'central London. He kept his back to the telescreen, but it was impossible to '
 'turn it off completely. Every sound he made would be picked up by the '
 'Thought Police, who watched him all the time. He had to live with the fear '
 'that he would be seen and heard by the police. He decided to write a book '
 'about his experiences. He began with the first chapter of his book, which he '
 "called 'Hate Week' The book was published by Simon & Schuster, and is "
 'available in paperback and hardback. For more information on the book, visit '
 'www.simonandschuster.co.uk/Hate-Week-Volume-1-2.')


In [19]:
## summarize book with T5 model
t0 = time.time() # timer
summary_orwell_t5 = summarizer_t5(orwell_1984_cl, truncation=True, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Summarization took 1.62 minutes.


In [20]:
pp.pprint(summary_orwell_t5[0]['summary_text'])

('big brother is watching you, the caption beneath the poster reads . inside '
 'the flat a fruity voice is reading out a list of figures which had something '
 'to do with the production of pig-iron . the telescreen can be dimmed, but '
 'there is no way of shutting it off completely . he moves over to the window: '
 'a smallish, frail figure, his skin roughened by coarse soap and blunt razor '
 'blades, and the cold of the winter that had ended. outside he sees a a- a. a '
 'en aa .-a n aen .a enaenenaaa')


In [21]:
## download book
book_raw = requests.get("http://www.gutenberg.org/cache/epub/2009/pg2009.txt")
darwin_origin_of_species = book_raw.text
# cleaning
delimiter = 'INTRODUCTION.'
darwin_origin_of_species_cl = "ORIGIN OF SPECIES." + delimiter + darwin_origin_of_species.split(delimiter, 1)[1]
delimiter2 = "GLOSSARY OF THE PRINCIPAL SCIENTIFIC TERMS USED IN THE PRESENT VOLUME."
darwin_origin_of_species_cl =  darwin_origin_of_species_cl.split(delimiter2, 1)[0]
print(darwin_origin_of_species_cl)

ORIGIN OF SPECIES.INTRODUCTION.

CHAPTER I. VARIATION UNDER DOMESTICATION
CHAPTER II. VARIATION UNDER NATURE
CHAPTER III. STRUGGLE FOR EXISTENCE
CHAPTER IV. NATURAL SELECTION; OR THE SURVIVAL OF THE FITTEST
CHAPTER V. LAWS OF VARIATION
CHAPTER VI. DIFFICULTIES OF THE THEORY
CHAPTER VII. MISCELLANEOUS OBJECTIONS TO THE THEORY OF NATURAL SELECTION
CHAPTER VIII. INSTINCT
CHAPTER IX. HYBRIDISM
CHAPTER X. ON THE IMPERFECTION OF THE GEOLOGICAL RECORD
CHAPTER XI. ON THE GEOLOGICAL SUCCESSION OF ORGANIC BEINGS
CHAPTER XII. GEOGRAPHICAL DISTRIBUTION
CHAPTER XIII. GEOGRAPHICAL DISTRIBUTION—continued
CHAPTER XIV. MUTUAL AFFINITIES OF ORGANIC BEINGS
CHAPTER XV. RECAPITULATION AND CONCLUSION




In [22]:
## summarize book with BART model
t0 = time.time() # timer
summary_darwin_bart = summarizer_bart(darwin_origin_of_species_cl, truncation=True, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Your max_length is set to 500, but your input_length is only 241. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=120)


Summarization took 0.81 minutes.


In [23]:
pp.pprint(summary_darwin_bart[0]['summary_text'])

('The book is divided into four parts. The first part of the book deals with '
 'the theory of natural selection. The second part focuses on the history of '
 'the world. The third part deals with how the world came into existence. The '
 'fourth part is about how the universe came into being. The book is published '
 'by Simon & Schuster, Inc. in the U.S. and Canada, and is available on '
 'Amazon.com for $24.99. For more, go to: '
 'http://www.simonandschuster.com/books/the-science-of-natural-selection-and-the-history-of\xa0'
 'the-world\xa0and\xa0for more, visit: '
 'http:/www.samaritans.org/science/thescience/natural- Selection.')


In [24]:
## summarize book with T5 model
t0 = time.time() # timer
summary_darwin_t5 = summarizer_t5(darwin_origin_of_species_cl, truncation=True, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Your max_length is set to 500, but your input_length is only 287. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=143)


Summarization took 1.38 minutes.


In [25]:
pp.pprint(summary_darwin_t5[0]['summary_text'])

('ORIGIN OF SPECIES.INTRODUCTION.CHAPTER I. VARIATION UNDER DOMESTICATION '
 'CHAPTER II. NATURAL SELECTION; OR THE SURVIVAL OF THE FITTEST CHAPTERS VI, '
 'VII, VIII. MISCELLANEOUS OBJECTIONS TO THE THEORY OF NATURE’S SECTION. '
 'HYBRIDISM. XI. ON THE GEOLOGICAL SUCCESS OF ORGANIC. .- '
 '.\xad\xad\xad-\xad\xad–\xad\xad –\xad–– – . ––E.–A.– \xad\xad')


In [26]:
## download book
book_raw = requests.get("http://www.gutenberg.org/cache/epub/3420/pg3420.txt")
rights_woman = book_raw.text
# cleaning
delimiter = 'A VINDICATION OF THE RIGHTS OF WOMAN,'
rights_woman_cl = delimiter + rights_woman.split(delimiter, 1)[1]
#print(rights_woman_cl)

In [27]:
## summarize book
t0 = time.time() # timer
summary_rights_woman_bart = summarizer_bart(rights_woman_cl, truncation=True, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Summarization took 1.11 minutes.


In [28]:
pp.pprint(summary_rights_woman_bart[0]['summary_text'])

('M. Wollstonecraft was born in 1759. She became a teacher from motives of '
 'benevolence, rather than philanthropy. Her father was so great that the '
 'place of her birth is uncertain. She left her parents at the age of '
 'nineteen, and resided with a Mrs. Dawson for two years. Her friend and '
 'colleague, Dr. Price, died of a pulmonary disease. She gave proof of the '
 'superior qualification of superior qualification for the superior role of a '
 'woman. She wrote a book called The Rights of the Woman, published in 2001. '
 'The book is published by Simon & Schuster, London, priced £16.99. For more '
 'information on the book, or to order a copy, visit: '
 'http://www.simonandschuster.com/ The-Rights-of-the-Woman.html.')


In [29]:
## summarize book
t0 = time.time() # timer
summary_rights_woman_t5 = summarizer_t5(rights_woman_cl, truncation=True, min_length=150, max_length=500) # change min_ and max_length for different output
print("Summarization took " + str(round((time.time() - t0) / 60, 2)) + " minutes.") # timer

Summarization took 1.64 minutes.


In [30]:
pp.pprint(summary_rights_woman_t5[0]['summary_text'])

('A VINDICATION OF THE RIGHTS OF WOMAN, WITH STRICTURES ON POLITICAL AND MORAL '
 'SUBJECTS, BY MARY WOLLSTONECRAFT . INTRODUCTION, OBSERVATIONS ON THE STATE '
 "OF DEGRADATION, AND WOMEN'S DUTY TO PARENTS, AND ON NATIONAL EDUCATION . "
 'WITH A BIOGRAPHICAL SKETCH OF AUTHOR. 8 April,n a a- - . '
 '--------n--E.--.-----------E----S-')
