The model google pegasus xsum is a Natural Language Processing (NLP) Model implemented in Transformer library, generally using the Python programming language for abstractive text summarization.

https://huggingface.co/google/pegasus-xsum

### **Installing dependencies**

In [2]:
#Installing pytorch
!pip install torch



In [3]:
# Installing transformers
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 4.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 31.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 5.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 31.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 31.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml


### **Importing and loading the Model**

In [4]:
# Importing pytorch
import torch

In [13]:
# Importing dependencies from transformers
# PegasusForConditionalGeneration is the holder of the model helps us to use the model 
# AutoTokenizer helps in converting words into set of tokens (set of number representation for sentences)
from transformers import PegasusForConditionalGeneration, AutoTokenizer

In [14]:
# Loading tokenizer
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")

Downloading:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [9]:
# Loading model 
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")

Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

### **Performing abstractive summarizatio**n

In [17]:
# text for performing summarization
src_text = [
    """this happened 5/6 years ago so my whole family every xmas day goes around to my aunties for celebrations. my cousin (of course) was there and he
asked if i wanted to play cops and robbers. i accepted of course. now, next to the side of my aunts house is a little area with a small fence, a covered
water tank and super duper sharp stones. my cousin (who was the cop) was gaining on me. i (tried) to jump over the fence, aaand i failed the jump
and went crashing onto the gravel, my leg hitting the sharpest bit and, then the next thing i knew it had a nasty gash."""
]

In [18]:
# Create tokens : number representation of our text
tokens = tokenizer(src_text, truncation=True, padding="longest", return_tensors="pt") 
# 1st argument(src_text) we are passing in tokenizer is text we want to summarize, 
# 2nd argument(truncation=True) is to truncate or shorten the text to appropriate length so that it can be passed through our model
# 3rd argument(padding="longest") is to set padding to be longest (shape and size)
# 4th argument(return_tensors="pt") is for returning pytorch tensors

In [19]:
# Input tokens
tokens
# token representation of our body text

{'input_ids': tensor([[  136,  2032, 75382,   231,   754,   167,   161,   664,   328,   290,
         72217,   242,  1168,   279,   112,   161, 72803,   116,   118,  9682,
           107,   161, 10691,   143,  1313,   422,   158,   140,   186,   111,
           178,  1049,   175,   532,   869,   112,   462, 22242,   111, 49998,
           107,   532,  2842,   113,   422,   107,   239,   108,   352,   112,
           109,   477,   113,   161, 46373,   480,   117,   114,   332,   345,
           122,   114,   360,  4501,   108,   114,  1622,   336,  3476,   111,
          1561,  6457,  3752,  4565,  6598,   107,   161, 10691,   143,  6062,
           140,   109, 17934,   158,   140,  6752,   124,   213,   107,   532,
           143, 62345,   158,   112,  3540,   204,   109,  4501,   108,   114,
           304,   526,   532,  3004,   109,  3540,   111,   687, 18471,  1656,
           109, 11197,   108,   161,  3928,  5876,   109, 71422,   588,   111,
           108,   237,   109,   352,  

This generate two things, first items are the actual tokens and the second items are attention mask which basically specifies where our specific attention is going to be directed when we are going to generate the text

In [20]:
# Summarizing 
summary = model.generate(**tokens) # unpacking thetokens and passing through the model to generate summary

In [26]:
# Output summary tokens
summary

tensor([[    0,   182,   117,   109,   584,   113,   199,   532,  2371,   164,
           122,   114, 12562,  1503,  1467,   124,   161,  3928,   107,     1]])

In [25]:
# Decoding summary
tokenizer.decode(summary[0])
# Converts the tokens into sentences

'<pad> This is the story of how i ended up with a nasty gash on my leg.</s>'

Trying another text

In [27]:
src_text2= [
    """throwaway here for obvious reasons.. today my friends and i decided to go off-roading in nowhereland. we packed up all our stuff, made the roughly
hour drive off to the mountains to make a fire, go fishing and just talk about life until we got too tired to stay any longer. we got everything packed
up and brought along one of my friends’ dog because she’s awesome and loves the outdoors. the dog was flipping out in the suv on the way to the
path because she knew was a kick-ass day she was about to have breaking out of her normally lame, domesticated dog life. my friends decided to
drink during the off-roading adventure, which was fine because i volunteered to drive since i cannot drink alcohol (mouth is wired shut [long story
but i can’t drink alcohol for a while]) so we were playing it safe. the dog couldn’t be any happier and was about to jump out of the truck (literally)
when we got there so the dog’s owner let her get out and run along side of us while we drove the dirt road up to the destination for the fire. as i was
driving, the dog went in and out of vision, mostly biting the tires as most dogs do, playing around. the owner kept asking us (the two guys up front)
if we could see her. we said yes, and kept driving. as i was driving at no more than 5-10mph along the dirt road, i could hear the dog biting at the
tires playfully, but we just laughed it off bc we thought she was having fun. the horrible, seconds-long event that ensued was me feeling the dreaded
’double-thud’ under the tires and heard the dog yelp in pain. i instantly stopped the ... ."""
]


In [36]:
tokens2 = tokenizer(src_text2, truncation=True, padding="longest", return_tensors="pt")

In [37]:
tokens2

{'input_ids': tensor([[81075,   264,   118,  3312,  1523,   107,   107,   380,   161,   594,
           111,   532,  1159,   112,   275,   299,   121, 10388,   273,   115,
          9011,  2567,   107,   145,  4023,   164,   149,   150,  1549,   108,
           266,   109,  5864,  1269,   919,   299,   112,   109,  4583,   112,
           193,   114,  1316,   108,   275,  3070,   111,   188,  1002,   160,
           271,   430,   145,   419,   314,  4633,   112,   753,   189,   895,
           107,   145,   419,   579,  4023,   164,   111,  1457,   466,   156,
           113,   161,   594,   123,  1296,   262,   265,   123,   116,  1990,
           111,  3452,   109,  5628,   107,   109,  1296,   140, 23402,   165,
           115,   109, 50263,   124,   109,   230,   112,   109,  1936,   262,
           265,  1606,   140,   114,  3951,   121,  1483,   116,   242,   265,
           140,   160,   112,   133,  4282,   165,   113,   215,  3209, 23964,
           108, 50200,  1296,   271,  

In [38]:
summary2 = model.generate(**tokens2)

In [39]:
summary2

tensor([[   0,  125,  123,  261,  174, 2050,  112, 1094,  136,  450,  118,  114,
          300,  166,  107,    1]])

In [40]:
tokenizer.decode(summary2[0])

'<pad> I’ve been meaning to write this post for a long time.</s>'