<a href="https://colab.research.google.com/github/Yatharth19/XL-sum/blob/main/XL_Sum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [16]:
!pip install transformers



In [17]:
!pip install sentencepiece  #Sentencepiece is a tokenizer and we need it because our pre-trained model was tokenized using sentencepiece.



IMPORTING DEPENDENCIES

In [18]:


#importing regular expression library (regex) to deal with the text
import re 

#importing AutoTokenizer which will be used to download the tokenizer associated to the model we pick
#It takes the text data and converts it into tokenized data
from transformers import AutoTokenizer

#Importing HuggingFace model used
from transformers import AutoModelForSeq2SeqLM



DEFINING MODEL NAME, TEXT, PREPROCESSING DEFINITIONS

In [19]:

WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip()))  #Replaces redundant spaces with single and newline characters with space

#This variable contains the text whose summary is desired
article_text = """To break the grip of corruption and black money, we have decided that the five hundred rupee and thousand rupee currency notes presently in use will no longer be legal tender from midnight tonight, that is 8th November 2016. This means that these notes will not be acceptable for transactions from midnight onwards. The five hundred and thousand rupee notes hoarded by anti-national and anti-social elements will become just worthless pieces of paper. The rights and the interests of honest, hard-working people will be fully protected. Let me assure you that notes of one hundred, fifty, twenty, ten, five, two and one rupee and all coins will remain legal tender and will not be affected."""

#The model name given to the pre trained model in the transformers model existing in pytorch
model_name = "csebuetnlp/mT5_multilingual_XLSum"


IMPORTING MODEL AND TOKENIZER (SENTENCEPIECE)

In [20]:

#Importing the model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


#The tokenization process for this specific model(csebuetnlp/mT5_multilingual_XLSum) will get downloaded
tokenizer = AutoTokenizer.from_pretrained(model_name)
#This tokenizer is responsible for converting the textual data in numerical values. 


READING THE INPUT

In [21]:

input_ids = tokenizer(                  # Tokenizing a text is splitting it into words or subwords, which then are converted to ids through a look-up table
    [WHITESPACE_HANDLER(article_text)], # The input fed to the model after handling white spaces from the text
    return_tensors="pt",                # It will return tensors instead of list of python integers
    padding="max_length",               # To make all the sentences read of the maximum length
    truncation=True,                    # In case of any space, it will get removed
    max_length=512                      # The maximum length for a particular sentence will be 512 
)["input_ids"]


PRODUCING THE OUTPUT TOKENS

In [22]:

output_ids = model.generate(
    input_ids=input_ids,                # The data for which output has to be generated
    max_length=84,                      # The maximum allowed length that can be printed in the summary
    no_repeat_ngram_size=2,             # It makes sure that no word sequences of 2 words appears twice by manually setting its probability to zero
    num_beams=4                         # It always gives that output where the probability of individual words is maximum
)[0]


PRINTING THE OUTPUT AS TEXT

In [23]:

summary = tokenizer.decode(             # Command to decode the output produced
    output_ids,                         # The output ids produced
    skip_special_tokens=True,           # It decides whether or not to remove special tokens in the decoding.
    clean_up_tokenization_spaces=False  # Whether or not to clean up the tokenization spaces.
)

print(summary)                          # The produced summary of the long text


India's currency notes have been banned from legal tender from 8 November.
