In [1]:
txt = text="""A vaccine for the coronavirus will likely be ready by early 2021 but rolling it out safely across India’s 1.3 billion people will be the country’s biggest challenge in fighting its surging epidemic, a leading vaccine scientist told Bloomberg.
India, which is host to some of the front-runner vaccine clinical trials, currently has no local infrastructure in place to go beyond immunizing babies and pregnant women, said Gagandeep Kang, professor of microbiology at the Vellore-based Christian Medical College and a member of the WHO’s Global Advisory Committee on Vaccine Safety.
The timing of the vaccine is a contentious subject around the world. 
In the U.S., President Donald Trump has contradicted a top administration health expert by saying a vaccine would be available by October. 
In India, Prime Minister Narendra Modi’s government had promised an indigenous vaccine as early as mid-August, a claim the government and its apex medical research body has since walked back.
"""

In [2]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
import nltk
import os
from extractive.lsa import *
from nltk import sent_tokenize
import transformers
from transformers import T5Tokenizer, BertTokenizer
import torch


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
parser = PlaintextParser.from_string(txt,Tokenizer("english"))

In [4]:
sum_candidates = []
summarizer = LsaSummarizer()

paragraph_split = sent_tokenize(txt)
sentences = [i for i in paragraph_split]
sentence_count = len(sentences)

# for i in range(sentence_count):
summary = summarizer(parser.document,2)
full_summary = ' '.join([sentence._text for sentence in summary])
sum_candidates.append(full_summary)

tokenizer = T5Tokenizer.from_pretrained("t5-small")
source = tokenizer.batch_encode_plus(sum_candidates, max_length = 1024, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors="pt", )    # change to 1024 from 512    
token_count = torch.count_nonzero(source['input_ids'], axis = 1)
print("Token count:", token_count)
idx = (token_count == min(token_count, key=lambda x:abs(x-150))).nonzero().flatten()
print(idx)
print(sum_candidates[idx])
print(token_count[idx])

('A', 'vaccine', 'for', 'the', 'coronavirus', 'will', 'likely', 'be', 'ready', 'by', 'early', 'but', 'rolling', 'it', 'out', 'safely', 'across', 'India', 's', 'billion', 'people', 'will', 'be', 'the', 'country', 's', 'biggest', 'challenge', 'in', 'fighting', 'its', 'surging', 'epidemic', 'a', 'leading', 'vaccine', 'scientist', 'told', 'Bloomberg', 'India', 'which', 'is', 'host', 'to', 'some', 'of', 'the', 'front-runner', 'vaccine', 'clinical', 'trials', 'currently', 'has', 'no', 'local', 'infrastructure', 'in', 'place', 'to', 'go', 'beyond', 'immunizing', 'babies', 'and', 'pregnant', 'women', 'said', 'Gagandeep', 'Kang', 'professor', 'of', 'microbiology', 'at', 'the', 'Vellore-based', 'Christian', 'Medical', 'College', 'and', 'a', 'member', 'of', 'the', 'WHO', 's', 'Global', 'Advisory', 'Committee', 'on', 'Vaccine', 'Safety', 'The', 'timing', 'of', 'the', 'vaccine', 'is', 'a', 'contentious', 'subject', 'around', 'the', 'world', 'In', 'the', 'President', 'Donald', 'Trump', 'has', 'contr

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Token count: tensor([107])
tensor([0])
India, which is host to some of the front-runner vaccine clinical trials, currently has no local infrastructure in place to go beyond immunizing babies and pregnant women, said Gagandeep Kang, professor of microbiology at the Vellore-based Christian Medical College and a member of the WHO’s Global Advisory Committee on Vaccine Safety. In the U.S., President Donald Trump has contradicted a top administration health expert by saying a vaccine would be available by October.
tensor([107])


In [5]:
sentences

['A vaccine for the coronavirus will likely be ready by early 2021 but rolling it out safely across India’s 1.3 billion people will be the country’s biggest challenge in fighting its surging epidemic, a leading vaccine scientist told Bloomberg.',
 'India, which is host to some of the front-runner vaccine clinical trials, currently has no local infrastructure in place to go beyond immunizing babies and pregnant women, said Gagandeep Kang, professor of microbiology at the Vellore-based Christian Medical College and a member of the WHO’s Global Advisory Committee on Vaccine Safety.',
 'The timing of the vaccine is a contentious subject around the world.',
 'In the U.S., President Donald Trump has contradicted a top administration health expert by saying a vaccine would be available by October.',
 'In India, Prime Minister Narendra Modi’s government had promised an indigenous vaccine as early as mid-August, a claim the government and its apex medical research body has since walked back.']