### Playing with transformers library from Hugging Face : Bible Summarization

#### Simpler approach using facebook bart-large-cnn for sentence summarization

Testing bbc.uk news on Ukraine war

In [1]:
TEXT = """Thirteen-year-old Nika Selivanova made a heart shape with both her hands, waving goodbye to her best friend Inna who was pressed up against the glass partition that divided the entrance hall of Kherson's train station from the waiting area. Moments earlier, they'd hugged, tears welling up in their eyes. Inna had kissed Asia, a tan dachshund dog wrapped up in a warm blanket, carried by Nika in her arms. The girls didn't know when they might see each other again. Nika's family was leaving Kherson, not sure of where they would end up eventually. For now, they were heading to the western city of Khmelnytskyi, hoping they would get some help there. The past few days in Kherson had simply been too much for Nika's mother Elena. "Before, they [Russian forces] shelled us seven to 10 times a day, now it's 70-80 times, all day long. It's too scary." Elena said. "I love Ukraine and my dear city. But we have to go." Elena and her three daughters are among more than 400 people who have left Kherson since Christmas Day, after a sharp increase in the intensity of the bombardment of the city by the Russian military. On Tuesday, a hospital maternity ward was shelled. No-one was hurt but it has further escalated fear among people. Elena left by train, in an evacuation facilitated by the Ukrainian government.Hundreds of people are leaving on their own, a queue of cars building up at the checkpoint leading out of Kherson, filled with terrified civilians. Iryna Antonenko was in tears when we walked up to her car to speak to her. 'We can't take it anymore. The shelling is so intense. We stayed this whole time and thought it would pass and that we would be lucky. But a strike hit the house next to ours, and my father's home was also shelled," she said. She planned to travel to Kryvyi Rih, a city in central Ukraine where she has family.Just last month, there had been jubilant scenes in Kherson. Taken by Russian forces on the second day of the invasion, the city was liberated on 11 November. Close to the spot where masses had gathered waving Ukrainian flags to celebrate being freed from Russian control, a mortar attack on Christmas Eve left eleven dead, and dozens injured. Among the dead were a social worker, a butcher and a woman selling mobile Sim cards - ordinary people working at or visiting the city's central market. That day, Kherson was hit by mortars 41 times, according to the Ukrainian government. The Russians are firing from the left (east) bank of the Dnipro river, where they withdrew to; the waterway has become a de facto frontline in the south of Ukraine."""

In [2]:
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=0)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
print(summarizer(TEXT, max_length=1000, min_length=10, do_sample=False))


Your max_length is set to 1000, but you input_length is only 589. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=294)


[{'summary_text': 'More than 400 people have left Kherson since Christmas Day, after a sharp increase in the intensity of the bombardment of the city by the Russian military. On Tuesday, a hospital maternity ward was shelled. The Russians are firing from the left (east) bank of the Dnipro river, where they withdrew to.'}]


Model has a limitation of maximum 1024 tokens (words, setences etc...) as inputs.


### Bible summarization

In [6]:
# lets left king james version due its archaic vacabulary
# bible_kjv_sents = None
# with open('American_King_James_Version_Only_Sentences.txt', 'r') as file:
#     bible_kjv_sents = file.read()
# bible_sents = bible_kjv_sents.split('\n')

In [8]:
import pandas as pd

In [9]:
bible = pd.read_csv('/media/andre/LxData/bible-corpus/t_web.csv')
bible = bible[['b', 'c', 'v', 't']]

#### 1. Word tokenize for summarization 

I don't want a summary based only on sentences (sentence tokenization).  
I want a deeper approach based on the words and its relation.  

Let's use a pretained tonenizer bert

In [10]:
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

cuda = torch.device('cuda')
model = model.to(cuda)

In [11]:
bible['wcount'] = bible.t.apply(lambda x: len(x.split(' ')))

In [12]:
bible.query('b == 42').groupby(bible.c).sum()

  bible.query('b == 42').groupby(bible.c).sum()


Unnamed: 0_level_0,b,c,v,wcount
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,3360,80,3240,1503
2,2184,104,1378,1072
3,1596,114,741,861
4,1848,176,990,971
5,1638,195,780,892
6,2058,294,1225,1193
7,2100,350,1275,1144
8,2352,448,1596,1370
9,2604,558,1953,1412
10,1764,420,903,961


In [13]:
long_text = ' '.join(bible.query('b == 42 and c > 14 and c < 16').t.to_list())

In [14]:
# tokenize without truncation
inputs_no_trunc = tokenizer(long_text, max_length=None, return_tensors='pt', truncation=False)
long_text

'Now all the tax collectors and sinners were coming close to him to hear him. The Pharisees and the scribes murmured, saying, "This man welcomes sinners, and eats with them." He told them this parable. "Which of you men, if you had one hundred sheep, and lost one of them, wouldn\'t leave the ninety-nine in the wilderness, and go after the one that was lost, until he found it? When he has found it, he carries it on his shoulders, rejoicing. When he comes home, he calls together his friends and his neighbors, saying to them, \'Rejoice with me, for I have found my sheep which was lost!\' I tell you that even so there will be more joy in heaven over one sinner who repents, than over ninety-nine righteous people who need no repentance. Or what woman, if she had ten drachma{A drachma coin was worth about 2 days wages for an agricultural laborer.} coins, if she lost one drachma coin, wouldn\'t light a lamp, sweep the house, and seek diligently until she found it? When she has found it, she ca

In [15]:
inputs_no_trunc['input_ids'][0].shape, len(long_text.split(' '))

(torch.Size([891]), 699)

In [16]:
# get batches of tokens corresponding to the exact model_max_length
chunk_start = 0
chunk_end = tokenizer.model_max_length  # == 1024 for Bart
inputs_batch_lst = []
while chunk_start <= len(inputs_no_trunc['input_ids'][0]):
    inputs_batch = inputs_no_trunc['input_ids'][0][chunk_start:chunk_end]  # get batch of n tokens
    inputs_batch = torch.unsqueeze(inputs_batch, 0)
    inputs_batch_lst.append(inputs_batch)
    chunk_start += tokenizer.model_max_length  # == 1024 for Bart
    chunk_end += tokenizer.model_max_length  # == 1024 for Bart

In [17]:
inputs_batch_lst = [x.to(cuda) for x in inputs_batch_lst]

In [35]:
# # generate a summary on each batch
summary_ids_lst = [model.generate(inputs, num_beams=5, min_length=130,
                                 max_length=1050, early_stopping=True) for inputs in inputs_batch_lst]

# summary_ids_lst = [model.generate(inputs, 
#                     do_sample=True, 
#                     max_length=150, 
#                     top_p=0.92, 
#                     top_k=50,   
#                     num_return_sequences=3) 
#                    for inputs in inputs_batch_lst]

In [36]:
# decode the output and join into one string with one paragraph per summary batch
summary_batch_lst = []
for summary_id in summary_ids_lst:
    summary_batch = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_id]
    summary_batch_lst.append(summary_batch[0])
summary_all = '\n'.join(summary_batch_lst)

print(summary_all)

The Pharisees and the scribes murmured, saying, "This man welcomes sinners, and eats with them." He told them this parable. "Which of you men, if you had one hundred sheep, and lost one of them, wouldn't leave the ninety-nine in the wilderness, and go after the one that was lost, until he found it? When he has found it, he carries it on his shoulders, rejoicing," he said. "Even so, I tell you, there is joy in the presence of the angels of God over one sinner repenting," he added. "It was appropriate to celebrate and be glad, for this, your brother was dead, and is alive again," Jesus said.
