In [19]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", '3.0.0')

  from .autonotebook import tqdm as notebook_tqdm


In [21]:
test_dataset = dataset['test']

In [22]:
test_dataset = test_dataset.map(lambda ex: {'reference_summary': ex['highlights']})

In [23]:
test_dataset

Dataset({
    features: ['article', 'highlights', 'id', 'reference_summary'],
    num_rows: 11490
})

In [21]:
output_list = []
for input in test_dataset:
    print(input['article'])
    print('\n****')
    print(input['highlights'])
    print('\n****')
    print(input['reference_summary'])
    break

(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC's founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday's ceremony, sa

In [24]:
test_dataset = test_dataset.select(list(range(10)))

In [25]:
test_dataset

Dataset({
    features: ['article', 'highlights', 'id', 'reference_summary'],
    num_rows: 10
})

In [26]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch

#src_text = test_dataset['article'][0]
output_list = []
model_name = "google/pegasus-cnn_dailymail"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)

for input in test_dataset:
    batch = tokenizer(input['article'], truncation=True, padding="longest", return_tensors="pt").to(device)
    translated = model.generate(**batch)
    op_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    output_list.append(op_text)
# assert (
#     tgt_text[0]
#     == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
# )

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
output_list

[["The Palestinian Authority formally becomes the 123rd member of the International Criminal Court.<n>The move gives the court jurisdiction over alleged crimes in Palestinian territories.<n>Palestinians signed the ICC's founding Rome Statute in January."],
 ['Theia, a white-and-black bully breed mix, was apparently hit by a car and buried in a field.<n>Four days later, the dog staggers to a farm and is taken in by a worker.<n>She needs surgery to fix a dislocated jaw and a caved-in sinus cavity.'],
 ["Mohammad Javad Zarif is the Iranian foreign minister.<n>He is the opposite number in talks with the U.S. over Iran's nuclear program.<n>He received a hero's welcome as he arrived in Iran on a sunny Friday morning."],
 ["One of the five had a heart-related issue on Saturday and has been discharged but hasn't left the area.<n>They were exposed to Ebola in Sierra Leone in March, but none developed the deadly virus."],
 ["A student has admitted to hanging a noose from a tree near a student un

In [32]:
from nltk.translate.bleu_score import sentence_bleu
import nltk
from rouge import Rouge
lenlist = []
rouge = Rouge()
for i in range(10):
    print("Article ",i)
    print()
    print("Reference Summary: \n",test_dataset['reference_summary'][i])
    print()
    print("Predicted Summary: \n", output_list[i][0])
    print()
    print('SCORES\n')
    reference_summary_tokens = nltk.word_tokenize(test_dataset['reference_summary'][i])
    generated_summary_tokens = nltk.word_tokenize(output_list[i][0])
    bleu_score = sentence_bleu([reference_summary_tokens], generated_summary_tokens)
    print('BLEU: ',bleu_score)
    print()
    rougescore = rouge.get_scores(output_list[i], [test_dataset['reference_summary'][i]])
    print('Rouge: ', rougescore)
    print()
    print('Length: ', len(output_list[i][0]))
    lenlist.append(len(output_list[i][0]))
    print('_______________________________________________________________________________________')
    print()

Article  0

Reference Summary: 
 Membership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since last June .
Israel and the United States opposed the move, which could open the door to war crimes investigations against Israelis .

Predicted Summary: 
 The Palestinian Authority formally becomes the 123rd member of the International Criminal Court.<n>The move gives the court jurisdiction over alleged crimes in Palestinian territories.<n>Palestinians signed the ICC's founding Rome Statute in January.

SCORES

BLEU:  0.08892786873926031

Rouge:  [{'rouge-1': {'r': 0.3, 'p': 0.3103448275862069, 'f': 0.3050847407641483}, 'rouge-2': {'r': 0.18181818181818182, 'p': 0.18181818181818182, 'f': 0.18181817681818196}, 'rouge-l': {'r': 0.3, 'p': 0.3103448275862069, 'f': 0.3050847407641483}}]

Length:  250
_______________________________________________________________________________________

Article  1

Reference Summary: 
 Theia, a bully breed mix, was apparentl

The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [36]:
avg = sum(lenlist)/len(lenlist)
print(avg)

200.1


In [37]:
lenlist

[250, 240, 213, 195, 177, 216, 249, 149, 168, 144]

In [22]:
print(rouge.get_scores(output_list[0], [test_dataset['reference_summary'][0]]))

[{'rouge-1': {'r': 0.3, 'p': 0.3103448275862069, 'f': 0.3050847407641483}, 'rouge-2': {'r': 0.18181818181818182, 'p': 0.18181818181818182, 'f': 0.18181817681818196}, 'rouge-l': {'r': 0.3, 'p': 0.3103448275862069, 'f': 0.3050847407641483}}]


In [19]:
output_list[0]

["The Palestinian Authority formally becomes the 123rd member of the International Criminal Court.<n>The move gives the court jurisdiction over alleged crimes in Palestinian territories.<n>Palestinians signed the ICC's founding Rome Statute in January."]

In [21]:
test_dataset['reference_summary'][0]

'Membership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since last June .\nIsrael and the United States opposed the move, which could open the door to war crimes investigations against Israelis .'

In [16]:
test_dataset['article'][0]

'(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC\'s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians\' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday\'s ceremony

In [17]:
op_text

['Bob Barker returns to "The Price Is Right" for the first time in eight years.<n>The 91-year-old TV legend hosted the classic "Lucky Seven" game.']

In [None]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch

src_text = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]

model_name = "google/pegasus-cnn_dailymail"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
batch = tokenizer(src_text, truncation=True, padding="longest", return_tensors="pt").to(device)
translated = model.generate(**batch)
op_text = tokenizer.batch_decode(translated, skip_special_tokens=True)

## PEGASUS XSUM

In [1]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch

src_text = [
    """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
]

model_name = "google/pegasus-xsum"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
batch = tokenizer(src_text, truncation=True, padding="longest", return_tensors="pt").to(device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
assert (
    tgt_text[0]
    == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
)

  from .autonotebook import tqdm as notebook_tqdm
Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:

tgt_text[0]

"California's largest electricity provider has turned off power to hundreds of thousands of customers."

In [9]:
teststr = 'बहुत धन्यवाद'

In [10]:
s = teststr.split()

In [11]:
s 

['बहुत', 'धन्यवाद']

In [2]:
import fasttext
import fasttext.util
#fasttext.util.download_model('hi')
ft = fasttext.load_model('wiki.hi/wiki.hi.bin')
word = "नृत्य"
print("Embedding Shape is {}".format(ft.get_word_vector(word).shape))
print("Nearest Neighbors to {} are:".format(word))
ft.get_nearest_neighbors(word) 

Embedding Shape is (300,)
Nearest Neighbors to नृत्य are:


[(0.8913929462432861, 'नृत्य।'),
 (0.8440190553665161, 'नृत्यगान'),
 (0.8374733924865723, 'नृत्यगीत'),
 (0.8336297869682312, 'नृत्यों'),
 (0.8265783190727234, 'नृत्यरत'),
 (0.7971948385238647, 'नृत्यकला'),
 (0.7879464626312256, 'नृत्त'),
 (0.7682990431785583, 'नृतक'),
 (0.7622954845428467, 'नृत्यरचना'),
 (0.7602956295013428, 'नृत्यग्राम')]

In [22]:
type(ft.get_word_vector(s[0]))
#ft.get_nearest_neighbors(s[0])

numpy.ndarray

In [18]:
if s[0] in ft.words:
    print(1)

1


In [None]:
newlist = "नृत्य"

In [3]:
import numpy as np
def get_average_vec(inp : str):
    inplist = inp.split()
    vecdim = 300
    wordvec = np.zeros(vecdim)
    count=0
    for i in inplist:
        if i in ft.words:
            count+=1
            wordvec += ft.get_word_vector(i)

    wordvec = wordvec/count
    return wordvec

In [4]:
strr = get_average_vec(teststr)
print(strr)
print()
print(strr.shape)

NameError: name 'teststr' is not defined

In [8]:
import numpy as np
from numpy.linalg import norm
# A = ft.get_word_vector(s[0])
# B = ft.get_word_vector(s[0])

# #cosine = np.dot(A,B)/(norm(A)*norm(B))
# cosine = np.dot(A,B)/(norm(A)*norm(B))
# check = norm(A - B)
# print(cosine)

In [31]:
print(check)

0.0


In [9]:
def get_cosine_similarity(inputvector1, inputvector2):
    assert (inputvector1.size == 300 and inputvector2.size == 300)
    cosine = np.dot(inputvector1, inputvector2)/(norm(inputvector1)*norm(inputvector2))
    return cosine

In [48]:
A = ft.get_word_vector(s[0])
B = ft.get_word_vector(s[1])
similarity = get_cosine_similarity(A, B)

0.23954187

AssertionError: 

301

In [10]:
newlist1 = "नृत्य नृत्यगान"
newlist2 = "नृत्यगीत नृत्यरत"
get_vec_1 = get_average_vec(newlist1)
get_vec_2 = get_average_vec(newlist2)
similarity = get_cosine_similarity(get_vec_1, get_vec_2)
print(similarity)

0.9260828981095699


In [11]:
import pandas as pd

testingdf = pd.read_csv('test.csv/test.csv')

In [12]:
testingdf.head()

Unnamed: 0,headline,article
0,"पठानकोट पहुंचे PM मोदी, एयरबेस का जायजा ले बॉर...",प्रधानमंत्री नरेंद्र मोदी पठानकोट एयरबेस पहुंच...
1,सचिन ने देशवासियों को समर्पित किया अपना दोहरा शतक,सचिन तेंदुलकर ने एकदिवसीय अंतरराष्ट्रीय क्रिके...
2,एनआईए करेगी छत्तीसगढ़ में सुरक्षा खामियों की ज...,केंद्रीय गृह राज्य मंत्री आर. पी. एन. सिंह ने ...
3,सीधी बात: शाह बोले- हमारा बस चलता तो अब तक मं...,भारतीय जनता पार्टी (बीजेपी) के राष्ट्रीय अध्यक...
4,"ऋषभ पंत के पास यूनिक टैलेंट, उसके साथ छेड़छाड़ न...",ऋषभ पंत की कभी कभार इस बात के लिए आलोचना की जा...


In [16]:
testingdf['headline'][1]

'सचिन ने देशवासियों को समर्पित किया अपना दोहरा शतक'

In [17]:
ip1 = "प्रधानमंत्री नरेंद्र मोदी ने पठानकोट एयरबेस पहुंचकर सुरक्षा स्थिति की समीक्षा की है और वायुसेना के कर्मियों से मिलकर उनके साथीयों को संबोधित किया है। सुबह करीब 7:30 बजे, प्रधानमंत्री ने पंजाब के पठानकोट का संदर्भ लेकर यात्रा की। उन्होंने एयरबेस की निगरानी के बाद सीमाई क्षेत्रों का हवाई सर्वेक्षण भी करने का निर्णय लिया है। पिछले हफ्ते, पठानकोट एयरबेस पर आतंकियों ने हमला किया था, लेकिन उनकी कोशिश नाकाम रही और सभी 6 पाकिस्तानी आतंकी मार गए थे, जबकि 7 सुरक्षाबलें शहीद हुई थीं। भारत ने दोषियों के खिलाफ कड़ी कार्रवाई की मांग की है । सूचना के अनुसार, प्रधानमंत्री के साथ सेना और एयरफोर्स के चीफ भी मौजूद हो सकते हैं। आतंकियों के हमले के तीसरे दिन, रक्षा मंत्री मनोहर पर्रिकर ने भी पठानकोट का दौरा किया था और उन्होंने साफ तौर पर बताया कि आतंकी विदेश से आए थे और उनके पास पाकिस्तान से आए सामग्री मिली थी। इस मामले की जांच एनआईए को सौंपी गई है । इस बीच, पाकिस्तान के प्रधानमंत्री नवाज शरीफ ने भी भारत के द्वारा पठानकोट आतंकी हमले के संबंध में दी गई सबूतों के आधार पर जांच करने के आदेश दिए हैं। इस पर गुरुवार को उच्च स्तरीय बैठक बुलाई गई, जिसमें पठानकोट हमले पर चर्चा हुई। बैठक के बाद, नवाज शरीफ ने भारत के सबूतों के आधार पर जांच करने के आदेश दिए हैं । सूचना के अनुसार, शरीफ ने भारत की ओर से सौंपे गए सबूतों के आधार पर जांच कराए जाने के लिए सहमति जताई है और इसके बाद उन्होंने इंटेलिजेंस ब्यूरो के चीफ को कार्रवाई करने के लिए सौंप दिया है। बैठक में शरीफ ने भारत के साथ आतंकरोधी नीति के तहत सहयोग बढ़ाने के लिए भी अपनी तैयारी जताई है। इस मौके पर वह अपने राष्ट्रीय सुरक्षा सलाहकार नासिर खान जंजुआ से भारत के एनएसए अजित डोभाल से संपर्क बनाए रखने का आदेश भी दिया है"
ip2 = testingdf['headline'][1]

ip1vec = get_average_vec(ip1)
ip2vec = get_average_vec(ip2)

In [18]:
similarity = get_cosine_similarity(ip1vec, ip2vec)
print(similarity)

0.7469793800194328
