Text Summarizing

In [160]:
from summarizer import Summarizer
from transformers import logging, AutoConfig, AutoTokenizer, AutoModel
logging.set_verbosity_error()
import warnings
warnings.filterwarnings("ignore")

# alternate: bert-large-uncased (better summary)
custom_config = AutoConfig.from_pretrained('mrm8488/bert-tiny-finetuned-squadv2')
custom_config.output_hidden_states=True
custom_tokenizer = AutoTokenizer.from_pretrained('mrm8488/bert-tiny-finetuned-squadv2')
custom_model = AutoModel.from_pretrained('mrm8488/bert-tiny-finetuned-squadv2', config=custom_config)

model = Summarizer(custom_model=custom_model, custom_tokenizer=custom_tokenizer)

In [181]:
f = open("sample.txt","r")
full_text = f.read()
result = model(full_text, min_length=60, max_length = 500 , ratio = 0.4)
summarized_text = ''.join(result)
print (summarized_text)

Robin van Persie (Dutch pronunciation: [ˈrɔbɪɱ vɑm ˈpɛrsi] (listen); born 6 August 1983) is a Dutch football coach and former professional footballer who played as a striker. Regarded as one of the best strikers of his generation, Van Persie was known for his excellent technique and ball control, intelligent positioning, and vision. He is the all-time top scorer for the Netherlands national team. Van Persie was converted to a striker by manager Arsène Wenger and went on to be a consistent goalscorer for Arsenal. He scored a club record of 35 goals in 2011 and was club captain for the 2011–12 season, prior to joining rivals Manchester United in July 2012. In his first season, he won the Premier League and his second consecutive Premier League Golden Boot.\nAfter two injury-hit seasons followed, Van Persie fell out of favour at United and he was allowed to leave for Fenerbahçe in July 2015. After representing the Netherlands at under-17, under-19 and under-21 level, Van Persie made his s

In [143]:
full_text

'India has a very ancient tradition of art, which has exchanged many influences with the rest of Eurasia, especially in the first millennium, when Buddhist art spread with Indian religions to Central, East and South-East Asia, the last also greatly influenced by Hindu art.[377] Thousands of seals from the Indus Valley Civilization of the third millennium BCE have been found, usually carved with animals, but a few with human figures. The "Pashupati" seal, excavated in Mohenjo-daro, Pakistan, in 1928–29, is the best known.[378][379] After this there is a long period with virtually nothing surviving.[379][380] Almost all surviving ancient Indian art thereafter is in various forms of religious sculpture in durable materials, or coins. There was probably originally far more in wood, which is lost. In north India Mauryan art is the first imperial movement.[381][382][383] In the first millennium CE, Buddhist art spread with Indian religions to Central, East and South-East Asia, the last also 

extract N (best) nouns from text

In [162]:
import nltk
nltk.download('stopwords')

import pke
import string
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/chiragmanjeshwar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [176]:
def get_nouns(text, X=20):
  text = text.lower()
  
  extractor = pke.unsupervised.MultipartiteRank()
  stoplist = list(string.punctuation)
  stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
  stoplist += stopwords.words('english')
  extractor.load_document(input=text, stoplist=stoplist)

  pos = {'PROPN'}

  extractor.candidate_selection(pos=pos)
  extractor.candidate_weighting(alpha=1.1,
                                  threshold=0.75,
                                  method='average')
  nouns = [i[0] for i in extractor.get_n_best(X) if i[0] in text]

  return nouns

In [164]:
filtered_keys = get_nouns(full_text)
filtered_keys

['hindu',
 'ellora',
 'karla',
 'east',
 'asia',
 'india',
 'ajanta',
 'eurasia',
 'millennium ce',
 'north india mauryan',
 'shiva',
 'prana',
 'pakistan',
 'nataraja',
 'hindu sculpture',
 'british']

In [165]:

nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from flashtext import KeywordProcessor
def tokenize_sentences(text):
    sentences = [sent_tokenize(text)]
    sentences = [y for x in sentences for y in x]
    sentences = [sentence.strip() for sentence in sentences if 20 < len(sentence) < 300]
    return sentences
def get_sentences_for_keyword(keywords, sentences):
    keyword_processor = KeywordProcessor()
    keyword_sentences = {}
    for word in keywords:
        keyword_sentences[word] = []
        keyword_processor.add_keyword(word)
    for sentence in sentences:
        keywords_found = keyword_processor.extract_keywords(sentence)
        for key in keywords_found:
            keyword_sentences[key].append(sentence)
    for key in keyword_sentences.keys():
        values = keyword_sentences[key]
        values = sorted(values, key=len, reverse=True)
        keyword_sentences[key] = values
    return keyword_sentences
sentences = tokenize_sentences(summarized_text)
keyword_sentence_mapping = get_sentences_for_keyword(get_nouns(full_text), sentences)    
print (keyword_sentence_mapping)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/chiragmanjeshwar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


{'hindu': ['India has a very ancient tradition of art, which has exchanged many influences with the rest of Eurasia, especially in the first millennium, when Buddhist art spread with Indian religions to Central, East and South-East Asia, the last also greatly influenced by Hindu art.'], 'ellora': [], 'karla': [], 'east': ['India has a very ancient tradition of art, which has exchanged many influences with the rest of Eurasia, especially in the first millennium, when Buddhist art spread with Indian religions to Central, East and South-East Asia, the last also greatly influenced by Hindu art.', 'India has a very ancient tradition of art, which has exchanged many influences with the rest of Eurasia, especially in the first millennium, when Buddhist art spread with Indian religions to Central, East and South-East Asia, the last also greatly influenced by Hindu art.'], 'asia': ['India has a very ancient tradition of art, which has exchanged many influences with the rest of Eurasia, especial

In [185]:
def unique_keys(keyword_sentence_mapping):
  sent_set = set()
  for key in keyword_sentence_mapping.keys():
    remove_list = []
    for sent in keyword_sentence_mapping[key]:
      print(sent)
      if sent not in sent_set:
        sent_set.add(sent)
      else:
        remove_list.append(sent)
    for sent in remove_list:
      keyword_sentence_mapping[key].remove(sent)
        # print('removing')
        # print(keyword_sentence_mapping)
        # keyword_sentence_mapping[key].remove(sent)
        # print(keyword_sentence_mapping)
  print(sent_set)
  return keyword_sentence_mapping

def get_unique_sentence_mapping(summarized_text):
    
  sentences = tokenize_sentences(summarized_text)
  keyword_sentence_mapping = get_sentences_for_keyword(get_nouns(full_text), sentences)
  keyword_sentence_mapping = unique_keys(keyword_sentence_mapping)
  
  keyword_sentence_mapping = {k: v for k, v in keyword_sentence_mapping.items() if v}
  return keyword_sentence_mapping

  


In [186]:
keyword_sentence_mapping = get_unique_sentence_mapping(summarized_text)

Robin van Persie (Dutch pronunciation: [ˈrɔbɪɱ vɑm ˈpɛrsi] (listen); born 6 August 1983) is a Dutch football coach and former professional footballer who played as a striker.
In his first season, he won the Premier League and his second consecutive Premier League Golden Boot.\nAfter two injury-hit seasons followed, Van Persie fell out of favour at United and he was allowed to leave for Fenerbahçe in July 2015.
On 17 August, Van Persie transferred to Manchester United for an initial £22.5 million, with an additional £1.5 million to follow if United won a Premier League or Champions League title within the next four years.
In the dying minutes of injury time, Marouane Fellaini\'s header from Ángel Di María\'s free-kick was saved by Thibaut Courtois, and Van Persie smashed in the rebound to equalise.
With Arsenal in unpredictable form, Van Persie was again of supreme importance for Arsenal, this time in the North London derby against Tottenham Hotspur, played on 26 February.
Two days afte

In [183]:
keyword_sentence_mapping

{'robin van persie': ['Robin van Persie (Dutch pronunciation: [ˈrɔbɪɱ vɑm ˈpɛrsi] (listen); born 6 August 1983) is a Dutch football coach and former professional footballer who played as a striker.'],
 'van persie': ['In his first season, he won the Premier League and his second consecutive Premier League Golden Boot.\\nAfter two injury-hit seasons followed, Van Persie fell out of favour at United and he was allowed to leave for Fenerbahçe in July 2015.',
  'On 17 August, Van Persie transferred to Manchester United for an initial £22.5 million, with an additional £1.5 million to follow if United won a Premier League or Champions League title within the next four years.',
  "In the dying minutes of injury time, Marouane Fellaini\\'s header from Ángel Di María\\'s free-kick was saved by Thibaut Courtois, and Van Persie smashed in the rebound to equalise.",
  'With Arsenal in unpredictable form, Van Persie was again of supreme importance for Arsenal, this time in the North London derby ag

In [187]:
keyword_sentence_mapping

{'robin van persie': ['Robin van Persie (Dutch pronunciation: [ˈrɔbɪɱ vɑm ˈpɛrsi] (listen); born 6 August 1983) is a Dutch football coach and former professional footballer who played as a striker.'],
 'van persie': ['In his first season, he won the Premier League and his second consecutive Premier League Golden Boot.\\nAfter two injury-hit seasons followed, Van Persie fell out of favour at United and he was allowed to leave for Fenerbahçe in July 2015.',
  'On 17 August, Van Persie transferred to Manchester United for an initial £22.5 million, with an additional £1.5 million to follow if United won a Premier League or Champions League title within the next four years.',
  "In the dying minutes of injury time, Marouane Fellaini\\'s header from Ángel Di María\\'s free-kick was saved by Thibaut Courtois, and Van Persie smashed in the rebound to equalise.",
  'With Arsenal in unpredictable form, Van Persie was again of supreme importance for Arsenal, this time in the North London derby ag

In [188]:
import requests
import re
import random
from pywsd.similarity import max_similarity
from pywsd.lesk import adapted_lesk
from pywsd.lesk import simple_lesk
from pywsd.lesk import cosine_lesk
from nltk.corpus import wordnet as wn

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Distractors from Wordnet
def get_distractors_wordnet(syn,word):
    distractors=[]
    word= word.lower()
    orig_word = word
    if len(word.split())>0:
        word = word.replace(" ","_")
    hypernym = syn.hypernyms()
    if len(hypernym) == 0: 
        return distractors
    for item in hypernym[0].hyponyms():
        name = item.lemmas()[0].name()
        #print ("name ",name, " word",orig_word)
        if name == orig_word:
            continue
        name = name.replace("_"," ")
        name = " ".join(w.capitalize() for w in name.split())
        if name is not None and name not in distractors:
            distractors.append(name)
    return distractors

def get_wordsense(sent,word):
    word= word.lower()
    
    if len(word.split())>0:
        word = word.replace(" ","_")
    
    
    synsets = wn.synsets(word,'n')
    if synsets:
        wup = max_similarity(sent, word, 'wup', pos='n')
        adapted_lesk_output =  adapted_lesk(sent, word, pos='n')
        lowest_index = min (synsets.index(wup),synsets.index(adapted_lesk_output))
        return synsets[lowest_index]
    else:
        return None

# Distractors from http://conceptnet.io/
def get_distractors_conceptnet(word):
    word = word.lower()
    original_word= word
    if (len(word.split())>0):
        word = word.replace(" ","_")
    distractor_list = [] 
    url = "http://api.conceptnet.io/query?node=/c/en/%s/n&rel=/r/PartOf&start=/c/en/%s&limit=5"%(word,word)
    obj = requests.get(url).json()

    for edge in obj['edges']:
        link = edge['end']['term'] 

        url2 = "http://api.conceptnet.io/query?node=%s&rel=/r/PartOf&end=%s&limit=10"%(link,link)
        obj2 = requests.get(url2).json()
        for edge in obj2['edges']:
            word2 = edge['start']['label']
            if word2 not in distractor_list and original_word.lower() not in word2.lower():
                distractor_list.append(word2)
                   
    return distractor_list

key_distractor_list = {}

for keyword in keyword_sentence_mapping:
    wordsense = get_wordsense(keyword_sentence_mapping[keyword][0],keyword)
    if wordsense:
        distractors = get_distractors_wordnet(wordsense,keyword)
        if len(distractors) ==0:
            distractors = get_distractors_conceptnet(keyword)
        if len(distractors) != 0:
            key_distractor_list[keyword] = distractors
    else:
        
        distractors = get_distractors_conceptnet(keyword)
        if len(distractors) != 0:
            key_distractor_list[keyword] = distractors

index = 1
print ("#############################################################################")
print ("NOTE::::::::  Since the algorithm might have errors along the way, wrong answer choices generated might not be correct for some questions. ")
print ("#############################################################################\n\n")
for each in key_distractor_list:
    sentence = keyword_sentence_mapping[each][0]
    pattern = re.compile(each, re.IGNORECASE)
    output = pattern.sub( " _______ ", sentence)
    print ("%s)"%(index),output)
    choices = [each.capitalize()] + key_distractor_list[each]
    top4choices = choices[:4]
    random.shuffle(top4choices)
    optionchoices = ['a','b','c','d']
    for idx,choice in enumerate(top4choices):
        print ("\t",optionchoices[idx],")"," ",choice)
    print ("\nMore options: ", choices[4:20],"\n\n")
    index = index + 1
    

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/chiragmanjeshwar/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/chiragmanjeshwar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#############################################################################
NOTE::::::::  Since the algorithm might have errors along the way, wrong answer choices generated might not be correct for some questions. 
#############################################################################


1) In the next Premier  _______  game against Chelsea at Stamford Bridge, he scored his seventh goal in the  _______ , with a sidefoot finish from Antonio Valencia\'s driven cross to put United 2–0 up.
	 a )   Archine
	 b )   League
	 c )   Body Length
	 d )   Astronomy Unit

More options:  ['Cable', 'Chain', 'Cicero', 'Cubit', 'Em', 'En', 'Fathom', 'Finger', 'Fistmele', 'Foot', 'Footer', 'Furlong', 'Geometric Pace', 'Half Mile', 'Handbreadth', 'Head'] 


2) He made his debut on 20  _______ , coming on as a 68th-minute substitute for Danny Welbeck in a 1–0 loss to Everton.
	 a )   August
	 b )   April
	 c )   December
	 d )   August

More options:  ['February', 'January', 'July', 'June', 'Marc

In [116]:
full_text

'India has a very ancient tradition of art, which has exchanged many influences with the rest of Eurasia, especially in the first millennium, when Buddhist art spread with Indian religions to Central, East and South-East Asia, the last also greatly influenced by Hindu art.[377] Thousands of seals from the Indus Valley Civilization of the third millennium BCE have been found, usually carved with animals, but a few with human figures. The "Pashupati" seal, excavated in Mohenjo-daro, Pakistan, in 1928–29, is the best known.[378][379] After this there is a long period with virtually nothing surviving.[379][380] Almost all surviving ancient Indian art thereafter is in various forms of religious sculpture in durable materials, or coins. There was probably originally far more in wood, which is lost. In north India Mauryan art is the first imperial movement.[381][382][383] In the first millennium CE, Buddhist art spread with Indian religions to Central, East and South-East Asia, the last also 

In [84]:
summarized_text

''