# Extractive Summarizer

## Importing Libraries

In [21]:
import numpy as np 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize,word_tokenize
from bs4 import BeautifulSoup
import requests
import re

## Download NLTK Packages
<div>
    <p>The following NLTK Packages are required for the processing of the texts:</p>
    <ul>
        <li><b>wordnet</b>: WordNet is a lexical database of English. It helps in finding the conceptual relationships between words such as hypernyms, hyponyms, synonyms, antonyms etc.</li>
        <li><b>punkt</b>: This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.</li>
        <li><b>stopwords</b>: Stop words are words that frequently appear in any language or corpus. However, they contribute no additional text, including them for several NLP tasks.</li>
    </ul>
</div>

In [22]:
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aashi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aashi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aashi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Assemble Article (Basic Scraping)
<div>
    <p>In this method, the following procedure is being used to assemble the article:</p>
    <ol>
        <li>Take Input from User (specific to Wikipeadia Page).</li>
        <li>Create the Link for the Wikipedia Page</li>
        <li>Send a request to the Wikipedia Page to obtain the HTML Page.</li>
        <li>Parse the HTML Page.</li>
        <li>Obtain all the paragraphs from the Page.</li>
        <li>Create an Article by joining all the paragraphs. (Simply concatenated the paragraphs with a space (" ") between them)</li>
        <li>Return the Article.</li>
    </ol>
</div>

In [23]:
def _input(topic):
	article = ""
	link = "https://en.wikipedia.org/wiki/" + topic.strip() 
	page = requests.get(link)
	content = BeautifulSoup(page.content,'html.parser')
	paragraphs = content.find_all('p')
	for paragraph in paragraphs:
		article+= paragraph.text+" "
	print("\n\n\n\nArticle: {}".format(article))
	return article


## Cleaning the Article
<div>
    <p>Procedure for cleaning the article:</p>
    <ol>
        <li>Import Limmatizer. (Lemmatization is the process of reducing a word to its base or dictionary form, known as the lemma.)</li>
        <li>Convert all the words to lowercase (to make them even because 'A' and 'a' are interpretted as different characters.)</li>
        <li>For each sentence, we check the conversion using Regular Expression.</li>
        <li>Split each sentence to obtain a complete list of words for that Article.</li>
        <li>Remove all the stopwords.</li>
        <li>Lemmatize each word in the sentence.</li>
        <li>Append the words to form the sentences again.</li>
        <li>Return the Lemmatized Article.</li>
    </ol>
</div>

In [24]:
def clean(sentences):
	lemmatizer = WordNetLemmatizer()
	cleaned_sentences = []
	for sentence in sentences:
		sentence = sentence.lower()
		sentence = re.sub(r'[^a-zA-Z]',' ',sentence)
		sentence = sentence.split()
		sentence = [lemmatizer.lemmatize(word) for word in sentence if word not in set(stopwords.words('english'))]
		sentence = ' '.join(sentence)
		cleaned_sentences.append(sentence)
	print("\n\n\n\nCleaned Sentences: {}".format(cleaned_sentences))
	return cleaned_sentences

## Calcularing Probability
<div>
    <p>Procedure for calculating the Probability:</p>
    <ol>
        <li>Split a given sentence into words using the NLTK library to obtain words.</li>
        <li>Calculate the number of occurences for each word.</li>
        <li>Calculate the Probability of each word (by simply diving the number of occurences for that word by the total number of words)</li>
        <li>Create and return the dictionary of the probability.</li>
    </ol>
</div>

In [25]:
def init_probability(sentences):
	probability_dict = {}
	words = word_tokenize('. '.join(sentences))
	total_words = len(set(words))
	for word in words:
		if word!='.':
			if not probability_dict.get(word):
				probability_dict[word] = 1
			else:
				probability_dict[word] += 1

	for word,count in probability_dict.items():
		probability_dict[word] = count/total_words 
	
	return probability_dict

In [26]:
def update_probability(probability_dict,word):
	if probability_dict.get(word):
		probability_dict[word] = probability_dict[word]**2
	return probability_dict

In [27]:
def average_sentence_weights(sentences,probability_dict):
	sentence_weights = {}
	for index,sentence in enumerate(sentences):
		if len(sentence) != 0:
			average_proba = sum([probability_dict[word] for word in sentence if word in probability_dict.keys()])
			average_proba /= len(sentence)
			sentence_weights[index] = average_proba 
	return sentence_weights


## Generate Summary
<div>
    <p>Procedure for generating the extractive summary:</p>
    <ol>
        <li>Enter the loop after checking the condition for the number of sentences.</li>
        <li>Obtain the maximum probability word with single argument function (<code>max(, key=...)</code>) to customize the sort order.</li>
        <li>Create an Enumeration of the cleaned Article.</li>
        <li>For every sentence, obtain the highest probability words.</li>
        <li>Obtain the sentences with the maximum probability and create a list.</li>
        <li>Sort the Weights for the sentences.</li>
        <li>Take the sentence with the hightest weight.</li>
        <li>Update the Weights.</li>
        <li>Return the Summary.</li>
    </ol>
</div>

In [28]:
def generate_summary(sentence_weights,probability_dict,cleaned_article,tokenized_article,summary_length = 30):
	summary = ""
	current_length = 0
	while current_length < summary_length :
		highest_probability_word = max(probability_dict,key=probability_dict.get)
		sentences_with_max_word= [index for index,sentence in enumerate(cleaned_article) if highest_probability_word in set(word_tokenize(sentence))]
		sentence_list = sorted([[index,sentence_weights[index]] for index in sentences_with_max_word], key=lambda x:x[1], reverse=True)
		summary += tokenized_article[sentence_list[0][0]] + "\n"
		for word in word_tokenize(cleaned_article[sentence_list[0][0]]):
			probability_dict = update_probability(probability_dict,word)
		current_length+=1
	return summary

## Executing the Entire Model
<div>
    <p>The steps followed in the execution are:</p>
    <ol>
        <li>Take the name of the Wikipedia page as input from User.</li>
        <li>Generate the Article by parsing the Text from the Page.</li>
        <li>Take the number of required sentences from the user.</li>
        <li>Tokenize the Article for cleaning and further processing.</li>
        <li>Clean the Article (for now a basic procedure is used)</li>
        <li>Calculate Probability for each word in a Sentence in the Article.</li>
        <li>Calculate the average weight/probability for each sentence in the Article for selection.</li>
        <li>Generate Summary for the Clean and Tokenized Article based on the Sentence Weights and required length.</li>
    </ol>
</div>

In [31]:
def main():
	topic = input("Enter the title of the wikipedia article to be scraped----->")
	article = _input(topic)
	required_length = int(input("Enter the number of required sentences"))
	tokenized_article = sent_tokenize(article)
	cleaned_article = clean(tokenized_article) 
	probability_dict = init_probability(cleaned_article)
	sentence_weights = average_sentence_weights(cleaned_article,probability_dict)
	summary = generate_summary(sentence_weights,probability_dict,cleaned_article,tokenized_article,required_length)
	print("\n\n\n\n\n\nSummary: {}".format(summary))

In [32]:
if __name__ == "__main__":
	main()





Article: 
 Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of humans or animals. It is a field of study in computer science which develops and studies intelligent machines. It may also refer to the intelligent machines themselves.
 AI technology is widely used throughout industry, government and science. Some high-profile applications are: advanced web search engines (e.g., Google Search), recommendation systems (used by YouTube, Amazon, and Netflix), understanding human speech (such as Google Assistant, Siri, and Alexa), self-driving cars (e.g., Waymo), generative and creative tools (ChatGPT and AI art), and superhuman play and analysis in strategy games (such as chess and Go).[1]
 Artificial intelligence was founded as an academic discipline in 1956.[2] The field went through multiple cycles of optimism[3][4] followed by disappointment and loss of funding,[5][6] but after 2012, when deep learning surpassed all previous AI t