# Document Summarization on Wikipedia Articles Using Python

In [1]:
# Backend
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [2]:
pip install beautifulsoup4



In [3]:
pip install lxml



In [4]:
# Core Libraries
import bs4 as bs
import urllib.request
import re

# Indirect requirements
import pandas as pd
import matplotlib.pyplot as plt
import io
import unicodedata
import numpy as np
import string

## Fetching Articles from Wikipedia

In [5]:
# Scrapping the data and loading from url

wikipedia_article = urllib.request.urlopen('https://en.wikipedia.org/wiki/Prabhas_filmography_and_awards')  # Open the URL which is the link to Wikipedia article on Earth
article = wikipedia_article.read() # Loading the content of article with all unwanted characters and tags

## Preprocessing of the Data

In [6]:
parsed_article = bs.BeautifulSoup(article,'lxml') # BeautifulSoup lxml allows us to parse HTML and XML files

paragraphs = parsed_article.find_all('p') # Reads the <p> </p> tags in the article

article_text = ""

for p in paragraphs:
    article_text += p.text
    #article_text2 += p.

In [7]:
# Viewiing content with symbols
article_text

"\nPrabhas is an Indian actor who works predominantly in Telugu cinema. One of the highest-paid actors in Indian cinema,[1] Prabhas has featured in Forbes India's Celebrity 100 list three times since 2015 based on his income and popularity.[2][3][4] He has received seven Filmfare Awards South nominations and is a recipient of the Nandi Award and the SIIMA Award.\nPrabhas made his acting debut with the 2002 Telugu drama Eeswar, and later attained his breakthrough with the romantic action film Varsham (2004). His notable works include Chatrapathi (2005), Bujjigadu (2008), Billa (2009), Darling (2010), Mr. Perfect (2011), and Mirchi (2013). He won the state Nandi Award for Best Actor for his performance in Mirchi.[5][6] In 2015, Prabhas starred in the title role in S. S. Rajamouli's epic action film Baahubali: The Beginning, which is the fourth-highest-grossing Indian film to date. He later reprised his role in its sequel, Baahubali 2: The Conclusion (2017), which became the first Indian 

In [8]:
# Droping unwanted characters and spaces

article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

In [9]:
# Viewing processed data without symbols
formatted_article_text

' Prabhas is an Indian actor who works predominantly in Telugu cinema One of the highest paid actors in Indian cinema Prabhas has featured in Forbes India s Celebrity list three times since based on his income and popularity He has received seven Filmfare Awards South nominations and is a recipient of the Nandi Award and the SIIMA Award Prabhas made his acting debut with the Telugu drama Eeswar and later attained his breakthrough with the romantic action film Varsham His notable works include Chatrapathi Bujjigadu Billa Darling Mr Perfect and Mirchi He won the state Nandi Award for Best Actor for his performance in Mirchi In Prabhas starred in the title role in S S Rajamouli s epic action film Baahubali The Beginning which is the fourth highest grossing Indian film to date He later reprised his role in its sequel Baahubali The Conclusion which became the first Indian film ever to gross over crore US million in just ten days and is the second highest grossing Indian film to date Prabhas

## Performing Text Tokenization

In [10]:
sentence_list = nltk.sent_tokenize(article_text) # Using NLTK and Punkt to generate tokens

In [11]:
sentence_list[:3] # Viewing few sentences

[' Prabhas is an Indian actor who works predominantly in Telugu cinema.',
 "One of the highest-paid actors in Indian cinema, Prabhas has featured in Forbes India's Celebrity 100 list three times since 2015 based on his income and popularity.",
 'He has received seven Filmfare Awards South nominations and is a recipient of the Nandi Award and the SIIMA Award.']

## Weighting the Frequency of Words

In [12]:
stopwords = nltk.corpus.stopwords.words('english') # Loading the English version, you can change to other langages as required

# Iterating for individual words
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords: # Dodge stop words
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In [13]:
maximum_frequncy = max(word_frequencies.values()) # Reading the number occurence of highest re-occuring word

In [14]:
maximum_frequncy

10

In [15]:
most_frequent_word = max(word_frequencies) # Printing the highest re-occuring word

In [16]:
most_frequent_word

'works'

In [17]:
# Using the most occuring word as an avarage
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

## Finding the Score of Sentences

In [18]:
# We use the word frequency to measure the value of a sentence
sentence_scores = {}
for sent in sentence_list: # Reads article coontaining symbols
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys(): # Will ignore stop words in sentence_list
            if len(sent.split(' ')) < 32: # Dropping sentences with words more than 32. Summary should be short
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [19]:
# Viewing the value of each sentence
sentence_scores

{' Prabhas is an Indian actor who works predominantly in Telugu cinema.': 0.6000000000000001,
 "One of the highest-paid actors in Indian cinema, Prabhas has featured in Forbes India's Celebrity 100 list three times since 2015 based on his income and popularity.": 1.0999999999999999,
 'He has received seven Filmfare Awards South nominations and is a recipient of the Nandi Award and the SIIMA Award.': 0.4,
 'Prabhas made his acting debut with the 2002 Telugu drama Eeswar, and later attained his breakthrough with the romantic action film Varsham (2004).': 2.2,
 'His notable works include Chatrapathi (2005), Bujjigadu (2008), Billa (2009), Darling (2010), Mr.': 0.4,
 'He won the state Nandi Award for Best Actor for his performance in Mirchi.': 0.30000000000000004,
 "In 2015, Prabhas starred in the title role in S. S. Rajamouli's epic action film Baahubali: The Beginning, which is the fourth-highest-grossing Indian film to date.": 2.9000000000000004,
 'Prabhas next film Saaho was an above a

## Extracting the Article Summary

In [20]:
# Making the final summary
number_of_sentence_to_summarize_to = 10

import heapq #  Heap queue algorithm, uses priority queue algorithm
summary_sentences = heapq.nlargest(number_of_sentence_to_summarize_to, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

In 2015, Prabhas starred in the title role in S. S. Rajamouli's epic action film Baahubali: The Beginning, which is the fourth-highest-grossing Indian film to date. Prabhas next film Saaho was an above average grosser at the box office while Radhe Shyam ended up as a commercial failure at the box office. Prabhas made his acting debut with the 2002 Telugu drama Eeswar, and later attained his breakthrough with the romantic action film Varsham (2004). Prabhas is also set to star in Sandeep Reddy Vanga's cop drama film Spirit. Currently Prabhas is filming 2 films - Kalki 2898 AD, and a film with Maruthi titled The Raja Saab. One of the highest-paid actors in Indian cinema, Prabhas has featured in Forbes India's Celebrity 100 list three times since 2015 based on his income and popularity. He has also acted in Om Raut's Adipurush which was a commercial failure.  Prabhas is an Indian actor who works predominantly in Telugu cinema. He has received seven Filmfare Awards South nominations and is