In [1]:
! pip install beautifulsoup4



In [2]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


In [1]:
import bs4 as bs
import urllib.request
import re

scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/MS_Dhoni')
article = scraped_data.read()

# To parse the data, we use BeautifulSoup

parsed_article = bs.BeautifulSoup(article,'lxml')

#In Wikipedia articles, all the text for the article is enclosed inside the <p> tags. To retrieve 
#the text we need to call find_all function on the object

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text

In [2]:
article_text

'\nMahendra Singh Dhoni (pronunciation\xa0(help·info) born 7 July 1981), is a former Indian international cricketer who captained the Indian national team in limited-overs formats from 2007 to 2016 and in Test cricket from 2008 to 2014. Under his captaincy, India won the inaugural 2007 ICC World Twenty20, the 2010 and 2016 Asia Cups, the 2011 ICC Cricket World Cup and the 2013 ICC Champions Trophy. A right-handed middle-order batsman and wicket-keeper, Dhoni is one of the highest run scorers in One Day Internationals (ODIs) with more than 10,000 runs scored and is considered an effective "finisher" in limited-overs formats.[2][3][4] He is widely regarded as one of the best wicket-keeper batsmen and captains in the history of the game.[5][6][7][8][9] He was also the first wicket-keeper to effect 100 stumpings in ODI cricket.[10]\nDhoni made his ODI debut on 23 December, 2004 against Bangladesh, and played his first Test a year later against Sri Lanka. He has been the recipient of many a

In [3]:
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

In [4]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

In [6]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [5]:
#Converting Text To Sentences
import nltk
#nltk.download('punkt')
sentence_list = nltk.sent_tokenize(article_text)

In [6]:
formatted_article_text

' Mahendra Singh Dhoni pronunciation help info born July is a former Indian international cricketer who captained the Indian national team in limited overs formats from to and in Test cricket from to Under his captaincy India won the inaugural ICC World Twenty the and Asia Cups the ICC Cricket World Cup and the ICC Champions Trophy A right handed middle order batsman and wicket keeper Dhoni is one of the highest run scorers in One Day Internationals ODIs with more than runs scored and is considered an effective finisher in limited overs formats He is widely regarded as one of the best wicket keeper batsmen and captains in the history of the game He was also the first wicket keeper to effect stumpings in ODI cricket Dhoni made his ODI debut on December against Bangladesh and played his first Test a year later against Sri Lanka He has been the recipient of many awards including the ICC ODI Player of the Year award in and the first player to win the award twice the Rajiv Gandhi Khel Ratna

In [7]:
#Find Weighted Frequency of Occurrence
#nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In [9]:
#Finally, to find the weighted frequency, we can simply divide the number of occurances of all the words by the 
#frequency of the most occurring word, as shown below:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

In [10]:
#Calculating Sentence Scores
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
                    

In [11]:
#Getting the Summary
import heapq
summary_sentences = heapq.nlargest(10, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

The final match of the series had a repeat performance as Dhoni scored 77 runs off 56 balls to enable India win the series 4–1. In the third match of the series, his knock of 50 helped India tie the match and eventually avoiding a series whitewash. India lost the first match on 27 August 2016, during which Dhoni surpassed former Australian captain Ricky Ponting to become the most experienced captain in international cricket. With the win against Bangladesh, he became the first non-Australian captain to win 100 ODI matches, and the first Indian captain to achieve the mark. He made a half century in his debut match scoring 68* in the second innings against Assam cricket team. During the 2015 Cricket World Cup, Dhoni became the first Indian captain to win all group stages matches in such a tournament. However, the team finished poorly scoring just 43 runs in the last eight overs and lost the match due to Duckworth-Lewis method. He had a relatively mediocre series, having scored 79 runs in