# Text Summarization

## Objective:

**Summarize the page of text in few lines**.

  * Will use python package **beautifulsoup4** to scrape the page from wikipedia for data.
  * Also will use XML parser - **lxml** to get the data from the paragraph p_tag.

In [9]:
# pip install beautifulsoup4
# pip install lxml

In [10]:
import bs4 as bs

In [11]:
import urllib2

In [12]:
print ('BeautifulSoup: {}' .format(bs.__version__))

BeautifulSoup: 4.6.3


## Scrape the data from Wiki

In [13]:
# https://en.wikipedia.org/wiki/Global_warming
site = "https://en.wikipedia.org/wiki/Global_warming"
source = urllib2.urlopen(site)

In [14]:
source

<addinfourl at 4369310640 whose fp = <socket._fileobject object at 0x104d5b350>>

In [15]:
# parsing the xml data using Beautifyl soup
soup = bs.BeautifulSoup(source, 'lxml')

In [16]:
#soup

## Get the data

In [17]:
# get the data from p_tag
text = ""

for paragraph in soup.find_all('p'):
    text += paragraph.text

In [18]:
#text

## Preprocess the data

In [19]:
import re

In [20]:
# remove the refererences like [44]
text = re.sub(r'\[\d+\]', ' ', text)

In [21]:
# remove excess spaces
text = re.sub(r'\s+', ' ', text)

In [22]:
#text

In [23]:
clean_text = text.lower()

In [24]:
# remove special characters. Retain the period to break into sentences
clean_text = re.sub(r'!@#$%^&\*\(\)\':;', ' ', clean_text)

In [25]:
clean_text = re.sub(r'\d', ' ', clean_text)

In [26]:
clean_text = re.sub(r'\s+', ' ', clean_text)

In [27]:
#clean_text

## Sentences

In [28]:
import nltk

In [29]:
sentences = nltk.sent_tokenize(clean_text)

In [30]:
len(sentences)

308

In [31]:
# get the stop words as well
stopwords = nltk.corpus.stopwords.words('english')

In [32]:
len(stopwords)

179

## Find Frequency of each word

In [33]:
word_count = {}

# go thru all words in the clean text
for word in nltk.word_tokenize(clean_text):
    if word not in stopwords:
        if word not in word_count.keys():
            word_count[word] = 1
        else:
            word_count[word] += 1

In [34]:
len(word_count)

1433

## Find the weights

In [35]:
max(word_count.values())

340

In [36]:
# convert the counts to weights
# weight = word freq / (max value of word-count-dict)

for keys in word_count.keys():
    word_count[keys] = (word_count[keys] *1.0 )/ max(word_count.values())

In [37]:
#word_count

## Find the sentence score

In [38]:
sent2score = {}

# go thru each sentence
for sentence in sentences:
    
    # ignore the very long sentences
    
    if len(sentence.split(' ')) < 25:
        # break sent to word and find their score using the word_count weight dictionary.
        for word in nltk.word_tokenize(sentence):
            if word in word_count.keys():
                # get the weight
                weight = word_count[word]
                ## need to update the sent2score dictionary
                if sentence not in sent2score.keys():
                    sent2score[sentence] = weight
                else:
                    sent2score[sentence] += weight


In [45]:
#sent2score

# first key
print(list(sent2score.keys())[0])

# first value
print(sent2score[list(sent2score.keys())[0]])

results from models can also vary due to different greenhouse gas inputs and the model's climate sensitivity.
3.38844666882


## Get Top 5 sentences based on weights

In [40]:
import heapq

In [41]:
best_sentences = heapq.nlargest(5, sent2score, key=sent2score.get)

In [42]:
best_sentences

[u'\xb0c ( .',
 u'according to basic physical principles, the greenhouse effect produces warming of the lower atmosphere (the troposphere), but cooling of the upper atmosphere (the stratosphere).',
 u'the adaptation may be planned, either in reaction to or anticipation of global warming, or spontaneous, i.e., without government intervention.',
 u'additional disputes concern estimates of climate sensitivity, predictions of additional warming, and what the consequences of global warming will be.',
 u'in view of the dominant role of human activity in causing it, the phenomenon is sometimes called "anthropogenic global warming" or "anthropogenic climate change".']

In [43]:
for sentence in best_sentences:
    print(sentence)

°c ( .
according to basic physical principles, the greenhouse effect produces warming of the lower atmosphere (the troposphere), but cooling of the upper atmosphere (the stratosphere).
the adaptation may be planned, either in reaction to or anticipation of global warming, or spontaneous, i.e., without government intervention.
additional disputes concern estimates of climate sensitivity, predictions of additional warming, and what the consequences of global warming will be.
in view of the dominant role of human activity in causing it, the phenomenon is sometimes called "anthropogenic global warming" or "anthropogenic climate change".
