# Text Summarization

## Objective:

**Summarize the page of text in few lines**.

  * Will use python package **beautifulsoup4** to scrape the page from wikipedia for data.
  * Also will use XML parser - **lxml** to get the data from the paragraph p_tag.

In [1]:
# pip install beautifulsoup4
# pip install lxml

In [2]:
import bs4 as bs

In [12]:
import urllib2

In [10]:
print ('BeautifulSoup: {}' .format(bs.__version__))

BeautifulSoup: 4.6.3


## Scrape the data from Wiki

In [13]:
# https://en.wikipedia.org/wiki/Global_warming
site = "https://en.wikipedia.org/wiki/Global_warming"
source = urllib2.urlopen(site)

In [14]:
source

<addinfourl at 4579979704 whose fp = <socket._fileobject object at 0x110fa92d0>>

In [15]:
# parsing the xml data using Beautifyl soup
soup = bs.BeautifulSoup(source, 'lxml')

In [16]:
soup



## Get the data

In [17]:
# get the data from p_tag
text = ""

for paragraph in soup.find_all('p'):
    text += paragraph.text

In [18]:
text



## Preprocess the data

In [19]:
import re

In [23]:
# remove the refererences like [44]
text = re.sub(r'\[\d+\]', ' ', text)

In [24]:
# remove excess spaces
text = re.sub(r'\s+', ' ', text)

In [25]:
text



In [26]:
clean_text = text.lower()

In [27]:
# remove special characters. Retain the period to break into sentences
clean_text = re.sub(r'!@#$%^&\*\(\)\':;', ' ', clean_text)

In [29]:
clean_text = re.sub(r'\d', ' ', clean_text)

In [31]:
clean_text = re.sub(r'\s+', ' ', clean_text)

In [32]:
clean_text



## Sentences

In [33]:
import nltk

In [35]:
sentences = nltk.sent_tokenize(clean_text)

In [37]:
len(sentences)

308

In [39]:
# get the stop words as well
stopwords = nltk.corpus.stopwords.words('english')

In [40]:
len(stopwords)

179

## Find Frequency of each word

In [47]:
word_count = {}

# go thru all words in the clean text
for word in nltk.word_tokenize(clean_text):
    if word not in stopwords:
        if word not in word_count.keys():
            word_count[word] = 1
        else:
            word_count[word] += 1

In [48]:
len(word_count)

1433

## Find the weights

In [49]:
max(word_count.values())

340

In [50]:
# convert the counts to weights
# weight = word freq / (max value of word-count-dict)

for keys in word_count.keys():
    word_count[keys] = (word_count[keys] *1.0 )/ max(word_count.values())

In [51]:
word_count

{u'limited': 0.008823529411764706,
 u'asian': 0.0029411764705882353,
 u'whose': 0.0058823529411764705,
 u'paris': 0.0029411764705882353,
 u'risk': 0.008823529411764706,
 u'regional': 0.0058823529411764705,
 u'updates': 0.0029411764705882353,
 u'summarized': 0.0058823529411764705,
 u'affect': 0.008823529411764706,
 u'bringing': 0.0029411764705882353,
 u'crops': 0.0029411764705882353,
 u'companies': 0.008823529411764706,
 u'humidity': 0.0058823529411764705,
 u'unrelated': 0.0029411764705882353,
 u'intensification': 0.0029411764705882353,
 u'enhance': 0.0029411764705882353,
 u'methane': 0.014705882352941176,
 u'leaders': 0.0029411764705882353,
 u'disciplines': 0.0058823529411764705,
 u'consistent': 0.008823529411764706,
 u'estimates': 0.01764705882352941,
 u'direct': 0.008823529411764706,
 u'feasibility': 0.0029411764705882353,
 u'likely': 0.029411764705882353,
 u'estimated': 0.0029411764705882353,
 u'even': 0.01764705882352941,
 u'established': 0.0029411764705882353,
 u'deliberate': 0.00

## Find the sentence score

In [58]:
sent2score = {}

# go thru each sentence
for sentence in sentences:
    
    # ignore the very long sentences
    
    if len(sentence.split(' ')) < 25:
        # break sent to word and find their score using the word_count weight dictionary.
        for word in nltk.word_tokenize(sentence):
            if word in word_count.keys():
                # get the weight
                weight = word_count[word]
                ## need to update the sent2score dictionary
                if sentence not in sent2score.keys():
                    sent2score[sentence] = weight
                else:
                    sent2score[sentence] += weight


In [59]:
sent2score

{u'% and gas flaring .': 1.5949367088607596,
 u'%.': 1.3417721518987342,
 u': at the th unfccc conference of the parties, held in at copenhagen, several unfccc parties produced the copenhagen accord.': 3.7164886818125593,
 u': emissions can be attributed to different regions.': 1.9240506329113924,
 u': this mandate was sustained in the kyoto protocol to the framework convention, : which entered into legal effect in .': 2.8101265822784813,
 u'[ .': 1.0886075949367089,
 u'[c] since , the average temperature of the lower troposphere has increased between .': 4.39662447257384,
 u"[d] without the earth's atmosphere, the earth's average temperature would be well below the freezing temperature of water.": 5.074576710155719,
 u']\xa0\xb0c.': 1.379746835443038,
 u'a climate model is a representation of the physical, chemical and biological processes that affect the climate system.': 3.254483860092505,
 u'a global pew research center report showed that a median of % of all respondents asked cons

## Get Top 5 sentences based on weights

In [60]:
import heapq

In [61]:
best_sentences = heapq.nlargest(5, sent2score, key=sent2score.get)

In [62]:
best_sentences

[u'\xb0c ( .',
 u'according to basic physical principles, the greenhouse effect produces warming of the lower atmosphere (the troposphere), but cooling of the upper atmosphere (the stratosphere).',
 u'the adaptation may be planned, either in reaction to or anticipation of global warming, or spontaneous, i.e., without government intervention.',
 u'additional disputes concern estimates of climate sensitivity, predictions of additional warming, and what the consequences of global warming will be.',
 u'in view of the dominant role of human activity in causing it, the phenomenon is sometimes called "anthropogenic global warming" or "anthropogenic climate change".']

In [63]:
for sentence in best_sentences:
    print(sentence)

°c ( .
according to basic physical principles, the greenhouse effect produces warming of the lower atmosphere (the troposphere), but cooling of the upper atmosphere (the stratosphere).
the adaptation may be planned, either in reaction to or anticipation of global warming, or spontaneous, i.e., without government intervention.
additional disputes concern estimates of climate sensitivity, predictions of additional warming, and what the consequences of global warming will be.
in view of the dominant role of human activity in causing it, the phenomenon is sometimes called "anthropogenic global warming" or "anthropogenic climate change".
