<h2 style='color:LawnGreen'>Text Summarization</h2>
<p style='color:chocolate'>Summarization is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content.</p>

<h4 style='color:LawnGreen'>Fetching Articles from internet</h4>

<h4 style='color:orange'>Install and import libraries required for webscraping</h4>

 ### beautiful soup ---->very useful Python utility for web scraping

In [None]:
# %pip install beautifulsoup4

### Another important library that we need to parse XML and HTML is the lxml library

In [None]:
# %pip install lxml

In [52]:
import bs4 as bs
import urllib.request
import re

 ##### 1.We then use the urlopen function from the urllib.request utility to scrape the data.
 ##### 2.we need to call read function on the object returned by urlopen function in order to read the data.
 ##### 3.To parse the data, we use BeautifulSoup object and pass it the scraped data object i.e. article and the lxml parser.

In [53]:
scraped_data = urllib.request.urlopen('https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century')
#read the scraped data 
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text

#### find_all function returns all the paragraphs in the article in the form of a list. All the paragraphs have been combined to recreate the article.

In [54]:
article_text

'Back in the 1990s, computer engineer and Wall Street “quant” were the hot occupations in business. Today data scientists are the hires firms are competing to make. As companies wrestle with unprecedented volumes and types of information, demand for these experts has raced well ahead of supply. Indeed, Greylock Partners, the VC firm that backed Facebook and LinkedIn, is so worried about the shortage of data scientists that it has a recruiting team dedicated to channeling them to the businesses in its portfolio.Data scientists are the key to realizing the opportunities presented by big data. They bring structure to it, find compelling patterns in it, and advise executives on the implications for products, processes, and decisions. They find the story buried in the data and communicate it. And they don’t just deliver reports: They get at the questions at the heart of problems and devise creative approaches to them. One data scientist who was studying a fraud problem, for example, realize

<h4 style='color:LawnGreen'>Install necessary libraries for Data Preprocessing</h4>

In [55]:
# %pip install regex
# %pip install nltk

<h4 style='color:LawnGreen'>Import Modules</h4>

In [56]:
import nltk
# import re

<h2 style='color:LawnGreen'>Data Preprocessing</h2>

#### Removing Square Brackets and Extra Spaces

In [57]:
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

#### Converting all the text into lower case

In [58]:
article_text = article_text.lower()
article_text

'back in the 1990s, computer engineer and wall street “quant” were the hot occupations in business. today data scientists are the hires firms are competing to make. as companies wrestle with unprecedented volumes and types of information, demand for these experts has raced well ahead of supply. indeed, greylock partners, the vc firm that backed facebook and linkedin, is so worried about the shortage of data scientists that it has a recruiting team dedicated to channeling them to the businesses in its portfolio.data scientists are the key to realizing the opportunities presented by big data. they bring structure to it, find compelling patterns in it, and advise executives on the implications for products, processes, and decisions. they find the story buried in the data and communicate it. and they don’t just deliver reports: they get at the questions at the heart of problems and devise creative approaches to them. one data scientist who was studying a fraud problem, for example, realize

#### remove spaces, punctuations and numbers

In [59]:
clean_text = re.sub('[^a-zA-Z]', ' ', article_text)#remove all non aplha characters
clean_text = re.sub('\s+', ' ', clean_text)#remove multiple spaces
clean_text

'back in the s computer engineer and wall street quant were the hot occupations in business today data scientists are the hires firms are competing to make as companies wrestle with unprecedented volumes and types of information demand for these experts has raced well ahead of supply indeed greylock partners the vc firm that backed facebook and linkedin is so worried about the shortage of data scientists that it has a recruiting team dedicated to channeling them to the businesses in its portfolio data scientists are the key to realizing the opportunities presented by big data they bring structure to it find compelling patterns in it and advise executives on the implications for products processes and decisions they find the story buried in the data and communicate it and they don t just deliver reports they get at the questions at the heart of problems and devise creative approaches to them one data scientist who was studying a fraud problem for example realized it was analogous to a t

#### Note:when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with s.decode("utf8")

In [60]:
# split into sentence list
sentence_list = nltk.sent_tokenize(article_text)
sentence_list

['back in the 1990s, computer engineer and wall street “quant” were the hot occupations in business.',
 'today data scientists are the hires firms are competing to make.',
 'as companies wrestle with unprecedented volumes and types of information, demand for these experts has raced well ahead of supply.',
 'indeed, greylock partners, the vc firm that backed facebook and linkedin, is so worried about the shortage of data scientists that it has a recruiting team dedicated to channeling them to the businesses in its portfolio.data scientists are the key to realizing the opportunities presented by big data.',
 'they bring structure to it, find compelling patterns in it, and advise executives on the implications for products, processes, and decisions.',
 'they find the story buried in the data and communicate it.',
 'and they don’t just deliver reports: they get at the questions at the heart of problems and devise creative approaches to them.',
 'one data scientist who was studying a fraud 

##### download stopwords

In [61]:
# run this cell once to download stopwords
# import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chitt\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

<h4 style='color:LawnGreen'>Word Frequencies</h4>

In [27]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(clean_text):
    if word not in stopwords:
        if word not in word_frequencies:
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In [28]:
maximum_frequency = max(word_frequencies.values())

for word in word_frequencies:
    word_frequencies[word] = word_frequencies[word] / maximum_frequency

<h4 style='color:LawnGreen'>Calculate Sentence Scores</h4>

In [29]:
sentence_scores = {}

for sentence in sentence_list:
    for word in nltk.word_tokenize(sentence):
        if word in word_frequencies and len(sentence.split(' ')) < 30:
            if sentence not in sentence_scores:
                sentence_scores[sentence] = word_frequencies[word]
            else:
                sentence_scores[sentence] += word_frequencies[word]

In [30]:
word_frequencies

{'back': 0.008620689655172414,
 'computer': 0.04310344827586207,
 'engineer': 0.008620689655172414,
 'wall': 0.034482758620689655,
 'street': 0.034482758620689655,
 'quant': 0.008620689655172414,
 'hot': 0.008620689655172414,
 'occupations': 0.008620689655172414,
 'business': 0.1206896551724138,
 'today': 0.04310344827586207,
 'data': 1.0,
 'scientists': 0.4396551724137931,
 'hires': 0.008620689655172414,
 'firms': 0.07758620689655173,
 'competing': 0.008620689655172414,
 'make': 0.10344827586206896,
 'companies': 0.09482758620689655,
 'wrestle': 0.008620689655172414,
 'unprecedented': 0.008620689655172414,
 'volumes': 0.017241379310344827,
 'types': 0.008620689655172414,
 'information': 0.034482758620689655,
 'demand': 0.034482758620689655,
 'experts': 0.017241379310344827,
 'raced': 0.017241379310344827,
 'well': 0.034482758620689655,
 'ahead': 0.017241379310344827,
 'supply': 0.017241379310344827,
 'indeed': 0.017241379310344827,
 'greylock': 0.02586206896551724,
 'partners': 0.0258

<h4>Sentence with <span style='color:red'>highest</span> sentence_score</h4>

In [49]:
s=heapq.nlargest(2,sentence_scores,key=sentence_scores.get)

In [64]:
for i in s :
    print(i+': '+str(sentence_scores[i]))

after acquiring the big data firm greenplum, emc decided that the availability of data scientists would be a gating factor in its own—and customers’—exploitation of big data.: 4.025862068965518
at intuit data scientists are asked to develop insights for small-business customers and consumers and report to a new senior vice president of big data, social design, and marketing.: 2.905172413793103


<h4 style='color:LawnGreen'>Text Summarization</h4>

In [65]:
# get top 5 sentences
import heapq
summary = heapq.nlargest(10, sentence_scores,key=sentence_scores.get)

print(" ".join(summary))

after acquiring the big data firm greenplum, emc decided that the availability of data scientists would be a gating factor in its own—and customers’—exploitation of big data. at intuit data scientists are asked to develop insights for small-business customers and consumers and report to a new senior vice president of big data, social design, and marketing. and what fields are those skills most readily found in?more than anything, what data scientists do is make discoveries while swimming in data. the program combines mentoring by data experts from local companies (such as facebook, twitter, google, and linkedin) with exposure to actual big data challenges. so its education services division launched a data science and big data analytics training and certification program. they identify rich data sources, join them with other, potentially incomplete data sources, and clean the resulting set. a little less surprisingly, many of the data scientists working in business today were formally 