#### What is text Summarization? 

- Text Summarization is the process of Distilling most important information from a source text.

- How to do Text Summarization?

  
  1. Text Cleaning
  2. Sentence Tokenozation
  3. Word Tokenization
  4. Word-Frequency Table
  5. Summarization

In [17]:
text = """Earth is the third planet from the Sun and the only astronomical object known to harbor life. According to radiometric dating and other evidence, Earth formed over 4.5 billion years ago. Earth's gravity interacts with other objects in space, especially the Sun and the Moon, which is Earth's only natural satellite. Earth orbits around the Sun in 365.256 solar days, a period known as an Earth sidereal year. During this time, Earth rotates about its axis 366.256 times, that is, a sidereal year has 366.256 sidereal days. Earth's axis of rotation is tilted with respect to its orbital plane, producing seasons on Earth. 
The gravitational interaction between Earth and the Moon causes tides, stabilizes Earth's orientation on its axis, and gradually slows its rotation. Earth is the densest planet in the Solar System and the largest and most massive of the four rocky planets.
Earth's outer layer (lithosphere) is divided into several rigid tectonic plates that migrate across the surface over many millions of years. About 29% of Earth's surface is land consisting of continents and islands. The remaining 71% is covered with water, mostly by oceans but also lakes, rivers and other fresh water, which all together constitute the hydrosphere. The majority of Earth's polar regions are covered in ice, including the Antarctic ice sheet and the sea ice of the Arctic ice pack. Earth's interior remains active with a solid iron inner core, a liquid outer core that generates Earth's magnetic field, and a convecting mantle that drives plate tectonics."""

In [18]:
# Importing Libraries

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [19]:
stopwords = list(STOP_WORDS)

In [20]:
stopwords[:5]

['enough', 'rather', 'until', 'least', 'have']

In [21]:
# Creating a nlp model

nlp = spacy.load("en_core_web_sm")

In [22]:
doc = nlp(text)

In [24]:
tokens = [token.text for token in doc]
print(tokens)

['Earth', 'is', 'the', 'third', 'planet', 'from', 'the', 'Sun', 'and', 'the', 'only', 'astronomical', 'object', 'known', 'to', 'harbor', 'life', '.', 'According', 'to', 'radiometric', 'dating', 'and', 'other', 'evidence', ',', 'Earth', 'formed', 'over', '4.5', 'billion', 'years', 'ago', '.', 'Earth', "'s", 'gravity', 'interacts', 'with', 'other', 'objects', 'in', 'space', ',', 'especially', 'the', 'Sun', 'and', 'the', 'Moon', ',', 'which', 'is', 'Earth', "'s", 'only', 'natural', 'satellite', '.', 'Earth', 'orbits', 'around', 'the', 'Sun', 'in', '365.256', 'solar', 'days', ',', 'a', 'period', 'known', 'as', 'an', 'Earth', 'sidereal', 'year', '.', 'During', 'this', 'time', ',', 'Earth', 'rotates', 'about', 'its', 'axis', '366.256', 'times', ',', 'that', 'is', ',', 'a', 'sidereal', 'year', 'has', '366.256', 'sidereal', 'days', '.', 'Earth', "'s", 'axis', 'of', 'rotation', 'is', 'tilted', 'with', 'respect', 'to', 'its', 'orbital', 'plane', ',', 'producing', 'seasons', 'on', 'Earth', '.', '

- We need to notice here that stop words and punctuations are also part of tokens.

### Removing stopwords and punctiations

In [25]:
# Removing punctuations

punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [26]:
# There is no new line (\n) in punctuations. hence we will add that to punctuation

punctuation = punctuation + "\n"

In [27]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

### Word Frequencies

- Word frequency is counting the number of times a word has occured in a text.

In [28]:
word_frequency = {}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequency.keys():
                word_frequency[word.text] = 1
            else:
                word_frequency[word.text] += 1

The above code means as follows:
    1. Create a dictionary called word_frequency
    2. For each word in doc
    3. If the word is not a stopword
    4. If the word is not a punctuation
    5. If the word is not a key in word_frequency dictionary
    6. Then add the word to word_frequency and give value as 1.
    7. If the word is already present in word_frequency then increament the value in front of that word

In [30]:
print(word_frequency)

{'Earth': 17, 'planet': 2, 'Sun': 3, 'astronomical': 1, 'object': 1, 'known': 2, 'harbor': 1, 'life': 1, 'According': 1, 'radiometric': 1, 'dating': 1, 'evidence': 1, 'formed': 1, '4.5': 1, 'billion': 1, 'years': 2, 'ago': 1, 'gravity': 1, 'interacts': 1, 'objects': 1, 'space': 1, 'especially': 1, 'Moon': 2, 'natural': 1, 'satellite': 1, 'orbits': 1, '365.256': 1, 'solar': 1, 'days': 2, 'period': 1, 'sidereal': 3, 'year': 2, 'time': 1, 'rotates': 1, 'axis': 3, '366.256': 2, 'times': 1, 'rotation': 2, 'tilted': 1, 'respect': 1, 'orbital': 1, 'plane': 1, 'producing': 1, 'seasons': 1, 'gravitational': 1, 'interaction': 1, 'causes': 1, 'tides': 1, 'stabilizes': 1, 'orientation': 1, 'gradually': 1, 'slows': 1, 'densest': 1, 'Solar': 1, 'System': 1, 'largest': 1, 'massive': 1, 'rocky': 1, 'planets': 1, 'outer': 2, 'layer': 1, 'lithosphere': 1, 'divided': 1, 'rigid': 1, 'tectonic': 1, 'plates': 1, 'migrate': 1, 'surface': 2, 'millions': 1, '29': 1, 'land': 1, 'consisting': 1, 'continents': 1,

In [31]:
# Take the max_frequency

max_freq = max(word_frequency.values())

In [32]:
max_freq

17

- Now we will divide each of the word_freq values by 17 so that normalized frequency can be achieved.

In [33]:
for word in word_frequency.keys():
    word_frequency[word] = word_frequency[word]/max_freq

In [34]:
print(word_frequency)

{'Earth': 1.0, 'planet': 0.11764705882352941, 'Sun': 0.17647058823529413, 'astronomical': 0.058823529411764705, 'object': 0.058823529411764705, 'known': 0.11764705882352941, 'harbor': 0.058823529411764705, 'life': 0.058823529411764705, 'According': 0.058823529411764705, 'radiometric': 0.058823529411764705, 'dating': 0.058823529411764705, 'evidence': 0.058823529411764705, 'formed': 0.058823529411764705, '4.5': 0.058823529411764705, 'billion': 0.058823529411764705, 'years': 0.11764705882352941, 'ago': 0.058823529411764705, 'gravity': 0.058823529411764705, 'interacts': 0.058823529411764705, 'objects': 0.058823529411764705, 'space': 0.058823529411764705, 'especially': 0.058823529411764705, 'Moon': 0.11764705882352941, 'natural': 0.058823529411764705, 'satellite': 0.058823529411764705, 'orbits': 0.058823529411764705, '365.256': 0.058823529411764705, 'solar': 0.058823529411764705, 'days': 0.11764705882352941, 'period': 0.058823529411764705, 'sidereal': 0.17647058823529413, 'year': 0.11764705

### Sentence Tokenization

In [35]:
sent_token = [sent for sent in doc.sents]

In [38]:
print(sent_token)

[Earth is the third planet from the Sun and the only astronomical object known to harbor life., According to radiometric dating and other evidence, Earth formed over 4.5 billion years ago., Earth's gravity interacts with other objects in space, especially the Sun and the Moon, which is Earth's only natural satellite., Earth orbits around the Sun in 365.256 solar days, a period known as an Earth sidereal year., During this time, Earth rotates about its axis 366.256 times, that is, a sidereal year has 366.256 sidereal days., Earth's axis of rotation is tilted with respect to its orbital plane, producing seasons on Earth. 
, The gravitational interaction between Earth and the Moon causes tides, stabilizes Earth's orientation on its axis, and gradually slows its rotation., Earth is the densest planet in the Solar System and the largest and most massive of the four rocky planets.
, Earth's outer layer (lithosphere) is divided into several rigid tectonic plates that migrate across the surfac

### Calculating Sentence Score

In [39]:
sent_score = {}
for sent in sent_token:
    for word in sent:
        if word.text.lower() in word_frequency.keys():
            if sent not in sent_score.keys():
                sent_score[sent] = word_frequency[word.text.lower()]
            else:
                sent_score[sent] += word_frequency[word.text.lower()]

In [40]:
print(sent_score)

{Earth is the third planet from the Sun and the only astronomical object known to harbor life.: 0.47058823529411764, According to radiometric dating and other evidence, Earth formed over 4.5 billion years ago.: 0.5294117647058824, Earth's gravity interacts with other objects in space, especially the Sun and the Moon, which is Earth's only natural satellite.: 0.411764705882353, Earth orbits around the Sun in 365.256 solar days, a period known as an Earth sidereal year.: 0.7647058823529412, During this time, Earth rotates about its axis 366.256 times, that is, a sidereal year has 366.256 sidereal days.: 1.1764705882352942, Earth's axis of rotation is tilted with respect to its orbital plane, producing seasons on Earth. 
: 0.6470588235294118, The gravitational interaction between Earth and the Moon causes tides, stabilizes Earth's orientation on its axis, and gradually slows its rotation.: 0.7647058823529412, Earth is the densest planet in the Solar System and the largest and most massive

### Getting the 30% sentences with maximum score

In [41]:
from heapq import nlargest

In [42]:
select_length = int(len(sent_token) * 0.3)

In [43]:
# 30% of total sentences in the text
select_length

3

- Hence we have to select most important 3 sentences from the entire text

In [44]:
summary = nlargest(select_length, sent_score, key = sent_score.get)

In [45]:
summary

[The majority of Earth's polar regions are covered in ice, including the Antarctic ice sheet and the sea ice of the Arctic ice pack.,
 Earth's interior remains active with a solid iron inner core, a liquid outer core that generates Earth's magnetic field, and a convecting mantle that drives plate tectonics.,
 During this time, Earth rotates about its axis 366.256 times, that is, a sidereal year has 366.256 sidereal days.]

In [47]:
### Combining the sentences together

final_summary = [word.text for word in summary]

In [48]:
summary = " ".join(final_summary)

In [50]:
print(text)

Earth is the third planet from the Sun and the only astronomical object known to harbor life. According to radiometric dating and other evidence, Earth formed over 4.5 billion years ago. Earth's gravity interacts with other objects in space, especially the Sun and the Moon, which is Earth's only natural satellite. Earth orbits around the Sun in 365.256 solar days, a period known as an Earth sidereal year. During this time, Earth rotates about its axis 366.256 times, that is, a sidereal year has 366.256 sidereal days. Earth's axis of rotation is tilted with respect to its orbital plane, producing seasons on Earth. 
The gravitational interaction between Earth and the Moon causes tides, stabilizes Earth's orientation on its axis, and gradually slows its rotation. Earth is the densest planet in the Solar System and the largest and most massive of the four rocky planets.
Earth's outer layer (lithosphere) is divided into several rigid tectonic plates that migrate across the surface over many

In [51]:
print(summary)

The majority of Earth's polar regions are covered in ice, including the Antarctic ice sheet and the sea ice of the Arctic ice pack. Earth's interior remains active with a solid iron inner core, a liquid outer core that generates Earth's magnetic field, and a convecting mantle that drives plate tectonics. During this time, Earth rotates about its axis 366.256 times, that is, a sidereal year has 366.256 sidereal days.


### Comparing the length of summary and original text

In [52]:
len(text)

1551

In [53]:
len(summary)

419

In [56]:
#The length of summary is almost 30% of original text

419/1551

0.27014829142488717

#### Extraction-based summarization

The extractive text summarization technique involves pulling keyphrases from the source document and combining them to make a summary. The extraction is made according to the defined metric without making any changes to the texts.
Here is an example:
Source text: Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In the city, Mary gave birth to a child named Jesus.
Extractive summary: Joseph and Mary attend event Jerusalem. Mary birth Jesus.
As you can see above, the words in bold have been extracted and joined to create a summary — although sometimes the summary can be grammatically strange.

#### Abstraction-based summarization

The abstraction technique entails paraphrasing and shortening parts of the source document. When abstraction is applied for text summarization in deep learning problems, it can overcome the grammar inconsistencies of the extractive method.
The abstractive text summarization algorithms create new phrases and sentences that relay the most useful information from the original text — just like humans do.
Therefore, abstraction performs better than extraction. However, the text summarization algorithms required to do abstraction are more difficult to develop; that’s why the use of extraction is still popular.
Here is an example:
Abstractive summary: Joseph and Mary came to Jerusalem where Jesus was born.

#### How does a text summarization algorithm work?

Usually, text summarization in NLP is treated as a supervised machine learning problem (where future outcomes are predicted based on provided data).
Typically, here is how using the extraction-based approach to summarize texts can work:
1. Introduce a method to extract the merited keyphrases from the source document. For example, you can use part-of-speech tagging, words sequences, or other linguistic patterns to identify the keyphrases.
2. Gather text documents with positively-labeled keyphrases. The keyphrases should be compatible to the stipulated extraction technique. To increase accuracy, you can also create negatively-labeled keyphrases.
3. Train a binary machine learning classifier to make the text summarization. Some of the features you can use include:
Length of the keyphrase
Frequency of the keyphrase
The most recurring word in the keyphrase
Number of characters in the keyphrase
4. Finally, in the test phrase, create all the keyphrase words and sentences and carry out classification for them.
