Text Summarization is the process of creating a summary of a certain document which contains an important information from that document.

![image.png](attachment:image.png)

There is a huge amount of data available digitally and it is necessary to develop a unique procedure to immediately summarize a long text while keeping a main idea. Text summarization also makes it possible to shorten the reading time, speed up information searches and obtain as much information as possible on a subject matter.

The main goal of using machine learning for text summarization is to reduce the reference text to a smaller version while keeping its knowledge alongside its meaning.

We don’t need to use a lot of machine learning here. We can easily summarize text without training a model. But still, we need to use some natural language processing for that. We will be using the `NLTK` library. 

In [1]:
import nltk
import string
from heapq import nlargest

Now let’s perform some steps for removing punctuations from the text. Then we need to do some steps of text processing and at the end, we will simply tokenize the text and see the results for text summarization.

In [2]:
text = """Steve was born in Tokyo, Japan in 1950. He moved to London with his parents 
when he was 5 years old. Steve started school there and his father began work at the hospital.
His mother was a house wife and he had four brothers.

He lived in England for 2 years then moved to Amman, Jordan where he lived there for 10 years.
Steve then moved to Cyprus to study at the Mediterranean University.
Unfortunately, he did not succeed and returned to Jordan. 
His parents were very unhappy so he decided to try in America."""

text.count(".")

8

In [3]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [4]:
nopuch =[char for char in text if char not in string.punctuation]
print(nopuch)

['S', 't', 'e', 'v', 'e', ' ', 'w', 'a', 's', ' ', 'b', 'o', 'r', 'n', ' ', 'i', 'n', ' ', 'T', 'o', 'k', 'y', 'o', ' ', 'J', 'a', 'p', 'a', 'n', ' ', 'i', 'n', ' ', '1', '9', '5', '0', ' ', 'H', 'e', ' ', 'm', 'o', 'v', 'e', 'd', ' ', 't', 'o', ' ', 'L', 'o', 'n', 'd', 'o', 'n', ' ', 'w', 'i', 't', 'h', ' ', 'h', 'i', 's', ' ', 'p', 'a', 'r', 'e', 'n', 't', 's', ' ', '\n', 'w', 'h', 'e', 'n', ' ', 'h', 'e', ' ', 'w', 'a', 's', ' ', '5', ' ', 'y', 'e', 'a', 'r', 's', ' ', 'o', 'l', 'd', ' ', 'S', 't', 'e', 'v', 'e', ' ', 's', 't', 'a', 'r', 't', 'e', 'd', ' ', 's', 'c', 'h', 'o', 'o', 'l', ' ', 't', 'h', 'e', 'r', 'e', ' ', 'a', 'n', 'd', ' ', 'h', 'i', 's', ' ', 'f', 'a', 't', 'h', 'e', 'r', ' ', 'b', 'e', 'g', 'a', 'n', ' ', 'w', 'o', 'r', 'k', ' ', 'a', 't', ' ', 't', 'h', 'e', ' ', 'h', 'o', 's', 'p', 'i', 't', 'a', 'l', '\n', 'H', 'i', 's', ' ', 'm', 'o', 't', 'h', 'e', 'r', ' ', 'w', 'a', 's', ' ', 'a', ' ', 'h', 'o', 'u', 's', 'e', ' ', 'w', 'i', 'f', 'e', ' ', 'a', 'n', 'd', ' 

In [5]:
nopuch = "".join(nopuch)
nopuch

'Steve was born in Tokyo Japan in 1950 He moved to London with his parents \nwhen he was 5 years old Steve started school there and his father began work at the hospital\nHis mother was a house wife and he had four brothers\n\nHe lived in England for 2 years then moved to Amman Jordan where he lived there for 10 years\nSteve then moved to Cyprus to study at the Mediterranean University\nUnfortunately he did not succeed and returned to Jordan \nHis parents were very unhappy so he decided to try in America'

In [7]:
# nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [8]:
processed_text = [word for word in nopuch.split() if word.lower() not in stop_words]
print(processed_text)

['Steve', 'born', 'Tokyo', 'Japan', '1950', 'moved', 'London', 'parents', '5', 'years', 'old', 'Steve', 'started', 'school', 'father', 'began', 'work', 'hospital', 'mother', 'house', 'wife', 'four', 'brothers', 'lived', 'England', '2', 'years', 'moved', 'Amman', 'Jordan', 'lived', '10', 'years', 'Steve', 'moved', 'Cyprus', 'study', 'Mediterranean', 'University', 'Unfortunately', 'succeed', 'returned', 'Jordan', 'parents', 'unhappy', 'decided', 'try', 'America']


In [9]:
word_freq = {}
for word in processed_text:
    if word not in word_freq:
        word_freq[word] = 1
    else:
        word_freq[word] = word_freq[word] + 1

print(word_freq)

{'Steve': 3, 'born': 1, 'Tokyo': 1, 'Japan': 1, '1950': 1, 'moved': 3, 'London': 1, 'parents': 2, '5': 1, 'years': 3, 'old': 1, 'started': 1, 'school': 1, 'father': 1, 'began': 1, 'work': 1, 'hospital': 1, 'mother': 1, 'house': 1, 'wife': 1, 'four': 1, 'brothers': 1, 'lived': 2, 'England': 1, '2': 1, 'Amman': 1, 'Jordan': 2, '10': 1, 'Cyprus': 1, 'study': 1, 'Mediterranean': 1, 'University': 1, 'Unfortunately': 1, 'succeed': 1, 'returned': 1, 'unhappy': 1, 'decided': 1, 'try': 1, 'America': 1}


In [10]:
max_freq = max(word_freq.values())

print(max_freq)

3


In [11]:
# proportion of values

for word in word_freq.keys():
    word_freq[word] = (word_freq[word]/max_freq)

print(word_freq)

{'Steve': 1.0, 'born': 0.3333333333333333, 'Tokyo': 0.3333333333333333, 'Japan': 0.3333333333333333, '1950': 0.3333333333333333, 'moved': 1.0, 'London': 0.3333333333333333, 'parents': 0.6666666666666666, '5': 0.3333333333333333, 'years': 1.0, 'old': 0.3333333333333333, 'started': 0.3333333333333333, 'school': 0.3333333333333333, 'father': 0.3333333333333333, 'began': 0.3333333333333333, 'work': 0.3333333333333333, 'hospital': 0.3333333333333333, 'mother': 0.3333333333333333, 'house': 0.3333333333333333, 'wife': 0.3333333333333333, 'four': 0.3333333333333333, 'brothers': 0.3333333333333333, 'lived': 0.6666666666666666, 'England': 0.3333333333333333, '2': 0.3333333333333333, 'Amman': 0.3333333333333333, 'Jordan': 0.6666666666666666, '10': 0.3333333333333333, 'Cyprus': 0.3333333333333333, 'study': 0.3333333333333333, 'Mediterranean': 0.3333333333333333, 'University': 0.3333333333333333, 'Unfortunately': 0.3333333333333333, 'succeed': 0.3333333333333333, 'returned': 0.3333333333333333, 'un

### Tokenize the Text

In [13]:
text

'Steve was born in Tokyo, Japan in 1950. He moved to London with his parents \nwhen he was 5 years old. Steve started school there and his father began work at the hospital.\nHis mother was a house wife and he had four brothers.\n\nHe lived in England for 2 years then moved to Amman, Jordan where he lived there for 10 years.\nSteve then moved to Cyprus to study at the Mediterranean University.\nUnfortunately, he did not succeed and returned to Jordan. \nHis parents were very unhappy so he decided to try in America.'

In [16]:
# nltk.download('punkt')
sent_list = nltk.sent_tokenize(text) # Sentence tokenzie makes a list of text. 
# Every next element is a next sentence after a fullstop.
sent_list

['Steve was born in Tokyo, Japan in 1950.',
 'He moved to London with his parents \nwhen he was 5 years old.',
 'Steve started school there and his father began work at the hospital.',
 'His mother was a house wife and he had four brothers.',
 'He lived in England for 2 years then moved to Amman, Jordan where he lived there for 10 years.',
 'Steve then moved to Cyprus to study at the Mediterranean University.',
 'Unfortunately, he did not succeed and returned to Jordan.',
 'His parents were very unhappy so he decided to try in America.']

In [15]:
# nltk.word_tokenize('Then we need to do'.lower()) # It will make a list of word

In [18]:
sent_score = {}

for sentence in sent_list:
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word_freq.keys():
            if sentence not in sent_score.keys():
                sent_score[sentence] = word_freq[word]
            else:
                sent_score[sentence] += word_freq[word]

sent_score

{'Steve was born in Tokyo, Japan in 1950.': 0.6666666666666666,
 'He moved to London with his parents \nwhen he was 5 years old.': 3.3333333333333335,
 'Steve started school there and his father began work at the hospital.': 1.9999999999999998,
 'His mother was a house wife and he had four brothers.': 1.6666666666666665,
 'He lived in England for 2 years then moved to Amman, Jordan where he lived there for 10 years.': 5.0,
 'Steve then moved to Cyprus to study at the Mediterranean University.': 1.3333333333333333,
 'Unfortunately, he did not succeed and returned to Jordan.': 0.6666666666666666,
 'His parents were very unhappy so he decided to try in America.': 1.6666666666666665}

In [19]:
summary_sents = nlargest(n =2 ,iterable =  sent_score, key=sent_score.get) # It find the largest elements in a datasets
summary_sents

['He lived in England for 2 years then moved to Amman, Jordan where he lived there for 10 years.',
 'He moved to London with his parents \nwhen he was 5 years old.']

In [20]:
summary = " ".join(summary_sents)
print(summary)

He lived in England for 2 years then moved to Amman, Jordan where he lived there for 10 years. He moved to London with his parents 
when he was 5 years old.
