Steps for text summarization
- Text Cleaning
- Sentence Tokenization
- Word Tokenization
- Word-frequency table
- Summarization

In [38]:
text = """Our solicitor Umar was fantastic he was introduced to us 10 weeks into the process after quite a lot of delays and quickly got everything resolved. We were really impressed with how fast and responsive he was and I'd highly recommend him. Overall I think Muve is a good option with some very competitive pricing but unfortunately we did experience some delays at the beginning of the process which I think it has to do with the amount of cases they had due to the stamp duty holiday but once our solicitor was assigned things moved really quickly. That's the only reason why I wouldn't give them 5 starts overall and I'd highly recommend getting a solicitor assigned as soon as you start the process. Would not have made the stamp duty deadline without them!

I was selling my existing property and purchasing a new one. It was all smooth sailing until the last few weeks where issues initiated by my buyers side, started to pop up out of nowhere. Umar and Ashleigh, really did everything they could to help me navigate these obstacles and were in constant contact. Keeping me updated and keeping the application on track. Completed on the final day of the stamp duty holiday and saved lots of £££. It was stressful at the end, but we made it.

Thank you both so much!"""

In [7]:
#needed libraries
!pip install pandas 
!pip install nltk
!pip install sklearn
!pip install networkx
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 9.3 MB/s eta 0:00:01


Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [39]:
import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#to measure the similarity between two sentences
from nltk.cluster.util import cosine_distance

#to creating similarity graphs and to manipulate them
import numpy as np

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [40]:
stopwords = list(STOP_WORDS)

In [41]:
stopwords

['himself',
 'ten',
 'across',
 'since',
 'below',
 'upon',
 'behind',
 'down',
 'up',
 'doing',
 'against',
 'whence',
 'beside',
 'yourselves',
 'me',
 'mine',
 'become',
 'a',
 'that',
 'seems',
 'yet',
 'bottom',
 'something',
 'hers',
 'through',
 '’s',
 'the',
 'nor',
 'whom',
 'formerly',
 'they',
 'now',
 "'m",
 'itself',
 'n’t',
 'and',
 'ours',
 'see',
 'still',
 'further',
 'these',
 'anyway',
 'first',
 'too',
 'regarding',
 'once',
 'hereafter',
 'hence',
 'four',
 'afterwards',
 '‘m',
 'only',
 'amongst',
 'more',
 'where',
 'ever',
 'from',
 'elsewhere',
 'any',
 'their',
 'thru',
 'no',
 'sixty',
 'until',
 'someone',
 '‘ve',
 'but',
 '’d',
 'whose',
 'everywhere',
 'move',
 'because',
 'many',
 'would',
 "n't",
 'whatever',
 'again',
 'hereupon',
 'your',
 'could',
 'why',
 'throughout',
 'fifty',
 'its',
 'much',
 'everything',
 'via',
 'forty',
 'eight',
 'n‘t',
 'he',
 'i',
 'beforehand',
 'eleven',
 'latter',
 'becoming',
 'rather',
 'least',
 'were',
 'is',
 'arou

In [42]:
nlp = spacy.load('en_core_web_sm')

In [43]:
doc = nlp(text)

In [44]:
tokens = [token.text for token in doc]
print(tokens)

['Our', 'solicitor', 'Umar', 'was', 'fantastic', 'he', 'was', 'introduced', 'to', 'us', '10', 'weeks', 'into', 'the', 'process', 'after', 'quite', 'a', 'lot', 'of', 'delays', 'and', 'quickly', 'got', 'everything', 'resolved', '.', 'We', 'were', 'really', 'impressed', 'with', 'how', 'fast', 'and', 'responsive', 'he', 'was', 'and', 'I', "'d", 'highly', 'recommend', 'him', '.', 'Overall', 'I', 'think', 'Muve', 'is', 'a', 'good', 'option', 'with', 'some', 'very', 'competitive', 'pricing', 'but', 'unfortunately', 'we', 'did', 'experience', 'some', 'delays', 'at', 'the', 'beginning', 'of', 'the', 'process', 'which', 'I', 'think', 'it', 'has', 'to', 'do', 'with', 'the', 'amount', 'of', 'cases', 'they', 'had', 'due', 'to', 'the', 'stamp', 'duty', 'holiday', 'but', 'once', 'our', 'solicitor', 'was', 'assigned', 'things', 'moved', 'really', 'quickly', '.', 'That', "'s", 'the', 'only', 'reason', 'why', 'I', 'would', "n't", 'give', 'them', '5', 'starts', 'overall', 'and', 'I', "'d", 'highly', 'rec

In [49]:
#remove stop words and punctuations
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n\n\n\n'

In [50]:
word_frequencies = {}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [51]:
print(word_frequencies)

{'solicitor': 3, 'Umar': 2, 'fantastic': 1, 'introduced': 1, '10': 1, 'weeks': 2, 'process': 3, 'lot': 1, 'delays': 2, 'quickly': 2, 'got': 1, 'resolved': 1, 'impressed': 1, 'fast': 1, 'responsive': 1, 'highly': 2, 'recommend': 2, 'Overall': 1, 'think': 2, 'Muve': 1, 'good': 1, 'option': 1, 'competitive': 1, 'pricing': 1, 'unfortunately': 1, 'experience': 1, 'beginning': 1, 'cases': 1, 'stamp': 3, 'duty': 3, 'holiday': 2, 'assigned': 2, 'things': 1, 'moved': 1, 'reason': 1, '5': 1, 'starts': 1, 'overall': 1, 'getting': 1, 'soon': 1, 'start': 1, 'deadline': 1, 'selling': 1, 'existing': 1, 'property': 1, 'purchasing': 1, 'new': 1, 'smooth': 1, 'sailing': 1, 'issues': 1, 'initiated': 1, 'buyers': 1, 'started': 1, 'pop': 1, 'Ashleigh': 1, 'help': 1, 'navigate': 1, 'obstacles': 1, 'constant': 1, 'contact': 1, 'Keeping': 1, 'updated': 1, 'keeping': 1, 'application': 1, 'track': 1, 'Completed': 1, 'final': 1, 'day': 1, 'saved': 1, 'lots': 1, '£': 3, 'stressful': 1, 'end': 1, 'Thank': 1}


In [52]:
max_frequency = max(word_frequencies.values())

In [53]:
max_frequency

3

In [54]:
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

In [55]:
print(word_frequencies)

{'solicitor': 1.0, 'Umar': 0.6666666666666666, 'fantastic': 0.3333333333333333, 'introduced': 0.3333333333333333, '10': 0.3333333333333333, 'weeks': 0.6666666666666666, 'process': 1.0, 'lot': 0.3333333333333333, 'delays': 0.6666666666666666, 'quickly': 0.6666666666666666, 'got': 0.3333333333333333, 'resolved': 0.3333333333333333, 'impressed': 0.3333333333333333, 'fast': 0.3333333333333333, 'responsive': 0.3333333333333333, 'highly': 0.6666666666666666, 'recommend': 0.6666666666666666, 'Overall': 0.3333333333333333, 'think': 0.6666666666666666, 'Muve': 0.3333333333333333, 'good': 0.3333333333333333, 'option': 0.3333333333333333, 'competitive': 0.3333333333333333, 'pricing': 0.3333333333333333, 'unfortunately': 0.3333333333333333, 'experience': 0.3333333333333333, 'beginning': 0.3333333333333333, 'cases': 0.3333333333333333, 'stamp': 1.0, 'duty': 1.0, 'holiday': 0.6666666666666666, 'assigned': 0.6666666666666666, 'things': 0.3333333333333333, 'moved': 0.3333333333333333, 'reason': 0.3333

In [56]:
sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

[Our solicitor Umar was fantastic he was introduced to us 10 weeks into the process after quite a lot of delays and quickly got everything resolved., We were really impressed with how fast and responsive he was and I'd highly recommend him., Overall I think Muve is a good option with some very competitive pricing but unfortunately we did experience some delays at the beginning of the process which I think it has to do with the amount of cases they had due to the stamp duty holiday but once our solicitor was assigned things moved really quickly., That's the only reason why I wouldn't give them 5 starts overall, and I'd highly recommend getting a solicitor assigned as soon as you start the process., Would not have made the stamp duty deadline without them!, 

I was selling my existing property and purchasing a new one., It was all smooth sailing until the last few weeks where issues initiated by my buyers side, started to pop up out of nowhere., Umar and Ashleigh, really did everything t

In [57]:
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]

In [58]:
sentence_scores

{Our solicitor Umar was fantastic he was introduced to us 10 weeks into the process after quite a lot of delays and quickly got everything resolved.: 6.0,
 We were really impressed with how fast and responsive he was and I'd highly recommend him.: 2.333333333333333,
 Overall I think Muve is a good option with some very competitive pricing but unfortunately we did experience some delays at the beginning of the process which I think it has to do with the amount of cases they had due to the stamp duty holiday but once our solicitor was assigned things moved really quickly.: 11.666666666666666,
 That's the only reason why I wouldn't give them 5 starts overall: 1.3333333333333333,
 and I'd highly recommend getting a solicitor assigned as soon as you start the process.: 5.0,
 Would not have made the stamp duty deadline without them!: 2.3333333333333335,
 
 
 I was selling my existing property and purchasing a new one.: 1.6666666666666665,
 It was all smooth sailing until the last few weeks w

In [59]:
from heapq import nlargest

In [60]:
select_length = int(len(sentence_tokens)*0.3)
select_length

3

In [61]:
summary = nlargest(select_length,sentence_scores,key=sentence_scores.get)
summary

[Overall I think Muve is a good option with some very competitive pricing but unfortunately we did experience some delays at the beginning of the process which I think it has to do with the amount of cases they had due to the stamp duty holiday but once our solicitor was assigned things moved really quickly.,
 Completed on the final day of the stamp duty holiday and saved lots of £££.,
 Our solicitor Umar was fantastic he was introduced to us 10 weeks into the process after quite a lot of delays and quickly got everything resolved.]

In [62]:
final_summary = [word.text for word in summary]
final_summary

['Overall I think Muve is a good option with some very competitive pricing but unfortunately we did experience some delays at the beginning of the process which I think it has to do with the amount of cases they had due to the stamp duty holiday but once our solicitor was assigned things moved really quickly.',
 'Completed on the final day of the stamp duty holiday and saved lots of £££.',
 'Our solicitor Umar was fantastic he was introduced to us 10 weeks into the process after quite a lot of delays and quickly got everything resolved.']

In [63]:
summary = ' '.join(final_summary)
print(summary)

Overall I think Muve is a good option with some very competitive pricing but unfortunately we did experience some delays at the beginning of the process which I think it has to do with the amount of cases they had due to the stamp duty holiday but once our solicitor was assigned things moved really quickly. Completed on the final day of the stamp duty holiday and saved lots of £££. Our solicitor Umar was fantastic he was introduced to us 10 weeks into the process after quite a lot of delays and quickly got everything resolved.


In [64]:
len(text)

1268

In [65]:
len(summary)

532