## Text Summarizer

Algorithm
1. Clean up the text(remove stop words, punctuations,get lemma,lower)
2. Feed text through a counter to make a dict tokens with there frequency
3. For each sentence, calculate a score - For each word in the sentence, if in dictionary add the score other wise ignore
4. Summarize by picking 20-30% of  sentences by highest sentence score

NOTE: An obvious flaw in this approach is longer the sentence, higher the likelihood of that sentence scoring a high importance score. Need to normalize for that. perhaps divide the score by the length of the sentence

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
import re
import spacy
# !python -m spacy download en_core_web_sm
nlp=spacy.load("en_core_web_sm")

In [6]:
doc="""
Defining text length is quite obvious: it’s how long your text is. But, why does it matter? Well, you have a higher chance of ranking in Google if you write long, high-quality blog posts, of 1000 words or more. We’ve also experienced this ourselves; we have written quite some articles that are over 2500 words. They are cornerstone content and they help our organic traffic grow. Here’s how long articles contribute to SEO:

When your text is longer, Google has more clues to determine what it is about. The longer your (optimized) text, the more often your focus keyphrase appears. This is no excuse for keyphrase stuffing though! If you optimize your copy naturally, your focus keyphrase will pop up here and there throughout your text. You can also fit in more synonyms and related keyphrases. In a longer post, you can add more headings, links, and images, in which you can also mention the keyphrase. So more content, means more on-topic, high-quality information here.

A longer text might also help you rank for multiple long-tail variants of the keyphrase you’ve optimized your text for. That’s because, in a lengthy text, you probably address various topics. Your article, or your other posts that take a deep-dive into the subtopic, will have a chance to turn up in search results for the long-tail variants of your keyphrase. If you do some smart internal linking you can even boost the traffic to the extensive post you’ve written. This will help you drive more organic traffic to your site.

Also, if a page consists of few words, Google is more likely to think of it as thin content. All search engines want to provide the best answers to the queries people have. Thin content is less likely to offer a complete answer and satisfy the needs of the public. Consequently, it will probably not rank very high.







"""

In [7]:
def clean_review(review):
    review=review.strip()
    review=re.sub('\n',' ',review)
    review=review.lower() # lower case
    review=nlp(review)
    review=[token for token in review if token.is_stop==False]  # remove stop words
    review=[token for token in review if token.is_punct==False]
    review=[token.lemma_ if token.lemma_ != "-PRON-" else token for token in review]
    return review
    

In [8]:
nlpdoc=clean_review(doc)

In [9]:
nlpdoc[:10]

['define',
 'text',
 'length',
 'obvious',
 'long',
 'text',
 'matter',
 'high',
 'chance',
 'rank']

In [10]:
cv=Counter(nlpdoc)
# cv

In [11]:
# normalize
total_words=len(nlpdoc)
word_importance=dict(cv)
word_importance={k:v/total_words for k,v in word_importance.items()}
word_importance=dict(sorted(word_importance.items(),key=lambda x:x[1],reverse=True))

In [12]:

# word_importance

In [13]:
# using original document to replace words with scores instead
sentence_dictionary={}
for sentence in doc.split('.'):
    sentence=re.sub('[.-]',' ',sentence)
    score=0
    for word in sentence.split():
        if word in word_importance:
            score+=word_importance[word]
    if sentence not in sentence_dictionary.keys():
        sentence_dictionary[sentence]=score
sentence_dictionary=dict(sorted(sentence_dictionary.items(),key=lambda x:x[1],reverse=True))
sentence_dictionary

{'\n\nA longer text might also help you rank for multiple long tail variants of the keyphrase you’ve optimized your text for': 0.2789115646258503,
 ' This is no excuse for keyphrase stuffing though! If you optimize your copy naturally, your focus keyphrase will pop up here and there throughout your text': 0.20408163265306123,
 '\nDefining text length is quite obvious: it’s how long your text is': 0.17687074829931973,
 ' Your article, or your other posts that take a deep dive into the subtopic, will have a chance to turn up in search results for the long tail variants of your keyphrase': 0.17006802721088435,
 ' Here’s how long articles contribute to SEO:\n\nWhen your text is longer, Google has more clues to determine what it is about': 0.1292517006802721,
 ' They are cornerstone content and they help our organic traffic grow': 0.09523809523809522,
 ' But, why does it matter? Well, you have a higher chance of ranking in Google if you write long, high quality blog posts, of 1000 words or 

In [14]:
total_sentences=len(doc.strip().split('.'))
summary_sentence_count=int(total_sentences*0.3)

In [15]:
reordered_document=sentence_dictionary.keys()
reordered_document=list(reordered_document)[:summary_sentence_count]
summary='.'.join(reordered_document)
print(f'Total sentences: {total_sentences}........ Summary Sentences: {summary_sentence_count}')
print('\nImportant Words:')
for word,score in list(word_importance.items())[:15]:
    print(f'{word}...{score}')
print('\nSUMMARY',summary)
print('\n\n\nORIGINAL',doc)

Total sentences: 20........ Summary Sentences: 6

Important Words:
long...0.061224489795918366
text...0.05442176870748299
keyphrase...0.047619047619047616
high...0.027210884353741496
post...0.027210884353741496
content...0.027210884353741496
rank...0.02040816326530612
google...0.02040816326530612
write...0.02040816326530612
word...0.02040816326530612
article...0.02040816326530612
help...0.02040816326530612
traffic...0.02040816326530612
 ...0.02040816326530612
optimize...0.02040816326530612

SUMMARY 

A longer text might also help you rank for multiple long tail variants of the keyphrase you’ve optimized your text for. This is no excuse for keyphrase stuffing though! If you optimize your copy naturally, your focus keyphrase will pop up here and there throughout your text.
Defining text length is quite obvious: it’s how long your text is. Your article, or your other posts that take a deep dive into the subtopic, will have a chance to turn up in search results for the long tail variants o