# Summarizing text documents
* How to quickly summarize Wikipedia articles
* How to summarize an entire book using various packages in Python

In [None]:
!pip install sumy
!pip install PyPDF2



In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
  import PyPDF2
  from sumy.parsers.plaintext import PlaintextParser
  from sumy.nlp.tokenizers import Tokenizer

  from sumy.summarizers.lex_rank import LexRankSummarizer
  from gensim.summarization import summarize, keywords

In [None]:
! sumy lex-rank --length=10 --url=http://en.wikipedia.org/wiki/Automatic_summarization

An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".
In both algorithms, the sentences are ranked by applying PageRank to the resulting graph.
Automatic Text Summarization .
"Learning Algorithms for Keyphrase Extraction".
"A method for evaluating modern systems of automatic text summarization".
Automatic Summarization .
Automatic Keyphrases Extraction .
Automatic Summarization .
"Summarizing Conceptual Graphs for Automatic Summarization Task".


* ! sumy lex-rank --length=10 
* ! -> Specifies batch command
* sumy -> Package name 
* lex-rank -> Technique name
* --length=10 -> How many sentences are required from a particular article

In [None]:
! sumy lex-rank --length=1% --url=http://en.wikipedia.org/wiki/Automatic_summarization

An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
Automatic Text Summarization .
"Learning Algorithms for Keyphrase Extraction".
Automatic Keyphrases Extraction .


* length=1% -> Specifying the length as only 1% of entire article

In [None]:
doc1 = "Mauritius Prime Minister has declared a state of environmental \
emergency and appealed to France for urgent assistance as oil \
from a grounded cargo ship spilled unabated \
into the island nation's protected waters. According to reports, \
rough seas have hampered efforts to stop fuel leaking from the \
bulk carrier MV Wakashio and is polluting pristine waters \
in an ecologically critical marine area off the southeast coast. \
The tanker was carrying 3,800 tonnes of fuel when it struck a \
reef at Pointe d'Esny, an internationally-listed conservation \
site near the turquoise waters of the Blue Bay marine park. \
Meanwhile, Environment Ministry announced this week that oil \
had begun seeping from the hull, as volunteers rushed to the \
coast to prepare for the worst.Taking \
to Twitter, PM Pravind Jugnauth said, \
A state of environmental emergency has been declared."

In [None]:
parser = PlaintextParser.from_string(doc1, Tokenizer('english'))

* PlaintextParser.from_string -> will identify the tokens and sentences 
* We need to create a parser to parse the doc & collect necessary info for summarization  

In [None]:
summarizer = LexRankSummarizer()
sentences = summarizer(parser.document, 3) # passed a parsed doc to this fn and then ask how many imp sentences(here 3) we want 

for x in sentences:
  print(x)
  print('-'*50)

Mauritius Prime Minister has declared a state of environmental emergency and appealed to France for urgent assistance as oil from a grounded cargo ship spilled unabated into the island nation's protected waters.
--------------------------------------------------
According to reports, rough seas have hampered efforts to stop fuel leaking from the bulk carrier MV Wakashio and is polluting pristine waters in an ecologically critical marine area off the southeast coast.
--------------------------------------------------
The tanker was carrying 3,800 tonnes of fuel when it struck a reef at Pointe d'Esny, an internationally-listed conservation site near the turquoise waters of the Blue Bay marine park.
--------------------------------------------------


Above results have brought top 3 sentences from the whole document as summary

# Join the string together as one

In [None]:
summary = ''
for x in sentences:
  summary = summary + str(x) + ' '
print(summary)

Mauritius Prime Minister has declared a state of environmental emergency and appealed to France for urgent assistance as oil from a grounded cargo ship spilled unabated into the island nation's protected waters. According to reports, rough seas have hampered efforts to stop fuel leaking from the bulk carrier MV Wakashio and is polluting pristine waters in an ecologically critical marine area off the southeast coast. The tanker was carrying 3,800 tonnes of fuel when it struck a reef at Pointe d'Esny, an internationally-listed conservation site near the turquoise waters of the Blue Bay marine park. 


# Summarize PDF

In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
downloaded = drive.CreateFile({'id':'1xmcwhXmHHT5S9cMo_iYjltY1UUFx5JpM'})
downloaded.GetContentFile('everythingfinance.pdf') 

In [None]:
pdf = PyPDF2.PdfFileReader(open('everythingfinance.pdf','rb'))

In [None]:
pdf

<PyPDF2.pdf.PdfFileReader at 0x7fb126c46630>

In [None]:
read_pages = lambda pdf, pg_num : pdf.getPage(pg_num).extractText().encode('utf-8').decode('utf-8') 
# lambda fn taking two parameters(pdf is filehandler, pg_num is page no which we want to read) 
# lambda fn uses filehandler(pdf) to take page no and extract text out of it after converting it into a string format

In [None]:
page = read_pages(pdf, 10) # reading 11th page
page

'market, or maybe you deeply enjoy gardening and can sell vegetables. There are lots of \npossibilities out there for starting a business that will supplement your current income and perhaps eventually grow into your main income [20].Move Towards Your Passions\nWhenever the opportunity presents itself, gravitate towards the things that really excite you, because passion is what will make you successful [21]. For me, my passion is writing, so I!ve made an effort to gravitate towards it by working on The Simple Dollar in \nmy spare time. For others, it could be anything - maybe it!s leading a team, or perhaps it!s writing beautiful computer code. Whatever really excites you and makes you want to do more and more and more and better and better and better, that\n!s what you need to move towards at all times [22].Don"t Burn BridgesYou never know when a relationship you\n!ve forged in your past might come in handy later on, even the ones you completely don!t expect. Thus, even if you feel wr

In [None]:
summarizer = LexRankSummarizer()
parser = PlaintextParser.from_string(page, Tokenizer('english'))
sentences = summarizer(parser.document, 3) # passed a parsed doc to this fn and then ask how many imp sentences(here 3) we want 

[x for x in sentences]

# for x in sentences:
#   print(x)
#   print('-'*50)

[<Sentence: For me, my passion is writing, so I!ve made an effort to gravitate towards it by working on The Simple Dollar in my spare time.>,
 <Sentence: Whatever really excites you and makes you want to do more and more and more and better and better and better, that !s what you need to move towards at all times [22].Don"t Burn BridgesYou never know when a relationship you !ve forged in your past might come in handy later on, even the ones you completely don!t expect.>,
 <Sentence: Never spread a negative word about anyone, because it never helps.Keep in Touch When you do build a bridge with someone, don!t let it get old and worn out - spend the time to keep in touch with that person.>]

# Read whole document

In [None]:
pages = [read_pages(pdf, i) for i in range(pdf.numPages)]
doc2 = '\n'.join(pages) # joining the pages to form a single doc


In [None]:
summarizer = LexRankSummarizer()
parser = PlaintextParser.from_string(doc2, Tokenizer('english'))
sentences = summarizer(parser.document, 10) # passed a parsed doc to this fn and then ask how many imp sentences(here 10) we want   

[x for x in sentences]

[<Sentence: Whatever really excites you and makes you want to do more and more and more and better and better and better, that !s what you need to move towards at all times [22].Don"t Burn BridgesYou never know when a relationship you !ve forged in your past might come in handy later on, even the ones you completely don!t expect.>,
 <Sentence: Or is it something that you just do out of habit at this point?>,
 <Sentence: Realize that what your children want most of all is your time, not your stuff, and you !ll Þnd money in your pocket and joy in your heart.11.>,
 <Sentence: Instead of eating fast food or just nuking some prepackaged food when you get home, try making some simple and healthy replacements that you can take with you [39], like homemade bulk breakfast burritos [40].. An hour!s worth of preparation one weekend can give you a ton of cheap and handy meals that will end up saving you a lot of cash and not eat into your time when you!re busy.>,
 <Sentence: If you spend time with

Above results bring summary of entire pdf into 10 sentences

# Summarizer from Gensim library

In [None]:
for x in summarize(doc2, split=True, ratio=0.1):
  print(x)
  print('---')

can click on those footnote numbers and immediately jump to online resources that expand upon that point.The hardest part of personal Þnance is just having the courage to take that Þrst step.Sharing This DocumentThis document is being freely distributed under the Creative Commons Attribution-Share 
---
One, if you write about this on your website, include a link back to the original source of the document - http://www.thesimpledollar.com/onepage/
---
new and compelling with it, use it in a classroom, use it in a major media source), please let me know by dropping me an email at trent@thesimpledollar.com
---
In short, I had little idea how to manage my own money, and when I left home for college, I made a long 
---
Charles Dickens, David CopperÞeldIn the end, this is the fundamental rule of personal Þnance: spend less than you earn [7].
---
It!s the one point that comes up time and time again in almost every personal Þnance book you read [8] or talk that you hear.
---
you have a house d

In [None]:
(keywords(doc2).split())[:10] # Gensim library helps us extract imp keywords from text doc

['money',
 'like',
 'likely',
 'free',
 'personal',
 'person',
 'personally',
 'savings',
 'saving',
 'save']