Extract - News - Summary
This is a pure python package that returns summary of news articles for a serach term provided by the user. The script automates the process of finding appropriate news links, extracting text and summarizing it.
The final script extracts URLs from Google News, so it works best where the search query is a current affaird topic.
Following are the different modules inside the ENS folder:
bsReadability - This module takes uses BeautifulSoup to extract boilerplate from a given URL
lxmlReadability - This module uses the more faster lxml library to extract boilerplate from a given URL, however the library is less robust to badly formed HTML and encoding
GoogleRSSReader - This module takes in a search query and returns the scraped URLs from the Google RSS reader. The Google RSS reader is not strict in their scraping policies
TextRankSummarize - This module uses a modified PageRank algorithm to mark the most important sentences/phrases in a text. A more graphical and intuitive approach than word-frequency. (NOTE: The parameter defining the number of nodes to be selected for final summary is hardcoded in this script, to play around with it you need to make appropriate changes here)
- nltk (Remember to install the punkt.tokenzier seperately)
git clone https://github.com/abhinavgupta/Extract-News-Summary cd Extract-News-Summary/ sudo sh install.sh
There are two versions of the script, one using BeautifulSoup, the other using lxml. This is done specifcially for benchmarking purposes give the pros and cons of both parsers.
To use the BeautifulSoup version via terminal:
ENS_Soup 5 Narendra Modi
To use the lxml version:
ENS_lxml 5 Narendra Modi
from ENS import Document, fetch_url, textRank, newsSearch links = newsSearch("RBI Governor", 5) for link in links: article = Document(url=link).summary() article = re.sub(regex, "", article) article = article.encode('ascii','ignore') summary = textRank(article) summary = summary.encode('ascii','ignore') print article print "*** SUMMARY ***" print summary
- Abhinav Gupta
- Remove networkx and sklearn dependencies
- Add own tokenizer and remove nltk dependency
- Solve encoding issue
- Benchmark the summary
- Improve text extraction