In [23]:
from collections import OrderedDict
import pprint

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords

In [34]:
text = '''
8 Open Source Big Data Tools to use in 2018
Go to the profile of Vladimir Fedak
Vladimir Fedak
Aug 29, 2018
Big Data analytics is an essential part of any business workflow nowadays. To make the most of it, we recommend using these popular open source Big Data solutions for each stage of data processing.

Why opting for open source Big Data tools and not for proprietary solutions, you might ask? The reason became obvious over the last decade — open sourcing the software is the way to make it popular.

Developers prefer to avoid vendor lock-in and tend to use free tools for the sake of versatility, as well as due to the possibility to contribute to the evolvement of their beloved platform. Open source products boast the same, if not better level of documentation depth, along with a much more dedicated support from the community, who are also the product developers and Big Data practitioners, who know what they need from a product. Thus said, this is the list of 8 hot Big Data tool to use in 2018, based on popularity, feature richness and usefulness.

1. Apache Hadoop
The long-standing champion in the field of Big Data processing, well-known for its capabilities for huge-scale data processing. This open source Big Data framework can run on-prem or in the cloud and has quite low hardware requirements. The main Hadoop benefits and features are as follows:

HDFS — Hadoop Distributed File System, oriented at working with huge-scale bandwidth
MapReduce — a highly configurable model for Big Data processing
YARN — a resource scheduler for Hadoop resource management
Hadoop Libraries — the needed glue for enabling third party modules to work with Hadoop
2. Apache Spark
Apache Spark is the alternative — and in many aspects the successor — of Apache Hadoop. Spark was built to address the shortcomings of Hadoop and it does this incredibly well. For example, it can process both batch data and real-time data, and operates 100 times faster than MapReduce. Spark provides the in-memory data processing capabilities, which is way faster than disk processing leveraged by MapReduce. In addition, Spark works with HDFS, OpenStack and Apache Cassandra, both in the cloud and on-prem, adding another layer of versatility to big data operations for your business.

3. Apache Storm
Storm is another Apache product, a real-time framework for data stream processing, which supports any programming language. Storm scheduler balances the workload between multiple nodes based on topology configuration and works well with Hadoop HDFS. Apache Storm has the following benefits:

Great horizontal scalability
Built-in fault-tolerance
Auto-restart on crashes
Clojure-written
Works with Direct Acyclic Graph(DAG) topology
Output files are in JSON format
4. Apache Cassandra
Apache Cassandra is one of the pillars behind Facebook’s massive success, as it allows to process structured data sets distributed across huge number of nodes across the globe. It works well under heavy workloads due to its architecture without single points of failure and boasts unique capabilities no other NoSQL or relational DB has, such as:

Great liner scalability
Simplicity of operations due to a simple query language used
Constant replication across nodes
Simple adding and removal of nodes from a running cluster
High fault tolerance
Built-in high-availability
5. MongoDB
MongoDB is another great example of an open source NoSQL database with rich features, which is cross-platform compatible with many programming languages. IT Svit uses MongoDB in a variety of cloud computing and monitoring solutions, and we specifically developed a module for automated MongoDB backups using Terraform. The most prominent MongoDB features are:

Stores any type of data, from text and integer to strings, arrays, dates and boolean
Cloud-native deployment and great flexibility of configuration
Data partitioning across multiple nodes and data centers
Significant cost savings, as dynamic schemas enable data processing on the go
6. R Programming Environment
R is mostly used along with JuPyteR stack (Julia, Python, R) for enabling wide-scale statistical analysis and data visualization. JupyteR Notebook is one of 4 most popular Big Data visualization tools, as it allows composing literally any analytical model from more than 9,000 CRAN (Comprehensive R Archive Network) algorithms and modules, running it in a convenient environment, adjusting it on the go and inspecting the analysis results at once. The main benefits of using R are as follows:

R can run inside the SQL server
R runs on both Windows and Linux servers
R supports Apache Hadoop and Spark
R is highly portable
R easily scales from a single test machine to vast Hadoop data lakes
7. Neo4j
Neo4j is an open source graph database with interconnected node-relationship of data, which follows the key-value pattern in storing data. IT Svit has recently built a resilient AWS infrastructure with Neo4j for one of our customers and the database performs well under heavy workload of network data and graph-related requests. Main Neo4j features are as follows:

Built-in support for ACID transactions
Cypher graph query language
High-availability and scalability
Flexibility due to the absence of schemas
Integration with other databases
8. Apache SAMOA
This is another of the Apache family of tools used for Big Data processing. Samoa specializes at building distributed streaming algorithms for successful Big Data mining. This tool is built with pluggable architecture and must be used atop other Apache products like Apache Storm we mentioned earlier. Its other features used for Machine Learning include the following:

Clustering
Classification
Normalization
Regression
Programming primitives for building custom algorithms
Using Apache Samoa enables the distributed stream processing engines to provide such tangible benefits:

Program once, use anywhere
Reuse the existing infrastructure for new projects
No reboot or deployment downtime
No need for backups or time-consuming updates
Final thoughts on the list of hot Big Data tools for 2018
Big Data industry and data science evolve rapidly and progressed a big deal lately, with multiple Big Data projects and tools launched in 2017. This is one of the hottest IT trends of 2018, along with IoT, blockchain, AI & ML.

Big Data analytics is increasingly widespread in multiple industries, from using ML in banking and financial services to healthcare and government, and open source Big Data tools are the mainframe of any Big Data architect’s toolkit. In case you have any difficulties with Big Data implementation — don’t hesitate to contact IT Svit, we would be glad to help!
'''

In [35]:
summary_sentences = []
candidate_sentences = {}
candidate_sentence_count = {}

In [44]:
# Parse text.
striptext = text.replace('\r', '').replace('\n', '')
words = word_tokenize(striptext)
words_lower = [word.lower() 
               for word in words 
               if word not in stopwords.words() 
               and word.isalpha()]

In [45]:
# Select top 20 most frequent words.
word_frequencies = FreqDist(words_lower)
most_frequent_words = FreqDist(words_lower).most_common(20)

In [46]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(most_frequent_words)

[   ('data', 36),
    ('big', 18),
    ('apache', 14),
    ('open', 9),
    ('source', 8),
    ('tools', 8),
    ('processing', 8),
    ('hadoop', 8),
    ('well', 5),
    ('features', 5),
    ('it', 5),
    ('r', 5),
    ('use', 4),
    ('using', 4),
    ('the', 4),
    ('due', 4),
    ('benefits', 4),
    ('follows', 4),
    ('distributed', 4),
    ('spark', 4)]


In [47]:
# For each sentence, create a dictionary.
sentences = sent_tokenize(striptext)
for sent in sentences:
    candidate_sentences[sent] = sent.lower()

In [48]:
# Find the sentence with the most important words.
for long, short in candidate_sentences.items():
    count = 0
    for freq_word, freq_score in most_frequent_words:
        if freq_word in short:
            count += freq_score
            candidate_sentence_count[long] = count

In [49]:
sorted_sentences = OrderedDict(sorted(candidate_sentence_count.items(), 
                                      key=lambda t: t[1],
                                      reverse=True)[:4])
pp.pprint(sorted_sentences)

OrderedDict([   (   'Its other features used for Machine Learning include the '
                    'following:ClusteringClassificationNormalizationRegressionProgramming '
                    'primitives for building custom algorithmsUsing Apache '
                    'Samoa enables the distributed stream processing engines '
                    'to provide such tangible benefits:Program once, use '
                    'anywhereReuse the existing infrastructure for new '
                    'projectsNo reboot or deployment downtimeNo need for '
                    'backups or time-consuming updatesFinal thoughts on the '
                    'list of hot Big Data tools for 2018Big Data industry and '
                    'data science evolve rapidly and progressed a big deal '
                    'lately, with multiple Big Data projects and tools '
                    'launched in 2017.',
                    119),
                (   'The main Hadoop benefits and features are as '
      

In [56]:
from gensim.summarization import summarize, keywords

In [55]:
summary = summarize(striptext, word_count=50)
summary

'The main Hadoop benefits and features are as follows:HDFS\u200a—\u200aHadoop Distributed File System, oriented at working with huge-scale bandwidthMapReduce\u200a—\u200aa highly configurable model for Big Data processingYARN\u200a—\u200aa resource scheduler for Hadoop resource managementHadoop Libraries\u200a—\u200athe needed glue for enabling third party modules to work with Hadoop2.'

In [58]:
print(keywords(striptext))

data
apache
big
processing
process
hadoop benefits
graph
spark
scales
times
time
great
mongodb
run
running
runs
use
usefulness
uses
node
platform
products
source
sourcing
distributed
product developers
query
developed
follows
following
open
fault
multiple nodes
language
languages
tools
tool
popular
enabling
enable
enables
database
databases
samoa
composing
hardware
arrays
projectsno
environment
adding
vendor
hdfs
main
configurable
configuration
popularity feature
stream
streaming
jupyter
solutions
support
supports
aws infrastructure
features
scheduler
programming


In [61]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.edmundson import EdmundsonSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

In [63]:
LANGUAGE = 'english'
SENTENCES_COUNT = 4

In [66]:
%%bash 
ls -a data

.
..
sampleText.txt


In [67]:
parser = PlaintextParser.from_file('data/sampleText.txt', 
                                   Tokenizer(LANGUAGE))
stemmer = Stemmer(LANGUAGE)

## Luhn Summarizer

In [69]:
luhn = LuhnSummarizer(stemmer)
luhn.stop_wrods = get_stop_words(LANGUAGE)
for sent in luhn(parser.document, SENTENCES_COUNT):
    print(sent)
    print()

Luhn's 1958 paper "The automatic creation of literature abstracts," describes a text summarization method that will "save a prospective reader time and effort in finding useful information in a given article or report" and that the problem of finding information "is being aggravated by the ever-increasing output of technical literature."

With this early work, Luhn proposed a text summarization method where the computer would read each sentence in a paper, extract the frequently-occurring words, which he calls significant words, and then look for the sentences that had the most examples of those significant words.

As long as the amount of text that is extracted is a subset of the original text, this type of summarization achieves the goal of compressing the original text into a shorter size.

In this chapter we will focus on summarization techniques for text documents, but researchers are also working on summarization algorithms designed for video, images, sound, and more.



## Text Rank

In [70]:
textrank = TextRankSummarizer(stemmer)
textrank.stop_words = get_stop_words(LANGUAGE)
for sent in textrank(parser.document, SENTENCES_COUNT):
    print(sent)
    print()

With this early work, Luhn proposed a text summarization method where the computer would read each sentence in a paper, extract the frequently-occurring words, which he calls significant words, and then look for the sentences that had the most examples of those significant words.

In an extractive summarization method, the summary is comprised of words, phrases, or sentences that are drawn directly from the original text.

As long as the amount of text that is extracted is a subset of the original text, this type of summarization achieves the goal of compressing the original text into a shorter size.

In this chapter we will focus on summarization techniques for text documents, but researchers are also working on summarization algorithms designed for video, images, sound, and more.



## LSA Summarizer

In [71]:
lsa = LsaSummarizer(stemmer)
lsa.stop_words = get_stop_words(LANGUAGE)
for sent in lsa(parser.document, SENTENCES_COUNT):
    print(sent)
    print()

In the academic literature, text summarization is often proposed as a solution to information overload, and we in the 21st century like to think that we are uniquely positioned in history in having to deal with this problem.

In an extractive summarization method, the summary is comprised of words, phrases, or sentences that are drawn directly from the original text.

Alternatively, an abstractive summarization attempts to distill the key ideas in a text and repackage them into a human-readable, and usually shorter, synthesis.

However, since the goal is to create a summary, abstractive methods must also reduce the length of the text while focusing on only retaining the most important concepts in it.



## Edmundson Summarizer

In [74]:
ed = EdmundsonSummarizer(stemmer)
ed.bonus_words = ('focus', 'proposed', 'method', 'describes') # Points to important sentences.
ed.stigma_words = ('example') # The opposite of bonus words.
ed.null_words = ('literature', 'however') # Neutral/irrelevant words.
for sent in ed(parser.document, SENTENCES_COUNT):
    print(sent)
    print()

In the academic literature, text summarization is often proposed as a solution to information overload, and we in the 21st century like to think that we are uniquely positioned in history in having to deal with this problem.

However, since the goal is to create a summary, abstractive methods must also reduce the length of the text while focusing on only retaining the most important concepts in it.

In this chapter we will focus on summarization techniques for text documents, but researchers are also working on summarization algorithms designed for video, images, sound, and more.

In the next section we will review some of the currently available text summarization libraries and applications.

