# Big Data

## Big Data Myths

* Since Spark is faster use it always over Hadoop MapReduce
* Add more nodes to the cluster to speed up the process.
* Since Spark was written in Scala, then Scala is always faster than Python.
 + https://stackoverflow.com/questions/32464122/spark-performance-for-scala-vs-python
 + https://stackoverflow.com/questions/52713466/in-theory-scala-is-faster-than-python-for-apache-spark-in-practice-it-is-not

At the end of this course, we will challenge and sometimes discard above sentences

## Introduction

The two most widely used big data distributed frameworks are *Hadoop MapReduce* and *Apache Spark*.

* Hadoop is a distributed filesystem (HDFS) while [Apache Spark](https://spark.apache.org/?utm_source=xp&utm_medium=blog&utm_campaign=content) needs a filesystem to work on.
 + In a matter of fact, Spark is actually designed to run on top of Hadoop.


## [What is Hadoop MapReduce?](https://www.xplenty.com/blog/apache-spark-vs-hadoop-mapreduce/)

The MapReduce paradigm consists of two sequential tasks: Map and Reduce. 
* Map filters and sorts data while converting it into key-value pairs. 
* Reduce then takes this input and reduces its size by performing some kind of summary operation over the dataset.

MapReduce can drastically speed up big data tasks by breaking down large datasets and processing them in parallel.

<font color=red>__Important__: to take advantage of the above __data must be splittable__. Take this into account when analyzing the data and the needed type of work and when you wonder why _more nodes does not mean more processing speed_. __No all the data meet that condition__.</font>

### The Differences Between Spark and MapReduce
The main differences between Apache Spark and Hadoop MapReduce are:
* Performance
* Ease of use
* Data processing
* Security

#### Spark vs MapReduce: Performance
Apache Spark processes data in RAM, while Hadoop MapReduce persists data back to the disk after a map or reduce action. In theory, then, Spark should outperform Hadoop MapReduce.

__*Spark needs a lot of memory*__: Much like standard databases, Spark loads a process into memory and keeps it there until further notice for the sake of caching. Consider that: 
* If you run Spark on Hadoop YARN with other resource-demanding services, 
* .. or if the data is too big to fit entirely into memory..

__=> then Spark could suffer major performance degradations__

MapReduce, on the other hand, kills its processes as soon as a job is done, so it can easily run alongside other services with minor performance differences.

* Iterative computations that need to pass over the same data many times: *Use Spark*
* one-pass ETL-like jobs —for example, data transformation or data integration: *that's exactly what MapReduce was designed for*.

__Bottom line__: Spark performs better when all the data fits in memory, *especially on dedicated clusters.* Hadoop MapReduce is designed for data that doesn’t fit in memory, and can run well alongside other services.

#### Spark vs MapReduce: Ease of Use
Spark has pre-built APIs for Java, Scala and Python, and also includes Spark SQL. Thanks to Spark’s simple building blocks, it’s easy to write user-defined functions. Spark even includes an interactive mode for running commands with immediate feedback.

MapReduce is written in Java and is not easy to program directly. Althouth, there some projects that makes it easier:
* Apache Pig (it requires some time to learn the syntax)
* Apache Hive adds SQL
* Projects like Apache Impala and Apache Tez want to bring full interactive querying to Hadoop.

__Bottom line:__ Spark is easier to program and includes an interactive mode. Hadoop MapReduce is more difficult to program, but several tools are available to make it easier.

#### Spark vs MapReduce: Data Processing
* **Spark can do more than plain data processing**: it can also process graphs, including MLlib machine learning library, can do real-time processing as well as batch processing. 

* **Hadoop MapReduce is great for batch processing**: If you want a real-time option you’ll need to use another platform like Impala or Apache Storm, and for graph processing you can use Apache Giraph. MapReduce used to have Apache Mahout for machine learning, but it's since been ditched in favor of Spark and H2O.

__Bottom line:__ Spark is the Swiss army knife of data processing, while Hadoop MapReduce is the commando knife of batch processing.



# Conclusion
* Apache Spark is potentially 100 times faster than Hadoop MapReduce...when data fits in memory space.
* Apache Spark isn’t tied to Hadoop’s two-stage map/reduce paradigm.
* Hadoop is more cost effective processing massive data sets.
* Apache Spark is now more popular that Hadoop MapReduce. [Check google Trends](https://trends.google.com/trends/explore?cat=5&date=2008-06-07%202020-07-07&geo=US&q=hadoop,spark)