# Big Data: Introduction

## Objectives

- Understand what is "big data"
- Know why big data is different and how it can be processed
- Understand how data can be handled in distributed and parallel systems
- Understand how MapReduce is run

## What is Big Data?

There is no clear/agreed upon definition but typically we say we're working on **big data** if we have to use something like a distributed computing system (not just one local machine)

## The 3 Vs of Data

+ Volume --> Large Amounts
+ Velocity --> Quickly Generated
+ Variety --> Unstructured 

<img src="images/3vs.png" width=600>

Data is *big* when it is better/faster to split the work over the network amongst more (parallel) because of one or more of these Vs

# Applying it via Tools

## Hadoop Framework

![](images/hadoop_logo.png)

> Considered "old-school"
>
> Slower since it has to write to disk each time

- Storage (usually HDFS) 
- Data Processing (MapReduce)
- Resource Management

## Apache Spark

![](images/apache_spark_logo.png)

> Holds data in memory whenever possible (faster)
>
> Can still be built on top of Hadoop but also S3 on AWS

Spark has become king of data since it does a good job with ETL (Extract-Transform-Load) & ML in distributed systems

##### _Aside: More Detail on Spark_

**Some Resources**

>[Here](https://towardsdatascience.com/a-neanderthals-guide-to-apache-spark-in-python-9ef1f156d427) is a great walkthrough of Spark basics!
>
> And [here](https://towardsdatascience.com/apache-spark-a-conceptual-orientation-e326f8c57a64)'s another from our very own alum, Alex Shropshire!
>
> Spark has APIs for Scala (this is ur-Spark), Java, Python, and R.

N.B. Unless otherwise marked, page references are to [Salloum, Dautov, et al., "Big Data Analytics on Apache Spark", 2016](https://link.springer.com/content/pdf/10.1007%2Fs41060-016-0027-9.pdf).

Spark is a tool for the management of big data. Sometimes data science professionals will refer to the [five "V"s](https://www.bbva.com/en/five-vs-big-data/) of big data. Clearly, the availabilty and size of datasets are growing rapidly. What counts as "big data"? Roughly speaking, we're talking about datasets that are too large to be processed on a single machine.

Many large companies are relying on big data these days, and Spark is a major player in the big data game. Examples can be found [here](https://www.icas.com/thought-leadership/technology/10-companies-using-big-data) and [here](https://enlyft.com/tech/products/apache-spark) and [here](https://www.quora.com/Which-are-the-companies-that-use-apache-spark).

So ... how in the world *do* you process a dataset that's too large for a single machine? You use multiple machines linked together! Let's call each machine a *node*, and the group of all machines working in parallel a *cluster*.

The origin story of Spark starts with [MapReduce](https://en.wikipedia.org/wiki/MapReduce), whose programs comprise (unsurprisingly) a "map" routine (for filtering and sorting) and a "reduce" routine (for performing some aggregate operation).

Let's look at an [example](https://en.wikipedia.org/wiki/MapReduce#Logical_view):

An early major player in big data that used MapReduce was [Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop). Hadoop was (and still is) a framework for distributed data processing. Its processing component used MapReduce, but it also had a storage component called the "Hadoop Distributed File System".

From Wikipedia: "Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking".

But Spark appeared as open source in 2010, and had some advantages over Hadoop MapReduce.

Spark's advances over Hadoop MapReduce:

- data processing in memory rather than on disks
- a single framework for machine learning, graph analysis, and processing of streaming data (pp. 159-160)

For more on the advantages of Spark over MapReduce, see [this piece](https://research.ijcaonline.org/volume113/number1/pxc3900531.pdf).

Distributed computing can help enormously with speed. Check out [this website](http://sortbenchmark.org) for the latest in speed records.

"As a framework, it combines a core engine for distributed computing with an advanced programming model for in-memory processing. Although it has the same linear scalability and fault tolerance capabilities as those of MapReduce, it comes with a multistage in-memory programming model comparing to the rigid map-then-reduce disk-based model" (146).

Illustration, p. 148, of Spark guts.

"Running a Spark application involves five key entities ... a driver program, a cluster manager, workers, executors and tasks. A driver program is an application that uses Spark as a library and defines a high-level control flow of the target computation. While a worker provides CPU, memory and storage resources to a Spark application, an executer \[sic\] is a JVM (Java Virtual Machine) process that Spark creates on each worker for that application. A job is a set of computations (e.g., a data processing algorithm) that Spark performs on a cluster to get results to the driver program. A Spark application can launch multiple jobs. Spark splits a job into a directed acyclic graph (DAG) of stages where each stage is a collection of tasks. A task is the smallest unit of work that Spark sends to an executor. The main entry point for Spark functionalities is a SparkContext through which the driver program access \[sic\] Spark. A SparkContext represents a connection to a computing cluster" (149).

RDDs, Transformations, and Actions:

Fault tolerance achieved by keeping a record of the RDD's lineage. There are *redundancies* in the data records, so that, in the event of node failure, the other nodes can provide for data recovery. This is what makes these RDDs *resilient*.

- Transformations take one from an RDD to another RDD;
- Actions take one from an RDD to an output value.

Broadcast variables and accumulators act as global variables; the latter are for counters or sums.

Surveys of Big Data tools [here](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-015-0032-1) and [here](https://ieeexplore.ieee.org/document/7300948).

Debugging can be a challenge in Spark. [This project](https://sites.google.com/site/sparkbigdebug/) was started to help with that.

Also check out Paco Nathan's [massive slide show presentation](http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf) on Spark. Let's just look at slides 66-7 and 82.

### Spark Data Objects

***In Pyspark there are only RDD and DataFrames***

In other languages where "compiling" is done, there is the distinction between DataFrames and DataSet. 

![dataframe image](https://databricks.com/wp-content/uploads/2018/05/DataFrames.png)

#### Use an RDD when:

[quoted from databricks](https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

> - you want low-level transformation and actions and control on your dataset;
> - your data is unstructured, such as media streams or streams of text;
> - you want to manipulate your data with functional programming constructs than domain specific expressions;
> - you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column

#### Use a dataframe when:

[also quoted from databricks](https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

> - you want rich semantics, high-level abstractions, and domain specific APIs, use DataFrame
> - your processing demands high-level expressions, filters, maps, aggregation, averages, sum, SQL queries, columnar access and use of lambda functions on semi-structured data, use DataFrame
> - you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.
> - you want unification and simplification of APIs across Spark Libraries, use DataFrame or Dataset.
> - If you are a R user, use DataFrames.
> - If you are a Python user, use DataFrames and resort back to RDDs if you need more control.

**Note**: Machine learning algorithms are run on _DataFrames_

## But Spark Isn't Always the Best Tool!

<img src="images/tech_stack.png" width=600>

# What Do We Mean by "Parallel" & "Distributed"?

## Distributed

<img src="images/types_of_network.png" width=700>

> tasks split up and executed by different workers

+ Multiple CPUs each have their own memory
+ Multiple CPUs share via a network (using "messages")

## Sequential

<img src="images/sequential.png" width=650>

> Take a step at a time

## Parallel

<img src="images/parallel.png" width=650>

> executing tasks in a non-sequential order

+ Multiple CPUs share same memory to "communicate"

# MapReduce

Describes two jobs: **Map** & **Reduce**

Software best for **clusters**

Below explain each step of the process:

<img src="images/MapReduceZooExample.drawio.png" width-700>

## Steps in MapReduce

![](images/mapreduce_visual.jpg)

### Split

> Assign tasks to each worker

### Map

> Map is another word for function: takes in data as one form, and 
transforms/maps it to another form

We create key-value pairs (tuples)

### Shuffle

> Reorganize to make reducing easier

### Reduce

> Takes data from the map and _combines_ the data into smaller sets

# PySpark!

## PySpark: Programming:

<a href="https://colab.research.google.com/github/flatiron-school/ds-spark/blob/main/spark-programming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## PySpark: Machine Learning:
<a href="https://colab.research.google.com/github/flatiron-school/ds-spark/blob/main/spark-ml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>