# Module 4: Introduction to Spark

# Introduction

This module provides an introduction to a popular distributed framework called Spark for processing and performing analytics on large volumes of data.  

# Learning Outcomes

In this module, you will learn the following:

* The advantages of Apache Spark over plain Hadoop

* The high-level architecture of Spark

* The Spark API


# Readings and Resources

We invite you to further supplement this notebook with the following:

* Spark Homepage: https://spark.apache.org/


* Spark Documentation: https://spark.apache.org/docs/latest/


* Databricks Blog: _A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets. When to use them and why._ https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html


* Databricks Cloud: https://goo.gl/X5cWGA


* Madhukar’s Blog - History of Apache Spark: Journey from Academia to Industry http://blog.madhukaraphatak.com/history-of-spark/




<h1>Table of Contents<span class="tocSkip"></span></h1>
<br>
<div class="toc">
<ul class="toc-item">
<li><span><a href="#Module-4:-Introduction-to-Spark" data-toc-modified-id="Module-4:-Introduction-to-Spark">Module 4: Introduction to Spark</a></span>
</li>
<li><span><a href="#Introduction" data-toc-modified-id="Introduction">Introduction</a></span>
</li>
<li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes">Learning Outcomes</a></span>
</li>
<li><span><a href="#Readings-and-Resources" data-toc-modified-id="Readings-and-Resources">Readings and Resources</a></span>
</li>
<li><span><a href="#Table-of-Contents" data-toc-modified-id="Table-of-Contents">Table of Contents</a></span>
</li>
<li><span><a href="#What-Is-Apache-Spark?" data-toc-modified-id="What-Is-Apache-Spark?">What Is Apache Spark?</a></span>
</li>
<li><span><a href="#History-of-Spark" data-toc-modified-id="History-of-Spark">History of Spark</a></span>
</li>
<li><span><a href="#Who-uses-Spark,-and-for-what?" data-toc-modified-id="Who-uses-Spark,-and-for-what?">Who uses Spark, and for what?</a></span>
<ul class="toc-item">
<li><span><a href="#Data-Analytics" data-toc-modified-id="Data-Analytics">Data Analytics</a></span>
</li>
<li><span><a href="#Other-Big-Data-Processing-Applications" data-toc-modified-id="Other-Big-Data-Processing-Applications">Other Big Data Processing Applications</a></span>
</li>
</ul>
</li>
<li><span><a href="#Spark-Terminology" data-toc-modified-id="Spark-Terminology">Spark Terminology</a></span>
</li>
<li><span><a href="#Apache-Spark-vs-Hadoop" data-toc-modified-id="Apache-Spark-vs-Hadoop">Apache Spark vs Hadoop</a></span>
<ul class="toc-item">
<li><span><a href="#Hadoop" data-toc-modified-id="Hadoop">Hadoop</a></span>
</li>
<li><span><a href="#Spark" data-toc-modified-id="Spark">Spark</a></span>
</li>
</ul>
</li>
<li><span><a href="#Apache-Spark-Architecture" data-toc-modified-id="Apache-Spark-Architecture">Apache Spark Architecture</a></span>
</li>
<li><span><a href="#Apache-Spark-Components" data-toc-modified-id="Apache-Spark-Components">Apache Spark Components</a></span>
</li>
<li><span><a href="#Apache-Spark-Setup-for-this-Course" data-toc-modified-id="Apache-Spark-Setup-for-this-Course">Apache Spark Setup for this Course</a></span>
<ul class="toc-item">
<li><span><a href="#Setup-Steps" data-toc-modified-id="Setup-Steps">Setup Steps</a></span>
</li>
<li><span><a href="#Get-Started" data-toc-modified-id="Get-Started">Get Started</a></span>
</li>
</ul>
</li>
<li><span><a href="#Spark-API" data-toc-modified-id="Spark-API">Spark API</a></span>
<ul class="toc-item">
<li><span><a href="#Spark-RDDs" data-toc-modified-id="Spark-RDDs">Spark RDDs</a></span>
</li>
<li><span><a href="#Spark-DataFrames" data-toc-modified-id="Spark-DataFrames">Spark DataFrames</a></span>
</li>
<li><span><a href="#Spark-Datasets" data-toc-modified-id="Spark-Datasets">Spark Datasets</a></span>
</li>
</ul>
</li>
<li><span><a href="#Spark-Transformations-and-Actions" data-toc-modified-id="Spark-Transformations-and-Actions">Spark Transformations and Actions</a></span>
<ul class="toc-item">
<li><span><a href="#Spark-Transformations" data-toc-modified-id="Spark-Transformations">Spark Transformations</a></span>
<ul class="toc-item">
<li><span><a href="#Map-and-FlatMap" data-toc-modified-id="Map-and-FlatMap">Map and FlatMap</a></span>
</li>
<li><span><a href="#Full-list-of-Spark-Transformations" data-toc-modified-id="Full-list-of-Spark-Transformations">Full list of Spark Transformations</a></span>
</li>
</ul>
</li>
<li><span><a href="#Spark-Actions" data-toc-modified-id="Spark-Actions">Spark Actions</a></span>
<ul class="toc-item">
<li><span><a href="#Full-list-of-Spark-Actions" data-toc-modified-id="Full-list-of-Spark-Actions">Full list of Spark Actions</a></span>
</li>
</ul>
</li>
</ul>
</li>
<li><span><a href="#Exercise" data-toc-modified-id="Exercise">Exercise</a></span>
</li>
<li><span><a href="#References" data-toc-modified-id="References">References</a></span>
</li>
</ul>
</div>

# What Is Apache Spark?

Apache Spark is a **cluster computing** platform designed to be fast and general purpose. Spark extends Hadoop's MapReduce model to efficiently support more types of computations, including interactive queries and stream processing.

If you need a reminder of the definitions of *cluster computing* and *MapReduce*, review the previous module or the Wikipedia entries here:

- https://en.wikipedia.org/wiki/Computer_cluster


- https://en.wikipedia.org/wiki/MapReduce

Spark can work with a cluster that has the Hadoop file system (recall HDFS from the last module) installed but adds a processing engine that performs far better than simple MapReduce can.  Spark has several advantages over plain Hadoop:

- **Speed**: Speed is important in processing large datasets. Exploring data interactively requires short response times. Spark is much faster than Hadoop's built-in MapReduce for most calculations.  One of the main features Spark offers that makes it fast is its ability to run multi-step computations in memory rather than always needing to save data to disk between tasks like Hadoop does.


- **Variety of workloads**: Spark is designed to cover a wide range of distributed computational workloads that previously required several specialized computers and software. The types of workloads include batch applications, iterative algorithms, interactive queries, and streaming. By supporting these all of these workload types with a single engine, Spark makes it easy and inexpensive to combine different processing types.  This is often necessary in production data analysis pipelines when we are running several data preprocessing, transformation and analyses in a sequence and/or in parallel. This reduces the management burden of knowing and maintaining many individual specialized tools.


- **Works with several programming languages**: Spark offers several simple APIs in Python, Java, Scala, R and SQL.


- **Rich libraries**: Spark provides a lot of rich built-in analytics capabilities through included libraries.


- **Integrates with other big data tools**: It also integrates closely with other popular Big Data tools.

# History of Spark

Matei Zaharia, a Romanian-Canadian started the Spark project during his PhD at UC Berkeley (although the name Spark was given to it much later). It started in 2009 as a research project in the UC Berkeley RAD Lab, later to become the AMPLab. The researchers in the lab had previously been working on Hadoop MapReduce, and observed that MapReduce was inefficient for iterative and interactive computing jobs.

Thus, from the beginning, Spark was designed to overcome Hadoop's limitations: to be fast for interactive queries and iterative algorithms, but without sacrificing Hadoop's fault tolerance and large file processing capabilities.
Even in its early form it was already 10–20 times faster than MapReduce for certain jobs.

Spark was open sourced in early 2010. After being released, a broad developer community grew, and the project was moved to the Apache Software Foundation in 2013. 

Matei, with two other partners, Ali Ghodsi and Ion Stoica, went on to start a company called Databricks. Databricks aims to help clients with cloud-based big data processing "as-a-service" using Spark.  Databricks is the platform we will use to experiment with Spark in this course.  

# Who uses Spark, and for what?

Since Spark is a general-purpose framework for cluster computing, companies and governments use it for a diverse range of data-intensive applications, but first and foremost for data analytics.

## Data Analytics

Spark provides facilities that are familiar to data scientists, such as dataframes, SQL and statistical tools. It also supports the programming languages most commonly used by Data Scientists. Whereas in the early days of Big Data, analysts needed to know about the details of Hadoop to do their jobs, now Spark provides a way of interacting with a computing cluster that is much higher-level, more intuitive and hides most of the details of how the underlying cluster operates.

## Other Big Data Processing Applications

Spark is also used by Big Data Engineers to support more general data processing of large quantities of data.  Rather than having deal with the details of concurrent programming or using Hadoop, Spark makes their life easier by looking after all the messy details of managing the cluster, handling faults automatically and making sure parallel processes don't interfere with each other.

# Spark Terminology

Spark, like Hadoop, has its own terminology for its components and abstractions.  Here is a handy glossary to refer back to as you read the following sections.

| ***Term*** | ***Definition*** |
| :---: | :--- |
| **API** | Application Program Interface.  Spark supports a few different programming models (**RDD**'s, **DataFrames** and **DataSets**) each with their own set of methods. |
| **Application** | When Spark documentation refers to the **application**, it means a program you've written in any of the languages it supports (Java, R, Python, Scala) that calls on Spark to do work |
| **Application JAR** | If you're using Java or Scala, this is essentially the compiled executable of your application |
| **Cluster** | A set of similar servers that are networked tightly together so they can be used for parallel computing |
| **Cluster Manager** | The operating system-like program that manages resources (CPU, disk, memory) on a cluster.  This is something that is set up when the cluster is first created.  For Hadoop, this is called **YARN**. Spark comes with its own cluster manager that can be used rather than YARN, or it can use YARN, or some other alternatives. |
| **DataFrame** | Similar to a Pandas DataFrame, Spark provides a table-like structure with methods for doing operations on the table's contents.  The difference here is that Spark DataFrames can be larger than what can be stored in the memory of a single computer whereas Pandas DataFrames cannot. |
| **DataSet** | A Spark DataSet is similar to a Spark **DataFrame**, but strongly typed.  (This means for example that the contents of a column of the table it represents to be of a single type, like a relational table.  DataSets may be less convenient for exploratory data analysis but are more robust for business applications).
| **Driver Program** | The part of your application that creates the **SparkContext**. In Java terms, the driver program is the `main()` method in your program. |
| **Executor** | One of many processes launched by Spark on **worker nodes** to run the tasks needed by your application to do its work on the cluster |
| **Job** | A parallel computation consisting of multiple tasks that gets created as the result of a **Spark action** |
| **Mesos** | A **cluster manager** that enables an entire data centre to be managed as if it's a **cluster**.  The nodes in a cluster are typically all very similar, but Mesos allows a more diverse mix of nodes to be treated as a single pool of resources.  Mesos is an Apache project. |
| **Resilient Distributed Dataset (RDD)** | An older, lower-level API which DataFrames and DataSets is built upon.  Early Spark applications manipulated RDD's directly but DataFrames or DataSets are now used instead unless there is a need to do special optimization of a program by using lower-level features. |
| **SparkContext** | An object that you create in your program before you can run a Spark **job**, that tells Spark how to run your application on the cluster |
| **Stage** | Each **job** gets divided into a smaller sets of tasks, called **stages** (similar to the *map* and *reduce* stages in Hadoop) |
| **Task** | A unit of work that will be sent to one executor |
| **Worker Node** | Any node that can run **application** code in the cluster |
| **YARN** | Hadoop's standard **cluster manager** |

# Apache Spark vs Hadoop

In order to compare Spark and Hadoop's approaches to processing, it is important to understand some of the differences in their architectures.

## Hadoop

For all its strengths, MapReduce is fundamentally a batch processing system.  It is not
suitable for interactive analysis. Queries can take seconds, hours or days. However, since its original incarnation, Hadoop has evolved beyond batch processing.
Indeed, the term *Hadoop* is sometimes used to refer to the larger ecosystem of related projects including, for example, **Hive** which does provide interactive queries, though somewhat slowly.

Recall that all the files passed into HDFS (the Hadoop File System) are split into blocks. Each block is replicated a specified number of times across the cluster based on a preconfigured block size and replication factor.

As each task runs, it typically reads the key-value pairs from an HDFS file, transforms them (does a map or reduce) and writes the result back to another HDFS file ready for the next step.

## Spark

Spark handles work in a similar way to Hadoop, except that computations are carried out in memory.  Writing to disk between tasks is avoided if at all possible.  The final result can be written to disk at the end of the computation if required.

The first thing a Spark program needs is for the Spark user/programmer to specify to Spark how to access the cluster. Spark uses this information to create a **SparkContext** object.  

Spark uses the SparkContext object to then create a structure called an **RDD**, or **Resilient Distributed Dataset**, which represents an immutable collection of elements that can be operated on in parallel.

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, Hadoop SequenceFiles, and any other Hadoop InputFormat (i.e. it can read anything Hadoop MapReduce can read). 

As the RDD and related actions are being created, Spark also creates a **Directed Acyclic Graph (DAG)** where the vertices are processing steps and the arcs define the order of these operations to efficiently carry out parallel computation. This is similar to an "explain" plan in SQL. (You can refer to "Explain" plans in SQL here: https://docs.oracle.com/cd/B28359_01/server.111/b28274/ex_plan.htm#PFGRF009).

Then you can interactively define **transformations** and **actions**. If you do this interactively, the steps you specify will be used to extend the DAG, but the actual processing won't kick off until an action is specified.  This kind of delayed onset of processing is often called **lazy execution**. The difference between Actions and Transformations is that actions save a result of some kind to a disk file whereas transformations don't unless necessary.

Spark also supports higher-level constructs called **DataFrames** and **DataSets**, which are built on top of RDDs. DataFrames organize data into named columns, similar to pandas or R dataframes. This makes them more user-friendly than RDDs, which don’t have a similar set of column-level header references.  DataSets are similar to DataFrames but are strongly typed which makes them better suited to business applications where high reliability is required.

SQL is also supported and is called **SparkSQL**.

![4_1.png](attachment:4_1.png)

**Source**: Talk at Hadoop Summit by Aaron Davidson on Building a Unified data pipeline in Spark. https://www.slideshare.net/Hadoop_Summit/building-a-unified-data-pipeline-in-apache-spark

To run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Hadoop's YARN, or others). The cluster manager is responsible for allocating resources such as nodes (servers), memory and disk space to applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (in a Java JAR or Python source code files according to what you specified in the SparkContext) to the executors. Finally, the SparkContext sends tasks (commands to run your application code) to the executors to run. 

![4_2.png](attachment:4_2.png)

**Source**: https://spark.apache.org/docs/latest/cluster-overview.html

There are several important things to note about this architecture:

- Each running application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, during both scheduling (each application schedules its own tasks) and execution (tasks from different applications run in different Java Virtual Machines). However, it also means that data cannot be shared across different Spark applications without writing it to an external storage system.


- Spark is agnostic as to which cluster manager you're using. As long as it can acquire executor processes, and these can communicate with each other, it is relatively easy to run it on a cluster manager that also supports non-Spark applications.  This means the cluster can be used for other work and need not be dedicated to only running Spark applications.


- The program you write to kick things off, called the **driver program**, must be able to listen for and accept incoming connections from its executors throughout its lifetime. This means the driver program must be network-addressable from the worker nodes, and there will be a lot of communication between it and the cluster.  Ideally the driver program should run on the same local area network as the cluster. If you’d like to send requests to the cluster remotely (such as over the Internet), it’s better to have the driver local to the cluster and invoke it remotely using a remote terminal program.

# Apache Spark Architecture

The Spark project contains several closely-integrated components. At its core, Spark
is a "computational engine” that is responsible for scheduling, distributing, and monitoring
applications consisting of many computational tasks across many worker machines.


![4_3.png](attachment:4_3.png)

# Apache Spark Components

- **Spark Core**: Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, and interacting with storage systems. Spark Core is also home to the API that defines resilient distributed datasets and dataframes.


- **Spark SQL**: Spark SQL is Spark’s package for working with structured data. It allows querying data via SQL as well as the Apache Hive variant of SQL and it supports many sources of data, including Hive tables, Parquet files, and JSON. Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL queries with programmatic data manipulation.


- **Spark Streaming**: Spark Streaming is a Spark component that enables processing of *live streams* of data.  Examples of live data streams include realtime Twitter feeds or updates posted by users of a web blog servicer.


- **MLlib**: Spark comes with a library called **MLlib** that provides popular machine learning algorithms, including classification, regression, clustering, and collaborative filtering. It also provides tools for data import and model evaluation.


- **GraphX** : GraphX is a library for manipulating graphs (e.g. a social network’s friend graph) and performing graph-parallel computations. GraphX builds on the Spark RDD and DataFrame APIs to provide Graphframes to allow creating directed graphs with arbitrary properties attached to each vertex and edge. GraphX also provides various operators for manipulating graphs (e.g. subgraph and mapVertices) and a library of common graph algorithms (e.g. PageRank).


- **Cluster Managers** : Under the hood, Spark is designed to efficiently scale up from one to many thousands of compute nodes. To achieve this while maximizing flexibility, Spark can run over a variety of cluster managers, including Hadoop YARN, Apache Mesos, and a simple cluster manager included in Spark itself called the Standalone Scheduler.

The Apache Spark components are shown in the image below.

![4_4.png](attachment:4_4.png)

**Source**: Karau, H., Konwinski, A., Zaharia, M., & Wendell, P. (2015). Chapter 1: Introduction to Data analysis with Spark in _Learning Spark_. Retrieved from https://www.oreilly.com/library/view/learning-spark/9781449359034.

# Apache Spark Setup for this Course

The easiest way to access Spark without having to go through installation steps is to create an account in Databricks. This is Spark running on the cloud.

**Databricks Community Edition** will be used in this course as the easiest way to access Spark without having to go through any installation steps. You do not need to sign up for the Standard Plan or the Enterprise Plan, which are paid services. 

## Setup Steps

1. Proceed to Databricks https://community.cloud.databricks.com/login.html<br><br>

2. Sign up for the **Community Edition**.

**NOTE**: When you create a Databricks account, you must agree to the Community Edition Terms of Service https://databricks.com/ce-termsofuse, and the Privacy Policy https://databricks.com/privacypolicy.
    

## Get Started

When you log in to Databricks Community Edition, your interface looks like a dashboard. Familiarize yourself with the basic layout, and click on **Explore the Quickstart Tutorial**, which takes you to a Databricks notebook entitled “Databricks in 5 minutes.” This tutorial will demonstrate the following:

* 	How to create a quickstart cluster


*	How to attach a notebook to a cluster


*	How to run all commands in a notebook


*	How to do the following in an SQL notebook:<br><br>
        * Create a table from a Databricks dataset
        * Manipulate data and display results
        * Convert a table to a chart
        
        
* How to repeat the same operations as above using Python DataFrame API

Be sure to complete all steps in the notebook to get a first look at Spark code.

**NOTE:** In your **Workspace**, there is a wealth of information in the **Documentation** folder, and in the **Training & Tutorials** folder. Be sure to refer to these as needed.
  
Databricks also provides a platform to import Java jar files and install libraries that you need for data processing. To provide data to process, you may upload data from your computer or use a data store already on the cloud.  You may also upload Java jars and create a library in Databricks for your own use.

# Spark API

After you get set up, it is time to start reading in data to process. The Spark API provides three ways of doing this:

1. Spark RDDs<br><br>

2. Spark Dataframes<br><br>

3. Spark Datasets 

We will introduce these concepts in the following sections.

## Spark RDDs

Spark revolves around the concept of a **Resilient Distributed Dataset** (RDD), which is an immutable, fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection (e.g an array or table of data) in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat (a data format recognized by Hadoop or metadata that would make it recognizable). 

RDD's provide, through an API, basic operations such as map, filter, and persist.

## Spark DataFrames

A DataFrame is an RDD organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood than an RDD. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, a DataFrame is simply a type alias of `Dataset[Row]`. However, for the Java API, users need to use `Dataset<Row>` to represent a DataFrame.

Spark SQL integrates with Spark Dataframes.
* Spark SQL makes it possible to seamlessly mix SQL queries with Spark programs and provides Uniform Data Access
* Spark SQL is Hive Compatible
* It also provides Standard Connectivity JDBC or ODBC


For the sake of this course, it is recommended to use the Spark Python API and Dataframes, since Python is a simpler language than Java or Scala and also because the Dataframes API supports Python.  This option has gained a lot of popularity due to the widespread use of pandas dataframes in python code for data science, and the fact that Spark dataframes also support pandas dataframe-like operations.  It is possible to carry out basic projects with just SQL, dataframes and PySpark (Python API for Spark) which is the notion behind the **Unified Spark API**.

![4_6.png](attachment:4_6.png)

**Source**: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

In Spark 2.0, the DataFrame API has been merged with the Datasets API, thereby unifying the data processing capabilities across libraries. As a result, developers now have fewer concepts to learn, and can work with a single high-level and type-safe API called Dataset.

## Spark Datasets

A Dataset is a distributed collection of data that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) but is used with Spark SQL and can benefit from Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have support for the Dataset API. However, due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name, e.g. row.columnName). The case for R is similar.

# Spark Transformations and Actions

Spark **transformations** create a new dataset from an existing one.  The existing dataset could be a file from which you read data or a dataset that you have already transformed in a prior step.  All transformations in Spark are **lazy** &mdash; i.e. they do not compute their results right away. Instead, they remember the transformations applied to some base dataset. This accomplishes the following:

- It allows optimization of the required calculations &mdash; the scheduler may recognize that the set of transformations can be done more efficiently in a different order.


- It helps recovery if data is lost part way through a computation.  Spark avoids storing intermediate results to disk. If a disk or node fails, the intermediate results won't be available on disk like they are with vanilla MapReduce.  Remembering the steps that were followed to create partial results allow the data to be reconstructed if necessary.

Transformations are similar to *map* tasks in MapReduce. A transformation is not executed until an action follows it. An **action** runs the tasks that precede it in the DAG and typically aggregates the results from transformations.  Actions are like the *reduce* in MapReduce.

Let's look at both more closely.


## Spark Transformations

Here is an example. The first line defines a base RDD from an external file. This dataset is not loaded in memory or otherwise acted on &mdash; `lines` is merely a pointer to the file. The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths` is not immediately computed, due to laziness. Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks to run on separate machines, and each machine runs both its part of the map and a local reduction, returning only its answer to the driver program.


    lines = sc.textFile("data.txt")
    lineLengths = lines.map(lambda s: len(s))
    totalLength = lineLengths.reduce(lambda a, b: a + b)

If we want to use `lineLengths` again later, we can add the following before the reduce, which would cause `lineLengths` to be saved in memory after the first time it is computed:
    
    lineLengths.persist()
    

### Map and FlatMap

The two most import kinds of transformations that are used in Spark are `map` and `flatMap`.

The `map` and `flatMap` transformations are similar, in the sense that they take a line from the input RDD and apply a function to it. The way they differ is that the function in `map` returns only one element, while the function in `flatMap` can return a list of elements (0 or more) which can be iterated over.

Also, the output of the `flatMap` is flattened. Although the function in `flatMap` returns a list of elements, the `flatMap` returns an RDD which has all the elements from the list in a flat way (not a list).

More formally:

- **`map(func)`**:  Returns a new distributed dataset formed by passing each element of the source through a function you provide (`func`). Each element in the dataset is sent into the function, one at a time, and produce one element an output.


- **`flatMap(func)`**: Similar to map, but each input item can be mapped to 0 or more output items (so `func` should return a sequence or list rather than a single item). Each element goes into `func`, 0 or more elements (a collection) come out.


**Example 1:** 

    sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect() 
       
**NOTE**: `collect`: collect returns the elements of the RDD back to the driver program.

**Output:** Flattened out in a single list, 
[1, 2, 1, 2, 3, 1, 2, 3, 4] 


**Example 2:**

    sc.parallelize([3,4,5]).map(lambda x: [x,  x*x]).collect() 

**Output:**
[[3, 9], [4, 16], [5, 25]]

    sc.parallelize([3,4,5]).flatMap(lambda x: [x, x*x]).collect() 

**Output:** A flattened list, 
[3, 9, 4, 16, 5, 25]



**Example 3:**

There is a file greetings.txt in HDFS with the following lines: 

Good Morning 
Good Evening 
Good Day 
Happy Birthday 
Happy New Year

    lines = sc.textFile("greetings.txt")
    lines.map(lambda line: line.split()).collect()

**Output:**
[['Good', 'Morning'], ['Good', 'Evening'], ['Good', 'Day'], ['Happy', 'Birthday'], ['Happy', 'New', 'Year']]


    lines.flatMap(lambda line: line.split()).collect()

**Output:**
['Good', 'Morning', 'Good', 'Evening', 'Good', 'Day', 'Happy', 'Birthday', 'Happy', 'New', 'Year']


We can do a word count of the file using `flatMap`:
    
    lines = sc.textFile("greetings.txt")
    sorted(lines.flatMap(lambda line: line.split()).map(lambda w: (w,1)).reduceByKey(lambda v1, v2: v1+v2).collect())

**Output:** 
[('Birthday', 1), ('Day', 1), ('Evening', 1), ('Good', 3), ('Happy', 2), ('Morning', 1), ('New', 1), ('Year', 1)]


**Example 4:**

Below is an example demonstrating the difference between the `map` and `flatMap` operations in RDD using Scala Shell. Here a `flatMap` flattens multiple arrays into one single array. We have a flatfile called "words.txt" which contains lines of words, separated by spaces.

    ~/user$ cat words.txt 
          
          line1 word1
          line2 word2 word1 
          line3 word3 word4
          line4 word1


    scala> val lines = sc.textFile("words.txt");
    scala> lines.map(_.split(" ")).take(3)

    res4: Array[Array[String]] = Array(Array(line1, word1), Array(line2, word2, word1), Array(line3, word3, word4))

The following flattens multiple lists into one single list:

    scala> lines.flatMap(_.split(" ")).take(3)
    res5: Array[String] = Array(line1, word1, line2)

### Full list of Spark Transformations

Both `map` and `flatMap` are very generic. There is a wide array of other more specialized transformations available in Spark. Follow this link for the full current list: https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations.

## Spark Actions

Actions trigger any previously constructed Spark transformations in the Spark directed acyclic graph to be evaluated.

**Example 1:**

`reduce`: Aggregate the elements of a dataset through a function

    names1 = sc.parallelize(["abe", "abby", "apple"])
    names1.reduce(lambda t1, t2: t1+t2)

**Output:**
abeabbyapple


    names2 = sc.parallelize(["apple", "beatty", "beatrice"]).map(lambda a: [a, len(a)])
    names2.collect()

**Output:**
[['apple', 5], ['beatty', 6], ['beatrice', 8]]

    names2.flatMap(lambda t: [t[1]]).reduce(lambda t1, t2: t1+t2)

**Output:**
19

**Example 2:**

    sc.parallelize([1,2,3]).flatMap(lambda x: [x,x,x]).collect()

**Output:**
[1, 1, 1, 2, 2, 2, 3, 3, 3]

**Example 3:**

`count`: Number of elements in the RDD

    names1 = sc.parallelize(["abe", "abby", "apple"])
    names1.count()

**Output:**
3

**Example 4:**

    saveAsTextFile(path, compressionCodecClass=None): Save RDD as text file, using string representations of elements.

Parameters:	

- `path`: path to file
- `compressionCodecClass`: (None by default) string i.e. “org.apache.hadoop.io.compress.GzipCodec”


    hockeyTeams = sc.parallelize(("wild", "blackhawks", "red wings", "wild", "oilers", "whalers", "jets", "wild"))
    hockeyTeams.map(lambda k: (k,1)).countByKey().items()
    hockeyTeams.saveAsTextFile("hockey_teams.txt")

**Output:**

    $ ls hockey_teams.txt/
    _SUCCESS	part-00001	part-00003	part-00005	part-00007
    part-00000	part-00002	part-00004	part-00006

Each partition is written to its own file.  There are 8 partitions in the dataset example here.

### Full list of Spark Actions

For your reference, this link will take you to the Spark actions and what they do: https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions.

# Exercise

If you have not already signed up for Databricks, please install the Community Edition according to the instructions found here:

> https://databricks.com/try-databricks

> There is a small link on the page where you select a Cloud provider that says 'Get started with Community Edition'. You'll need to click on this link rather than selecting a cloud provider. This will avoid you having to provide credit card details or be limited by a $ value for the free trial. 

> ![image.png](attachment:image.png)

**Instructions to import the notebook to your Databricks environment:**

1. Import the Exercise Notebook from here: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5915990090493625/1208183970109446/6085673883631125/latest.html.<br><br>

2. On the top right corner of this notebook, you will ﬁnd the "Import Notebook" button. Click on it, and a URL will be generated.<br><br>

3. Copy this URL.<br><br>

4. Go to your Databricks dashboard, and navigate to your Workspace.<br><br>

5. Right click on the folder in your Workspace file listing, select Import, and paste the URL.<br><br>

6. Complete the operation by clicking Import. You now have a copy of the Exercise notebook in your Workspace to use.


This notebook contains the code for a simple "Word count" problem. This notebook is in Python and uses the Spark API for Python and can be run in the latest version of Spark or any version of Spark 3. Databricks comes pre-installed with basic libraries. The dataset is already uploaded and the program shows all the steps including transformations, actions and SQL leading up to finding word counts, unique words and basic data analysis.

**NOTE**: As a best practice, choose a *stable* Spark cluster when creating a cluster.  The Databricks service makes several clusters available with different Spark and Scala versions and updates them often.  This notebook should work with any cluster with py3 made available.

**End of Module**

You have reached the end of this module.

If you have any questions, please reach out to your peers using the discussion boards. If you
and your peers are unable to come to a suitable conclusion, do not hesitate to reach out to
your instructor on the designated discussion board.

When you are comfortable with the content, you may proceed to the next module.

# References

- Databricks (2018). Databricks home page. Retrieved from https://databricks.com/.


- Databricks (2018). Databricks blog. Retrieved from https://databricks.com/blog.


- Karau, H., Konwinski, A., Zaharia, M., & Wendell, P. (2015). Chapter 1: Introduction to Data analysis with Spark in _Learning Spark_. Retrieved from https://www.oreilly.com/library/view/learning-spark/9781449359034.


- Xin, R. (2015). _Performance Optimization Case Study: Shattering Hadoop’s Sort Record with Spark and Scala_. Publicly available presentation by Databricks. Retrieved from https://www.slideshare.net/databricks/2015-0317-scala-days.


- Apache Software Foundation (n.d.). Official website for Apache Zeppelin project. Retrieved from https://zeppelin.apache.org/.


- Databricks Notebook (n.d.) Word Count Example. Retrieved from https://goo.gl/RzJzEL. (Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License)


- Definition of MapReduce (n.d.). Retrieved from https://en.wikipedia.org/wiki/MapReduce.


- Definition of Cluster Computing (n.d.). Retrieved from https://en.wikipedia.org/wiki/Computer_cluster.


- Explain plans in SQL (n.d.). Retrieved from Oracle docs from https://docs.oracle.com/cd/B28359_01/server.111/b28274/ex_plan.htm#PFGRF009.


- Apache Spark the fastest open source engine for sorting a petabyte (n.d.). Retrieved from
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.


- Spark sets a new record in large scale sorting (n.d.). Retrieved from https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html.


- Spark Cluster Overview (n.d.). Retrieved from https://spark.apache.org/docs/latest/cluster-overview.html.