# Part I: Installation and first steps

Author: **Julien Peloton** [@JulienPeloton](https://github.com/astrolabsoftware/spark-tutorials/issues/new?body=@JulienPeloton)  
Last Verifed to Run: **2018-10-25**  

Welcome to the series of notebooks on Apache Spark! The main goal of this series is to get familiar with Apache Spark, and in particular its Python API called PySpark. 

__Learning objectives__

- Apache Spark: what it is?
- Installation @ HOME
- Using the PySpark shell
- Your first Spark program
- Going further


## Apache Spark 

[Apache Spark](http://spark.apache.org/) is a cluster computing framework, that is a set of tools to perform computation on a network of many machines. Spark started in 2009 as a research project, and it had a huge success so far in the industry. It is based on the so-called MapReduce cluster computing paradigm, popularized by the Hadoop framework using implicit data parallelism and fault tolerance. 

The core of Spark is written in Scala which is a general-purpose programming language that has been started in 2004 by Martin Odersky (EPFL). The language is inter-operable with Java and Java-like languages, and Scala executables run on the Java Virtual Machine (JVM). Note that Scala is not a pure functional programming language. It is multi-paradigm, including functional programming, imperative programming, object-oriented programming and concurrent computing.

Spark provides many functionalities exposed through Scala/Python/Java/R API (Scala being the most complete one). As far as DESC is concerned, I would advocate to use the Python API (called PySpark) for obvious reasons. But feel free to put your hands on Scala, it's worth it. For those interested, you can have a look at this [tutorial](https://gitlab.in2p3.fr/MaitresNageurs/QuatreNages/Scala) on Scala.

## Installation @ HOME

You might want to install Apache Spark on your laptop, to prototype programs and perform local checks. The easiest way to do so is to [download](https://spark.apache.org/downloads.html) a pre-built version of Spark (take the latest one). Untar it, move it to the location you want, and update your path such that it can be found when you launch a job:

```bash
# Put those lines in your HOME/.bash_profile
SPARK_HOME=/path/to/spark
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
```

Latest version of Spark should run on Java 8+, but I recommend using it on Java 8. On macOS, to see the different java jdk installed on your machine: 

```
/usr/libexec/java_home -V
```

If Java 8 is not present, download the JDK and set it using:

```bash
# Put this line in your HOME/.bash_profile, with the 
# version number you just downloaded. Example:
export JAVA_HOME=`/usr/libexec/java_home -v 1.8.0_151`
```

Obviously you won't process massive data sets on your laptop, but it is sufficient for getting familiar with the API and learn what's under the hood.

## Using the PySpark shell

### Python/IPython shells

To access the PySpark shell, just type `pyspark` in a terminal. You will be redirected to the standard python shell, augmented with spark environment and pre-loaded objects such as the `sparkContext` (`sc`) and the `sparkSession` (`spark`). Between you and me, the standard python shell is rather ugly and lacks of nice functionalities. If you really want to increase your productivity, you probably want to switch to IPython as the Python binary executable to use for PySpark (in driver only). Just type in your shell:

```
PYSPARK_DRIVER_PYTHON=ipython pyspark
```
And you should see (with your corresponding Spark, Python and IPython versions):

```
Python 3.7.0 (default, Jun 28 2018, 07:39:16)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help.
2018-10-24 21:13:45 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/

Using Python version 3.7.0 (default, Jun 28 2018 07:39:16)
SparkSession available as 'spark'.

In [1]:
```

If it complains about IPython not found but you know it is installed somewhere, just specify the whole path to it (to see it: `which ipython`). As said previously, you'll have your Spark environment loaded and few objects ready:

```python
In [1]: # Spark Session 
In [2]: spark
Out[2]: <pyspark.sql.session.SparkSession at 0x10d8d6e80>
    
In [3]: # Spark Context 
In [4]: sc
Out[4]: <SparkContext master=local[*] appName=PySparkShell>
```

### Specifying resources

By default, if you just execute `pyspark`, you will run in _local_ mode, with very few options:

```python
In [1]: sc.getConf().getAll()
Out[1]:
[('spark.eventLog.enabled', 'true'),
 ('spark.driver.port', '51813'),
 ('spark.driver.host', '192.168.0.10'),
 ('spark.executor.id', 'driver'),
 ('spark.app.name', 'PySparkShell'),
 ('spark.app.id', 'local-1540409933299'),
 ('spark.sql.catalogImplementation', 'hive'),
 ('spark.rdd.compress', 'True'),
 ('spark.eventLog.dir',
  '/Users/julien/Documents/workspace/lib/spark/logfiles'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.history.fs.logDirectory',
  '/Users/julien/Documents/workspace/lib/spark/logfiles')]
```

While it is useful for debugging and testing, it does not contain many options and it is not super useful to work at scale. Instead you should specify the kind of resources you want for your Spark cluster. There are litteraly hundreds of options that can be set (see [here](https://spark.apache.org/docs/latest/configuration.html)), but here is a summary of the most useful ones to start:

```
# At home
PYSPARK_DRIVER_PYTHON=ipython pyspark --master local[*] \
  --driver-memory 2g --executor-memory 2g \
  --jars ... --packages ... --py-files ...

# On a cluster, or at NERSC on interactive nodes
ncore_per_node = 32
n_node = ...
PYSPARK_DRIVER_PYTHON=ipython pyspark --master ${SPARKURL} \
  --driver-memory 64g --executor-memory 64g \
  --executor-cores ${ncore_per_node} --total-executor-cores ${ncore_per_node} * ${n_node} \
  --jars ... --packages ... --py-files ...
```

Note that if needed, options can be automatically set in a `conf` file stored under `/path/to/spark/conf/`.

The options `--jars`, `--packages`, and `--py-files` are ways to specify third-party packages. `--jars` and `--packages` control Java/Scala/etc. third-party packages with Python API you would like to use, and `--py-files` controls other python modules. You might say:

```
Why do I need to specify other python modules if they can be found in my PYTHONPATH or PATH?
```

The answer is: because of distributed computing! It is correct that your _driver_, that is the machine where your program is shipped is probably aware of those modules, but the _executors_, that is the other machines of the network, they probably do not know those (unless you explicitly installed the modules in all the machines of the network).
Of course, if you are in local mode (e.g. in your laptop) _driver_ and _executors_ are in the same machine (just different threads) so you do not see this problem. But in a real cluster, you need to connect all of this. Note that modules listed in `--py-files` will be basically broadcasted to the executors. 
We will see examples later.

### Working with Jupyter

You can use Apache Spark with Jupyter as you would do for IPython:

```
PYSPARK_DRIVER_PYTHON=jupyter-notebook pyspark <...resource...>
```

Then select a Python kernel, and done! For NERSC, you will need a specific kernel (see the second part of this series of tutorials).

## Your first Spark program

Interactive jobs are OK, but at some point you might want to switch to batch mode. One of the strength of (Py)Spark is that you do not need a different structure for your program if you are dealing with KB of TB of data. Here is a very simple PySpark program (can be found at [astrolabsoftware/tutorials](https://github.com/astrolabsoftware/spark-tutorials)):

```python
# in resources/part1_example.py
from pyspark.sql import SparkSession

# (1) Initialise the Spark Session
spark = SparkSession\
    .builder\
    .getOrCreate()

# (2) Generate fake data on the driver
mylist = [
    ["Julien", 67], 
    ["Ιουλιανός", 32], 
    ["Юлиан", 89],
    ["尤利安", 40]
]

# (3) Distribute it over the network
rdd = spark.sparkContext.parallelize(mylist)

# (4) Return the mean of ages:
meanage = rdd.map(lambda x: x[1]).mean()
print("Mean age is {}".format(meanage))

# (5) Return person whose age is below 60
belowthreshold = rdd\
    .filter(lambda x: x[1] < 60)\
    .map(lambda x: x[0])\
    .collect()
print("{} is/are below 60".format(belowthreshold))

# (6) Go from RDD to DataFrame
df = rdd.toDF(["Name", "Age"])
df.show()
df.printSchema()
```

Remarks (see numbers in the example above):

- (1) Note that this is automatically done for you in the PySpark shell.
- (2) Just for the sake of having something simple enough. In practive you do not generate data on the driver. You would read and distribute (massive) data from disk for example. The IO is discussed in the second part of this tutorial.
- (3) A Resilient Distributed Datasets (**RDD**) is a partitioned collection of records across all machines. RDDs are distributed memory abstractions, fault-tolerant and immutable. This is a central object in Spark. In the latest Spark version, you would rather manipulate **DataFrames**, which are more or less RDD plus a schema of the data. Even if you will not manipulate RDD explicitly in the future, keep this in mind.
- (4) (a) For all pairs, take only the second element (b) compute the mean within each partition and reduce all sub-means to the driver, and (c) the final mean is computed on the driver.
- (5) Keep only persons whose age is below 60, and return only their name to the driver. Note that partitions are processed in parallel as long as there is enough resource (if not enough resources, partitions will queue until a Spark mapper becomes available). Note that Spark will try to enforce data locality as much as possible (i.e. send computation to worker nodes as close as possible to the DataNodes where the data are stored).
- (6) You can easily create a DataFrame from a RDD. If the data is simple enough (like in this case), data types are inferred.

And you would execute this example using for example:

```
spark-submit --master local[*] \
    --driver-memory 2g --executor-memory 2g \
    example.py
```

Remarks:

- Spark is really verbose... An overwhelming flow of message coming from the JVM and Spark (log4j), and from our program... Whether they are useful or not, that depends on what you want to do (learn/debug/prod/...) and your programming skills (do I understand what is written? And if yes, can I switch off some of them without missing an important piece of information?). You can set the level of verbosity for Spark by either using a conf file (see [here](http://spark.apache.org/docs/latest/configuration.html#configuring-logging)) or directly inside your program.
- For such a simple example, the time to start Spark is clearly dominating with respect to the actual computation. It's time to move to bigger data volumes!

## Going further

Here is a series of useful links on similar topics:

- TBD