## Setup

This guide was written in Python 3.5.


### Python and Pip

Download [Python](https://www.python.org/downloads/) and [Pip](https://pip.pypa.io/en/stable/installing/).

### Modules

```
```

### Other 

Follow [these](https://www.dataquest.io/blog/pyspark-installation-guide/) instructions for PySpark setup! 


### 

install.packages("sparklyr")

library(sparklyr)

spark_install()

## Background

Spark is a low-level system for distributed computation on clusters, capable of doing in-memory caching between stages, improving the performance. This is in contrast to Hadoop, which instead writes everything to disk. Spark is also a much more flexible system in that it's not only constrained to MapReduce. 

Now, it might seem as though Spark is a replacement for Hadoop; and though it sometimes is used as a replacement, it can also be used to complement Hadoop's functionality. By running Spark on top of a Hadoop cluster, you can still leverage HDFS and YARN and then have Spark replace MapReduce. 


## REPLs

Spark is composed of a built-in **R**ead-**E**valuate-**P**rint-**L**oop in the form of a shell that can be used for interactive analysis. You can begin by entering the following into your terminal:

`$SPARK_HOME/bin/pyspark`

Next, you can run the following in Python:

In [1]:
import pyspark
sc = pyspark.SparkContext("local[*]", "demo")

ModuleNotFoundError: No module named 'pyspark'

Note that Spark only allows one Spark Context to be active at a time,  so you'll need to stop the current spark context before starting a new one.

If this doesn't work, make sure the `PYTHONPATH` contains the module:

``` bash
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/:$PYTHONPATH
```

Now in R, the process is very similar: 

In [None]:
library(SparkR, 
        lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", 
               sparkConfig = list(spark.driver.memory = "2g"))

## Joins in Spark

In Spark, a join is executed by emitting key-value pairs `(k, v)` and joining on similar keys.

## DataFrames in Spark

Spark SQL provides support for writing SQL queries in Spark. The key feature of Spark SQL is the use of DataFrames instead of RDDs. Recall that a DataFrame is data organized into columns and rows 

In Spark, operations on DataFrames are essentially operations on RDDs, but the RDDs are created by the execution engine and not [directly] by the user. With Spark, it's also possible to convert RDDs to DataFrames and vice versa.

## Machine Learning with Spark

There are two machine learning APIs in Spark, `mllib` and `ml`. `mllib` is based on RDDs while the `ml` package is based on DataFrames.

### Vectors 

A local vector is often used as a base type for RDDs in `mllib`. A local vector has integer-typed indices and double-typed values that is stored on a single machine. `mllib` supports two types of local vectors: dense and sparse. 

A dense vector is backed by a double array representing its entry values, whereas a sparse vector is backed by two parallel arrays: indices and values. For dense vectors, `mllib` uses either Python lists or NumPy arrays. For sparse vectors, users can construct a SparseVector object from `mllib` or use SciPy scipy.sparse column vectors.

### Sparklyr

R's Spark Interface. If you know dplyr, you know most of sparklyr. It's fairly new though. Scala and Python are more developed. 

Converts code to SQL before passing to spark. 

Workflow: 
1. Connect w spark_connect("local")
2. Do work
3. Close connection w spark_disconnect(sc=SPARKCONNECTION)

spark_connect() -- takes URL that gives location to spark
spark_version(sc=SPARKCONNECTION)

spark_read_csv() -- Allows you to read CSVs into Spark. 

dplyr.copy_to(dest=SPARKCONNECTION, df=DATAFRAME) -- copy data from R to Spark, VERY SLOW.

src_tbls(x=SPARKCONNECTION) -- lists all dataframes stores in Spark


In [None]:
# Load dplyr
library(dplyr)

# Explore track_metadata structure
str(track_metadata)

# Connect to your Spark cluster
spark_conn <- spark_connect("local")

# Copy track_metadata to Spark
track_metadata_tbl <- copy_to(dest=spark_conn, df=track_metadata)

# List the data frames available in Spark
src_tbls(x=spark_conn)

# Disconnect from Spark
spark_disconnect(spark_conn)