# Installing Spark

---

This file has instructions to 

* install PySpark **locally** on your own computer, and 
* to integrate with the Jupyter Notebook.

---
## 0. Check and install dependencies

- Is **`Homebrew`** installed?
    - Get it [here](http://brew.sh/)
    

- Is **`Java`** Installed? (Run `java -version` to check)
    - Install (by downloading the `.dmg` files) 
        - **Java Software Development Kit** [SDK 8](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) and 
        - **Java Runtime Environment** [JRE 8](http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-2133155.html) 
        

- Is **`Scala`** installed? (Run `scala -version` to check)
    - Run 
        - `brew install scala`  
        - `brew install sbt`

---

## 1. Two Methods of Installing Spark


## $Build$ It 

- Download the latest version from the official Spark [downloads](http://spark.apache.org/downloads.html) page as a `.tar` file <br><br>
- Move the `.tar` file to your preferred installation directory <br><br>
- Untar it by running (change version number!)
    - `tar -xvzf spark-1.5.1-bin-hadoop2.4.tar` <br><br>
- `cd` to the untarred folder and build Spark using the Scala Build Tools by running the following command. (This will take a while.)
    - `sbt assembly` 


## $Brew$ It 

- If all dependencies are met, simply open the Terminal and run the following
    - `brew install apache-spark`
---

## 2. Using Spark with Jupyter Notebooks


- To make Spark run with Jupyter Notebooks by default, we have to pass some parameters to the system `PATH`<br><br>
- Open up the Terminal and type out the following commands


        export PYSPARK_DRIVER_PYTHON=ipython
        export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777"

- Then just run `$ pyspark` on a new line of the Terminal <br><br>
- Navigate to [http://localhost:7777/tree](http://localhost:7777/) and launch a new notebook.<br><br>
- Check if everything worked by running `sc` in a cell. If the PySpark Context exists, we're ready to use Spark!<br><br>
- Run the following code to create an RDD

        x = sc.parallelize(range(10))
        print type(x)
 
--- 
 
## 3. Examples

- [Word Count](http://nbviewer.jupyter.org/github/marek5050/Hadoop_Examples/blob/master/SparkieNET.ipynb). Because every first example in the Big Data world has to be.
- Data on S3. Spark on a Cluster. 1TB Reddit comments data. [Link](http://blog.insightdatalabs.com/jupyter-on-apache-spark-step-by-step/)

---

## 4. Further Reading

- StackOverflow Top Questions
    - [Spark](http://stackoverflow.com/questions/tagged/apache-spark)
    - [PySpark](http://stackoverflow.com/questions/tagged/pyspark)
    - [Spark SQL](http://stackoverflow.com/questions/tagged/apache-spark-sql)
    - [DataFrame](http://stackoverflow.com/questions/tagged/spark-dataframe)
    - [MLLIB](http://stackoverflow.com/questions/tagged/apache-spark-mllib) 
    
- [Official Documentation](http://spark.apache.org/docs/latest/) Version 1.5.0
- [Spark Packages](http://spark-packages.org/)
- Understand the [difference between MR and Spark](http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/)


## 5. Tutorial Videos

- [Sparkling Pandas](https://www.youtube.com/watch?v=AcyI_V8FeIU) with Holden Karau and Juliet Hougland at PyGotham 2014
- [Introduction to Spark with Python](https://www.youtube.com/watch?v=9xYfNznjClE) with Orlando Karam @PyCon2015
- [Practical Machine Learning Pipelines with MLlib](https://www.youtube.com/watch?v=Riuee7qxdX4) with Joseph Bradley (Databricks)
- *Spark Summit 2015 Videos*
	- [Keynotes](https://www.youtube.com/playlist?list=PL-x35fyliRwgdKsaLFMwl-Q-vSd7-X6mi)
	- Tutorials: [Intro to Apache Spark & Advanced Spark](https://www.youtube.com/playlist?list=PL-x35fyliRwioDix9XjD3HptH8ro55SuB)
	- Data Science Track: [Tutorials](https://www.youtube.com/playlist?list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs)
    
