# Configuring PySpark in Linux

> tested in with ubuntu 19.10, anaconda python 3.6

* spark depends on java 8, to install in ubuntu 19 or greater:

```bash
sudo apt update && sudo apt upgrade
sudo apt install default-jdk
sudo apt install openjdk-8-jdk
# (optional) 
sudo apt install scala 
```

* here we tell ubuntu to use java 8 as the default version instead of 11:

```bash
sudo update-alternatives --config java
java -version
```

* download spark: currently as of nov-2019 - click [spark-2.4.4 hadoop2.7](https://spark.apache.org/downloads.html) to download spark. ***not preview version***. after downloading execute the following commands:

```bash
tar -xzf spark-2.4.4-bin-hadoop2.7.tgz
sudo mv spark-2.4.4-bin-hadoop2.7 /opt/spark-2.4.4
sudo ln -s /opt/spark-2.4.4 /opt/spark
```
* finally, tell your bash where to find Spark. To do so, configure your `PATH` variables by adding the following lines in your `~/.bashrc` file:

```bash
nano ~/.bashrc
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
# save and exit and then:
source ~/.bashrc
```

> installing *pyspark* in an anaconda enviroment:

```bash
conda activate <YOUR-CONDA-ENV>
conda install -c conda-forge pyspark
```

* test the installation in pyspark's shell - simply execute: `pyspark`

```shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Python version 3.6.9 (default, Jul 30 2019 19:07:31)
SparkSession available as 'spark'.
>>> import random
>>> nsamples = 10000000
>>> def inside(p):
...   x = random.random()
...   y = random.random()
...   return x*x + y*y < 1
...
>>> count = sc.parallelize(range(0, nsamples)).filter(inside).count()
>>> pi = 4 * count / nsamples
>>> print(pi)
3.1418816
>>> sc.stop()
```

**method 1**

> running pyspark in a jupyter notebook (you can optionally add them to your `~/.bashrc` file to use jupyter as the default when running `pyspark` or simply excute the following commands whenever you decide to use jupyter instead of the shell enviroment)

* NOTE: i recommend running any exports in a new terminal where a conda enviroment is not activated.

```shell
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
```

* to start pyspark in jupyter simply run:

```shell
pyspark
```

**method 2**

> there is another and more generalized way to use PySpark in a Jupyter Notebook: use `findSpark package` to make a Spark Context available in your code.

```shell
conda install -c conda-forge findspark
jupyter notebook
```

* inside a jupyter notebook run the following to test:

```python
import findspark
findspark.init()

import pyspark
import random

sc = pyspark.SparkContext(appName="Pi")
nsamples = 100000000

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, nsamples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
```

In [1]:
import findspark
findspark.init()

import pyspark
import random

In [2]:
sc = pyspark.SparkContext(appName="Pi")
nsamples = 100000000

def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

In [3]:
count = sc.parallelize(range(0, nsamples)).filter(inside).count()
pi = 4 * count / nsamples
print(pi)

3.14086944


In [4]:
sc.stop()