# Setup your local laptop for the labs

This guide will walk you through how to setup a local setup to run the labs.

There are two options
- option A (recommended) : Running our training docker image on your computer
- option B : setup your own computer 

## Option A : Run training Docker 
This is the same environmnet you used for training.

Here is the [docker repository](https://hub.docker.com/r/elephantscale/es-training)

Follow the instructions in the above page.


---

## Option B : Setting up your own machine

### Operating System
Setup is easier if you have either **MacOS or Linux**.   
If you are on Windows, download the [docker image](https://hub.docker.com/repository/docker/elephantscale/es-training) and do the setup in the sandbox.

## Software Needed
- Java version 11 or later
- Anaconda Python version 3.x
- A few more python packages
- Spark version 3.0 or latest
- our data files


### B1: Java 11
Download and install JDK (not JRE) v11 or later from [here](https://www.oracle.com/java/technologies/javase-jdk11-downloads.html).  
Verify you have the correct version by doing 
```
    java -version
```

### B2: Anaconda Python

Download and install Anaconda Python version 3.x from [here](https://www.anaconda.com/download/).

### B3: Create a separate conda environment

It is highly recommended to have a unique Conda environment, so you can have clean environments and don't run into any conflicts

```bash
# create an env called 'pyspark' with python version 3.8
$   conda create -n pyspark python=3.8
$   conda env list

$   conda activate pyspark
## activate pyspark env, all subsequent installs will be in this environment
```


### B4: Install following add-on packages
Open a **new** terminal and run the following command

```bash
$   conda install numpy  pandas  matplotlib  seaborn  jupyter  jupyterlab
$   conda install -c conda-forge findspark
```

Note: you can remove the environment like this

```bash
# $  conda deactivate
# $  conda remove --name pyspark --all
```

### B5: Download Spark
- Download latest Spark from [here](https://spark.apache.org/downloads.html)
- Unzip the downloaded zip file
- where Spark is unzipped is the SPARK_HOME  (in our case  `~/spark`)

(Labs are tested with version 3.0 of Spark)

```bash
$   cd    # go to home directory
$   rm -rf  spark   # cleanup existing spark installation (if any)

## download
$   wget https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
# alternative download location
# $   wget  https://elephantscale-public.s3.amazonaws.com/downloads/spark-3.2.1-bin-hadoop3.2.tgz

## unpack
$   tar xvf spark-3.2.1-bin-hadoop3.2.tgz
$   mv  spark-3.2.1-bin-hadoop3.2    spark
```

### B6: Download data files

- download our data files from [here](https://s3.amazonaws.com/elephantscale-public/data/data.zip)

```bash
$   wget 'https://s3.amazonaws.com/elephantscale-public/data/data.zip'
```

- unzip this bundle in the top level project dir, so the structure looks like this

```text
    spark-labs
        +--- README.ipynb
        +--- data
               +--- house-sales
               +--- uber
               +--- etc
```

### B7: Download labs / solutions
Unzip them anywhere

### B8: Edit  `run-jupyter.sh`
This file is located in the labs directory. 
Edit this file to match your environment.

```bash
# TODO : Edit the following lines   
export PATH=$HOME/anaconda3/bin:$PATH   
export SPARK_HOME=$HOME/spark   
jupyter lab   
```



### B9: Run the labs
```bash
$   ./run-jupyter.sh
```

### B10: Open and run `testing123.ipynb` file 
This file is under `0-testing` directory.   
This file will check your setup.  
If there are no errors here, then you are good to go

### B11: Open `README.ipynb`
And practice!
