# Getting Started

## Platforms to Practice

Let us understand different platforms we can leverage to practice Apache Spark using Python.

* Local Setup
* Using ITVersity labs
* Databricks Platform
* Setting up your own cluster

## Setup Spark Locally - Ubuntu

Let us setup Spark Locally on Ubuntu.

* Install latest version of Anaconda
* Make sure Jupyter Notebook is setup and validated.
* Setup Spark and Validate.
* Setup Environment Variables to integrate Pyspark with Jupyter Notebook.
* Launch Jupyter Notebook using `pyspark` command.
* Setup PyCharm (IDE) for application development.

## Setup Spark Locally - Mac

* Install latest version of Anaconda
* Make sure Jupyter Notebook is setup and validated.
* Setup Spark and Validate.
* Setup Environment Variables to integrate Pyspark with Jupyter Notebook.
* Launch Jupyter Notebook using `pyspark` command.
* Setup PyCharm (IDE) for application development.

## Signing up for ITVersity Labs

Let us understand how to sign up for ITVersity labs to get access to the content as well environment to practice.
* Go to https://labs.itversity.com
* Sign up for the website and purchase the plan
* Create account in the labs and then use credentials in lab page to login to Jupyter based environment.
* Following are the advantages using labs:
  * Unified experience - content, cluster.
  * Support using Slack.
  * No headache in setting up environment for practice.

## Using ITVersity Labs

Let us understand how to submit the Spark Jobs in ITVersity Labs.

* As we are using Python we can also use the help command to get the documentation - for example `help(spark.read.csv)`
* You can practice either using terminal or by using Jupyter Notebook directly.
* You need to choose appropriate kernel to run the code leveraging Spark.
* To access terminal you need to launch terminal from File -> New -> New Terminal.
* You can use `pyspark2` command to launch Pyspark in terminal and then to practice.
* Example to copy paste in the terminal.
```
export PYSPARK_PYTHON=python3
pyspark2 --master yarn --conf spark.ui.port=0
```

## Interacting with File Systems

When it comes to on-prem clusters such as our labs one can use `hdfs` commands to interact with file system. However it is different for Databricks Platform. 

Let us understand how to interact with file system using %fs command from Databricks Notebook.

* We can access datasets using `%fs` magic command in Databricks notebook
* By default, we will see files under dbfs
* We can list the files using ls command - e. g.: `(%fs ls)`
* Databricks provides lot of datasets for free under databricks-datasets
* If the cluster is integrated with AWS or Azure Blob we can access files by specifying the appropriate protocol (e.g.: s3:// for s3)
* List of commands available under %fs
 * Copying files or directories `cp`
 * Moving files or directories `mv`
 * Creating directories `mkdirs`
 * Deleting files and directories `rm`
 * We can copy or delete directories recursively using `-r` or `--recursive`

## Getting File Metadata

Let us review the source location to get number of files and the size of the data we are going to process.

* Location of airlines data dbfs:/databricks-datasets/airlines
* We can get first 1000 files using %fs ls dbfs:/databricks-datasets/airlines
* Location contain 1919 Files, however we will not be able to see all the details using %fs command.
* Databricks File System commands does not have capability to understand metadata of files such as size in details.
* When Spark Cluster is started, it will create 2 objects - spark and sc
* sc is of type SparkContext and spark is of type SparkSession
* Spark uses HDFS APIs to interact with the file system and we can access HDFS APIs using sc._jsc and sc._jvm to get file metadata.
* Here are the steps to get the file metadata.
 * Get Hadoop Configuration using ` sc._jsc.hadoopConfiguration()` - let's say `conf`
  * We can pass conf to `sc._jvm.org.apache.hadoop.fs.FileSystem` get to get FileSystem object - let's say `fs`
  * We can build `path`  object by passing the path as string to `sc._jvm.org.apache.hadoop.fs.Path`
  * We can invoke `listStatus` on top of fs by passing path which will return an array of FileStatus objects - let's say files.  
  * Each `FileStatus` object have all the metadata of each file.
  * We can use `len` on files to get number of files.
  * We can use `getLen` on each `FileStatus` object to get the size of each file. 
  Cumulative size of all files can be achieved using `sum(map(lambda file: file.getLen(), files))`
  
Let us first get list of files  

In [None]:
%fs ls dbfs:/databricks-datasets/airlines

* Here is the `hdfs` command to run from terminal.

```
hdfs dfs -ls /public/airlines_all
hdfs dfs -ls /public/airlines_all/airlines
hdfs dfs -ls /public/airlines_all/airlines_part
```

In [None]:
import subprocess
subprocess. \
    check_output(['hdfs', 'dfs', '-ls', '/public/airlines_all']). \
    splitlines()

In [None]:
import subprocess
subprocess. \
    check_output(['hdfs', 'dfs', '-ls', '/public/airlines_all/airlines']). \
    splitlines()

Here is the consolidated script to get number of files and cumulative size of all files in a given folder.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    master('yarn'). \
    appName('Computing Airlines Data Size'). \
    getOrCreate()

In [None]:
sc = spark.sparkContext

In [None]:
conf = sc._jsc.hadoopConfiguration()
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(conf)

In [None]:
# Databricks
# path = sc._jvm.org.apache.hadoop.fs.Path("dbfs:/databricks-datasets/airlines")

In [None]:
# ITVersity labs
path = sc._jvm.org.apache.hadoop.fs.Path('/public/airlines_all/airlines')

In [None]:
files = fs.listStatus(path)

In [None]:
round(sum(map(lambda file: file.getLen(), files))/1024/1024/1024, 2)

* We can also get the size of a folder by using Hadoop commands such as `hdfs dfs`.

```
hdfs dfs -du -s -h /public/airlines_all/airlines
```

In [None]:
import subprocess
subprocess.check_output(['hdfs', 'dfs', '-du', '-s', '-h', '/public/airlines_all/airlines'])

## Exercises

* You can either use command line or programmatic approach.
* Get size of **/public/airlines_all/airlines_part**
* Get size of **/public/retail_db**