# Initialize SparkContext

The `master_url` variable contains the URL of the Spark master node.

In [1]:
from pyspark import SparkContext

sc = SparkContext(master=master_url)

# Local files

## Download and Uplaod

Download and upload files via the Spark Notebook interface.

## Access Local Files

The file path to local files requires `file://` prefix.

In [2]:
ls /root/spark/conf/

core-site.xml                slaves.template
docker.properties.template   spark-defaults.conf
fairscheduler.xml.template   spark-defaults.conf.template
log4j.properties.template    [0m[01;32mspark-env.sh[0m*
metrics.properties.template  [01;32mspark-env.sh.template[0m*
slaves
[m

In [3]:
local_files = sc.textFile("file:///root/spark/conf/slaves")
local_files.collect()

[u'ec2-54-88-27-130.compute-1.amazonaws.com']

# S3 files

The object `s3helper` is created to help you access S3 files.

Run `help(s3helper)` to learn all its methods.

In [None]:
help(s3helper)

## (1) Set AWS Credentials

To access a S3 bucket, the first step is to set AWS credential. There are two ways to do it.

1. (**RECOMMENDED**) Set S3 credentials via Spark Notebook interface.
2. Set it using the `set_credential` method.

In [None]:
s3helper.set_credential(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)

## (2) Open the bucket that has your files.

In [None]:
s3helper.open_bucket('your-bucket-name')

## (3) List files in the bucket.

In [None]:
print s3helper.ls()  # By default, list all files in the root directory of the bucket
print s3helper.ls('sub-directory')

## (4) Read files from S3

There are two ways to read files from S3.

**(1) Get the list of S3 file paths and pass it to Spark. Spark supports read files directly from S3.**

In [None]:
file_paths = s3helper.get_path('/sub-directory')
rdd = sc.textFile(','.join(file_paths))

** (2) Load S3 files to the HDFS on this cluster and read them from HDFS **

In [None]:
files = s3helper.load_path('/sub-directory', '/hdfs-directory')
rdd = sc.textFile(','.join(files))

# Parquet Files

To get a reasonable reading speed, please always load parquet files from S3 to HDFS before accessing them.

In [None]:
s3helper.open_bucket("your-bucket-name")

files = s3helper.load_path('/sub-directory-for-parquets', '/hdfs-directory.parquet')
files[:10]

In [None]:
from pyspark import SparkContext
from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)
df = sqlContext.sql("SELECT key, value FROM parquet.`/hdfs-directory.parquet`")