# Apache Spark is a must for big data processing
Spark is likely going to be slower in Python than in Scala. PySpark allows you to write Spark applications using Python APIs and also provides the PySpark shell for interactively analyzing your data in a distributed environment. The ease of these notebooks and the popularity of Python make for a nice addition to the data science toolbox. [Find the docs here](https://spark.apache.org/docs/latest/api/python/index.html)

## Simple spark test

In [None]:
from pyspark.sql import SparkSession, SQLContext
import os
import socket

# create a spark session
spark_cluster_url = f"spark://{os.environ['SPARK_CLUSTER']}:7077"
spark = SparkSession.builder.master(spark_cluster_url).getOrCreate()

# test your spark connection
spark.range(5, numPartitions=5).rdd.map(lambda x: socket.gethostname()).distinct().collect()

## This next part assumes you have Ceph or S3 Setup - fill in the bucket info for your storage

In [None]:
# Edit this section using your own credentials
s3_region = 'region-1' # fill in for AWS, blank for Ceph
s3_endpoint_url = 'https://s3.storage.server'
s3_access_key_id = 'AccessKeyId-ChangeMe'
s3_secret_access_key = 'SecretAccessKey-ChangeMe'
s3_bucket = 'MyBucket'

# for easy download
!pip install wget

import wget
import boto3

# configure boto S3 connection
s3 = boto3.client('s3',
                  s3_region,
                  endpoint_url = s3_endpoint_url,
                  aws_access_key_id = s3_access_key_id,
                  aws_secret_access_key = s3_secret_access_key)

# download the sample data file
url = "https://raw.githubusercontent.com/dudash/jupyter-gpu-examples/main/sample_data.csv"
file = wget.download(url=url, out='sample_data.csv')

#upload the file to storage
s3.upload_file(file, s3_bucket, "sample_data.csv")

## Now let's test Spark with S3 and do basic Pandas data analysis call

In [None]:
hadoopConf = spark.sparkContext._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.endpoint", s3_endpoint_url)
hadoopConf.set("fs.s3a.access.key", s3_access_key_id)
hadoopConf.set("fs.s3a.secret.key", s3_secret_access_key)
hadoopConf.set("fs.s3a.path.style.access", "true")
hadoopConf.set("fs.s3a.connection.ssl.enabled", "true") # false if not https

data = spark.read.csv('s3a://' + s3_bucket + '/sample_data.csv',sep=",", header=True)
df = data.toPandas()
df.head()