# Setting up your spark environment
We are going to use the [pyspark api](https://spark.apache.org/docs/2.3.1/quick-start.html), and machine learning through MLlib

* pyspark - https://spark.apache.org/docs/latest/api/python/index.html
* MLlib - https://spark.apache.org/docs/latest/ml-guide.html
* S3 - 

## In this Noteboook
1. create a context
2. work with s3 and parquet
3. work with MLlib

In [13]:
import pyspark
from os import listdir
from os.path import isfile, join
import boto3
import pandas as pd
from sagemaker import get_execution_role
from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType

In [2]:

# Initialize the spark environment (takes ~ 1min)
conf = pyspark.SparkConf().setAppName('odl').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
sqlc = pyspark.sql.SQLContext(sc)


In [3]:
sc

## Connect to S3
There are a few ways to connect to S3, we are going to use boto
* boto3 - https://boto3.amazonaws.com/v1/documentation/api/latest/index.html

### Read into spark dataframe from csv in s3

In [4]:

role = get_execution_role()
bucket='odl-spark19spds6003-001'
data_key = 'sample_data/data.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

pd.read_csv(data_location)


Unnamed: 0,alpha,beta,gamma
0,1,2,3
1,1,4,9
2,1,8,27


In [5]:
df = sqlc.createDataFrame(pd.read_csv(data_location))

In [6]:
df

DataFrame[alpha: bigint, beta: bigint, gamma: bigint]

### Write parquet to s3

In [None]:
parquetPath = '/home/ec2-user/SageMaker/tmp-pqt'
df.write.parquet(parquetPath)

In [24]:
# prep list of files to transfer
files = [f for f in listdir(parquetPath) if isfile(join(parquetPath, f))]

s3 = boto3.resource('s3')
for f in files:
    #print('copying {} to {}'.format(parquetPath+'/'+f,"sample_data/"+f))
    s3.Bucket(bucket).upload_file(parquetPath+'/'+f, "sample_data/pqt/"+f)


### Write to spark dataframe from parquet

In [8]:
df = sqlc.read.parquet(parquetPath)

In [9]:
df

DataFrame[alpha: bigint, beta: bigint, gamma: bigint]

# Now to start our ML pipeline
1. Make dataframe from csv file (s3://odl-spark19spds6003-001/checkouts-by-title-head.csv)
  * nb: this one is 166MB, we will start small and then compare performance with larger version
2. Make parquet file
3. Make dataframe from parquet file

In [19]:
bucket='odl-spark19spds6003-001'
data_key = 'checkouts-by-title-head.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
df = sqlc.createDataFrame(data_location)

TypeError: Can not infer schema for type: <class 'str'>

In [None]:
schema = StructType([
    StructField("UsageClass", StringType(), True),
    StructField("CheckoutType", StringType(), True),
    StructField("MaterialType", StringType(), True),
    StructField("CheckoutYear", IntegerType(), True),
    StructField("CheckoutMonth", IntegerType(), True),
    StructField("Checkouts", IntegerType(), True),
    StructField("Title", StringType(), True),
    StructField("Creator", StringType(), True),
    StructField("Subjects", StringType(), True),
    StructField("Publisher", StringType(), True),
    StructField("PublicationYear", StringType(), True)])
df = sqlc.createDataFrame(data_location,schema)
#>>> df3 = spark.createDataFrame(rdd, schema)

In [18]:
from pyspark.sql.types import *

bucket='odl-spark19spds6003-001'
data_key = 'sample_data/data.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
schema = StructType(Array(
    StructField("alpha", IntegerType(), True),
    StructField("beta", IntegerType(), True),
    StructField("gamma", IntegerType(), True))
sqlc.createDataFrame(data_location,schema)
#>>> df3 = spark.createDataFrame(rdd, schema)

SyntaxError: invalid syntax (<ipython-input-18-ddfc35fe746a>, line 10)