# Accessing the Ascend Structured Data Lake from Pyspark

This notebook demonstrates how to use Ascend's Structured data lake to access component data with pyspark.

## Environment Setup
This notebook was tested with the following configuration:

### Spark dependency versions:

- Spark version: 2.4.3 w/ scala 2.12 and user-provided hadoop from https://spark.apache.org/downloads.html
- Hadoop version: 3.2.0 from https://hadoop.apache.org/releases.html

### Environment variables:
- `SPARK_HOME`: spark installation directory
- `HADOOP_HOME`: hadoop installation directory
- `PYSPARK_SUBMIT_ARGS="--packages=org.apache.hadoop:hadoop-aws:3.2.0,com.amazonaws:aws-java-sdk-bundle:1.11.546 pyspark-shell"`

### Manual config edits:
- Inside your `$SPARK_HOME/conf/spark-env.sh` file, add the following line to use the correct hadoop version:
`export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)`
## AWS profile configuration:
- Create a developer access key in the Ascend UI

In [1]:
### Create Spark Session
import pyspark
from pyspark.sql import SparkSession
import findspark
import boto3

findspark.init()
         
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("test_sdl") \
    .getOrCreate()

sc = spark.sparkContext

## Credentials

To run this notebook, you will need

  * a Service Account with `READ ONLY` permission or developer access keys
  * an Access Key ID and Secret for that Account

You can create a Service Account in the Ascend UI by going to **Data Service > Service Accounts**.
If you are using `trial.ascend.io`, create the Service Account in your own Data Service.
Otherwise, select the Data Service **Getting Started with Ascend** and create a Service Account there.

You can create Developer Access Keys in the Ascend UI by clicking on the dropdown in the top right (with your name), then **Access Keys -> +ACCESS KEY**

Access Keys should not be stored in a notebook. 
Instead, this notebook will look for them in `~/.ascend/credentials` on the machine where your Jupyter server is running.
Your `~/.ascend/credentials` file should look like this (substitute your Ascend Access Key ID and Secret Access Key):
```
[trial]
ascend_access_key_id=Y0URACC355K3Y1D
ascend_secret_access_key=yourSecret!AccessKeyisthelong1
```

Once you have a `credentials` file, you can read it with `configparser` and
create a `Client` to connect to the host using your credentials.

In [2]:
import configparser
import os

profile = "trial"

config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.ascend/credentials"))

access_id = config.get(profile, "ascend_access_key_id")
secret_key = config.get(profile, "ascend_secret_access_key")


## Setup hadoop-aws config for Spark to point at Ascend's SDL

In [3]:
sc = spark.sparkContext
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "https://s3.ascend.io")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", access_id)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", secret_key)
sc._jsc.hadoopConfiguration().set("fs.s3a.attempts.maximum", "1") # helps us see errors faster

# Read data from a data feed
You can now read data from a data feed by supplying a path with the following structure. You can also copy the path from a component in the Ascend UI by going to the integrations tab and copying the 'S3 Protocol URI'
- path prefix = `s3a://` to locate the correct hadoop filesystem we set up in the config
- bucket = environment prefix, e.g. `trial`
- piece 1: data service id, e.g `Getting_Started_with_Ascend` 
- piece 2: data feed id, e.g. `_DF_Clusters_w__Solar`

In [5]:
# read data from a data feed
# https://trial.ascend.io/ui/v2/organization/Getting_Started_with_Ascend/project/IoT_Device_and_Weather_Analysis/pub/_DF__Clusters_w__Solar
location ='s3a://trial/Getting_Started_with_Ascend/_DF__Clusters_w__Solar'
'''
parts of path: 
s3a -> tell spark which type of path this is
trial -> environment prefix, can be seen in the url (trial.ascend.io)
Getting_Started_with_Ascend -> data service id 
_DF__Clusters_w__Solar -> data feed id
''' 
clusters_with_solar = spark.read.parquet(location)


In [6]:
clusters_with_solar.select(['net_usage_KW', 'temperature', 'prediction']).show(100)

+--------------------+------------------+----------+
|        net_usage_KW|       temperature|prediction|
+--------------------+------------------+----------+
| 0.27584394812583923| 68.74703979492188|         0|
|  0.9253414869308472|  49.2901496887207|         3|
|  0.6987212300300598| 33.54389572143555|         4|
|  0.4225758910179138| 41.04164123535156|         8|
|   6.468533515930176|  85.6972885131836|         7|
|  0.2268587201833725|58.132205963134766|         0|
|  1.8147929906845093| 52.64035415649414|         2|
|  0.9291468262672424|53.118438720703125|         3|
|  1.0964287519454956|23.443756103515625|         1|
|-0.04885953664779663| 59.78987121582031|         6|
|  0.2978518307209015| 58.51332092285156|         1|
|  0.7331710457801819|29.353544235229492|         4|
|0.023483455181121826| 50.60091781616211|         1|
|  0.8424805998802185| 59.18572998046875|         2|
|-0.13565999269485474| 75.26445770263672|        10|
|  1.0251768827438354| 53.66426467895508|     

# Read data from a component
You can also read data from a component by supplying a path with the following structure: 
- path prefix = `s3a://` to locate the correct hadoop filesystem we set up in the config
- bucket = environment prefix, e.g. `trial`
- piece 1: data service id, e.g `Getting_Started_with_Ascend` 
- piece 2: dataflow id, e.g. `IoT_Device_and_Weather_Analysis`
- piece 3: component id, e.g. `K_Means_Cluster`

You can also copy the path from a component in the Ascend UI by going to the integrations tab and copying the 'S3 Protocol URI'

In [7]:
# read data from a component
# https://trial.ascend.io/ui/v2/organization/Getting_Started_with_Ascend/project/IoT_Device_and_Weather_Analysis/view/K_Means_Cluster

location = 's3a://trial/Getting_Started_with_Ascend/IoT_Device_and_Weather_Analysis/K_Means_Cluster'
'''
parts of path: 
s3a -> tell spark which type of path this is
trial -> environment prefix, can be seen in the url (trial.ascend.io)
Getting_Started_with_Ascend -> data service id 
IoT_Device_and_Weather_Analysis -> dataflow id
K_Means_Cluster -> component id
''' 
clusters = spark.read.parquet(location)

In [8]:
clusters_with_solar.select(['net_usage_KW', 'temperature', 'prediction']).show(100)

+--------------------+------------------+----------+
|        net_usage_KW|       temperature|prediction|
+--------------------+------------------+----------+
| 0.27584394812583923| 68.74703979492188|         0|
|  0.9253414869308472|  49.2901496887207|         3|
|  0.6987212300300598| 33.54389572143555|         4|
|  0.4225758910179138| 41.04164123535156|         8|
|   6.468533515930176|  85.6972885131836|         7|
|  0.2268587201833725|58.132205963134766|         0|
|  1.8147929906845093| 52.64035415649414|         2|
|  0.9291468262672424|53.118438720703125|         3|
|  1.0964287519454956|23.443756103515625|         1|
|-0.04885953664779663| 59.78987121582031|         6|
|  0.2978518307209015| 58.51332092285156|         1|
|  0.7331710457801819|29.353544235229492|         4|
|0.023483455181121826| 50.60091781616211|         1|
|  0.8424805998802185| 59.18572998046875|         2|
|-0.13565999269485474| 75.26445770263672|        10|
|  1.0251768827438354| 53.66426467895508|     