# Shared Data Context

The intent of this notebook is to provide examples of how data scientists can use object storage, and more specifically, Ceph object storage, much in the same way they are accoustomed to interacting with Amazon Simple Storage Service (S3). This is made possible because Ceph's object storage gateway offers excellent fidelity with the modalities of Amazon S3.

# Working with Boto

Boto is an integrated interface to current and future infrastructural services offered by Amazon Web Services. Amoung the services it provides interfaces for is Amazon S3. For lightweight analysis of data using python tools like numpy or pandas, it is handy to interact with data stored in object storage using pure python. This is where Boto shines. The base-notebook from [radanalyticsio](https://radanalytics.io) doesn't include Boto, but you can install it from the comfort of a notebook using the conda install command below. If you find yourself using Boto frequently, it might be worth modifying [base-notebook](https://github.com/radanalyticsio/base-notebook) and building a custom notebook image that includes Boto.

In [49]:
import sys

In [50]:
import os
import boto3


s3_endpoint_url = os.environ['S3_ENDPOINT_URL']
s3_access_key = os.environ['AWS_ACCESS_KEY_ID']
s3_secret_key = os.environ['AWS_SECRET_ACCESS_KEY']
s3_bucket_name = os.environ['JUPYTERHUB_USER']

s3 = boto3.client('s3','us-east-1', endpoint_url= s3_endpoint_url,
                       aws_access_key_id = s3_access_key,
                       aws_secret_access_key = s3_secret_key)


Creating a bucket, uploading and object (put), and listing the bucket.

In [51]:
s3.create_bucket(Bucket=s3_bucket_name)
s3.put_object(Bucket=s3_bucket_name,Key='object',Body='data')
for key in s3.list_objects(Bucket=s3_bucket_name)['Contents']:
    print(key['Key'])

kube-metrics/_SUCCESS
kube-metrics/part-00000-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kube-metrics/part-00001-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kube-metrics/part-00002-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kube-metrics/part-00003-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kube-metrics/part-00004-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kube-metrics/part-00005-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kube-metrics/part-00006-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kube-metrics/part-00007-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kube-metrics/part-00008-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kube-metrics/part-00009-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kube-metrics/part-00010-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kube-metrics/part-00011-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kube-metrics/part-00012-60b830bd-ab59-4bb3-b59a-3102450831f5-c000.json.bz2
kub

# Working with Spark

When running an application you can either establish a Spark session locally in the notebook pod, or point it to a remote Spark cluster. Oshinko is a collection of components from the radanalyticsio community that aid in the provisioning and scaling of Spark clusters for intelligent applications.

In [52]:
import os
import pyspark

from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext

##################################################
# This is a HACK until we can resolve the issue
# preventing the spark nodes from resolving pod
# names
spark_driver_host = "10.131.4.15"
##################################################

os.environ["PYSPARK_SUBMIT_ARGS"] = f"--conf spark.driver.host={spark_driver_host} --conf spark.cores.max=6 --conf spark.executor.instances=2 --conf spark.executor.memory=3G --conf spark.executor.cores=3 --conf spark.driver.memory=4G --packages com.amazonaws:aws-java-sdk:1.8.0,org.apache.hadoop:hadoop-aws:2.8.5 pyspark-shell"
spark_cluster_url = f"spark://{os.environ['SPARK_CLUSTER']}:7077"
spark = SparkSession.builder.master(spark_cluster_url).getOrCreate()

In [53]:
hadoopConf=spark.sparkContext._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.endpoint", s3_endpoint_url)
hadoopConf.set("fs.s3a.access.key", s3_access_key)
hadoopConf.set("fs.s3a.secret.key", s3_secret_key)
hadoopConf.set("fs.s3a.path.style.access", "true")
hadoopConf.set("fs.s3a.connection.ssl.enabled", "false")

In [54]:
import socket
spark.range(100, numPartitions=100).rdd.map(lambda x: socket.gethostname()).distinct().collect()

['jupyterhub-nb-user99']

In [55]:
df0 = spark.read.text(f"s3a://{s3_bucket_name}/object")

In [56]:
df0

DataFrame[value: string]

# Working with a Hybrid Data Context

As of Hadoop 2.8, S3A supports per bucket configuration. This is very powerful. It allows us to have a distinct S3A configuration, with a different endpoint and different set of credentials. With this I can use a single Spark context to read a parquet file from a bucket in the public cloud (Amazon S3) into a data frame, then turn around and write that dataframe as a parquet file into a bucket that exists in the Ceph Nano service running in Minishift.

In [57]:
hadoopConf=spark.sparkContext._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.bucket.bd-dist.endpoint", "s3.amazonaws.com")
hadoopConf.set("fs.s3a.bucket.bd-dist.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")

__Public to Private ETL__

Simply read tab separated data from a bucket in Amazon S3 and write it back out to a bucket in our Ceph Nano service.

In [63]:
spark.read.csv("s3a://bd-dist/trip_report.tsv",sep="\t").write.csv(f"s3a://{s3_bucket_name}/trip_report.tsv",sep="\t",mode="overwrite")

E0804 15:55:04.926597 139708490286912 java_gateway.py:1078] An error occurred while trying to connect to the Java server (127.0.0.1:46581)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 929, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1067, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused
E0804 15:55:04.938004 139708490286912 java_gateway.py:1078] An error occurred while trying to connect to the Java server (127.0.0.1:46581)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java

Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:46581)

Extract all JSON files from a bucket prefix (pseudo directory) in Amazon S3 and write them back out to a bucket in our Ceph nano service with the same bucket prefix.

In [59]:
spark.read.option("multiline", True).option("mode", "PERMISSIVE").json("s3a://bd-dist/kube-metrics").repartition(76).write.option("compression", "bzip2").mode("overwrite").json(f"s3a://{s3_bucket_name}/kube-metrics")

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 58706)
E0804 15:54:31.097792 139708490286912 java_gateway.py:1078] An error occurred while trying to connect to the Java server (127.0.0.1:46581)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while r

E0804 15:54:31.111729 139708490286912 java_gateway.py:1078] An error occurred while trying to connect to the Java server (127.0.0.1:46581)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

During handling of the

E0804 15:54:31.121760 139708490286912 java_gateway.py:1078] An error occurred while trying to connect to the Java server (127.0.0.1:46581)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

During handling of the

E0804 15:54:31.131465 139708490286912 java_gateway.py:1078] An error occurred while trying to connect to the Java server (127.0.0.1:46581)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

During handling of the

E0804 15:54:31.139921 139708490286912 java_gateway.py:1078] An error occurred while trying to connect to the Java server (127.0.0.1:46581)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

During handling of the

E0804 15:54:31.148988 139708490286912 java_gateway.py:1078] An error occurred while trying to connect to the Java server (127.0.0.1:46581)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

During handling of the

E0804 15:54:31.157096 139708490286912 java_gateway.py:1078] An error occurred while trying to connect to the Java server (127.0.0.1:46581)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/opt/app-root/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

During handling of the

Py4JError: An error occurred while calling o210.json

# Working with SparkSQL

Load Prometheus data set from Ceph Nano into a data frame.

In [None]:
jsonFile = spark.read.json(f"s3a://{s3_bucket_name}/kube-metrics")

__Import statistics libraries__

In [None]:
import pandas as pd
import json
import numpy as np
import seaborn as sns
import sys
import matplotlib.pyplot as plt
%matplotlib inline

from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

__Display schema of files__

In [None]:
print('Display schema:')
jsonFile.printSchema()

# Save the model, tokenizer and feature dimension and store them in Ceph

In [47]:
model.save("./model")

import pickle

with open('./tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

feature_dimension = X_highlights_train.shape[1]
with open('./feature_dimension.pickle', 'wb') as handle:
    pickle.dump(feature_dimension, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [48]:
import boto3
s3 = boto3.resource('s3')

#Create S3 session for writing manifest file
session = boto3.Session(
    aws_access_key_id = s3_access_key,
    aws_secret_access_key = s3_secret_key
)

s3 = session.resource('s3', endpoint_url=s3_endpoint_url, verify=False)

# Upload the model to S3
s3.meta.client.upload_file('./model', s3_bucket_name, 'models/trip_report_model')

# Upload the tokenizer to S3
s3.meta.client.upload_file('./tokenizer.pickle', s3_bucket_name, 'models/trip_report_tokenizer.pickle')

# Upload the feature dimension to S3
s3.meta.client.upload_file('./feature_dimension.pickle', s3_bucket_name, 'models/trip_report_feature_dimension.pickle')

The model has been saved to s3 as binary objects and can be viewed