# Ingest Data with EMR

This notebook demonstrates how to read the data from the EMR cluster.
We are going to use the data we load into S3 in the previous notebook [011_Ingest_tabular_data.ipynb](011_Ingest_tabular_data_v1.ipynb).

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. With EMR you can run Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. 

## Set up Notebook
First, we are going to make sure we have the EMR Cluster set up and the connection between EMR and Sagemaker Notebook set up correctly. You can follow the [documentation](https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/) and [procedure](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-lifecycle-config-emr.html) to set up this notebook. Once you are done with setting up, restart the kernel and run the following command to check if you set up the EMR and Sagemaker connection correctly.

In [None]:
%%info

In [None]:
import sagemaker
import pandas as pd

sagemaker_session = sagemaker.Session()
s3 = sagemaker_session.boto_session.resource('s3')
bucket = sagemaker_session.default_bucket() #replace with your own bucket name if you have one
prefix = 'data'
filename = 'sample_tabular_data.csv'

### Download data from online resources and write data to S3

In [5]:
#helper functions to upload data to s3
def write_to_s3(filename, bucket, prefix):
    #put one file in a separate folder. This is helpful if you read and prepare data with Athena
    #filename_key = filename.split('.')[0]
    key = "{}/{}".format(prefix,filename)
    return s3.Bucket(bucket).upload_file(filename,key)

def upload_to_s3(bucket, prefix, filename):
    url = 's3://{}/{}/{}'.format(bucket, prefix, filename)
    print('Writing to {}'.format(url))
    write_to_s3(filename, bucket, prefix)

In [6]:
# download a synthetic tabular data
!wget -q https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/master/synthetic_data/sample_tabular_data.csv

In [7]:
# upload the sample data to S3 
upload_to_s3(bucket, prefix, filename)

Writing to s3://sagemaker-us-west-2-688520471316/data/sample_tabular_data.csv


## Copy the S3 bucket file path
The S3 bucket file path is required to read the data on EMR Spark. Copy and paste the path string shown above into the next cell.

In [11]:
### replace this path string with your path shown in last step
data_s3_path = f's3://{bucket}/{prefix}/{filename}'


In [12]:
s

NameError: name 'spark' is not defined

In [13]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.0.1.tar.gz (204.2 MB)
[K     |████████████████████████████████| 204.2 MB 79 kB/s /s eta 0:00:01
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 118.9 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=6b0334681afc4f3a2d69cdebd940e4e5df39c3ddc3fe8a941930ea85733b7d87
  Stored in directory: /local/home/hongshal/.cache/pip/wheels/ea/21/84/970b03913d0d6a96ef51c34c878add0de9e4ecbb7c764ea21f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


## Read the data in EMR spark Cluster

Once we have a path to our data in S3, we can use `spark s3 select` to read data with the following command. You can specify a data format, schema is not necessary but recommended, and in options you can specify `compression`, `delimiter`, `header`, etc. For more details, please see [documentation on using S3 select with Spark](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html).

In [17]:
# define a spark session
from pyspark.sql import SparkSession
import sagemaker_pyspark


classpath = ":".join(sagemaker_pyspark.classpath_jars())

# See the SageMaker Spark Github to learn how to connect to EMR from a notebook instance
spark = SparkSession.builder.config("spark.driver.extraClassPath", classpath)\
    .master("local[*]").getOrCreate()
    

Exception: Java gateway process exited before sending its port number

In [18]:
# EMR cell
schema = ' A double, B double, C double,\
D double, E double, F double,  G double, H double,  I double,  J double, K double, \
L double,  M double, target double'
df = spark.read.format('csv').schema(schema).options(header='true').load(data_s3_path)

AttributeError: module 'pyspark' has no attribute 'read'

In [None]:
df.show(5)

## Conclusion
Now that you have read in the data, you can pre-process the data with Spark in an EMR cluster, build an ML pipeline, and train models in scale.

### Citation
Boston Housing data,  Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.