# Ingest Data with EMR

This notebook demonstrates how to read the data from the EMR cluster.. 

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. With EMR you can run Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. 

## Set up Notebook
First we are going to make sure we have the EMR Cluster set up and the connection between EMR and Sagemaker Notebook set up correctly. You can follow the [documentation](https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/), [procedure](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-lifecycle-config-emr.html), or the **[video](links to the video)** to set up this notebook. Once you are done with setting up, restart the kernel and run the following command to check if you set up the EMR and Sagemaker connection correctly.

In [1]:
%%info

In [2]:
%%local
import io
import boto3
import sagemaker
import json
from sagemaker import get_execution_role
import os
import sys

# Get region 
session = boto3.session.Session()
region_name = session.region_name

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

In [10]:
%%local
prefix = 'data/tabular/boston_house'
filename = 'boston_house.csv'
data_s3_path = 's3://{}/{}/{}'.format(bucket, prefix, filename)
print ('this is path to your s3 files: '+data_s3_path)

this is path to your s3 files: s3://sagemaker-us-east-2-060356833389/data/tabular/boston_house/boston_house.csv


## Copy the S3 bucket file path
The S3 bucket file path is required to read the data on EMR Spark. Copy and paste the below string into the next cell.

In [11]:
data_s3_path = 's3://sagemaker-us-east-2-060356833389/data/tabular/boston_house/boston_house.csv'

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Read the data in EMR spark Cluster

Once we have a path to our data in S3, we can use `spark s3 select` to read data with the following command.You can specify a data format, schema is not necessary but recommended, and in options you can specify `compression`, `delimiter`, `header`, etc. For more details, please see [documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html).

In [19]:
# EMR cell
schema = ' CRIM double, ZN double, INDUS double,\
CHAS double, NOX double, RM double,  AGE double, DIS double,  RAD double,  TAX double, PTRATIO double, \
B double,  LSTAT double, target double'
df = spark.read.format('csv').schema(schema).options(header='true').load(data_s3_path)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [20]:
df.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+------+
|   CRIM|  ZN|INDUS|CHAS|  NOX|   RM| AGE|   DIS|RAD|  TAX|PTRATIO|     B|LSTAT|target|
+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+------+
|0.00632|18.0| 2.31| 0.0|0.538|6.575|65.2|  4.09|1.0|296.0|   15.3| 396.9| 4.98|  24.0|
|0.02731| 0.0| 7.07| 0.0|0.469|6.421|78.9|4.9671|2.0|242.0|   17.8| 396.9| 9.14|  21.6|
|0.02729| 0.0| 7.07| 0.0|0.469|7.185|61.1|4.9671|2.0|242.0|   17.8|392.83| 4.03|  34.7|
|0.03237| 0.0| 2.18| 0.0|0.458|6.998|45.8|6.0622|3.0|222.0|   18.7|394.63| 2.94|  33.4|
|0.06905| 0.0| 2.18| 0.0|0.458|7.147|54.2|6.0622|3.0|222.0|   18.7| 396.9| 5.33|  36.2|
+-------+----+-----+----+-----+-----+----+------+---+-----+-------+------+-----+------+
only showing top 5 rows

# Conclusion
Now that we read in the data, from here you can pre-process the data with Spark in EMR cluster, build ML pipeline, and train models in scale. 