# Get started with EMR Serverless on EMR Studio

#### Topics covered in this example
<ol>
    <li> Configure a Spark session </li>
    <li> Import a library to help with plot </li>
    <li> Spark DataFrames: reading a public dataset, selecting data and writing to a S3 location </li>
    <li> Spark SQL: creating a new view and selecting data </li>
    <li> Visualize your data </li>
</ol>

***

## Prerequisites
<div class="alert alert-block alert-info">
<b>NOTE :</b> In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.</div>

* EMR Serverless should be chosen as the Compute.
* Make sure the Studio user role has permission to attach the Workspace to the Application and to pass the runtime role to it.
* This notebook uses the `PySpark` kernel.
* Your Serverless Application must be configured with a VPC that has internet connectivity. [Learn more](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html)
***

## 1. Configure your Spark session.
Configure the Spark Session to use Virtualenv. Virtualenv is needed to install other Python packages.

In [None]:
%%configure -f
{
  "conf": {
    "spark.pyspark.virtualenv.enabled": "true",
    "spark.pyspark.virtualenv.bin.path": "/usr/bin/virtualenv",
    "spark.pyspark.virtualenv.type": "native",
    "spark.pyspark.python": "/usr/bin/python3",
    "spark.executorEnv.PYSPARK_PYTHON": "/usr/bin/python3"
  }
}

Start a Spark session:

In [None]:
spark

Run the `%%info` magic command which shows the Spark configuration for the current session as well as provides links to navigate to the live Spark UI for the session:

In [None]:
%%info

---
## 2. Install packages from PyPI
We will install matplotlib Python package. 
<div class="alert alert-block alert-info">
<b>NOTE :</b> You will need internet access to do this step.</div>

In [None]:
sc.install_pypi_package("matplotlib")

---
## 3. Read data from S3
We will use a public data set on NYC yellow taxis. Read the Parquet file from S3. The file has headers and we want Spark to infer the schema. 
<div class="alert alert-block alert-info">
<b>NOTE :</b> You will need to update your runtime role to allow Get access to the s3://athena-examples-us-east-1/notebooks/ folder and its sub-folders.</div>

In [None]:
file_name = "s3://athena-examples-us-east-1/notebooks/yellow_tripdata_2016-01.parquet"

taxi_df = (spark.read.format("parquet").option("header", "true") \
           .option("inferSchema", "true").load(file_name))

#### Use Spark Dataframe to group and count specific column from taxi_df

In [None]:
taxi1_df = taxi_df.groupBy("VendorID", "passenger_count").count()
taxi1_df.show()

### Use the %%display magic to quickly visualize a dataframe
<ol>
    <li> You can choose to view the results in a table format. </li>
    <li> You can also choose to visualize your data with five types of charts. You can select the display type below and the chart will change accordingly. </li>
</ol>

In [None]:
%%display
taxi1_df

---
## 4. Run Spark SQL commands
#### Create a new temporary view taxis. Use Spark SQL to select data from this view. Create a taxi dataframe for further processing

In [None]:
taxi_df.createOrReplaceTempView("taxis")

sqlDF = spark.sql(
    "SELECT DOLocationID, sum(total_amount) as sum_total_amount \
     FROM taxis where DOLocationID < 25 Group by DOLocationID ORDER BY DOLocationID"
)
sqlDF.show(50)

Use %%sql magic

In [None]:
%%sql
SHOW DATABASES

---
## 5. Visualize your data using Python 
#### Use matplotlib to plot the drop off location and the total amount as a bar chart

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

plt.clf()
df = sqlDF.toPandas()
plt.bar(df.DOLocationID, df.sum_total_amount)
%matplot plt

### You have made it to the end of the demo notebook!!