### Overview

This notebook does the following:

* Demonstrates how you can visually connect Amazon SageMaker Studio Python 3 (Data Science) kernel to an EMR Cluster
* Explore and query data from a Hive table using the pyhive library
* Demonstrates how to use the data for Machine Learning






In [2]:
%load_ext sagemaker_studio_analytics_extension.magics
%sm_analytics emr connect --cluster-id j-3GWESRXJMF9D8 --auth-type None --language python  

Successfully read emr cluster(j-3GWESRXJMF9D8) details
Initiating EMR connection..
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
1,application_1662658496356_0003,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.
{"namespace": "sagemaker-analytics", "cluster_id": "j-3GWESRXJMF9D8", "error_message": null, "success": true, "service": "emr", "operation": "connect"}


### Connection to EMR Cluster

In the cell below, the code block is autogenerated. You can generate this code by clicking on the "Cluster" link on the top of the notebook and select the EMR cluster. The "j-xxxxxxxxxxxx" is the cluster id of the cluster selected. 

For the example in our blog, we used a no-auth cluster for simplicity, but this works equally well for Kerberos, LDAP and HTTP auth mechanisms

In [None]:
# %load_ext sagemaker_studio_analytics_extension.magics
# %sm_analytics emr connect --cluster-id j-xxxxxxxxxxx --auth_type None --language python

First, we will import hive module from the pyhive library

In [6]:
from pyhive import hive

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
No module named 'pyhive'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'pyhive'



Next, We use the private DNS name of the EMR primary in the following code. Replace the host with the correct DNS name. You can find this information on the Amazon EMR console (expand the cluster name and locate Master public DNS under in summary section).

In [5]:
conn = hive.Connection(host='https://p-3gwesrxjmf9d8.emrappui-prod.us-east-2.amazonaws.com', port=10000)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
name 'hive' is not defined
Traceback (most recent call last):
NameError: name 'hive' is not defined



Next, we will query the movie_reviews table and get the data into a pandas dataframe. You can visualize the data using the code below

In [None]:
cursor = conn.cursor()
cursor.execute("show databases")
cursor.fetchall()

In [None]:
cursor.execute("show tables")
cursor.fetchall()

In [None]:
import pandas as pd

movie_reviews = pd.read_sql("select review, sentiment from movie_reviews", conn)

In [None]:
movie_reviews.head()

In [None]:
pos_reviews = movie_reviews.filter(movie_reviews.sentiment == "positive")
neg_reviews = movie_reviews.filter(movie_reviews.sentiment == "negative")

In [None]:
import matplotlib.pyplot as plt


def plot_counts(positive, negative):
    plt.rcParams["figure.figsize"] = (6, 6)
    plt.bar(0, positive, width=0.6, label="Positive Reviews", color="Green")
    plt.bar(2, negative, width=0.6, label="Negative Reviews", color="Red")
    handles, labels = plt.gca().get_legend_handles_labels()
    by_label = dict(zip(labels, handles))
    plt.legend(by_label.values(), by_label.keys())
    plt.ylabel("Count")
    plt.xlabel("Type of Review")
    plt.tick_params(axis="x", which="both", bottom=False, top=False, labelbottom=False)
    plt.show()


plot_counts(len(pos_reviews), len(neg_reviews))