
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session.                                                                                                 |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0).                               |
| %security_config            |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |  Changes the session type to Glue ETL.                                                                                                                    |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |
| %min_workers                |  Integer     |  The minimum number of workers that are allocated to a Ray job. Default: 0.                                                                                  |
| %object_memory_head         |  Integer     |  The percentage of free memory on the instance head node after a warm start. Minimum: 0. Maximum: 100.                                                       |
| %object_memory_worker       |  Integer     |  The percentage of free memory on the instance worker nodes after a warm start. Minimum: 0. Maximum: 100.                                                    |

# Importing the `libraries`

In [1]:
%glue_ray

import ray
import pandas
import pyarrow
from ray import data
import time
from ray.data import ActorPoolStrategy

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.37.0 
Previous Job type: glueray
Setting new Job type to glueray
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::507922848584:role/AWSGlueServiceRole-glueworkshop
Trying to create a Glue session for the kernel.
Worker Type: Z.2X
Number of Workers: 5
Session ID: d2aa0686-6fd0-4013-9592-a917bc4db5af
Job Type: glueray
Applying the following default arguments:
--glue_kernel_version 0.37.0
--enable-glue-datacatalog true
Waiting for session d2aa0686-6fd0-4013-9592-a917bc4db5af to get into ready status...
Session d2aa0686-6fd0-4013-9592-a917bc4db5af has been created.


# Initialize a `Ray` Cluster with AWS Glue

In [2]:
ray.init('auto')

RayContext(dashboard_url='127.0.0.1:8265', python_version='3.9.14', ray_version='2.0.0', ray_commit='{{RAY_COMMIT_SHA}}', address_info={'node_ip_address': '2600:1f14:27:7e13:603a:287b:d027:e58c', 'raylet_ip_address': '2600:1f14:27:7e13:603a:287b:d027:e58c', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2023-01-30_23-21-28_870018_1672/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2023-01-30_23-21-28_870018_1672/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2023-01-30_23-21-28_870018_1672', 'metrics_export_port': 8080, 'gcs_address': '2600:1f14:27:7e13:603a:287b:d027:e58c:6379', 'address': '2600:1f14:27:7e13:603a:287b:d027:e58c:6379', 'dashboard_agent_listen_port': 52365, 'node_id': '25b1727f3e9fc349075ba9dc98ca079272a501b7748ed7afe88d429c'})


2023-01-30 23:21:44,983	INFO worker.py:1329 -- Connecting to existing Ray cluster at address: 2600:1f14:27:7e13:603a:287b:d027:e58c:6379...
2023-01-30 23:21:44,991	INFO worker.py:1511 -- Connected to Ray cluster. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


# Read the dataset in `Parquet` file format

In [3]:
start = time.time()
ds = ray.data.read_parquet("s3://amazon-reviews-pds/parquet/product_category=Wireless/")
end = time.time()

print(f"Reading the data to dataframe: {round(end - start, 2)} seconds")

Reading the data to dataframe: 3.15 seconds


In [4]:
print(ds.schema())

marketplace: string
customer_id: string
review_id: string
product_id: string
product_parent: string
product_title: string
star_rating: int32
helpful_votes: int32
total_votes: int32
vine: string
verified_purchase: string
review_headline: string
review_body: string
review_date: date32[day]
year: int32
-- schema metadata --
org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' + 1036


In [5]:
print(ds.size_bytes())

20487684500


# Applying dataset `transformations` with Ray

In [6]:
# Dropping few columns from the underlying Dataset 
start = time.time()
ds = ds.drop_columns(["review_body", "vine", "product_parent", "verified_purchase", "review_headline"])
end = time.time()

print(f"Time taken to drop a few columns : {round(end - start, 2)} seconds")
ds.schema()

Time taken to drop a few columns : 89.8 seconds
PandasBlockSchema(names=['marketplace', 'customer_id', 'review_id', 'product_id', 'product_title', 'star_rating', 'helpful_votes', 'total_votes', 'review_date', 'year'], types=[dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('int32'), dtype('int32'), dtype('int32'), dtype('O'), dtype('int32')])


Read->Map_Batches: 100%|##########| 10/10 [01:29<00:00,  8.98s/it]


In [7]:
#Sort the dataset by total votes
start = time.time()
ds =ds.sort("total_votes")
end = time.time()

print(f"Time taken for sort operation  : {end - start} seconds")
ds.show(3)

Time taken for sort operation  : 36.849875926971436 seconds
{'marketplace': 'US', 'customer_id': '3440602', 'review_id': 'RQZTNLO53E78V', 'product_id': 'B00D30TTOM', 'product_title': 'GreatShield Stretchable Neoprene Sport Armband Case with Key Storage for Galaxy S8/S7, HTC One M9/M8/M7, LG G3/G2, Moto G5, Nokia 3', 'star_rating': 4, 'helpful_votes': 0, 'total_votes': 0, 'review_date': datetime.date(2015, 1, 9), 'year': 2015}
{'marketplace': 'US', 'customer_id': '25538031', 'review_id': 'R1H9GHPRAIKCP8', 'product_id': 'B008AGQMQC', 'product_title': 'Mpow Bluetooth Receiver, Protable Bluetooth 4.1 Car Adapter & Bluetooth Car Aux Adapter for Music Streaming Sound System, Hands-free Audio Adapter & Wireless Car Kits for Home/Car Audio Stereo System', 'star_rating': 1, 'helpful_votes': 0, 'total_votes': 0, 'review_date': datetime.date(2015, 8, 18), 'year': 2015}
{'marketplace': 'US', 'customer_id': '15894288', 'review_id': 'RINWA13QJCUX', 'product_id': 'B001T8DEL4', 'product_title': 'DC ca

Sort Sample: 100%|##########| 10/10 [00:01<00:00,  8.09it/s]
Shuffle Map: 100%|##########| 10/10 [00:04<00:00,  2.37it/s]
[2m[36m(reduce pid=2323, ip=169.254.1.2)[0m   return self._table.memory_usage(index=True, deep=True).sum()
[2m[36m(reduce pid=2322, ip=169.254.1.2)[0m   return self._table.memory_usage(index=True, deep=True).sum()
[2m[36m(reduce pid=2324, ip=169.254.1.2)[0m   return self._table.memory_usage(index=True, deep=True).sum()
[2m[36m(reduce pid=1784, ip=169.254.1.2)[0m   return self._table.memory_usage(index=True, deep=True).sum()
[2m[36m(reduce pid=1785, ip=169.254.1.2)[0m   return self._table.memory_usage(index=True, deep=True).sum()
[2m[36m(reduce pid=1786, ip=169.254.1.2)[0m   return self._table.memory_usage(index=True, deep=True).sum()
Shuffle Reduce: 100%|##########| 10/10 [00:31<00:00,  3.12s/it]


# Clean up