# AWS Glue Studio Notebook
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.


#### Optional: Run this cell to see available notebook commands ("magics").


In [None]:
%help

####  Run this cell to set up and start your interactive session.


In [4]:
%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 5

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

You are already connected to a glueetl session 0a320293-350a-4235-b465-a720b0f3556a.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Current idle_timeout is 2880 minutes.
idle_timeout has been set to 2880 minutes.


You are already connected to a glueetl session 0a320293-350a-4235-b465-a720b0f3556a.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Setting Glue version to: 4.0


You are already connected to a glueetl session 0a320293-350a-4235-b465-a720b0f3556a.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Previous worker type: G.1X
Setting new worker type to: G.1X


You are already connected to a glueetl session 0a320293-350a-4235-b465-a720b0f3556a.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Previous number of workers: 5
Setting new number of workers to: 5



#### Example: Create a DynamicFrame from a table in the AWS Glue Data Catalog and display its schema


In [12]:
# Define the input file path and delimiter
input_file_path = 's3://awstrainingjune01/Input_Data/csv/customers-100.csv'
delimiter = ','

# Read the CSV file
datasource0 = glueContext.create_dynamic_frame.from_options(
    connection_type = "s3",
    connection_options = {"paths": [input_file_path], "recurse": True},
    format = "csv",
    format_options = {
        "withHeader": True,
        "separator": delimiter
    }
)

datasource0.printSchema();

root
|-- Index: string
|-- Customer Id: string
|-- First Name: string
|-- Last Name: string
|-- Company: string
|-- City: string
|-- Country: string
|-- Phone 1: string
|-- Phone 2: string
|-- Email: string
|-- Subscription Date: string
|-- Website: string


#### Example: Convert the DynamicFrame to a Spark DataFrame and display a sample of the data


In [13]:
df = datasource0.toDF()
df.show()

+-----+---------------+----------+---------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|Index|    Customer Id|First Name|Last Name|             Company|             City|             Country|             Phone 1|             Phone 2|               Email|Subscription Date|             Website|
+-----+---------------+----------+---------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+-----------------+--------------------+
|    1|DD37Cf93aecA6Dc|    Sheryl|   Baxter|     Rasmussen Group|     East Leonard|               Chile|        229.077.5154|    397.884.0519x718|zunigavanessa@smi...|       2020-08-24|http://www.stephe...|
|    2|1Ef7b82A4CAAD10|   Preston|   Lozano|         Vega-Gentry|East Jimmychester|            Djibouti|          5153435776|    686-620-1820x944|     vmata@colon.com|     

In [15]:
dyf = glueContext.create_dynamic_frame.from_catalog(database='db_aws_training_exercises', table_name='cst_tbl_customer_master')
dyf.printSchema()

root
|-- index: long
|-- customerid: long
|-- first name: string
|-- last name: string
|-- company: string
|-- city: string
|-- country: string
|-- phone 1: string
|-- phone 2: string
|-- email: string
|-- subscription date: string
|-- website: string


#### Example: Write the data in the DynamicFrame to a location in Amazon S3 and a table for it in the AWS Glue Data Catalog

In [26]:
record_count = dyf.count()
print(f"Total records: {record_count}")

Total records: 100


In [None]:
s3output = glueContext.getSink(
  path="s3://bucket_name/folder_name",
  connection_type="s3",
  updateBehavior="UPDATE_IN_DATABASE",
  partitionKeys=[],
  compression="snappy",
  enableUpdateCatalog=True, 
  transformation_ctx="s3output",
)
s3output.setCatalogInfo(
  catalogDatabase="demo", catalogTableName="populations"
)
s3output.setFormat("glueparquet")
s3output.writeFrame(dyf)

In [27]:
s3output = glueContext.getSink(
  path="s3://awstrainingjune01/Output_Data/Notebook_Output/",
  connection_type="s3",
  updateBehavior="UPDATE_IN_DATABASE",
  partitionKeys=[],
  compression="snappy",
  enableUpdateCatalog=True, 
  transformation_ctx="s3output",
)
s3output.setCatalogInfo(
  catalogDatabase="db_aws_training_exercises", catalogTableName="trg_customers"
)
s3output.setFormat("csv")
s3output.writeFrame(dyf)

<awsglue.dynamicframe.DynamicFrame object at 0x7f65594fd420>
