## Getting Started Using AWS Glue

### Setting up IAM Permissions for AWS Glue

https://docs.aws.amazon.com/glue/latest/dg/getting-started-access.html

You use AWS Identity and Access Management (IAM) to define policies and roles that are needed to access resources used by AWS Glue. The following steps lead you through the basic permissions that you need to set up your environment. Depending on your business needs, you might have to add or reduce access to your resources.

1. [Create an IAM Policy for the AWS Glue Service:](https://docs.aws.amazon.com/glue/latest/dg/create-service-policy.html) Create a service policy that allows access to AWS Glue resources.

2. [Create an IAM Role for AWS Glue:](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html) Create an IAM role, and attach the AWS Glue service policy and a policy for your Amazon Simple Storage Service (Amazon S3) resources that are used by AWS Glue.

3. [Attach a Policy to IAM Users That Access AWS Glue:](https://docs.aws.amazon.com/glue/latest/dg/attach-policy-iam-user.html) Attach policies to any IAM user that signs in to the AWS Glue console.

4. [Create an IAM Policy for Notebooks:](https://docs.aws.amazon.com/glue/latest/dg/create-notebook-policy.html) Create a notebook server policy to use in the creation of notebook servers on development endpoints.

5. [Create an IAM Role for Notebooks:](https://docs.aws.amazon.com/glue/latest/dg/getting-started-access.html) Create an IAM role and attach the notebook server policy.

6. [Create an IAM Policy for Amazon SageMaker Notebooks:](https://docs.aws.amazon.com/glue/latest/dg/create-sagemaker-notebook-policy.html) Create an IAM policy to use when creating Amazon SageMaker notebooks on development endpoints.

7. [Create an IAM Role for Amazon SageMaker Notebooks:](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role-sagemaker-notebook.html) Create an IAM role and attach the policy to grant permissions when creating Amazon SageMaker notebooks on development endpoints.

### Setting Up DNS in Your VPC

Domain Name System (DNS) is a standard by which names used on the internet are resolved to their corresponding IP addresses. A DNS hostname uniquely names a computer and consists of a host name and a domain name. DNS servers resolve DNS hostnames to their corresponding IP addresses.

To set up DNS in your VPC, ensure that DNS hostnames and DNS resolution are both enabled in your VPC. The VPC network attributes enableDnsHostnames and enableDnsSupport must be set to true. To view and modify these attributes, go to the VPC console at https://console.aws.amazon.com/vpc/.

For more information, see [Using DNS with your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html). Also, you can use the AWS CLI and call the [modify-vpc-attribute](https://docs.aws.amazon.com/cli/latest/reference/ec2/modify-vpc-attribute.html) command to configure the VPC network attributes.

### Setting Up Your Environment to Access Data Stores

To run your extract, transform, and load (ETL) jobs, AWS Glue must be able to access your data stores. If a job doesn't need to run in your virtual private cloud (VPC) subnet—for example, transforming data from Amazon S3 to Amazon S3—no additional configuration is needed.

If a job needs to run in your VPC subnet—for example, transforming data from a JDBC data store in a private subnet—AWS Glue sets up [elastic network interfaces](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_ElasticNetworkInterfaces.html) that enable your jobs to connect securely to other resources within your VPC. Each elastic network interface is assigned a private IP address from the IP address range within the subnet you specified. No public IP addresses are assigned. Security groups specified in the AWS Glue connection are applied on each of the elastic network interfaces. For more information, see [Setting Up a VPC to Connect to JDBC Data Stores](https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html).

All JDBC data stores that are accessed by the job must be available from the VPC subnet. To access Amazon S3 from within your VPC, a [VPC endpoint](https://docs.aws.amazon.com/glue/latest/dg/vpc-endpoints-s3.html) is required. If your job needs to access both VPC resources and the public internet, the VPC needs to have a Network Address Translation (NAT) gateway inside the VPC.

A job or development endpoint can only access one VPC (and subnet) at a time. If you need to access data stores in different VPCs, you have the following options:

- Use VPC peering to access the data stores. For more about VPC peering, see [VPC Peering Basics](https://docs.aws.amazon.com/vpc/latest/peering/vpc-peering-basics.html)

- Use an Amazon S3 bucket as an intermediary storage location. Split the work into two jobs, with the Amazon S3 output of job 1 as the input to job 2.

For JDBC data stores, you create a connection in AWS Glue with the necessary properties to connect to your data stores. For more information about the connection, see [Adding a Connection to Your Data Store](https://docs.aws.amazon.com/glue/latest/dg/populate-add-connection.html).


### Amazon VPC Endpoints for Amazon S3

For security reasons, many AWS customers run their applications within an Amazon Virtual Private Cloud environment (Amazon VPC). With Amazon VPC, you can launch Amazon EC2 instances into a virtual private cloud, which is logically isolated from other networks—including the public internet. With an Amazon VPC, you have control over its IP address range, subnets, routing tables, network gateways, and security settings.

Note
If you created your AWS account after 2013-12-04, you already have a default VPC in each AWS Region. You can immediately start using your default VPC without any additional configuration.

For more information, see [Your Default VPC and Subnets](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html) in the Amazon VPC User Guide.

Many customers have legitimate privacy and security concerns about sending and receiving data across the public internet. Customers can address these concerns by using a virtual private network (VPN) to route all Amazon S3 network traffic through their own corporate network infrastructure. However, this approach can introduce bandwidth and availability challenges.

VPC endpoints for Amazon S3 can alleviate these challenges. A VPC endpoint for Amazon S3 enables AWS Glue to use private IP addresses to access Amazon S3 with no exposure to the public internet. AWS Glue does not require public IP addresses, and you don't need an internet gateway, a NAT device, or a virtual private gateway in your VPC. You use endpoint policies to control access to Amazon S3. Traffic between your VPC and the AWS service does not leave the Amazon network.

When you create a VPC endpoint for Amazon S3, any requests to an Amazon S3 endpoint within the Region (for example, s3.us-west-2.amazonaws.com) are routed to a private Amazon S3 endpoint within the Amazon network. You don't need to modify your applications running on EC2 instances in your VPC—the endpoint name remains the same, but the route to Amazon S3 stays entirely within the Amazon network, and does not access the public internet.

The following diagram shows the architecture of an AWS Glue environment.
<img src="https://docs.aws.amazon.com/glue/latest/dg/images/PopulateCatalog-vpc-endpoint.png" align="left" alt="Glue Concept image" width = "800">

For more information about VPC endpoints, see [VPC Endpoints](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html) in the Amazon VPC User Guide.

The following diagram shows how AWS Glue can use a VPC endpoint to access Amazon S3.

__To set up access for Amazon S3__

1. Sign in to the AWS Management Console and open the Amazon VPC console at https://console.aws.amazon.com/vpc/.

2. In the left navigation pane, choose __Endpoints__.

3. Choose __Create Endpoint__, and follow the steps to create an Amazon S3 endpoint in your VPC.

### Setting Up a VPC to Connect to JDBC Data Stores

To enable AWS Glue components to communicate, you must set up access to your data stores, such as Amazon Redshift and Amazon RDS. To enable AWS Glue to communicate between its components, specify a security group with a self-referencing inbound rule for all TCP ports. By creating a self-referencing rule, you can restrict the source to the same security group in the VPC, and it's not open to all networks. The default security group for your VPC might already have a self-referencing inbound rule for ALL Traffic.

To set up access for Amazon Redshift data stores

1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://console.aws.amazon.com/redshift/.

2. In the left navigation pane, choose Clusters.

3. Choose the cluster name that you want to access from AWS Glue.

4. In the Cluster Properties section, choose a security group in VPC security groups to allow AWS Glue to use. Record the name of the security group that you chose for future reference. Choosing the security group opens the Amazon EC2 console Security Groups list.

5. Choose the security group to modify and navigate to the Inbound tab.

6. Add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source is the same security group name as the Group ID.

The inbound rule looks similar to the following:

|Type|Protocol|Port Range|Source|
|:-------|:-------:|:-------:|-------:|
|All TCP|TCP|0–65535|database-security-group|


<a href="Intended hyperlink">![Set up security group](https://docs.aws.amazon.com/glue/latest/dg/images/SetupSecurityGroup-Start.png)</a>

7. Add a rule for outbound traffic also. Either open outbound traffic to all ports, for example:

|Type|Protocol|Port Range|Source|
|:-------|:-------:|:-------:|-------:|
|All Traffic|ALL|ALL|0.0.0.0/0|

Or create a self-referencing rule where __Type__ All TCP, __Protocol__ is TCP, __Port Range__ includes all ports, and whose __Destination__ is the same security group name as the Group ID. If using an Amazon S3 VPC endpoint, also add an HTTPS rule for Amazon S3 access. The `s3-prefix-list-id` is required in the security group rule to allow traffic from the VPC to the Amazon S3 VPC endpoint.

For example:

|Type|Protocol|Port Range|Source|
|:-------|:-------:|:-------:|-------:|
|All TCP|TCP|0–65535|`security-group`|
|HTTPS|TCP|443|`s3-prefix-list-id`|

__To set up access for Amazon RDS data stores__

1. Sign in to the AWS Management Console and open the Amazon RDS console at https://console.aws.amazon.com/rds/.

2. In the left navigation pane, choose __Instances__.

3. Choose the Amazon RDS __Engine__ and __DB Instance__ name that you want to access from AWS Glue.

4. From __Instance Actions__, choose __See Details__. On the __Details__ tab, find the __Security Groups__ name you will access from AWS Glue. Record the name of the security group for future reference.

5. Choose the security group to open the Amazon EC2 console.

6. Confirm that your __Group ID__ from Amazon RDS is chosen, then choose the __Inbound__ tab.

7. Add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule of __Type__ `All TC`P, __Protocol__ is `TCP`, __Port Range__ includes all ports, and whose __Source__ is the same security group name as the Group ID.

The inbound rule looks similar to this:

|Type|Protocol|Port Range|Source|
|:-------|:-------:|:-------:|-------:|
|All TCP|TCP|0–65535|database-security-group|

<a href="Intended hyperlink">![Set up security group](https://docs.aws.amazon.com/glue/latest/dg/images/SetupSecurityGroup-Start.png)</a>

8. Add a rule for outbound traffic also. Either open outbound traffic to all ports, for example:

|Type|Protocol|Port Range|Source|
|:-------|:-------:|:-------:|-------:|
|All Traffic|ALL|ALL|0.0.0.0/0|

Or create a self-referencing rule where __Type__ `All TCP`, __Protocol__ is `TCP`, __Port Range__ includes `all ports`, and whose Destination is the same security group name as the Group ID. If using an Amazon S3 VPC endpoint, also add an HTTPS rule for Amazon S3 access. The <span style="color:red">s3-prefix-list-id</span> is required in the security group rule to allow traffic from the VPC to the Amazon S3 VPC endpoint.



For example:

|Type|Protocol|Port Range|Source|
|:-------|:-------:|:-------:|-------:|
|All TCP|TCP|0–65535|<span style="color:red">security-group</span>|
|HTTPS|TCP|443|<span style="color:red">s3-prefix-list-id</span>|

### Setting Up Your Environment for Development Endpoints

To run your extract, transform, and load (ETL) scripts with AWS Glue, you sometimes develop and test your scripts using a development endpoint. When you set up a development endpoint, you specify a virtual private cloud (VPC), subnet, and security groups.

Note
Make sure you set up your DNS environment for AWS Glue. For more information, see [Setting Up DNS in Your VPC](https://docs.aws.amazon.com/glue/latest/dg/set-up-vpc-dns.html).

### Setting Up Your Network for a Development Endpoint

To enable AWS Glue to access required resources, add a row in your subnet route table to associate a prefix list for Amazon S3 to the VPC endpoint. A prefix list ID is required for creating an outbound security group rule that allows traffic from a VPC to access an AWS service through a VPC endpoint. To ease connecting to a notebook server that is associated with this development endpoint, from your local machine, add a row to the route table to add an internet gateway ID. For more information, see [VPC Endpoints](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html). Update the subnet routes table to be similar to the following table:

|Destination|Target|
|:-------|-------:|
|10.0.0.0/16|local|
|pl-id for Amazon S3|vpce-id|
|0.0.0.0/0|igw-xxxx|

To enable AWS Glue to communicate between its components, specify a security group with a self-referencing inbound rule for all TCP ports. By creating a self-referencing rule, you can restrict the source to the same security group in the VPC, and it's not open to all networks. The default security group for your VPC might already have a self-referencing inbound rule for ALL Traffic.

To set up a security group

1. Sign in to the AWS Management Console and open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.

2. In the left navigation pane, choose Security Groups.

3. Either choose an existing security group from the list, or Create Security Group to use with the development endpoint.

4. In the security group pane, navigate to the Inbound tab.

5. Add a self-referencing rule to allow AWS Glue components to communicate. Specifically, add or confirm that there is a rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source is the same security group name as the Group ID.

The inbound rule looks similar to this:

|Type|Protocol|Port Range|Source|
|:-------|:-------:|:-------:|-------:|
|All TCP|TCP|0–65535|<span style="color:red">security-group</span>|

The following shows an example of a self-referencing inbound rule:

<a href="Intended hyperlink">![Set up security group](https://docs.aws.amazon.com/glue/latest/dg/images/SetupSecurityGroup-Start.png)</a>

6. Add a rule to for outbound traffic also. Either open outbound traffic to all ports, or create a self-referencing rule of Type All TCP, Protocol is TCP, Port Range includes all ports, and whose Source is the same security group name as the Group ID.

The outbound rule looks similar to one of these rules:

|Type|Protocol|Port Range|Source|
|:-------|:-------:|:-------:|-------:|
|All TCP|TCP|0–65535|<span style="color:red">security-group</span>|
|All Traffic|ALL|ALL|0.0.0.0/0|

### Setting Up Amazon EC2 for a Notebook Server

With a development endpoint, you can create a notebook server to test your ETL scripts with Zeppelin notebooks. To enable communication to your notebook, specify a security group with inbound rules for both HTTPS (port 443) and SSH (port 22). Ensure that the rule's source is either 0.0.0.0/0 or the IP address of the machine that is connecting to the notebook.

__To set up a security group__

1. Sign in to the AWS Management Console and open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.

2. In the left navigation pane, choose Security Groups.

3. Either choose an existing security group from the list, or Create Security Group to use with your notebook server. The security group that is associated with your development endpoint is also used to create your notebook server.

4. In the security group pane, navigate to the Inbound tab.

Add inbound rules similar to this:

|Type|Protocol|Port Range|Source|
|:-------|:-------:|:-------:|-------:|
|SSH|TCP|22|0.0.0.0/0|
|HTTPS|TCP|443|0.0.0.0/0|

The following shows an example of the inbound rules for the security group:

<a href="Intended hyperlink">![Set up security group](https://docs.aws.amazon.com/glue/latest/dg/images/SetupSecurityGroupNotebook-Start.png)</a>

### Setting Up Encryption in AWS Glue

The following example workflow highlights the options to configure when you use encryption with AWS Glue. The example demonstrates the use of specific AWS Key Management Service (AWS KMS) keys, but you might choose other settings based on your particular needs. This workflow highlights only the options that pertain to encryption when setting up AWS Glue.

1. If the user of the AWS Glue console doesn't use a permissions policy that allows all AWS Glue API operations (for example, "glue:*"), confirm that the following actions are allowed:

- "glue:GetDataCatalogEncryptionSettings"

- "glue:PutDataCatalogEncryptionSettings"

- "glue:CreateSecurityConfiguration"

- "glue:GetSecurityConfiguration"

- "glue:GetSecurityConfigurations"

- "glue:DeleteSecurityConfiguration"

2. Any client that accesses or writes to an encrypted catalog—that is, any console user, crawler, job, or development endpoint—needs the following permissions.

```
{
 "Version": "2012-10-17",
  "Statement": {
     "Effect": "Allow",
     "Action": [
           "kms:GenerateDataKey",
           "kms:Decrypt",  
           "kms:Encrypt"
      ],
     "Resource": "<key-arns-used-for-data-catalog>"
   }
}
```

3. Any user or role that accesses an encrypted connection password needs the following permissions.

```
{
 "Version": "2012-10-17",        
  "Statement": {
     "Effect": "Allow",
     "Action": [
           "kms:Decrypt"
          ],
     "Resource": "<key-arns-used-for-password-encryption>"
          }
}
```

4. The role of any extract, transform, and load (ETL) job that writes encrypted data to Amazon S3 needs the following permissions.

```
{
 "Version": "2012-10-17",
  "Statement": {
     "Effect": "Allow",
     "Action": [
           "kms:Decrypt",  
           "kms:Encrypt",
           "kms:GenerateDataKey"
      ],
     "Resource": "<key-arns-used-for-s3>"
   }
}
```

5. Any ETL job or crawler that writes encrypted Amazon CloudWatch Logs requires the following permissions in the key policy (not the IAM policy).

```
{
    "Effect": "Allow",
    "Principal": {
        "Service": "logs.region.amazonaws.com"
        },
        "Action": [
        "kms:Encrypt*",
        "kms:Decrypt*",
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*",
        "kms:Describe*"
    ],
    "Resource": "<arn of key used for ETL/crawler cloudwatch encryption>"
}
```

For more information about key policies, see [Using Key Policies in AWS KMS](https://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html) in the AWS Key Management Service Developer Guide.

6. Any ETL job that uses an encrypted job bookmark needs the following permissions.

```
{
 "Version": "2012-10-17",
  "Statement": {
     "Effect": "Allow",
     "Action": [
           "kms:Decrypt",  
           "kms:Encrypt"
      ],
     "Resource": "<key-arns-used-for-job-bookmark-encryption>"
   }
}
```

7. On the AWS Glue console, choose __Settings__ in the navigation pane.

    a. On the __Data catalog settings__ page, encrypt your Data Catalog by selecting __Metadata encryption__. This option encrypts all the objects in the Data Catalog with the AWS KMS key that you choose.

    b. For __AWS KMS key__, choose __aws/glue__. You can also choose a customer master key (CMK) that you created.
    
__!Important__
AWS Glue supports only symmetric customer master keys (CMKs). The __AWS KMS key__ list displays only symmetric keys. However, if you select __Choose a KMS key ARN__, the console lets you enter an ARN for any key type. Ensure that you enter only ARNs for symmetric keys.

When encryption is enabled, the client that is accessing the Data Catalog must have AWS KMS permissions.

8. In the navigation pane, choose __Security configurations__. A security configuration is a set of security properties that can be used to configure AWS Glue processes. Then choose __Add security configuration__. In the configuration, choose any of the following options:

    a. Select __S3 encryption__. For __Encryption mode__, choose __SSE-KMS__. For the __AWS KMS key__, choose __aws/s3__ (ensure that the user has permission to use this key). This enables data written by the job to Amazon S3 to use the AWS managed AWS Glue AWS KMS key.

    b. Select CloudWatch logs encryption, and choose a CMK. (Ensure that the user has permission to use this key). For more information, see [Encrypt Log Data in CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/encrypt-log-data-kms.html) Using AWS KMS in the AWS Key Management Service Developer Guide.
    
    c. Choose Advanced properties, and select Job bookmark encryption. For the AWS KMS key, choose aws/glue (ensure that the user has permission to use this key). This enables encryption of job bookmarks written to Amazon S3 with the AWS Glue AWS KMS key.

9 In the navigation pane, choose __Connections__.

    a. Choose __Add connection__ to create a connection to the Java Database Connectivity (JDBC) data store that is the target of your ETL job.

    b. To enforce that Secure Sockets Layer (SSL) encryption is used, select __Require SSL connection__, and test your connection.

10. In the navigation pane, choose __Jobs__.

    a. Choose __Add job__ to create a job that transforms data.

    b. In the job definition, choose the security configuration that you created.

11. On the AWS Glue console, run your job on demand. Verify that any Amazon S3 data written by the job, the CloudWatch Logs written by the job, and the job bookmarks are all encrypted.

### AWS Glue Console Workflow Overview

With AWS Glue, you store metadata in the AWS Glue Data Catalog. You use this metadata to orchestrate ETL jobs that transform data sources and load your data warehouse or data lake. The following steps describe the general workflow and some of the choices that you make when working with AWS Glue.

__Note__
You can follow the steps below, or you can create a workflow that automatically performs steps 1 through 3. For more information, see [Performing Complex ETL Activities Using Workflows in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/orchestrate-using-workflows.html).

1. Populate the AWS Glue Data Catalog with table definitions.

In the console, for persistent data stores, you can add a crawler to populate the AWS Glue Data Catalog. You can start the __Add crawler__ wizard from the list of tables or the list of crawlers. You choose one or more data stores for your crawler to access. You can also create a schedule to determine the frequency of running your crawler. For data streams, you can manually create the table definition, and define stream properties.

Optionally, you can provide a custom classifier that infers the schema of your data. You can create custom classifiers using a grok pattern. However, AWS Glue provides built-in classifiers that are automatically used by crawlers if a custom classifier does not recognize your data. When you define a crawler, you don't have to select a classifier. For more information about classifiers in AWS Glue, see [Adding Classifiers to a Crawler](https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html).

Crawling some types of data stores requires a connection that provides authentication and location information. If needed, you can create a connection that provides this required information in the AWS Glue console.

The crawler reads your data store and creates data definitions and named tables in the AWS Glue Data Catalog. These tables are organized into a database of your choosing. You can also populate the Data Catalog with manually created tables. With this method, you provide the schema and other metadata to create table definitions in the Data Catalog. Because this method can be a bit tedious and error prone, it's often better to have a crawler create the table definitions.

For more information about populating the AWS Glue Data Catalog with table definitions, see [Defining Tables in the AWS Glue Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html).

2. Define a job that describes the transformation of data from source to target.

Generally, to create a job, you have to make the following choices:

Choose a table from the AWS Glue Data Catalog to be the source of the job. Your job uses this table definition to access your data source and interpret the format of your data.

Choose a table or location from the AWS Glue Data Catalog to be the target of the job. Your job uses this information to access your data store.

Tell AWS Glue to generate a PySpark script to transform your source to target. AWS Glue generates the code to call built-in transforms to convert data from its source schema to target schema format. These transforms perform operations such as copy data, rename columns, and filter data to transform data as necessary. You can modify this script in the AWS Glue console.

For more information about defining jobs in AWS Glue, see [Authoring Jobs in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/author-job.html).

3. Run your job to transform your data.

You can run your job on demand, or start it based on a one of these trigger types:

- A trigger that is based on a cron schedule.

- A trigger that is event-based; for example, the successful completion of another job can start an AWS Glue job.

- A trigger that starts a job on demand.

For more information about triggers in AWS Glue, see [Starting Jobs and Crawlers Using Triggers](https://docs.aws.amazon.com/glue/latest/dg/trigger-job.html).

4. Monitor your scheduled crawlers and triggered jobs.

Use the AWS Glue console to view the following:

- Job run details and errors.

- Crawler run details and errors.

- Any notifications about AWS Glue activities

For more information about monitoring your crawlers and jobs in AWS Glue, see [Running and Monitoring AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/monitor-glue.html).

