Reference. Developing and testing AWS Glue job scripts - AWS Glue (amazon.com)


	1. Create an AWS named profile
	2. Open cmd on Windows and run the following command
	SET PROFILE_NAME="AWS_ATCG_PROFILE"
	3. Run this command
	docker pull amazon/aws-glue-libs:glue_libs_3.0.0_image_01
	4. Start container

spark -submit
You can run an AWS Glue job script by running the spark-submit command on the container


	1. Run this command to execute the spark -submit

	$ export PROFILE_NAME="AWS_ATCG_PROFILE"
	$  export WORKSPACE_LOCATION=/home/glue_user/workspace/src
	$  export SCRIPT_FILE_NAME=sample.py 
	$ mkdir -p {WORKSPACE_LOCATION}/src 
	$ vim ${WORKSPACE_LOCATION}/src/${SCRIPT_FILE_NAME}

	$ docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/src -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_spark_submit amazon/aws-glue-libs:glue_libs_3.0.0_image_01 spark-submit /home/glue_user/workspace/src/$SCRIPT_FILE_NAME
	
REPL shell (Pyspark)

You can run REPL (read-eval-print loops) shell for interactive development.
Run the following command to execute the PySpark command on the container to start the REPL shell:

docker run -it -v ~/.aws:/home/glue_user/.aws -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark 





In [None]:
# sample.py

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions


class GluePythonSampleTest:
    def __init__(self):
        params = []
        if '--JOB_NAME' in sys.argv:
            params.append('JOB_NAME')
        args = getResolvedOptions(sys.argv, params)

        self.context = GlueContext(SparkContext.getOrCreate())
        self.job = Job(self.context)

        if 'JOB_NAME' in args:
            jobname = args['JOB_NAME']
        else:
            jobname = "test"
        self.job.init(jobname, args)

    def run(self):
        dyf = read_json(self.context, "s3://awsglue-datasets/examples/us-legislators/all/persons.json")
        dyf.printSchema()

        self.job.commit()


def read_json(glue_context, path):
    dynamicframe = glue_context.create_dynamic_frame.from_options(
        connection_type='s3',
        connection_options={
            'paths': [path],
            'recurse': True
        },
        format='json'
    )
    return dynamicframe


if __name__ == '__main__':
    GluePythonSampleTest().run()

The above code requires Amazon S3 permissions in AWS IAM. You need to grant the IAM managed policy arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess or an IAM custom policy which allows you to call ListBucket and GetObject for the Amazon S3 path.

test_sample.py: Sample code for unit test of sample.py.

In [None]:
# test_sample.py

import pytest
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import sys
from src import sample


@pytest.fixture(scope="module", autouse=True)
def glue_context():
    sys.argv.append('--JOB_NAME')
    sys.argv.append('test_count')

    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    context = GlueContext(SparkContext.getOrCreate())
    job = Job(context)
    job.init(args['JOB_NAME'], args)

    yield(context)

    job.commit()


def test_counts(glue_context):
    dyf = sample.read_json(glue_context, "s3://awsglue-datasets/examples/us-legislators/all/persons.json")
    assert dyf.toDF().count() == 1961