
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session                                                                                                  |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X                                                                            |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0                                                        |
| %security_configuration     |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |   Changes the session type to Glue ETL.                                                                                                                   |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |

In [None]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
It looks like there is a newer version of the kernel available. The latest version is 0.31 and you have 0.30 installed.
Please run `pip install --upgrade aws-glue-sessions` to upgrade your kernel
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::886206532651:role/hutima-glue
Attempting to use existing AssumeRole session credentials.
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: 362273fc-6974-4c9d-aed1-0549d6cfbf87
Applying the following default arguments:
--glue_kernel_version 0.30
--enable-glue-datacatalog true
Waiting for session 362273fc-6974-4c9d-aed1-0549d6cfbf87 to get into ready

In [1]:
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={"multiline": False},
    connection_type="s3",
    format="json",
    connection_options={"paths": ["s3://securituhub-finding"], "recurse": True},
    transformation_ctx="S3bucket_node1",
)




In [2]:
df_raw = S3bucket_node1




In [3]:
df_raw.count()

19403


In [None]:
df_raw_selectFields = df_raw.select_fields( paths=['detail.findings'])

In [None]:
df_raw_selectFields.printSchema()

In [10]:
dfc = df_raw_selectFields.relationalize("root", "s3://aws-glue-reda-job")




In [13]:
dfc.keys()

dict_keys(['root', 'root_detail.findings.val.Compliance.StatusReasons', 'root_detail.findings.val.Resources.val.Details.AwsEc2NetworkAcl.Associations', 'root_detail.findings.val.Resources.val.Details.AwsIamRole.InstanceProfileList.val.Roles', 'root_detail.findings.val.FindingProviderFields.Types', 'root_detail.findings.val.Resources.val.Details.AwsIamRole.InstanceProfileList', 'root_detail.findings.val.Resources.val.Details.AwsIamRole.AttachedManagedPolicies', 'root_detail.findings.val.Resources', 'root_detail.findings.val.Resources.val.Details.AwsEc2SecurityGroup.IpPermissions.val.Ipv6Ranges', 'root_detail.findings.val.Resources.val.Details.AwsEc2SecurityGroup.IpPermissionsEgress.val.IpRanges', 'root_detail.findings.val.Resources.val.Details.AwsEc2SecurityGroup.IpPermissions.val.UserIdGroupPairs', 'root_detail.findings.val.Resources.val.Details.AwsEc2SecurityGroup.IpPermissionsEgress', 'root_detail.findings.val.Resources.val.Details.AwsEc2SecurityGroup.IpPermissions.val.IpRanges', 'ro

In [23]:
from datetime import date
from awsglue import DynamicFrame

from pyspark.sql.functions import col,lit
todays_date = date.today()
print(todays_date)
for key in dfc.keys():
    if "root_detail.findings" in key:
            df_temp = dfc.select(key)
            spark_df_temp = df_temp.toDF()
            print(key)
            spark_df_temp_id_renamed = spark_df_temp.withColumnRenamed("id","id_glue")
            rename_col = [f"{e.replace('detail.findings.val.', '').replace('/','_')}" for e in spark_df_temp_id_renamed.columns]
            spark_df_temp_clean = spark_df_temp_id_renamed.toDF(*rename_col)
            spark_df_with_partition = spark_df_temp_clean.withColumn('process_year', lit(todays_date.year )).withColumn('process_month', lit(todays_date.month)).withColumn('process_day', lit(todays_date.day))
            df_final = DynamicFrame.fromDF(spark_df_with_partition ,glueContext, "r")
            df_final.count()
            S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
            frame= df_final,
            connection_type="s3",
            format="json",
            connection_options={"path": f"s3://datalake-reda-01/df/{key}", "partitionKeys": ["process_year", "process_month", "process_day"]},
            transformation_ctx="S3bucket_node3",
            )

2022-07-25
root_detail.findings.val.Compliance.StatusReasons
12396
root_detail.findings.val.Resources.val.Details.AwsEc2NetworkAcl.Associations
160
root_detail.findings.val.Resources.val.Details.AwsIamRole.InstanceProfileList.val.Roles
5
root_detail.findings.val.FindingProviderFields.Types
19480
root_detail.findings.val.Resources.val.Details.AwsIamRole.InstanceProfileList
5
root_detail.findings.val.Resources.val.Details.AwsIamRole.AttachedManagedPolicies
1042
root_detail.findings.val.Resources
19480
root_detail.findings.val.Resources.val.Details.AwsEc2SecurityGroup.IpPermissions.val.Ipv6Ranges
33
root_detail.findings.val.Resources.val.Details.AwsEc2SecurityGroup.IpPermissionsEgress.val.IpRanges
171
root_detail.findings.val.Resources.val.Details.AwsEc2SecurityGroup.IpPermissions.val.UserIdGroupPairs
105
root_detail.findings.val.Resources.val.Details.AwsEc2SecurityGroup.IpPermissionsEgress
171
root_detail.findings.val.Resources.val.Details.AwsEc2SecurityGroup.IpPermissions.val.IpRanges
9