# Processing CloudTrail logs

This notebook shows how to process CloudTrail log data using Spark.  

## License

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: MIT-0

## Assumptions

* CloudTrail is enabled and writing logs into an S3 bucket in the standard year/month/day format

## Goals

* Periodically take latest data and reformat it for more efficient query and storage

## Approach

* Turn on S3 object expiration for raw data (not shown here)
* At some interval, perhaps nightly, process the latest CT data.  Doing this daily makes the job easier as we can load up the latest partition.  This notebook contains most of that logic except for the orchestration.  We could kick off a job using the Glue scheduler or Step Functions triggered by a CloudWatch scheduled event.
* Convert to Parquet and set partition structure.  In this example we use the region and event source, but normally we'd include the date as well.
* Append to existing processed data set (not shown here)

In [1]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext.getOrCreate()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
3,application_1586272817950_0004,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
# Load in one day's worth of logs from one region
ct_bucket = '<name of S3 bucket containing CloudTrail logs>'
acct = '<your account number>'
region = '<AWS region>'
ct_dt = '<date of interest, e.g. 2020/04/01>'
df = spark.read.json('s3://' + ct_bucket + '/AWSLogs/' + acct + '/CloudTrail/' + region + '/' + ct_dt + '/*.gz')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
df.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1854

In [4]:
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- Records: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- additionalEventData: struct (nullable = true)
 |    |    |    |-- AuthenticationMethod: string (nullable = true)
 |    |    |    |-- CipherSuite: string (nullable = true)
 |    |    |    |-- SSEApplied: string (nullable = true)
 |    |    |    |-- SignatureVersion: string (nullable = true)
 |    |    |    |-- bytesTransferredIn: double (nullable = true)
 |    |    |    |-- bytesTransferredOut: double (nullable = true)
 |    |    |    |-- capabilities: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- clone: boolean (nullable = true)
 |    |    |    |-- configRuleArn: string (nullable = true)
 |    |    |    |-- configRuleInputParameters: string (nullable = true)
 |    |    |    |-- configRuleName: string (nullable = true)
 |    |    |    |-- dataTransferred: boolean (nullable = true)
 |    |    |    |-- managedRuleIdentifier:

In [6]:
from pyspark.sql.functions import explode
dfLong = df.select(explode(df.Records))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
dfLong.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- col: struct (nullable = true)
 |    |-- additionalEventData: struct (nullable = true)
 |    |    |-- AuthenticationMethod: string (nullable = true)
 |    |    |-- CipherSuite: string (nullable = true)
 |    |    |-- SSEApplied: string (nullable = true)
 |    |    |-- SignatureVersion: string (nullable = true)
 |    |    |-- bytesTransferredIn: double (nullable = true)
 |    |    |-- bytesTransferredOut: double (nullable = true)
 |    |    |-- capabilities: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- clone: boolean (nullable = true)
 |    |    |-- configRuleArn: string (nullable = true)
 |    |    |-- configRuleInputParameters: string (nullable = true)
 |    |    |-- configRuleName: string (nullable = true)
 |    |    |-- dataTransferred: boolean (nullable = true)
 |    |    |-- managedRuleIdentifier: string (nullable = true)
 |    |    |-- notificationJobType: string (nullable = true)
 |    |    |-- protocol: string (nullab

In [8]:
dfLong.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

50849

In [None]:
dfLong.head(1)

In [14]:
dfUpOneLevel = dfLong.select("col.*")
dfUpOneLevel.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- additionalEventData: struct (nullable = true)
 |    |-- AuthenticationMethod: string (nullable = true)
 |    |-- CipherSuite: string (nullable = true)
 |    |-- SSEApplied: string (nullable = true)
 |    |-- SignatureVersion: string (nullable = true)
 |    |-- bytesTransferredIn: double (nullable = true)
 |    |-- bytesTransferredOut: double (nullable = true)
 |    |-- capabilities: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- clone: boolean (nullable = true)
 |    |-- configRuleArn: string (nullable = true)
 |    |-- configRuleInputParameters: string (nullable = true)
 |    |-- configRuleName: string (nullable = true)
 |    |-- dataTransferred: boolean (nullable = true)
 |    |-- managedRuleIdentifier: string (nullable = true)
 |    |-- notificationJobType: string (nullable = true)
 |    |-- protocol: string (nullable = true)
 |    |-- repositoryId: string (nullable = true)
 |    |-- repositoryName: string (nullable = true)
 |    |--

In [16]:
dfUpOneLevel.write.partitionBy("awsRegion","eventSource").parquet("s3://" + ct_bucket + "/parquet_pt/")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…