Skip to content

champ-oss/terraform-aws-github-data-lake

Repository files navigation

terraform-aws-github-data-lake

A Terraform module for ingesting GitHub event data

.github/workflows/lint.yml .github/workflows/module.yml .github/workflows/pytest.yml .github/workflows/sonar.yml

SonarCloud

Quality Gate Status Vulnerabilities Reliability Rating

Features

  • AWS Lambda function to act as a receiver for HTTP webhook events from GitHub (about GitHub webhooks)
  • Supports GitHub shared secret to secure the endpoint (more information)
  • AWS Kinesis Data Firehose receives the event data and writes to S3 in JSON format
  • AWS Athena table is created and configured to query data

Example Usage

Look at examples/complete/main.tf for an example of how to deploy this Terraform module

Querying Data

AWS Athena can be used to query the data in S3. This Terraform module sets up the Athena table so that it is possible to immediately begin running queries.

Below is an example query for extracting nested JSON fields:

select json_extract_scalar(repository, '$.name')        as name,
       json_extract_scalar(repository, '$.owner.login') as login
from "my-athena-table"

Testing

Several integration tests are run in test/src/examples_complete_test.go, which are executed on each commit to this repository.

  • An HTTP test event is sent to the Lambda function with a secret set to validate the event is received successfully
  • An HTTP test event is sent to the Lambda function without a secret set to validate the event is rejected
  • The S3 bucket is inspected to ensure the test event data is written successfully
  • An AWS Athena query is executed and the result is checked to ensure data is returned successfully

Requirements

Name Version
terraform >= 0.15.1
aws >= 4.17.1

Providers

Name Version
archive n/a
aws >= 4.17.1

Modules

Name Source Version
lambda github.com/champ-oss/terraform-aws-lambda.git v1.0.97-948bb8b
s3 github.com/champ-oss/terraform-aws-s3.git v1.0.29-4a98121

Resources

Name Type
aws_glue_catalog_database.this resource
aws_glue_catalog_table.this resource
aws_iam_policy.this resource
aws_iam_role.this resource
aws_iam_role_policy_attachment.firehose resource
aws_iam_role_policy_attachment.s3 resource
aws_iam_role_policy_attachment.this resource
aws_kinesis_firehose_delivery_stream.this resource
aws_sns_topic.this resource
aws_sns_topic_subscription.this resource
archive_file.this data source
aws_iam_policy_document.assume data source
aws_iam_policy_document.this data source
aws_region.this data source

Inputs

Name Description Type Default Required
buffer_interval https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/kinesis_firehose_delivery_stream#buffer_interval number 300 no
buffer_size https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/kinesis_firehose_delivery_stream#buffer_size number 5 no
compression_format https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/kinesis_firehose_delivery_stream#compression_format string "UNCOMPRESSED" no
git Identifier to be used on all resources string n/a yes
prefix https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/kinesis_firehose_delivery_stream#prefix string "firehose/" no
protect Enables deletion protection on eligible resources bool true no
runtime https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html string "python3.8" no
shared_secret https://docs.github.com/en/developers/webhooks-and-events/webhooks/securing-your-webhooks string n/a yes
table_string_columns https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_catalog_table#name list(string)
[
"action",
"after",
"before",
"changes",
"check_suite",
"check_run",
"comment",
"issue",
"number",
"organization",
"pull_request",
"repository",
"sender",
"workflow",
"workflow_job",
"workflow_run"
]
no
tags https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html map(string) {} no

Outputs

Name Description
bucket S3 bucket name
database https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_catalog_database
function_arn https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function#arn
function_name https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function#function_name
function_url https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function_url#function_url
region AWS Region
table https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_catalog_table