Validating data on Amazon S3
This is a very short tutorial on using goodtables.io to continuously validate data hosted on Amazon S3.
Pre-requisites
Instructions
Setting up Amazon S3 bucket and read-only user
- Create a bucket on S3 to hold your data
- Create the bucket on the
us-west-2
region. It's a current limitation of goodtables.io that we're working to fix.
- Create the bucket on the
- Create a new IAM user. This user will be used by goodtables.io to read your bucket.
- Make sure you take note of the AWS Access Key ID, AWS Secret Access Key, and the User ARN.
- Go to your bucket's overview page, click on the
Permissions
tab, and find theBucket Policy
link. We need the permissions:- s3:ListBucket: To list the bucket's contents
- s3:GetObject: To read the bucket's files
- s3:GetBucketPolicy, s3:PutBucketPolicy, s3:GetBucketLocation, and s3:PutBucketNotification: To set up the AWS Lambda functions that notifies goodtables.io when a new file is added
The final bucket policy should look like:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "statement1",
"Effect": "Allow",
"Principal": {
"AWS": "IAM_USER_ARN"
},
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:GetBucketPolicy",
"s3:PutBucketPolicy",
"s3:PutBucketNotification"
],
"Resource": "arn:aws:s3:::BUCKET_NAME"
},
{
"Sid": "statement2",
"Effect": "Allow",
"Principal": {
"AWS": "IAM_USER_ARN"
},
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::BUCKET_NAME/*"
}
]
}
With your IAM User ARN and Bucket Name substituting the IAM_USER_ARN
and BUCKET_NAME
.
Setting up goodtables.io
- Login on goodtables.io using your GitHub account.
- Go to the Manage Sources page, click on the
Amazon
tab, and on the plus sign on the right of the Filter input. - Fill in the
Access Key Id
,Secret Access Key
andBucket Name
with the IAM User and bucket you just created in the previous section.
We're all set. Goodtables will automatically validate whenever a file is added or modified in the bucket. You can now upload data to your bucket and goodtables will automatically validate any tabular files (CSV, XLS, ODS, ...) and tabular data packages.
Next steps
- Write a table schema to validate the contents of your data
- Configure which files are validated and how