Skip to content

Latest commit

 

History

History
153 lines (81 loc) · 14.5 KB

getting-started-guide.md

File metadata and controls

153 lines (81 loc) · 14.5 KB

Getting Started with the Amazon Sustainability Data Initiative (ASDI)

Getting Started

The Amazon Sustainability Data Initiative (ASDI) provides a collection of freely available datasets for use by researchers, developers and innovators working in sustainability. This guide provides an overview of how to get started with ASDI: what kinds of data are available, examples of ways to work with data in the cloud, and an overview of some of the resources and libraries well-suited for ASDI data.

AWS Basics

ASDI data is hosted on the Amazon Web Services (AWS) Cloud. You can find a general overview of cloud computing here.

An advantage of storing data in the cloud is that it makes it virtually accessible from anywhere in the world. Another is the ability of anyone to leverage the scalable infrastructure and run analysis and computing on-demand, in close proximity to where the data is stored. AWS offers a wide range of both storage and compute services; this guide will focus on those most relevant to working with ASDI data.

Setting up an AWS account

To work with the ASDI data in the cloud and leverage AWS services, you need to create an AWS account, which you can learn how to do here. If you're new to AWS or want to make sure your account is set up correctly, please see this guide to AWS account best practices.

AWS Storage

ASDI data is stored on Amazon Simple Storage Service (S3). S3 is an object storage service that can store any volume of data, of any type, and which allows data to be made public or private, at the owner's discretion. S3 data is stored in buckets and a bucket can hold anything from a single zero byte file to billions of objects and petabytes of data.

You can view S3 documentation here, and a 10-minute 'getting started with S3' tutorial here.

AWS also offers traditional relational databases via the Amazon Relational Database Service (RDS), NoSQL databases like Amazon DynamoDB and Amazon DocumentDB (MongoDB compatible), and file system storage through the Amazon Elastic File System service and Lustre compatible filesytems.

This guide will cover the basics of interacting with S3, with a focus on ASDI data. Follow these links for more information on the full range of AWS storage and AWS database offerings.

AWS Compute

AWS offers a number of compute options, including Linux or Windows server instances, container-based services, and event-driven computing. See AWS compute options in the following section for more details on using AWS compute services with ASDI data, or click here for a full overview of all AWS compute services.

AWS Notifications

Amazon Simple Notification Service (SNS) is a managed notification service, using a pub/sub model, which some ASDI dataset managers use to notify subscribers of updates to their datasets. Details are listed in the 'Resources on AWS' section of dataset pages that support notifications. See the GOES dataset for an example.

Working with ASDI Data

Each ASDI dataset is stored in its own S3 bucket, which is managed by a Data Provider (e.g., NOAA, NASA, UK Met Office). Most ASDI data is available without commercial restriction--license details are available with each dataset. There may be costs associated with querying ASDI data, or running compute resources on AWS to take advantage of it. There is an AWS free tier that covers some limited usage. You can also apply for a cloud grant to offset the costs of experimentation.

Available datasets

ASDI datasets are hosted through the AWS Public Dataset Program, which covers the storage and egress costs for publicly available, high-value datasets. The data can be discoverable through the Registry of Open Data on AWS, and are tagged for sustainability. You can also explore a list of those datasets here.

Accessing datasets

To access the data, you can either use the AWS Command Line Interfaceor use HTTP. See an example below for the NOAA's National Water Model dataset.

AWS Command Line Interface

aws s3 ls s3://noaa-nwm-pds/

aws s3 cp s3://noaa-nwm-pds/nwm.20190923/short_range/nwm.t19z.short_range.terrain_rt.f01

8.conus.nc .

Via HTTP

http:// noaa-nwm-pds.s3.amazonaws.com/ nwm.20190923/short_range/nwm.t19z.short_range

.terrain_rt.f018.conus.nc

Querying and analyzing data

In general, there are three options for working with ASDI data: query or process it in place, transfer it to your own AWS account, or download it.

Depending on their content and structure, some datasets can be queried directly without having to be transferred or downloaded. This can be done using AWS Athena, which lets you define a schema for existing data residing in flat files, and then query it using standard SQL. For an in-depth example using ASDI data, please see this blog post Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight or Querying OpenStreetMap with Amazon Athena.

Data can also be copied into your own AWS account and then be queried, analyzed or transformed however you choose. You can transfer data from S3 using AWS SDKs or the CLI, details of which can be found below. You can see examples of using the AWS Python SDK to work with S3 here. Note that transferring data to your own bucket may lead to storage and egress fees so we encourage you, as much as possible, to run your analysis on the data hosted in the Public Dataset buckets.

There are a number of options available if you want to transform ASDI into a format better suited for your purposes, including AWS Glue, an ETL service, Apache Spark, and Amazon Kinesis, a streaming service that can transform JSON into Apache Parquet or ORC, tabular data formats that provide more efficient querying. Learn more about using tabular formats with Athena here. Tabular data may not be the ideal structure if you are looking to analyze geospatial data. Instead, Cloud-Optimized Geotiff (COG) is a good format to explore (see Landsat data on AWS).

AWS compute options

Although you can download ASDI data for use on-premises, this pattern leads to duplication of storage, incurs data transfer latency and can make it difficult to discern data provenance. By doing computation on AWS, close to where ASDI data resides, you can reduce latency and increase access speeds, more easily work with large volumes of data, and allow collaborators access to your work. It also makes available to you an array of cloud services well-suited for research and scientific computing. See the AWS documentation for a full overview of all AWS compute modalities. Below is a brief summary of some of those most relevant to analyzing ASDI data for sustainability applications.

Amazon EC2

Amazon Elastic Compute Cloud (EC2) allows you to create Linux or Windows-based server instances on AWS, which you can configure to meet your needs. [Add an example of how this is being used by a sustainability application -- sustainability tutorials??]

Amazon ECS

Docker containers allow you to package code and dependent libraries in a convenient, self-contained unit. Containers run on top of host operating systems, are lighter weight than full instances, and are designed to be easily portable between different compute environments. For instance, a container running on a laptop should be able to run without modification in the cloud. Containers can be used to run most workloads--data analysis, web applications, GIS servers, etc--that a traditional cloud instance might be used for.

There are a number of ways to manage container-based workloads on AWS including:

  • --AWS Batch which enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS.
  • --AWS Fargate which is a compute engine for Amazon ECS that allows you to run containers without having to manage servers or clusters.
  • --Amazon Elastic Container Service (Amazon ECS) which is a highly scalable, high-performance container orchestration service that supports Docker containers and allows you to easily run and scale containerized applications on AWS.
  • --Amazon Elastic Kubernetes Service (Amazon EKS) which makes it easy to deploy, manage, and scale containerized applications using Kubernetes on AWS.

AWS Lambda

AWS Lambda is an event-driven service in which functions are triggered in response to things like HTTP calls or notifications from other services, for instance when a file is uploaded to S3. To create a Lambda function you upload the function code, configure the event it's tied to, and AWS manages the underlying infrastructure involved--you have no need to provision servers or containers. Lambdas are typically single-purpose functions, and need to be able to run in a limited timeframe and memory footprint: a maximum of 15 minutes and 3GB, respectively (longer running processes are more appropriate for containers or running directly on EC2). A group of Lambda functions can interoperate to form an application, and they are well-suited to microservice architectures.

SDK & CLI access

Software Development Kits, or SDKs, are code libraries specific to a programming language, which package functionality related to a specific software package, framework, platform, or service. For example there are SDKs to help developers to build ArcGIS applications using Java, or to access NOAA data using Python.

AWS has official SDKs for many popular languages, which make it easier to work with AWS services from within applications or programs written in those languages. There are also community-created SDKs and libraries are available for some languages not officially supported.

Python

There is an officially supported AWS SDK, and the numpy library supports many scientific and mathematical tasks.

C

C99 is supported by an open-source AWS project on Github, awslabs/aws-c-common. Note there is an official SDK for C++.

Other AWS-supported languages

Official SDKs are available for Python, JavaScript, Node.js, C++, Go, Ruby, .NET, and PHP. See the AWS SDKs and Programming Toolkits page for details.

Matlab

Matlab natively supports reading and writing data to and from S3.

R

The community-run cloudyr project has packages supporting many AWS storage and compute services, including S3.

CLI

You can also call AWS APIs using the AWS Command Line Interface (CLI). This library, available for Linux, Mac, and Windows, lets you control AWS services from the command line, and automate them through scripts. If you're operating in an environment, for instance Fortran, for which there is no SDK, you can invoke the CLI from your programs, or write wrappers around them to do so.

Additional resources

This guide covers a selection of services likely to be used with ASDI data. There are a great many other learning resources available. Good places to start include