Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README.md Project naming edits May 10, 2019
exploring-data-with-python-and-amazon-s3-select.ipynb Project naming edits May 10, 2019

README.md

Exploring data with Python and Amazon S3 Select

by Manav Sehgal | on 3 MAY 2019

We hear from public institutions all the time that they are looking to extract more value from their data but struggle to capture, store, and analyze all the data generated by today’s modern and digital sources. Data is growing exponentially, coming from new sources, is increasingly diverse, and needs to be securely accessed and analyzed by any number of applications and people. The size, complexity, and varied sources of the data means the same technology and approaches that worked in the past don’t work anymore.

Data Analytics Workflow

A new approach is needed to extract insights and value from data. This approach needs to address complexities of multi-step data analytics workflow. This includes setting up durable, secure, and scalable storage for data, moving data from source to destination with speed and low cost, ease of data preparation for analytics, and making data available for different types of analytics including ad-hoc, real-time, and predictive.

About AWS Open Data Analytics Notebooks

This notebook is first in a series of AWS Open Data Analytics Notebooks following step-by-step workflow for open data analytics on cloud. We will present these notebooks with guidance on using AWS Cloud programmatically, introduce relevant AWS services, explaining the code, reviewing the code outputs, evaluating alternative steps in our workflow, and ultimately designing a reusable API for open data analytics workflow on cloud. The first step in this workflow is sourcing the appropriate open dataset(s) for setting up our analytics pipeline. You may want to run these notebooks using Amazon SageMaker. Amazon SageMaker is a fully-managed service that covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the model, tune and optimize it for deployment, make predictions, and take action.

Why Open Datasets

When building analytical models it is best to start with tried and tested open datasets from the problem domain we are solving. This enables us to setup our data analytics workflow, determine the appropriate models and analytical methods, benchmark the results, collaborate with open data community, before we apply these to our own data. Such open datasets are available at the Registry of Open Data on AWS.

For this notebook let us start with a big open dataset. Big enough that we will struggle to open it in Excel on a laptop. Excel has around million rows limit. We will setup AWS services to source from a 270GB data source, filter and store more than 8 million rows or 100 million data points into a flat file, extract schema from this file, transform this data, load into analytics tools, run Structured Query Language (SQL) on this data, perform exploratory data analytics, train and build machine learning models, and visualize all 100 million data points using an interactive dashboard.

Open Data Analytics Architecture

When we complete these workflow notebooks we will build the following open data analytics architecture. This is a serverless architecture. It requires no software licenses to be procured. You do not need to manage any virtual servers or operating systems. Billing of each of the services is pay-per-use. You can plug-and-play 160 AWS services within this stack based on your specific requirements.

Open Data Analytics Architecture

Setup Notebook Environment

We begin by importing the required Python dependencies. We will use Boto3 Python SDK for using AWS services. The import Pandas is a popular library providing high-performance, easy-to-use data structures and data analysis tools for Python. The IPython.display and Markdown dependencies are required for well-formatted output from Notebook cells. We require botocore for exceptions management.

import boto3
import botocore
import pandas as pd
from IPython.display import display, Markdown

Before we start to access AWS services from an Amazon SageMaker notebook we need to ensure that the SageMaker Execution IAM role associated with the notebook instance is allowed permissions to use the specific services like Amazon S3.

We will setup an S3 client to call most of the S3 APIs. S3 resource is required for specific calls to object loading and copying features.

s3 = boto3.client('s3')
s3_resource = boto3.resource('s3')

Create Bucket

Now we come to an important part of the workflow of creating a Python function. These functions are created all along this notebook and others in the series. Think of these functions as reusable APIs for applying all that you learn from AWS Open Data Analytics Notebooks into your own projects by simply importing these functions as a library.

Before we source the open dataset from the Registry, we need to define a destination for our data. We will store our open datasets within Amazon S3. S3 storage in turn is organized in universally unique buckets. These bucket names form special URLs of the format s3://bucket-name which access the contents of the buckets depending on security and access policies applied to the bucket and its contents. Buckets can further contain folders and files. Keys are combination of folder and file name path or just the file name in case it is within the bucket root.

Our first function create_bucket will do just that, it will create a bucket or return as-is if the bucket already exists for your account. If the bucket name is already used by someone else other than you, then this generates an exception caught by the message Bucket <name> could not be created as defined.

AWS services can be accessed using the SDK as we are using right now, using browser based console GUI, or using a Command Line Interface (CLI) over OS terminal or shell. Benefits of using the SDK are reusability of commands across different use cases, handling exceptions with custom actions, and focusing on just the functionality needed by the solution.

def create_bucket(bucket):
    import logging

    try:
        s3.create_bucket(Bucket=bucket)
    except botocore.exceptions.ClientError as e:
        logging.error(e)
        return 'Bucket ' + bucket + ' could not be created.'
    return 'Created or already exists ' + bucket + ' bucket.'
create_bucket('open-data-analytics-taxi-trips')
'Created or already exists open-data-analytics-taxi-trips bucket.'

List Buckets

We can confirm that the new bucket has been created by listing the buckets within S3. The list_buckets function takes a match parameter which enables us to search among available buckets and only list the ones which contain the matching string in their name.

def list_buckets(match=''):
    response = s3.list_buckets()
    if match:
        print(f'Existing buckets containing "{match}" string:')
    else:
        print('All existing buckets:')
    for bucket in response['Buckets']:
        if match:
            if match in bucket["Name"]:
                print(f'  {bucket["Name"]}')
list_buckets(match='open')
Existing buckets containing "open" string:
  open-analytics-assistant
  open-data-analytics
  open-data-analytics-taxi-trips
  open-data-on-cloud

List Bucket Contents

Now that we have prepared our destination bucket we can shift our attention to the source for our dataset. The Registry of Open Data on AWS also happens to be a listing of S3 hosted open datasets. So all Registry listed datasets can be accessed by the same API we use for S3 within our own AWS account.

Next, all we need to do is search the Registry for the dataset we want to analyze. For this notebook let us analyze the New York Taxi Trips dataset. On the dataset description page we make a note of the Amazon Resource Name (ARN) which is arn:aws:s3:::nyc-tlc in this case. We are interested in the last part which provides access to the open datasets using the s3://nyc-tlc URL.

Let us create a function to list contents of this open dataset. We will iterate through the keys or the path names of file objects stored within the bucket. The function allows us to match and return only keys which contain the matching string. It also optionally allows us to return only those files in the listing which are less than a certain size in MB. This helps traverse a large open dataset which may contain data in Gigabytes or even Terabytes with hundreds if not thousands of files.

def list_bucket_contents(bucket, match='', size_mb=0):
    bucket_resource = s3_resource.Bucket(bucket)
    total_size_gb = 0
    total_files = 0
    match_size_gb = 0
    match_files = 0
    for key in bucket_resource.objects.all():
        key_size_mb = key.size/1024/1024
        total_size_gb += key_size_mb
        total_files += 1
        list_check = False
        if not match:
            list_check = True
        elif match in key.key:
            list_check = True
        if list_check and not size_mb:
            match_files += 1
            match_size_gb += key_size_mb
            print(f'{key.key} ({key_size_mb:3.0f}MB)')
        elif list_check and key_size_mb <= size_mb:
            match_files += 1
            match_size_gb += key_size_mb
            print(f'{key.key} ({key_size_mb:3.0f}MB)')

    if match:
        print(f'Matched file size is {match_size_gb/1024:3.1f}GB with {match_files} files')            
    
    print(f'Bucket {bucket} total size is {total_size_gb/1024:3.1f}GB with {total_files} files')

For this notebook we want to list the latest data files matching year 2018 and we also want files which are less than 250MB in size for reasons explained shortly. Note that the function results in quickly filtering 12 or 251 files within a dataset size of 270GB.

list_bucket_contents(bucket='nyc-tlc', match='2018', size_mb=250)
trip data/green_tripdata_2018-01.csv ( 68MB)
trip data/green_tripdata_2018-02.csv ( 66MB)
trip data/green_tripdata_2018-03.csv ( 71MB)
trip data/green_tripdata_2018-04.csv ( 68MB)
trip data/green_tripdata_2018-05.csv ( 68MB)
trip data/green_tripdata_2018-06.csv ( 63MB)
trip data/green_tripdata_2018-07.csv ( 58MB)
trip data/green_tripdata_2018-08.csv ( 57MB)
trip data/green_tripdata_2018-09.csv ( 57MB)
trip data/green_tripdata_2018-10.csv ( 61MB)
trip data/green_tripdata_2018-11.csv ( 56MB)
trip data/green_tripdata_2018-12.csv ( 59MB)
Matched file size is 0.7GB with 12 files
Bucket nyc-tlc total size is 273.3GB with 251 files

Preview CSV Dataset

Now that we know which files we are interested in for our analytics, we want to write a function to quickly preview this big data from source without having to download the entire data file locally or open it in Excel. The preview_csv_dataset function takes bucket and key names as parameters for identifying the file object to preview. It also takes an optional parameter to determine the number of rows of records to return or display when previewing the dataset.

We use Pandas DataFrame feature to read data from a web URL. As Pandas does not recognise S3 URLs, we first generate a presigned web URL which makes the source data available securely to our dataframe.

Benefit of this approach is that we can quickly preview CSV based open datasets from the Registry listings without having to store these datasets into our own S3 account or download locally.

def preview_csv_dataset(bucket, key, rows=10):
    data_source = {
            'Bucket': bucket,
            'Key': key
        }
    # Generate the URL to get Key from Bucket
    url = s3.generate_presigned_url(
        ClientMethod = 'get_object',
        Params = data_source
    )

    data = pd.read_csv(url, nrows=rows)
    return data

We can perform some manual analysis based on the preview dataset. We note that the dataset contains 19 columns. Data types are mixed among float, object, and int. We also note potentially categorical features including trip_type and payment_type among others. There are continuous features include fare_amount and trip_distance among others. Data quality seems good as there are no missing data (Nulls) in preview, only one column ehail_fee which has NaN values and the values in the columns seem consistent at a glance. Of course there are formal methods to confirm all these observations however at this stage we are only interested in filtering and sourcing a dataset for further analytics.

As you will appreciate the ability of filtering a big data repository containing hundreds of files and Gigabytes of data and previewing one of the files without having to download the entire file, is a really powerful feature for our open data analytics workflow.

df = preview_csv_dataset(bucket='nyc-tlc', key='trip data/green_tripdata_2018-02.csv', rows=100)
df.head()
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag RatecodeID PULocationID DOLocationID passenger_count trip_distance fare_amount extra mta_tax tip_amount tolls_amount ehail_fee improvement_surcharge total_amount payment_type trip_type
0 2 2018-02-01 00:39:38 2018-02-01 00:39:41 N 5 97 65 1 0.00 20.0 0.0 0.0 3.00 0.0 NaN 0.0 23.00 1 2
1 2 2018-02-01 00:58:28 2018-02-01 01:05:35 N 1 256 80 5 1.60 7.5 0.5 0.5 0.88 0.0 NaN 0.3 9.68 1 1
2 2 2018-02-01 00:56:05 2018-02-01 01:18:54 N 1 25 95 1 9.60 28.5 0.5 0.5 5.96 0.0 NaN 0.3 35.76 1 1
3 2 2018-02-01 00:12:40 2018-02-01 00:15:50 N 1 61 61 1 0.73 4.5 0.5 0.5 0.00 0.0 NaN 0.3 5.80 2 1
4 2 2018-02-01 00:45:18 2018-02-01 00:51:56 N 1 65 17 2 1.87 8.0 0.5 0.5 0.00 0.0 NaN 0.3 9.30 2 1
df.shape
(100, 19)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 19 columns):
VendorID                 100 non-null int64
lpep_pickup_datetime     100 non-null object
lpep_dropoff_datetime    100 non-null object
store_and_fwd_flag       100 non-null object
RatecodeID               100 non-null int64
PULocationID             100 non-null int64
DOLocationID             100 non-null int64
passenger_count          100 non-null int64
trip_distance            100 non-null float64
fare_amount              100 non-null float64
extra                    100 non-null float64
mta_tax                  100 non-null float64
tip_amount               100 non-null float64
tolls_amount             100 non-null float64
ehail_fee                0 non-null float64
improvement_surcharge    100 non-null float64
total_amount             100 non-null float64
payment_type             100 non-null int64
trip_type                100 non-null int64
dtypes: float64(9), int64(7), object(3)
memory usage: 14.9+ KB
df.describe()
VendorID RatecodeID PULocationID DOLocationID passenger_count trip_distance fare_amount extra mta_tax tip_amount tolls_amount ehail_fee improvement_surcharge total_amount payment_type trip_type
count 100.000000 100.000000 100.00000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 0.0 100.000000 100.000000 100.000000 100.000000
mean 1.840000 1.200000 120.29000 138.890000 1.280000 2.946800 11.735000 0.465000 0.465000 1.054100 0.115200 NaN 0.279000 14.113300 1.540000 1.050000
std 0.368453 0.876172 73.43757 78.900256 0.792388 3.356363 9.716033 0.146594 0.146594 2.011155 0.810462 NaN 0.087957 11.151924 0.520683 0.219043
min 1.000000 1.000000 7.00000 7.000000 1.000000 0.000000 -4.500000 -0.500000 -0.500000 0.000000 0.000000 NaN -0.300000 -5.800000 1.000000 1.000000
25% 2.000000 1.000000 69.00000 68.750000 1.000000 0.947500 6.000000 0.500000 0.500000 0.000000 0.000000 NaN 0.300000 8.195000 1.000000 1.000000
50% 2.000000 1.000000 106.00000 135.000000 1.000000 1.885000 8.750000 0.500000 0.500000 0.000000 0.000000 NaN 0.300000 10.300000 2.000000 1.000000
75% 2.000000 1.000000 168.75000 207.000000 1.000000 3.340000 13.875000 0.500000 0.500000 1.485000 0.000000 NaN 0.300000 16.922500 2.000000 1.000000
max 2.000000 5.000000 256.00000 265.000000 5.000000 20.660000 56.000000 0.500000 0.500000 10.560000 5.760000 NaN 0.300000 63.360000 3.000000 2.000000

Copy Among Buckets

We are ready to query our dataset so we copy it over from the S3 bucket listed on the Registry to our own account. To perform this action we first check if the file already exists in our destination bucket using the key_exists function. You would be running this notebook over several iterations and it may be a case that the data file is already copied over. If the file does not exist we copy from one S3 bucket to another. You will notice that even for big datasets in GBs the copy operation from S3 bucket to bucket across accounts does not take much time.

def key_exists(bucket, key):
    try:
        s3_resource.Object(bucket, key).load()
    except botocore.exceptions.ClientError as e:
        if e.response['Error']['Code'] == "404":
            # The key does not exist.
            return(False)
        else:
            # Something else has gone wrong.
            raise
    else:
        # The key does exist.
        return(True)

def copy_among_buckets(from_bucket, from_key, to_bucket, to_key):
    if not key_exists(to_bucket, to_key):
        s3_resource.meta.client.copy({'Bucket': from_bucket, 'Key': from_key}, 
                                        to_bucket, to_key)        
        print(f'File {to_key} saved to S3 bucket {to_bucket}')
    else:
        print(f'File {to_key} already exists in S3 bucket {to_bucket}') 
copy_among_buckets(from_bucket='nyc-tlc', from_key='trip data/green_tripdata_2018-02.csv',
                      to_bucket='open-data-analytics-taxi-trips', to_key='few-trips/trips-2018-02.csv')
File few-trips/trips-2018-02.csv already exists in S3 bucket open-data-analytics-taxi-trips

Amazon S3 Select

Structured Query Language (SQL) SELECT statement is generally associated with relational databases and is a powerful first tool for querying and analyzing a dataset. Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only), and server-side encrypted objects. This means we do not need to deploy servers, setup databases, import data into our database, before querying our data. Simply copy datasets to S3 and Query. S3 Select can query a file which is up to 256MB uncompressed and 100 columns.

As we build the function to run S3 Select we capture the results as a set of events payload. This payload includes records of results and statistics of query operation which can be useful to calculate the cost of running the query.

def s3_select(bucket, key, statement):
    import io

    s3_select_results = s3.select_object_content(
        Bucket=bucket,
        Key=key,
        Expression=statement,
        ExpressionType='SQL',
        InputSerialization={'CSV': {"FileHeaderInfo": "Use"}},
        OutputSerialization={'JSON': {}},
    )

    for event in s3_select_results['Payload']:
        if 'Records' in event:
            df = pd.read_json(io.StringIO(event['Records']['Payload'].decode('utf-8')), lines=True)
        elif 'Stats' in event:
            print(f"Scanned: {int(event['Stats']['Details']['BytesScanned'])/1024/1024:5.2f}MB")            
            print(f"Processed: {int(event['Stats']['Details']['BytesProcessed'])/1024/1024:5.2f}MB")
            print(f"Returned: {int(event['Stats']['Details']['BytesReturned'])/1024/1024:5.2f}MB")
    return (df)

This is the power of serverless at its best. We did not provision any servers, virtual or otherwise. We did not write more than a handful lines of code and even that could be avoided in future reuse of the s3_select function or when we run this operation directly on the AWS Console. We did not setup any physical database engine. We simply copied a flat file and ran SQL to query the results. The query did not even have to scan the entire file for sending back structured results for our analysis.

df = s3_select(bucket='open-data-analytics-taxi-trips', key='few-trips/trips-2018-02.csv', 
          statement="""
          select passenger_count, payment_type, trip_distance 
          from s3object s 
          where s.passenger_count = '4' 
          limit 100
          """)
Scanned:  1.72MB
Processed:  1.71MB
Returned:  0.01MB
df.head()
passenger_count payment_type trip_distance
0 4 1 7.20
1 4 1 1.05
2 4 1 0.63
3 4 2 8.41
4 4 2 1.38

In case you do not need to manipulate or edit the dataset within your local S3 environment, you can also use S3 Select on the source dataset directly. This save you steps in copying the dataset over and also saves on storage costs within your account. In fact if you list_bucket_contents to match the S3 Select size limits and then use s3_select function, it turns into a much faster and more flexible preview option that the preview_csv_dataset function described earlier.

df = s3_select(bucket='nyc-tlc', key='trip data/green_tripdata_2018-02.csv', 
          statement="""
          select passenger_count, payment_type, trip_distance 
          from s3object s 
          where s.passenger_count = '4' 
          limit 100
          """)
Scanned:  1.72MB
Processed:  1.71MB
Returned:  0.01MB
df.head()
passenger_count payment_type trip_distance
0 4 1 7.20
1 4 1 1.05
2 4 1 0.63
3 4 2 8.41
4 4 2 1.38

Let us enter the big data leagues now. Let's list all the files in the Registry dataset which match the year 2018 with no constraints on file size this time. We are now reusing the function written earlier suggesting how the API will get used when we complete the notebook series. This time the results have files going beyond 1.5GB in size. We pick a file which is +700MB in size for our analysis.

list_bucket_contents(bucket='nyc-tlc', match='2018')
trip data/fhv_tripdata_2018-01.csv (1337MB)
trip data/fhv_tripdata_2018-02.csv (1307MB)
trip data/fhv_tripdata_2018-03.csv (1486MB)
trip data/fhv_tripdata_2018-04.csv (1425MB)
trip data/fhv_tripdata_2018-05.csv (1459MB)
trip data/fhv_tripdata_2018-06.csv (1430MB)
trip data/fhv_tripdata_2018-07.csv (1463MB)
trip data/fhv_tripdata_2018-08.csv (1498MB)
trip data/fhv_tripdata_2018-09.csv (1501MB)
trip data/fhv_tripdata_2018-10.csv (1578MB)
trip data/fhv_tripdata_2018-11.csv (1550MB)
trip data/fhv_tripdata_2018-12.csv (1616MB)
trip data/green_tripdata_2018-01.csv ( 68MB)
trip data/green_tripdata_2018-02.csv ( 66MB)
trip data/green_tripdata_2018-03.csv ( 71MB)
trip data/green_tripdata_2018-04.csv ( 68MB)
trip data/green_tripdata_2018-05.csv ( 68MB)
trip data/green_tripdata_2018-06.csv ( 63MB)
trip data/green_tripdata_2018-07.csv ( 58MB)
trip data/green_tripdata_2018-08.csv ( 57MB)
trip data/green_tripdata_2018-09.csv ( 57MB)
trip data/green_tripdata_2018-10.csv ( 61MB)
trip data/green_tripdata_2018-11.csv ( 56MB)
trip data/green_tripdata_2018-12.csv ( 59MB)
trip data/yellow_tripdata_2018-01.csv (736MB)
trip data/yellow_tripdata_2018-02.csv (714MB)
trip data/yellow_tripdata_2018-03.csv (793MB)
trip data/yellow_tripdata_2018-04.csv (783MB)
trip data/yellow_tripdata_2018-05.csv (777MB)
trip data/yellow_tripdata_2018-06.csv (734MB)
trip data/yellow_tripdata_2018-07.csv (660MB)
trip data/yellow_tripdata_2018-08.csv (660MB)
trip data/yellow_tripdata_2018-09.csv (677MB)
trip data/yellow_tripdata_2018-10.csv (743MB)
trip data/yellow_tripdata_2018-11.csv (686MB)
trip data/yellow_tripdata_2018-12.csv (688MB)
Matched file size is 26.4GB with 36 files
Bucket nyc-tlc total size is 273.3GB with 251 files

You will notice the preview function takes longer to return results for a larger file. It is still usable for preview purposes however this is an indication that we need more suitable tools for running our analytics this time. We cannot use S3 Select here due to the 256 MB size limit.

preview_csv_dataset(bucket='nyc-tlc', key='trip data/yellow_tripdata_2018-06.csv')
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2018-06-01 00:15:40 2018-06-01 00:16:46 1 0.00 1 N 145 145 2 3.0 0.5 0.5 0.00 0 0.3 4.30
1 1 2018-06-01 00:04:18 2018-06-01 00:09:18 1 1.00 1 N 230 161 1 5.5 0.5 0.5 1.35 0 0.3 8.15
2 1 2018-06-01 00:14:39 2018-06-01 00:29:46 1 3.30 1 N 100 263 2 13.0 0.5 0.5 0.00 0 0.3 14.30
3 1 2018-06-01 00:51:25 2018-06-01 00:51:29 3 0.00 1 N 145 145 2 2.5 0.5 0.5 0.00 0 0.3 3.80
4 1 2018-06-01 00:55:06 2018-06-01 00:55:10 1 0.00 1 N 145 145 2 2.5 0.5 0.5 0.00 0 0.3 3.80
5 1 2018-06-01 00:09:00 2018-06-01 00:24:01 1 2.00 1 N 161 234 1 11.5 0.5 0.5 2.55 0 0.3 15.35
6 1 2018-06-01 00:02:33 2018-06-01 00:13:01 2 1.50 1 N 163 233 1 8.5 0.5 0.5 1.95 0 0.3 11.75
7 1 2018-06-01 00:13:23 2018-06-01 00:16:52 1 0.70 1 N 186 246 1 5.0 0.5 0.5 1.85 0 0.3 8.15
8 1 2018-06-01 00:24:29 2018-06-01 01:08:43 1 5.70 1 N 230 179 2 22.0 0.5 0.5 0.00 0 0.3 23.30
9 2 2018-06-01 00:17:01 2018-06-01 00:23:16 1 0.85 1 N 179 223 2 6.0 0.5 0.5 0.00 0 0.3 7.30

Copy operation for a larger file does not take that much longer though. If interested you can time the operation by adding %%time magic function in the first line. Before timing the operation do ensure that you delete the file if it already exists in your S3 bucket. You can do so using the AWS Management Console.

copy_among_buckets(from_bucket='nyc-tlc', from_key='trip data/yellow_tripdata_2018-06.csv',
                      to_bucket='open-data-analytics-taxi-trips', to_key='many-trips/trips-2018-06.csv')
File many-trips/trips-2018-06.csv already exists in S3 bucket open-data-analytics-taxi-trips

This time when we list our bucket contents we should see the smaller file used in earlier S3 Select use case and the larger one we have copied over just now.

list_bucket_contents(bucket='open-data-analytics-taxi-trips', match='trips/trips')
few-trips/trips-2018-02.csv ( 66MB)
many-trips/trips-2018-06.csv (734MB)
Matched file size is 0.8GB with 2 files
Bucket open-data-analytics-taxi-trips total size is 0.9GB with 57 files

Change Log

This section captures changes and updates to this notebook across releases.

Source S3 Select - Release 3 MAY 2019

This release adds alternative workflow for directly querying source datasets on the Registry of Open Data on AWS. You may want to use this alternative workflow if you do not want to retain a copy of the source dataset within your local S3 bucket, saving on workflow steps and storage costs.

Known issue: Running s3_select with query limit 1000 or more results in ValueError - Expected object or value. Is this exception because of the maximum length of a record in the result as 1 MB limit? [TODO] Handle exception gracefully.

Launch - Release 30 APR 2019

This is the launch release which builds the AWS Open Data Analytics API for exploring open datasets within your Amazon S3 account using S3 Select.


Exploring data with Python and Amazon S3 Select

by Manav Sehgal | on 3 MAY 2019

You can’t perform that action at this time.