<!-- DATA PROVIDER INSTRUCTIONS

1. Provide the name of your dataset, replacing the bracketed placeholder text.
2. Update the Registry of Open Data landing page URL, by replacing the bracketed placeholder text. The [REGISTRY_YAML_NAME] will correspond to the name of the YAML document in your pull request to the Registry of Open Data on Github, minus the .yaml file extension.
3. Remove these comment blocks when you have completed each section.

DATA PROVIDER INSTRUCTIONS -->

# Get to Know a Dataset: [NAME OF DATASET]

This notebook serves as a guided tour of the [[NAME OF DATASET - IN THIS TEMPLATE WE WILL USE THE LAST MILE DATASET EXAMPLE FROM AMAZON]](https://registry.opendata.aws/[REGISTRY_YAML_NAME]) dataset. More usage examples, tutorials, and documentation for this dataset and others can be found at the [Registry of Open Data on AWS](https://registry.opendata.aws/).

<!-- DATA PROVIDER INSTRUCTIONS

The goal of this section is to orient users to the structure of your dataset. 

1. How are key prefixes and objects organized in your S3 bucket?
2. What kinds of filetypes are represented in your dataset?
3. Explain with text what users are expected to encounter, and then demonstrate with code the organizational framework you applied when creating your dataset.
4. The responses to each question section are meant to be expanded or replaced as dictated by your dataset

DATA PROVIDER INSTRUCTIONS -->

### Q: How have you organized your dataset? Help us understand the key prefix structure of your S3 bucket.

EXAMPLE - REPLACE

At the top level of our S3 bucket, we have a single key prefix "almrrc2021" that in turn contains:

 1. The dataset license (License.txt)
 2. A Readme.txt document
 3. Two key prefixes, "almrrc2021-data-training" and "almrrc2021-data-evaluation" that respectively contain training and testing data in JSON format
 4. These in turn contain prefixes corresponding to [...]
 
 Full documentation for this dataset can be found at: https://pubsonline.informs.org/doi/10.1287/trsc.2022.1173

EXAMPLE - REPLACE



In [None]:
# CODING GUIDELINES FOR DATA PROVIDER
#
# General notebook coding guidelines:
# 1. Assume that your reader understands the basics of Jupyter Notebooks, Python, and their Python environment.
#    The focus of this tutorial is on your dataset.
# 2. For library requirements, list the required libraries in a comment block in "requirements.txt" format
#    (https://pip.pypa.io/en/stable/reference/requirements-file-format/)
# 3. Demonstrate importing libraries with the assumption that the user has correctly installed the required
#    libraries.
# 4. List and load all library dependencies once, at this point of the notebook, unless a complicated dependency
#    set makes it unweildy.
# 5. Remember, the goal of this tutorial is a 101-level introduction to your dataset using common tools and libraries.
#    Examples using specialized environments and deep-diving methods are better suited to follow-up tutorials.
#
# CODING GUIDELINES FOR DATA PROVIDER

EXAMPLE - REPLACE

First we will import the Python libraries required throughout this notebook.

EXAMPLE - REPLACE

In [None]:
### EXAMPLE - REPLACE


# This notebook requires the following additional libraries
# (please install using the preferred method for your environment, e.g. pip, conda):
#
# boto3 >= 1.38.23
# polars >= 1.30.0
# matplotlib >= 3.10.3 

# Import the libraries required for this notebook
# Built-ins
import json
from pprint import pprint
# Installed libraries
import boto3, polars, matplotlib.pyplot as plt
from botocore import UNSIGNED
from botocore.config import Config

### EXAMPLE - REPLACE

EXAMPLE - REPLACE

Next, we will define the location of our dataset, create our boto3 S3 client, and list the top level prefixes in our S3 bucket. Here we see there is only one top-level prefix in our bucket.

EXAMPLE - REPLACE

In [None]:
### EXAMPLE - REPLACE

# Location of the S3 bucket for this dataset
bucket = "amazon-last-mile-challenges"

# List the top level of the bucket using boto3. Because this is a public bucket, we don't need to sign requests.
# Here we set the signature version to unsigned, which is required for public buckets.
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# Print the items in the top-level prefixes
for item in s3.list_objects_v2(Bucket=bucket, Delimiter='/')['CommonPrefixes']:
    print(item['Prefix'])

### EXAMPLE - REPLACE

EXAMPLE - REPLACE

Looking into the top-level S3 prefix of our dataset, we see that the data have been separated into training and evaluation datasets.

EXAMPLE - REPLACE

In [None]:
### EXAMPLE - REPLACE

# List the key prefixes within the top level 'almrrc2021' prefix
for item in s3.list_objects_v2(Bucket=bucket, Prefix='almrrc2021/', Delimiter='/', MaxKeys=10)['CommonPrefixes']:
    print(item['Prefix'])

### EXAMPLE - REPLACE

EXAMPLE - REPLACE

The training and evaluation prefixes are similar in structure, and so we can look into the training portion to get a sense of the deeper structure of the dataset where the data objects reside.

EXAMPLE - REPLACE

In [None]:
### EXAMPLE - REPLACE

# List the keys within the 'almrrc2021/almrrc2021-data-training' prefix.
for item in s3.list_objects_v2(Bucket=bucket, Prefix='almrrc2021/almrrc2021-data-training/', MaxKeys=100)['Contents']:
    print(item['Key'])

### EXAMPLE - REPLACE

<!-- DATA PROVIDER INSTRUCTIONS
This section is meant to orient users of your dataset to the formats present in your dataset, particularly if your dataset includes formats that may be unfamiliar to a general data scientist audience. This section should include:

1. Explanation of data format(s) (very common formats can be very briefly described, while less common
   or domain specific formats should include more explanation as well as links to official documentation)
2. Explanation of why the data format was chosen for your dataset
3. Recommendations around software and tooling to work with this data format
4. Explanation of any dataset-specific aspects to your usage of the format
5. Description of AWS services that may be useful to users working with your data
DATA PROVIDER INSTRUCTIONS -->

### Q: What data formats are present in your dataset? What kinds of data are stored using these formats? Can you give any advice for how you work with these data formats?

EXAMPLE - REPLACE

Our dataset comes as a set of JSON (JavaScript Object Notation) files organized as shown in the last section. Generically, JSON is a lightweight, text-based data format that uses key-value pairs and arrays to structure data in a human-readable and machine-parseable way. It supports nested structures and primarily works with:

- Objects (key-value pairs)
- Arrays
- Basic data types (strings, numbers, booleans, null)

Our dataset used this format because JSON:
 - naturally represents delivery route data's hierarchical nature (routes containing stops containing packages)
 - handles mixed data types well (coordinates, timestamps, delivery constraints)
 - is easily processed by most programming languages
 - is human-readable format helps with data validation and debugging
 - efficiently represents sparse data in cases where not all entries have the same fields

These JSON files contain data contain route, stop, and package level features as fully documented in [this journal article](https://pubsonline.informs.org/doi/10.1287/trsc.2022.1173).

JSON is well supported by Python through its built in 'json' library. Packages such as Pandas and Polars can be used to work with JSON as well. Lastly, services such as [Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/querying-JSON.html), and [Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/json-functions.html) can be used to query JSON data at scale using SQL.

EXAMPLE - REPLACE

<!-- DATA PROVIDER INSTRUCTIONS
The goal of this section is to demonstrate loading a portion of data from your dataset, and reveal something about its structure.
1. Load an object from S3
2. Show the structure of data in the object
DATA PROVIDER INSTRUCTIONS -->

### Q: Can you show us an example of downloading and loading data from your dataset?

EXAMPLE - REPLACE

As an example, let us load up and look into some package data as found in s3://amazon-last-mile-challenges/almrrc2021/almrrc2021-data-training/model_apply_inputs/new_package_data.json.

EXAMPLE - REPLACE

In [None]:
### EXAMPLE - REPLACE

# First we'll load the data into a Python dictionary using the built-in json library

file_key = "almrrc2021/almrrc2021-data-training/model_build_inputs/package_data.json"

with s3.get_object(Bucket=bucket, Key=file_key)['Body'] as file_object:
    package_data = json.load(file_object)

### EXAMPLE - REPLACE

EXAMPLE - REPLACE

First, let's take a look at the keys in our newly-loaded dataset. Here we see that top-level keys correspond to route IDs.

EXAMPLE - REPLACE

In [None]:
### EXAMPLE - REPLACE

# pretty print a truncated list of keys in our dictionary
pprint(list(package_data.keys())[:10])

### EXAMPLE - REPLACE

EXAMPLE - REPLACE

Next we'll look at packages associated with the first route ID in our dataset to get a sense for the structure of this file. Here we note that each package ID in this file has dimensions as well as a planned service time duration.

EXAMPLE - REPLACE

In [None]:
### EXAMPLE - REPLACE

# pretty print the structure an individual route record
pprint(package_data["RouteID_15baae2d-bf07-4967-956a-173d4036613f"])

### EXAMPLE - REPLACE

<!-- DATA PROVIDER INSTRUCTIONS
The goal here is to visualize some aspect of your dataset in order to help users understand it. In addition to helping users of your dataset understand the dataset, an additional goal is to impress!

Please demonstrate any data preprocessing or reshaping required for your visualization(s).

https://www.reddit.com/r/dataisbeautiful/ for inspiration.
DATA PROVIDER INSTRUCTIONS -->

### Q: A picture is worth a thousand words. Show us a visual (or several!) from your dataset that either illustrates something informative about your dataset, or that you think might excite someone to dig in further.

EXAMPLE - REPLACE

For this example, we will first demonstrate the distribution of package sizes (cm^3). To aid ourselves, we can first flatten the nested dictionary structure we created in the last section.

EXAMPLE - REPLACE

In [None]:
### EXAMPLE - REPLACE

# Create a function to flatten our data structure
def flatten_package_data(data):
    flattened = []
    
    for route_key, route_data in data.items():
        route_id = route_key.split('_')[1]  # Extract RouteID part
        
        # Iterate through all zone dictionaries (like 'AH')
        for zone_data in route_data.values():
            # Iterate through package dictionaries
            for package_key, package_info in zone_data.items():
                package_id = package_key.split('_')[1]  # Extract PackageID part
                
                flattened.append({
                    'RouteID': route_id,
                    'PackageID': package_id,
                    'depth_cm': package_info['dimensions']['depth_cm'],
                    'height_cm': package_info['dimensions']['height_cm'],
                    'width_cm': package_info['dimensions']['width_cm'],
                    'planned_service_time_seconds': package_info['planned_service_time_seconds']
                })
    
    return polars.DataFrame(flattened)

# Convert to Polars DataFrame
df = flatten_package_data(package_data)

### EXAMPLE - REPLACE

EXAMPLE - REPLACE

Now let's take a look at the first few records of our newly flattened dataset.

EXAMPLE - REPLACE

In [None]:
### EXAMPLE - REPLACE

# Print first few rows of our newly flattened dataset
df.head()

### EXAMPLE - REPLACE

EXAMPLE - REPLACE

Now let's add another column, volume_cm3, that gives us the volume of each package.

EXAMPLE - REPLACE

In [None]:
### EXAMPLE - REPLACE

# Calculate volume by multiplying length, width, and depth dimensions
df = df.with_columns(
    (polars.col('depth_cm') * polars.col('height_cm') * polars.col('width_cm')).alias('volume_cm3')
)

df.head()

### EXAMPLE - REPLACE

EXAMPLE - REPLACE

Now we can plot the distribution of package volumes for the entirety of our dataset.

EXAMPLE - REPLACE

In [None]:
### EXAMPLE - REPLACE


# Plot using matplotlib
# Set figure size and DPI for better resolution
plt.figure(figsize=(12, 7), dpi=100, facecolor='white')

# Create histogram with custom styling
plt.hist(df['volume_cm3'], 
         bins=30,
         color='#3498db',    # Nice blue color
         edgecolor='white',
         linewidth=1.2,
         alpha=0.8)

# Customize title and labels
plt.title('Distribution of Package Volumes', 
         fontsize=16, 
         pad=20, 
         fontweight='bold')
plt.xlabel('Volume (cm³)', fontsize=12, labelpad=10)
plt.ylabel('Count', fontsize=12, labelpad=10)

# Add grid with custom styling
plt.grid(True, linestyle='--', alpha=0.3, color='gray')

# Customize axes and background
ax = plt.gca()
ax.set_facecolor('#f8f9fa')  # Light gray background
ax.spines['top'].set_visible(False)    # Remove top border
ax.spines['right'].set_visible(False)  # Remove right border
ax.spines['left'].set_linewidth(0.5)   # Thin left border
ax.spines['bottom'].set_linewidth(0.5) # Thin bottom border

# Adjust layout to prevent label clipping
plt.tight_layout()

plt.show()

### EXAMPLE - REPLACE

EXAMPLE - REPLACE

Now, let's take a look and see if we can get a sense of how many packages there are to deliver per route.

EXAMPLE - REPLACE

In [None]:
### EXAMPLE - REPLACE

# Group by RouteID and count PackageID
packages_per_route = df.group_by('RouteID').agg(
    polars.col('PackageID').count()
).get_column('PackageID')

# Create the plot
plt.figure(figsize=(12, 7), dpi=100, facecolor='white')

plt.hist(packages_per_route, 
         bins=30,
         color='#3498db',
         edgecolor='white',
         linewidth=1.2,
         alpha=0.8)

plt.title('Distribution of Packages per Route', 
         fontsize=16, 
         pad=20, 
         fontweight='bold')
plt.xlabel('Number of Packages', fontsize=12, labelpad=10)
plt.ylabel('Number of Routes', fontsize=12, labelpad=10)

plt.grid(True, linestyle='--', alpha=0.3, color='gray')

# Customize axes and background
ax = plt.gca()
ax.set_facecolor('#f8f9fa')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_linewidth(0.5)
ax.spines['bottom'].set_linewidth(0.5)

plt.tight_layout()
plt.show()

# Print some summary statistics using Polars
print(f"Average packages per route: {packages_per_route.mean():.1f}")
print(f"Median packages per route: {packages_per_route.median():.1f}")
print(f"Min packages per route: {packages_per_route.min()}")
print(f"Max packages per route: {packages_per_route.max()}")

### EXAMPLE - REPLACE

<!-- DATA PROVIDER INSTRUCTIONS
This section is less prescriptive / freeform than previous sections. The goal here is to show an opinionated example of answering a question using your data. The scale of your dataset may preclude a full example, and so feel free to limit the scope of this example (e.g. work on a subset of data). Users should be able to replicate your example in this notebook, and get a sense of how they would scale up.

A "toy" example is better than no example.

Ideally, your example would:
1. Transmit some of your domain & dataset experience to the reader, drawing on your own work as much as possible
2. Provide a jumping off point for users to extend your work, and do novel work of their own.

DATA PROVIDER INSTRUCTIONS -->

### Q: What is one question that you have answered using these data? Can you show us how you came to that answer?

EXAMPLE - REPLACE

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent sem enim, finibus vel leo vel, blandit interdum tortor. In lectus dolor, congue eu odio vel, euismod facilisis lacus. Vestibulum aliquet, quam in rhoncus tempus, arcu massa suscipit lacus, nec fermentum quam justo nec lacus. Duis id leo fermentum ante tempor pulvinar eu vitae ligula. Cras feugiat vel ligula sit amet lacinia. Mauris sit amet sem vestibulum ligula volutpat iaculis in eu velit. Sed turpis magna, porta ac nisi vitae, maximus volutpat mi. Vestibulum mattis est eros, nec pellentesque nisi iaculis sed. 

EXAMPLE - REPLACE

<!-- DATA PROVIDER INSTRUCTIONS
This section is, like the previous one, intended to be freeform / non-prescriptive. The goal here is to provide a challenge to the community to do something novel with your dataset. That can either be novel in terms of the task, or novel in terms of methodological or computational approach.

Another way to consider this section, is as a wishlist. If you were less constrained by time, cost, skill, etc., what would you like to see achieved using these data? 

The challenge should, however, be somewhat realistic. A challenge that assumes e.g. original data collection, is likely to go unanswered.
DATA PROVIDER INSTRUCTIONS -->

### Q: What is one unanswered question that you think could be answered using these data? Do you have any recommendatinos or advice for someone wanting to answer this question?

EXAMPLE - REPLACE

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent sem enim, finibus vel leo vel, blandit interdum tortor. In lectus dolor, congue eu odio vel, euismod facilisis lacus. Vestibulum aliquet, quam in rhoncus tempus, arcu massa suscipit lacus, nec fermentum quam justo nec lacus. Duis id leo fermentum ante tempor pulvinar eu vitae ligula. Cras feugiat vel ligula sit amet lacinia. Mauris sit amet sem vestibulum ligula volutpat iaculis in eu velit. Sed turpis magna, porta ac nisi vitae, maximus volutpat mi. Vestibulum mattis est eros, nec pellentesque nisi iaculis sed. 

EXAMPLE - REPLACE

# DATA PROVIDER: PLEASE REMEMBER TO CLEAR ALL OUTPUTS BEFORE COMMITTING TO YOUR GITHUB REPOSITORY