# Create an end to end machine learning workflow using Amazon Athena
---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

---

Importing and transforming data can be one of the most challenging tasks in a machine learning workflow. We provide you with a Jupyter notebook that demonstrates a cost-effective strategy for an extract, transform, and load (ETL) workflow. Using Amazon Simple Storage Service (Amazon S3) and Amazon Athena, you learn how to query and transform data from a Jupyter notebook. Amazon S3 is an object storage service that allows you to store data and machine learning artifacts. Amazon Athena enables you to interactively query the data stored in those buckets, saving each query as a CSV file in an Amazon S3 location.

The tutorial imports 16 CSV files for the 2019 NYC taxi dataset from multiple Amazon S3 locations. The goal is to predict the fare amount for each ride. From these 16 files, the notebook creates a single ride fare dataset and a single ride info dataset with deduplicated values. We join the deduplicated datasets into a single dataset.

Amazon Athena stores the query results as a CSV file in the specified location. We provide the output to a SageMaker Processing Job to split the data into training, validation, and test sets. While data can be split using queries, a processing job ensures that the data is in a format that's parseable by the XGBoost algorithm.

__Prerequisites:__

The notebook must be run in the us-east-1 AWS Region. You also need your own Amazon S3 bucket and a database within Amazon Athena. You won't be able to access the data used in the tutorial otherwise.

For information about creating a bucket, see [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html). For information about creating a database, see [Create a database](https://docs.aws.amazon.com/athena/latest/ug/getting-started.html#step-1-create-a-database).

Amazon Athena uses the AWS Glue Data Catalog to read the data from Amazon S3 into a database. You must have permissions to use Glue. To clean up, you also need permissions to delete the bucket you've created. For information about providing permissions, see [Identity and access management for AWS Glue
](https://docs.aws.amazon.com/glue/latest/dg/security-iam.html).

## Solution overview

To create the end to end workflow, we do the following:

1. Create an Amazon Athena client within the us-east-1 AWS Region.
2. Define the run_athena_query function that runs queries and prints out the status in the following cell.
3. Create the `ride_fare` table within your database using all ride fare tables for the year 2019.
4. Create the `ride_info` table using ride info table for the year 2019.
5. Create the `ride_info_deduped` and `ride_fare_deduped` tables that have all duplicate values removed from the original tables.
6. Run test queries to get the first ten rows of each table to see whether they have data.
7. Define the `get_query_results` function that takes the query ID and returns comma separated values that can be stored as a dataframe.
8. View the results of the test queries within pandas dataframes.
9. Join the `ride_info_deduped` and `ride_fare_deduped` tables into the `combined_ride_data_deduped` table.
10. Select all values in the combined table.
11. Define the `get_csv_file_location` function to get the Amazon S3 location of the query results.
12. Download the CSV file to our environment.
13. Perform Exploratory Data Analysis (EDA) on the data.
14. Use the results of the EDA to select the relevant features in query.
15. Use the `get_csv_file_location` function to get the location of those query results.
16. Split the data into training, validation, and test sets using a processing job.
17. Download the test dataset.
18. Take a 20 row sample from the test dataset.
20. Create a dataframe with 20 rows of actual and predicted values.
21. Calculate the RMSE of the data.
22. Clean up the resources created within the notebook.

### Define the run_athena_query function

In the following cell, we define the `run_athena_query` function. It runs an Athena query and waits for its completion.

It takes the following arguments:

-       query_string (str): The SQL query to be executed.
-       database_name (str): The name of the Athena database.
-       output_location (str): The S3 location where the query results are stored.


It returns the query execution ID string.

In [None]:
# Import required libraries
import time
import boto3


def run_athena_query(query_string, database_name, output_location):
    # Create an Athena client
    athena_client = boto3.client("athena", region_name="us-east-1")

    # Start the query execution
    response = athena_client.start_query_execution(
        QueryString=query_string,
        QueryExecutionContext={"Database": database_name},
        ResultConfiguration={"OutputLocation": output_location},
    )

    query_execution_id = response["QueryExecutionId"]
    print(f"Query execution ID: {query_execution_id}")

    while True:
        # Check the query execution status
        query_status = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
        state = query_status["QueryExecution"]["Status"]["State"]

        if state == "SUCCEEDED":
            print("Query executed successfully.")
            break
        elif state == "FAILED":
            print(
                f"Query failed with error: {query_status['QueryExecution']['Status']['StateChangeReason']}"
            )
            break
        else:
            print(f"Query is currently in {state} state. Waiting for completion...")
            time.sleep(5)  # Wait for 5 seconds before checking again

    return query_execution_id

### Create the ride_fare table

We've provided you with the query. You most provide the name of the database you created within Amazon Athena and the Amazon S3 output location. If you're not sure about how to specify the output location, provide the name of the S3 bucket. After running the query, you should get a message that says "Query executed successfully." and a 36 character string in single quotes.

In [None]:
# SQL query to create the 'ride_fare' table
create_ride_fare_table = """
CREATE EXTERNAL TABLE `ride_fare` (
  `ride_id` bigint, 
  `payment_type` smallint, 
  `fare_amount` float, 
  `extra` float, 
  `mta_tax` float, 
  `tip_amount` float, 
  `tolls_amount` float, 
  `total_amount` float
)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  LINES TERMINATED BY '\n' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://dsoaws/nyc-taxi-orig-cleaned-split-csv-with-header-per-year-multiple-files/ride-fare/year=2019'
TBLPROPERTIES (
  'skip.header.line.count'='1', 
  'transient_lastDdlTime'='1716908234'
);
"""

# Athena database name
database = "example-database-name"

# S3 location for query results
s3_output_location = "s3://example-s3-bucket/example-s3-prefix"

# Execute the query to create the 'ride_fare' table
run_athena_query(create_ride_fare_table, database, s3_output_location)

### Create the ride fare table with the duplicates removed

In [None]:
# SQL query to create a new table with duplicates removed
remove_duplicates_from_ride_fare = """
CREATE TABLE ride_fare_deduped
AS
SELECT DISTINCT *
FROM ride_fare
"""

# Run the preceding query
run_athena_query(remove_duplicates_from_ride_fare, database, s3_output_location)

### Create the ride_info table

In [None]:
# SQL query to create the ride_info table
create_ride_info_table_query = """
CREATE EXTERNAL TABLE `ride_info` (
  `ride_id` bigint, 
  `vendor_id` smallint, 
  `passenger_count` smallint, 
  `pickup_at` string, 
  `dropoff_at` string, 
  `trip_distance` float, 
  `rate_code_id` int, 
  `store_and_fwd_flag` string
)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  LINES TERMINATED BY '\n' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://dsoaws/nyc-taxi-orig-cleaned-split-csv-with-header-per-year-multiple-files/ride-info/year=2019'
TBLPROPERTIES (
  'skip.header.line.count'='1', 
  'transient_lastDdlTime'='1716907328'
);
"""

# Run the query to create the ride_info table
run_athena_query(create_ride_info_table_query, database, s3_output_location)

### Create the ride info table with the duplicates removed

In [None]:
# SQL query to create table with duplicates removed
remove_duplicates_from_ride_info = """
CREATE TABLE ride_info_deduped
AS
SELECT DISTINCT *
FROM ride_info
"""

# Run the query to create the table with the duplicates removed
run_athena_query(remove_duplicates_from_ride_info, database, s3_output_location)

### Run a test query on ride_info_deduped

In [None]:
test_ride_info_query = """
SELECT * FROM ride_info_deduped limit 10
"""

run_athena_query(test_ride_info_query, database, s3_output_location)

### Run a test query on ride_fare_deduped

In [None]:
test_ride_fare_query = """
SELECT * FROM ride_fare_deduped limit 10
"""

run_athena_query(test_ride_fare_query, database, s3_output_location)

### Define the `get_query_results` function

In the following cell, we define the `get_query_results` function to get the query results in CSV format. The function gets the 36 character query execution ID string. The end of the output of the preceding cell is an example of a query execution ID string.

In [None]:
import io


def get_query_results(query_execution_id):
    athena_client = boto3.client("athena", region_name="us-east-1")
    s3 = boto3.client("s3")

    # Get the query execution details
    query_execution = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
    s3_location = query_execution["QueryExecution"]["ResultConfiguration"]["OutputLocation"]

    # Extract bucket and key from S3 output location
    bucket_name, key = s3_location.split("/", 2)[2].split("/", 1)

    # Get the CSV file location
    obj = s3.get_object(Bucket=bucket_name, Key=key)
    csv_data = obj["Body"].read().decode("utf-8")
    csv_buffer = io.StringIO(csv_data)

    return csv_buffer

### Read `ride_info_deduped` test query into a dataframe

Specify the query execution ID string in the `get_query_results` function. The output is the head of the dataframe. 

In [None]:
import pandas as pd

# Provide the query execution id of the test_ride_info query to get the query results
ride_info_sample = get_query_results("test_ride_info_query_execution_id")

df_ride_info_sample = pd.read_csv(ride_info_sample)

df_ride_info_sample.head()

### Read `ride_fare_deduped` test query into a dataframe

Specify the query execution ID string in the `get_query_results` function. The output is the head of the resulting dataframe. 

In [None]:
# Provide the query execution id of the test_ride_fare query to get the query results

ride_fare_sample = get_query_results("test_ride_fare_query_execution_id")

df_ride_fare_sample = pd.read_csv(ride_fare_sample)

df_ride_fare_sample.head()

### Join the deduplicated tables together

In [None]:
# SQL query to join the tables into a single table containing all the data.
create_ride_joined_deduped = """
CREATE TABLE combined_ride_data_deduped AS
SELECT 
    rfs.ride_id, 
    rfs.payment_type, 
    rfs.fare_amount, 
    rfs.extra, 
    rfs.mta_tax, 
    rfs.tip_amount, 
    rfs.tolls_amount, 
    rfs.total_amount,
    ris.vendor_id, 
    ris.passenger_count, 
    ris.pickup_at, 
    ris.dropoff_at, 
    ris.trip_distance, 
    ris.rate_code_id, 
    ris.store_and_fwd_flag
FROM 
    ride_fare_deduped rfs
JOIN 
    ride_info_deduped ris
ON 
    rfs.ride_id = ris.ride_id;
;
"""

# Run the query to create the ride_data_deduped table
run_athena_query(create_ride_joined_deduped, database, s3_output_location)

### Select all values from the deduplicated table

In [None]:
# SQL query to select all values from the table and create the dataset that we're using for our analysis
ride_combined_full_table_query = """
SELECT * FROM combined_ride_data_deduped
"""

# Run the query to select all values from the combined_ride_data_deduped table
run_athena_query(ride_combined_full_table_query, database, s3_output_location)

### Define get_csv_file_location function and get Amazon S3 location of query results

Specify the query ID from the preceding cell in the function call. The output is the Amazon S3 URI of the dataset. 

In [None]:
# Function to get the Amazon S3 URI location of Amazon Athena select statements
def get_csv_file_location(query_execution_id):
    athena_client = boto3.client("athena", region_name="us-east-1")
    query_execution = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
    s3_location = query_execution["QueryExecution"]["ResultConfiguration"]["OutputLocation"]

    return s3_location


# Provide the 36 character string at the end of the output of the preceding cell as the query.
get_csv_file_location("ride_combined_full_table_query_execution_id")

### Download the dataset and rename it

Replace the example S3 path in the following cell with the output of the preceding cell. The second command renames the CSV file it downloads to `nyc-taxi-whole-dataset.csv`.

In [None]:
# Use the S3 URI location returned from the preceding cell to download the dataset and rename it.
!aws s3 cp s3://example-s3-bucket/ride_combined_full_table_query_execution_id.csv .
!mv ride_combined_full_table_query_execution_id.csv nyc-taxi-whole-dataset.csv

### Get a 20,000 row sample and some information about it

In [None]:
sample_nyc_taxi_combined = pd.read_csv("nyc-taxi-whole-dataset.csv", nrows=20000)

In [None]:
print("Dataset shape: ", sample_nyc_taxi_combined.shape)

In [None]:
df = sample_nyc_taxi_combined

df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df["vendor_id"].value_counts()

In [None]:
df["passenger_count"].value_counts()

### View the distribution of fare amount values

In [None]:
# Plot to find the distribution of ride fare values
import matplotlib.pyplot as plt

plt.hist(df["fare_amount"], edgecolor="black", bins=30, range=(0, 100))
plt.xlabel("Fare Amount")
plt.ylabel("Count")
plt.show

### Make sure that all rows are unique

In [None]:
df["ride_id"].nunique()

### Drop the store_and_fwd flag

Determining its relevance isn't in scope for this tutorial.

In [None]:
df.drop("store_and_fwd_flag", axis=1, inplace=True)

### Drop the time series columns

Analyzing the time series data also isn't in scope for this analysis.

In [None]:
# We're dropping the time series columns to streamline the analysis.
time_series_columns_to_drop = ["pickup_at", "dropoff_at"]
df.drop(columns=time_series_columns_to_drop, inplace=True)

### Install seaborn and create scatterplots

In [None]:
!pip install seaborn

In [None]:
# Create visualizations showing correlations between variables.
import seaborn as sns

target = "fare_amount"
features = [col for col in df.columns if col != target]

# Create a figure with subplots
fig, axes = plt.subplots(nrows=1, ncols=len(features), figsize=(50, 10))

# Create scatter plots
for i, feature in enumerate(features):
    sns.scatterplot(x=df[feature], y=df[target], ax=axes[i])
    axes[i].set_title(f"{feature} vs {target}")
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel(target)

plt.tight_layout()
plt.show()

## Calculate the correlation coefficient between each feature and fare amount

In [None]:
# extra and mta_tax seem weakly correlated
# total_amount is almost perfectly correlated, indicating target leakage.
continuous_features = [
    "tip_amount",
    "tolls_amount",
    "extra",
    "mta_tax",
    "total_amount",
    "trip_distance",
]

for i in continuous_features:
    correlation = df["fare_amount"].corr(df[i])
    print(i, correlation)

### Calculate a one way ANOVA between the groups

From running the ANOVA, `mta_tax` and `extra` have the most variance between the groups. We're using them as features to train our model.

In [None]:
# The mta tax and extra have the most variance between the groups
from scipy.stats import f_oneway

# Separate features and target variable
X = df[["payment_type", "extra", "mta_tax", "vendor_id", "passenger_count"]]
y = df["fare_amount"]

# Perform one-way ANOVA for each feature
for feature in X.columns:
    groups = [y[X[feature] == group] for group in X[feature].unique()]
    if len(groups) > 1:
        f_statistic, p_value = f_oneway(*groups)
        print(f"Feature: {feature}, F-statistic: {f_statistic:.2f}, p-value: {p_value:.5f}")

### Run a query to get the dataset we're using for ML workflow

The XGBoost algorithm on Amazon SageMaker uses the first column as the target column. `fare_amount` must be the first column in our query.

In [None]:
# Final select statement has tip_amount, tolls_amount, extra, mta_tax, trip_distance
ride_combined_notebook_relevant_features_query = """
SELECT fare_amount, tip_amount, tolls_amount, extra, mta_tax, trip_distance FROM combined_ride_data_deduped
"""

run_athena_query(ride_combined_notebook_relevant_features_query, database, s3_output_location)

### Get the Amazon S3 URI of the dataset

In [None]:
get_csv_file_location("ride_combined_notebook_relevant_features_query_execution_id")

### Run a SageMaker processing job to split the data

The code in `processing_data_split.py` splits the dataset into training, validation, and test sets. We use a SageMaker processing job to provide the compute needed to transform large volumes of data. For more information about processing jobs, see [Use processing jobs to run data transformation workloads](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html). For more information about running sci-kit scripts, see [Data Processing with scikit-learn](https://docs.aws.amazon.com/sagemaker/latest/dg/use-scikit-learn-processing-container.html). 

For faster processing, we recommend using an `instance_count` of `2`, but you can use whatever value you prefer.

For `source` within the `ProcessingInput` function, replace `'s3://example-s3-bucket/ride_combined_notebook_relevant_features_query_execution_id.csv'` with the output of the preceding cell. Within `processing_data_split.py`, you specify `/opt/ml/processing/input/query-id` as the `input_path`. The processing job is copying the query results to a location within its own container.

For `Destination` under `ProcessingOutput`, replace `example-s3-bucket` with the Amazon S3 bucket that you've created.

In [None]:
import sagemaker
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput


# Define the SageMaker execution role
role = sagemaker.get_execution_role()

# Define the SKLearnProcessor
sklearn_processor = SKLearnProcessor(
    framework_version="0.20.0", role=role, instance_type="ml.m5.4xlarge", instance_count=2
)

# Run the processing job
sklearn_processor.run(
    code="processing_data_split.py",
    inputs=[
        ProcessingInput(
            source="s3://example-s3-bucket/ride_combined_notebook_relevant_features_query_execution_id.csv",
            destination="/opt/ml/processing/input",
        )
    ],
    outputs=[
        ProcessingOutput(
            source="/opt/ml/processing/output/train",
            destination="s3://example-s3-bucket/output/train",
        ),
        ProcessingOutput(
            source="/opt/ml/processing/output/validation",
            destination="s3://example-s3-bucket/output/validation",
        ),
        ProcessingOutput(
            source="/opt/ml/processing/output/test",
            destination="s3://example-s3-bucket/output/test",
        ),
    ],
)

### Verify that train.csv is in the location that you've specified

In [None]:
# Verify that train.csv is in the location that you've specified
!aws s3 ls s3://example-s3-bucket/output/train/train.csv

### Verify that val.csv is in the location that you've specified

In [None]:
# Verify that val.csv is in the location that you've specified
!aws s3 ls s3://example-s3-bucket/output/validation/val.csv

### Specify `train.csv` and `val.csv` as the input for the training job

In [None]:
from sagemaker.session import TrainingInput

bucket = "example-s3-bucket"

train_input = TrainingInput(f"s3://{bucket}/output/train/train.csv", content_type="csv")
validation_input = TrainingInput(f"s3://{bucket}/output/validation/val.csv", content_type="csv")

### Specify the model container and output location of the model artifact

Specify the S3 location of the trained model artifact. You can access it later.

It also gets the URI of the container image. We used version `1.2-2` of the XGBoost container image, but you can specify a different version. For more information about XGBoost container images, see [Use the XGBoost algorithm with Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html).  

In [None]:
# Getting the XGBoost container that's in us-east-1
prefix = "training-output-data"
region = "us-east-1"

from sagemaker.debugger import Rule, ProfilerRule, rule_configs
from sagemaker.session import TrainingInput

s3_output_location = f"s3://{bucket}/{prefix}/xgboost_model"

container = sagemaker.image_uris.retrieve("xgboost", region, "1.2-2")
print(container)

### Define the model

In [None]:
xgb_model = sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=2,
    region=region,
    instance_type="ml.m5.4xlarge",
    volume_size=5,
    output_path=s3_output_location,
    sagemaker_session=sagemaker.Session(),
    rules=[
        Rule.sagemaker(rule_configs.create_xgboost_report()),
        ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
    ],
)

### Set the model hyperparameters

For the purposes of running the training job more quickly, we set the number of training rounds to 10.

In [None]:
xgb_model.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
    objective="reg:squarederror",
    num_round=10,
)

### Train the model

In [None]:
xgb_model.fit({"train": train_input, "validation": validation_input}, wait=True)

### Deploy the model

Copy the name of the model endpoint. We use it for our model evaluation.

In [None]:
xgb_predictor = xgb_model.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

### Download the test.csv file

In [None]:
!aws s3 cp s3://example-s3-bucket/output/test/test.csv .

### Create a 20 row test dataframe

In [None]:
import boto3
import json

test_df = pd.read_csv("test.csv", nrows=20)
test_df.head()

### Get predictions from the test dataframe

Define the `get_predictions` function to convert the 20 row dataframe to a CSV string and get predictions from the model endpoint. Provide the `get_predictions` function with the name of the model and the model endpoint.

In [None]:
import json
import pandas as pd

# Initialize the SageMaker runtime client
runtime = boto3.client("runtime.sagemaker")

# Define the endpoint name
endpoint_name = "sagemaker-xgboost-timestamp"


# Function to make predictions
def get_predictions(data, endpoint_name):
    # Convert the DataFrame to a CSV string and encode it to bytes
    csv_data = data.to_csv(header=False, index=False).encode("utf-8")

    response = runtime.invoke_endpoint(
        EndpointName=endpoint_name, ContentType="text/csv", Body=csv_data
    )

    # Read the response body
    response_body = response["Body"].read().decode("utf-8")

    try:
        # Try to parse the response as JSON
        result = json.loads(response_body)
    except json.JSONDecodeError:
        # If response is not JSON, just return the raw response
        result = response_body

    return result


# Drop the target column from the test dataframe
test_df = test_df.drop(test_df.columns[0], axis=1)

# Get predictions
predictions = get_predictions(test_df, endpoint_name)
print(predictions)

### Create an array from the string of predictions

The notebook uses the newline character as the separator, so we use the following code to create an array of predictions.

In [None]:
predictions_array = predictions.split("\n")
predictions_array = predictions_array[:-1]
predictions_array

### Get the 20 row sample of the test dataframe

In [None]:
df_with_target_column_values = pd.read_csv("test.csv", nrows=20)
df_with_target_column_values.head()

### Convert the values of the predictions array from strings to floats

In [None]:
predictions_array = [float(x) for x in predictions_array]

### Create a dataframe to store the predicted versus actual values

In [None]:
comparison_df = pd.DataFrame(predictions_array, columns=["predicted_values"])
comparison_df

### Add the actual values to the comparison dataframe

In [None]:
column_to_add = df_with_target_column_values.iloc[:, 0]

comparison_df["actual_values"] = column_to_add

comparison_df

### Verify that the datatypes of both columns are floats

In [None]:
comparison_df.dtypes

### Compute the RMSE

In [None]:
import numpy as np

# Calculate the squared differences between the predicted and actual values
comparison_df["squared_diff"] = (
    comparison_df["actual_values"] - comparison_df["predicted_values"]
) ** 2

# Calculate the mean of the squared differences
mean_squared_diff = comparison_df["squared_diff"].mean()

# Take the square root of the mean to get the RMSE
rmse = np.sqrt(mean_squared_diff)

print(f"RMSE: {rmse}")

### Clean up

In [None]:
# Delete the S3 bucket
!aws s3 rb s3://example-s3-bucket --force

In [None]:
# Delete the endpoint
xgb_predictor.delete_endpoint()

## Notebook CI Test Results
    
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/use-cases|athena_ml_workflow_end_to_end|athena_ml_workflow_end_to_end.ipynb)