AWS Serverless Data Analytics Pipeline

This project demonstrates an end-to-end serverless data processing pipeline on AWS. The workflow ingests raw data into S3, processes it with a Lambda function, catalogs it with AWS Glue, queries it with Amazon Athena, and displays the results on a simple Flask webpage hosted on an EC2 instance.

Bucket: serverless-data-analytics-bucket

Region: us-east-1

Glue Database: orders_db

Lambda Function: FilterAndProcessOrders

IAM Roles: Lambda-S3-Processing-Role, Glue-S3-Crawler-Role, EC2-Athena-Dashboard-Role

EC2 Web URL: http://13.223.188.161:5000

Athena results S3 path: s3://serverless-data-analytics-bucket/enriched/

Processed S3 prefix: s3://serverless-data-analytics-bucket/processed/

1) Amazon S3 Bucket Structure

Create a bucket and three prefixes to organize data:

raw/: incoming raw files
processed/: Lambda output (filtered/cleaned CSVs)
enriched/: Athena query result CSVs

Screenshot:

Figure: serverless-data-analytics-bucket root with raw/, processed/, enriched/.

Upload orders.csv to raw/:

aws s3 cp orders.csv s3://serverless-data-analytics-bucket/raw/orders.csv

2) IAM Roles and Permissions

Create three roles with managed policies:

2.1 Lambda Execution Role

Role: Lambda-S3-Processing-Role
Policies: AWSLambdaBasicExecutionRole, AmazonS3FullAccess

2.2 Glue Service Role

Role: Glue-S3-Crawler-Role
Policies: AmazonS3FullAccess, AWSGlueConsoleFullAccess, AWSGlueServiceRole

2.3 EC2 Instance Profile

Role: EC2-Athena-Dashboard-Role
Policies: AmazonS3FullAccess, AmazonAthenaFullAccess

Screenshot:

Figure: Roles list highlighting the three roles above.

3) Lambda Function

Create a Lambda function that filters orders (drops pending/cancelled older than 30 days) and writes to processed/.

Function: FilterAndProcessOrders
Runtime: Python 3.11 (or 3.9/3.10)
Role: Lambda-S3-Processing-Role

Replace the default code with the contents of LambdaFunction.py from the repo.

Screenshot:

Figure: Lambda overview page showing function name and runtime.

3.1 Configure S3 Trigger

Bucket: serverless-data-analytics-bucket
Event type: All object create events
Prefix: raw/
Suffix: .csv

Screenshot:

Figure: Trigger panel showing S3 event on raw/*.csv.

3.2 Verify Processing

Upload a new raw file (e.g., orders_2.csv) to trigger Lambda. Confirm the output file in processed/ with prefix filtered_...csv.

Screenshot:

Figure: processed/ with filtered CSV.

4) AWS Glue Crawler

Create a database and crawler to catalog the processed data:

Database: orders_db
Crawler: orders_processed_crawler
Data source: s3://serverless-data-analytics-bucket/processed/
Role: Glue-S3-Crawler-Role
Target database: orders_db

Run the crawler and verify a table appears in orders_db (columns for orderid, customer, amount, status, orderdate).

Screenshot:

Figure: Glue crawler CloudWatch logs showing a successful run.

5) Query Data with Athena

In Athena (Trino):

Data source: AwsDataCatalog
Database: orders_db
Results location: s3://serverless-data-analytics-bucket/enriched/

5.1 Create a Clean View

Replace <TABLE_NAME> with your Glue table name if needed, but this view assumes you’re already in orders_db:

CREATE OR REPLACE VIEW orders_db.filtered_orders AS
SELECT
  CAST(orderid   AS VARCHAR)             AS orderid,
  CAST(customer  AS VARCHAR)             AS customer,
  CAST(amount    AS DOUBLE)              AS amount,
  LOWER(CAST(status AS VARCHAR))         AS status,
  CAST(orderdate AS DATE)                AS orderdate
FROM orders_db."<TABLE_NAME>";

5.2 Queries (produce CSVs in `enriched/`)

Total Sales by Customer

SELECT customer, SUM(amount) AS totalamountspent
FROM orders_db.filtered_orders
GROUP BY customer
ORDER BY totalamountspent DESC;

Monthly Order Volume and Revenue

SELECT date_trunc('month', orderdate) AS ordermonth,
       COUNT(orderid)                 AS numberoforders,
       ROUND(SUM(amount), 2)          AS monthlyrevenue
FROM orders_db.filtered_orders
GROUP BY 1
ORDER BY ordermonth;

Order Status Dashboard

SELECT status,
       COUNT(orderid)       AS ordercount,
       ROUND(SUM(amount),2) AS totalamount
FROM orders_db.filtered_orders
GROUP BY status;

Average Order Value (AOV) per Customer

SELECT customer, ROUND(AVG(amount), 2) AS averageordervalue
FROM orders_db.filtered_orders
GROUP BY customer
ORDER BY averageordervalue DESC;

Top 10 Largest Orders in February 2025

SELECT orderdate, orderid, customer, amount
FROM orders_db.filtered_orders
WHERE orderdate BETWEEN DATE '2025-02-01' AND DATE '2025-02-28'
ORDER BY amount DESC
LIMIT 10;

Screenshot:

Figure: enriched/ with Athena result CSVs.

6) EC2 Web Server

Launch an Amazon Linux 2023 t2.micro with security group rules:

SSH: 22 from My IP
Custom TCP: 5000 from 0.0.0.0/0
Attach IAM profile EC2-Athena-Dashboard-Role

SSH into EC2, install deps, and run the Flask app:

# on EC2
sudo yum update -y
sudo yum install -y python3-pip
pip3 install Flask boto3

# create app.py and paste your Flask code
nano app.py  # set AWS_REGION="us-east-1", ATHENA_DATABASE="orders_db", S3_OUTPUT_LOCATION="s3://serverless-data-analytics-bucket/enriched/"
python3 app.py  # runs on 0.0.0.0:5000

Open the dashboard at: http://13.223.188.161:5000.

Screenshot:

Figure: Dashboard page with tables for the five queries.

Notes & Troubleshooting

Lambda didn’t trigger? Ensure the S3 trigger exists, correct bucket, Prefix=raw/, Suffix=.csv.
No processed file? Upload a new object to raw/ (events don’t retro-trigger). Check CloudWatch logs.
Crawler failed? Confirm role Glue-S3-Crawler-Role and that processed/ has CSV with headers.
Athena errors? Verify orders_db, the table name, and that the filtered_orders view exists. Ensure results S3 path is set to enriched/.
EC2 5000 not reachable? Security group rule for TCP 5000 must be open to 0.0.0.0/0. The app must run on 0.0.0.0.

Repository Structure

aws-serverless-data-analytics-pipeline/
├─ LambdaFunction.py
├─ EC2InstanceNANOapp.py
├─ orders.csv
├─ README.md
└─ screenshots/
   ├─ 01-s3-structure.png
   ├─ 02-iam-roles.png
   ├─ 03-lambda-overview.png
   ├─ 04-lambda-trigger.png
   ├─ 05-processed-folder.png
   ├─ 06-glue-crawler-cloudwatch.png
   ├─ 07-athena-enriched.png
   └─ 08-final-webpage.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AWS Serverless Data Analytics Pipeline

1) Amazon S3 Bucket Structure

2) IAM Roles and Permissions

2.1 Lambda Execution Role

2.2 Glue Service Role

2.3 EC2 Instance Profile

3) Lambda Function

3.1 Configure S3 Trigger

3.2 Verify Processing

4) AWS Glue Crawler

5) Query Data with Athena

5.1 Create a Clean View

5.2 Queries (produce CSVs in `enriched/`)

6) EC2 Web Server

Notes & Troubleshooting

Repository Structure

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
screenshots		screenshots
EC2InstanceNANOapp.py		EC2InstanceNANOapp.py
LambdaFunction.py		LambdaFunction.py
README.md		README.md
orders.csv		orders.csv

ak811/aws-serverless-data-analytics-pipeline

Folders and files

Latest commit

History

Repository files navigation

AWS Serverless Data Analytics Pipeline

1) Amazon S3 Bucket Structure

2) IAM Roles and Permissions

2.1 Lambda Execution Role

2.2 Glue Service Role

2.3 EC2 Instance Profile

3) Lambda Function

3.1 Configure S3 Trigger

3.2 Verify Processing

4) AWS Glue Crawler

5) Query Data with Athena

5.1 Create a Clean View

5.2 Queries (produce CSVs in enriched/)

6) EC2 Web Server

Notes & Troubleshooting

Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

5.2 Queries (produce CSVs in `enriched/`)

Packages