This project demonstrates an end-to-end serverless data processing pipeline on AWS. The workflow ingests raw data into S3, processes it with a Lambda function, catalogs it with AWS Glue, queries it with Amazon Athena, and displays the results on a simple Flask webpage hosted on an EC2 instance.
- Bucket:
serverless-data-analytics-bucket- Region:
us-east-1- Glue Database:
orders_db- Lambda Function:
FilterAndProcessOrders- IAM Roles:
Lambda-S3-Processing-Role,Glue-S3-Crawler-Role,EC2-Athena-Dashboard-Role- EC2 Web URL:
http://13.223.188.161:5000- Athena results S3 path:
s3://serverless-data-analytics-bucket/enriched/- Processed S3 prefix:
s3://serverless-data-analytics-bucket/processed/
Create a bucket and three prefixes to organize data:
raw/: incoming raw filesprocessed/: Lambda output (filtered/cleaned CSVs)enriched/: Athena query result CSVs
Screenshot:

Figure: serverless-data-analytics-bucket root with raw/, processed/, enriched/.
Upload orders.csv to raw/:
aws s3 cp orders.csv s3://serverless-data-analytics-bucket/raw/orders.csvCreate three roles with managed policies:
- Role:
Lambda-S3-Processing-Role - Policies:
AWSLambdaBasicExecutionRole,AmazonS3FullAccess
- Role:
Glue-S3-Crawler-Role - Policies:
AmazonS3FullAccess,AWSGlueConsoleFullAccess,AWSGlueServiceRole
- Role:
EC2-Athena-Dashboard-Role - Policies:
AmazonS3FullAccess,AmazonAthenaFullAccess
Screenshot:

Figure: Roles list highlighting the three roles above.
Create a Lambda function that filters orders (drops pending/cancelled older than 30 days) and writes to processed/.
- Function:
FilterAndProcessOrders - Runtime: Python 3.11 (or 3.9/3.10)
- Role:
Lambda-S3-Processing-Role
Replace the default code with the contents of LambdaFunction.py from the repo.
Screenshot:

Figure: Lambda overview page showing function name and runtime.
- Bucket:
serverless-data-analytics-bucket - Event type:
All object create events - Prefix:
raw/ - Suffix:
.csv
Screenshot:

Figure: Trigger panel showing S3 event on raw/*.csv.
Upload a new raw file (e.g., orders_2.csv) to trigger Lambda. Confirm the output file in processed/ with prefix filtered_...csv.
Screenshot:

Figure: processed/ with filtered CSV.
Create a database and crawler to catalog the processed data:
- Database:
orders_db - Crawler:
orders_processed_crawler - Data source:
s3://serverless-data-analytics-bucket/processed/ - Role:
Glue-S3-Crawler-Role - Target database:
orders_db
Run the crawler and verify a table appears in orders_db (columns for orderid, customer, amount, status, orderdate).
Screenshot:

Figure: Glue crawler CloudWatch logs showing a successful run.
In Athena (Trino):
- Data source:
AwsDataCatalog - Database:
orders_db - Results location:
s3://serverless-data-analytics-bucket/enriched/
Replace <TABLE_NAME> with your Glue table name if needed, but this view assumes you’re already in orders_db:
CREATE OR REPLACE VIEW orders_db.filtered_orders AS
SELECT
CAST(orderid AS VARCHAR) AS orderid,
CAST(customer AS VARCHAR) AS customer,
CAST(amount AS DOUBLE) AS amount,
LOWER(CAST(status AS VARCHAR)) AS status,
CAST(orderdate AS DATE) AS orderdate
FROM orders_db."<TABLE_NAME>";- Total Sales by Customer
SELECT customer, SUM(amount) AS totalamountspent
FROM orders_db.filtered_orders
GROUP BY customer
ORDER BY totalamountspent DESC;- Monthly Order Volume and Revenue
SELECT date_trunc('month', orderdate) AS ordermonth,
COUNT(orderid) AS numberoforders,
ROUND(SUM(amount), 2) AS monthlyrevenue
FROM orders_db.filtered_orders
GROUP BY 1
ORDER BY ordermonth;- Order Status Dashboard
SELECT status,
COUNT(orderid) AS ordercount,
ROUND(SUM(amount),2) AS totalamount
FROM orders_db.filtered_orders
GROUP BY status;- Average Order Value (AOV) per Customer
SELECT customer, ROUND(AVG(amount), 2) AS averageordervalue
FROM orders_db.filtered_orders
GROUP BY customer
ORDER BY averageordervalue DESC;- Top 10 Largest Orders in February 2025
SELECT orderdate, orderid, customer, amount
FROM orders_db.filtered_orders
WHERE orderdate BETWEEN DATE '2025-02-01' AND DATE '2025-02-28'
ORDER BY amount DESC
LIMIT 10;Screenshot:

Figure: enriched/ with Athena result CSVs.
Launch an Amazon Linux 2023 t2.micro with security group rules:
- SSH:
22from My IP - Custom TCP:
5000from 0.0.0.0/0 - Attach IAM profile
EC2-Athena-Dashboard-Role
SSH into EC2, install deps, and run the Flask app:
# on EC2
sudo yum update -y
sudo yum install -y python3-pip
pip3 install Flask boto3
# create app.py and paste your Flask code
nano app.py # set AWS_REGION="us-east-1", ATHENA_DATABASE="orders_db", S3_OUTPUT_LOCATION="s3://serverless-data-analytics-bucket/enriched/"
python3 app.py # runs on 0.0.0.0:5000Open the dashboard at: http://13.223.188.161:5000.
Screenshot:

Figure: Dashboard page with tables for the five queries.
- Lambda didn’t trigger? Ensure the S3 trigger exists, correct bucket,
Prefix=raw/,Suffix=.csv. - No processed file? Upload a new object to
raw/(events don’t retro-trigger). Check CloudWatch logs. - Crawler failed? Confirm role
Glue-S3-Crawler-Roleand thatprocessed/has CSV with headers. - Athena errors? Verify
orders_db, the table name, and that thefiltered_ordersview exists. Ensure results S3 path is set toenriched/. - EC2 5000 not reachable? Security group rule for TCP 5000 must be open to
0.0.0.0/0. The app must run on0.0.0.0.
aws-serverless-data-analytics-pipeline/
├─ LambdaFunction.py
├─ EC2InstanceNANOapp.py
├─ orders.csv
├─ README.md
└─ screenshots/
├─ 01-s3-structure.png
├─ 02-iam-roles.png
├─ 03-lambda-overview.png
├─ 04-lambda-trigger.png
├─ 05-processed-folder.png
├─ 06-glue-crawler-cloudwatch.png
├─ 07-athena-enriched.png
└─ 08-final-webpage.png