Data Engineering Fundamentals on AWS

Building a serverless data processing pipeline with PySpark on AWS

Data is all over the place, and what matters is how we manage that data and make sense out of it and take some meaningful data driven decision. In this session we will discuss about whole data engineering pipeline, starting from data collection, processing, analysis and visualization in a complete serverless fashion. We will pick some opensource dataset and shall store and process it on cloud(AWS). While the focus would be more on the general understanding of data pipeline aspects of data engineering, but during the process we will learn few of the AWS services which can help us to achieve our goal in an effective and efficient way.

Implementation Steps for Data Processing

Create an S3 Bucket dataconla2022
Create few folders : a) input b) code c) output
Place the dataset in the S3 bucket under the input folder`
- aws s3 cp dataset/wikiticker.json s3://dataconla2022-1/input/
Create an EMR cluster using the steps mentioned here
SSH to the EMR Cluster created in the previous step
Copy the pySpark code inside the EMR cluster
Submit the pySpark job

sudo spark-submit agg_filter.py

Steps for Data Cataloging and Analysis

Create a Glue Crawler from the Glue console
- Create a database or use any of the existing database
- When asked for the data source, use the following path : s3://dataconla2022/output
- Run the crawler
Once the crawler run gets completed, open Amazon Athena and query the database
- SELECT * FROM "AwsDataCatalog"."<your DB Name>"."aggregate" LIMIT 10;

Implementation Steps for Data Processing (using Amazon EMR Serverless)

Create an IAM role using the steps mentioned here
Open Amazon EMR Serverless console
Create an Application and use the IAM role created in Step 1
Copy the code to the S3 bucket aws s3 cp agg_filter.py s3://dataconla2022-1/code/
Submit a job and mentioned the script location as s3://dataconla2022-1/code/agg_filter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

dataset

dataset

README.md

README.md

Repository files navigation

Data Engineering Fundamentals on AWS

Building a serverless data processing pipeline with PySpark on AWS

Implementation Steps for Data Processing

Steps for Data Cataloging and Analysis

Implementation Steps for Data Processing (using Amazon EMR Serverless)

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
code		code
dataset		dataset
README.md		README.md

debnsuma/DataEngineeringAWS101

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Fundamentals on AWS

Building a serverless data processing pipeline with PySpark on AWS

Implementation Steps for Data Processing

Steps for Data Cataloging and Analysis

Implementation Steps for Data Processing (using Amazon EMR Serverless)

About

Resources

Stars

Watchers

Forks

Languages