Data is all over the place, and what matters is how we manage that data and make sense out of it and take some meaningful data driven decision. In this session we will discuss about whole data engineering pipeline, starting from data collection, processing, analysis and visualization in a complete serverless fashion. We will pick some opensource dataset and shall store and process it on cloud(AWS). While the focus would be more on the general understanding of data pipeline aspects of data engineering, but during the process we will learn few of the AWS services which can help us to achieve our goal in an effective and efficient way.
-
Create an S3 Bucket
dataconla2022
-
Create few folders : a)
input
b)code
c)output
-
Place the dataset in the S3 bucket under the
input
folder`aws s3 cp dataset/wikiticker.json s3://dataconla2022-1/input/
-
Create an EMR cluster using the steps mentioned here
-
SSH to the EMR Cluster created in the previous step
-
Copy the pySpark code inside the EMR cluster
-
Submit the pySpark job
sudo spark-submit agg_filter.py
-
Create a Glue Crawler from the Glue console
- Create a database or use any of the existing database
- When asked for the data source, use the following path :
s3://dataconla2022/output
- Run the crawler
-
Once the crawler run gets completed, open Amazon Athena and query the database
SELECT * FROM "AwsDataCatalog"."<your DB Name>"."aggregate" LIMIT 10;
-
Create an IAM role using the steps mentioned here
-
Open
Amazon EMR Serverless
console -
Create an Application and use the IAM role created in
Step 1
-
Copy the code to the S3 bucket
aws s3 cp agg_filter.py s3://dataconla2022-1/code/
-
Submit a job and mentioned the script location as
s3://dataconla2022-1/code/agg_filter.py