Skip to content

hello everyone, this is AWS ETL project using Spark and AWS glue. in this project I have basically performed 3 steps Extract, Transform and load that’s why we called it ETL project. i performed these steps with AWS various services like AWS S3, AWS Glue, AWS IAM.

Notifications You must be signed in to change notification settings

gaju45/Data-Engineer-ETL-Project-Using-Spark-with-AWS-Glue

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Data Engineer ETL Project Using Spark with AWS Glue

Project Summary:

hello everyone, this is AWS ETL project using Spark and AWS glue. in this project I have basically performed 3 steps Extract, Transform and load that’s why we called it ETL project. i performed these steps with AWS various services like AWS S3, AWS Glue, AWS IAM. with the help of below image you can understand the structure of project.

What we gonna do step by step

step 1: Create IAM Role for whole project.

step 2: Create an S3 bucket and load data to the bucket from our local.

step 3: Create AWS Glue database and table.

step 4: Create ETL Job in AWS Glue

step 5: Write transform Code

Step 1: Create IAM Role

What is IAM Role – An IAM (Identity and Access Management) role in AWS defines permissions for resources and services, allowing them to access specific resources without needing permanent credentials. It enhances security and simplifies access management.

  • Go to AWS IAM → Roles → Create Role
  • Use cases for other AWS services : Select Glue
  • Add permissions → Search and Selec ‘AdministratorAccess’
  • Role name : ‘AWSGlueServiceRole-me’

aws-pipeline

Step 2: We create a bucket in AWS S3

AWS S3 – Amazon S3 is an object storage service offered by AWS. It’s renowned for its high availability, durability, and scalability. In Amazon S3, you can store objects such as images, code files, video files, and even host websites. All data in Amazon S3 is organized into containers called “buckets.”

Step 2.1: Go to AWS S3 -> Buckets -> Create bucket

  • Bucket name: ‘new-york-city-airbnb-2023’

aws-bucket-airbnb

Step 2.2: We create database folder

  • Go to AWS S3 –> Buckets –> ‘new-york-city-airbnb-2023’ –> Create folder
  • Folder name : ‘NYC-airbnb-2023’

airbnb-folder-in-bucket

Step 2.3: We create 2 folder for raw data and transformed data.

  • Go to AWS s3 --> Buckets --> 'new-york-city-airbnb-2023' --> 'NYC-airbnb-2023' --> create 2 folders 
  • Folder name: 'row-data'
  • Folder name: 'transformed-data'

row-transfromed-data

Step 2.4: Now we will upload our local data to S3 'row-data' folder.

  • Go to AWS S3 --> Buckets --> 'new-york-city-airbnb-2023' --> 'NYC-airbnb-2023' --> 'row-data' --> click on Upload --> Add Files --> 'NYC-Airbnb-2023.csv'

nyc-airbnb-2023-csv-file

here we have completed our 2nd step we successful upload our local data to S3 bucket. next we create AWS database and table.

Step 3: Set-up AWS Glue Database and table

AWS Glue - AWS Glue is a fully managed ETL (Extract, Transform, Load) service that makes it easier to prepare and move data for analysis. It automates data discovery, transformation, and loading tasks, helping you to create data pipelines and work with data efficiently in the AWS cloud.

Step 3.1: Create Database in AWS Glue

  • Go to AWS Glue --> Data Catalog --> Database --> Add databaseDatabase name: 'etl-pipeline-project'

aws-glue-database

Step 3.2: Create Table in Database

  • Go to AWS Glue --> Data Catalog --> Tables --> click on 'Add tables using crawler'

after clicking you have to fill some detail to fetch schema from AWS S3. those detail we have given below.

  • Crawler name: ‘elt-airbnb-data-pipeline’
  • Data source S3 path: Choose the ‘raw_data’ bucketIAM role --> Choose
  • IAM role --> ’AWSGlueServiceRole-me’
  • Target database --> Choose the ‘etl-pipeline-project’
  • Schedule : On demand

table-in-aws-glue-database

Step 3.3: Run the Crawler

after created successful crawler click on Run Crawler to fetch schema of file. our crawler is successfully created. we can check the table and schema of our table.

AWS-crawler

schema-of-csv-file

Step 4: we create ETL Job in AWS Glue

  • Go to AWS Glue  -->  Spark script editor -->  click on Create  Job name: 'myGlueETLjob' IAM Role --> Choose the role that we created --> 'AWSGlueServiceRole-me' Type: Spark

aws-glue-etl-job

Step 5: let’s Write Some PySpark Code for Transform the Data.

import pyspark from pyspark.sql import SparkSession from pyspark.sql.functions import * from pyspark.sql.types import *

S3_INPUT_DATA = 's3://new-york-city-airbnb-2023/NYC-airbnb-2023/row-data/'

S3_OUTPUT_DATA = 's3://new-york-city-airbnb-2023/NYC-airbnb-2023/transformed-data/'

Creating the SparkSession

spark = SparkSession.builder.master("local[1]")\ .appName('techycrispy.com')\ .getOrCreate()

Spark DataFrame (Raw) - Transformation

df = spark.read.format("csv")\ .option("header",True)\ .option("inferschema",True)\ .load(S3_INPUT_DATA)

Convert the column's datatype

df = df.withColumn("id", df['id'].cast(IntegerType())\ .withColumn("host_id",df['host_id'].cast(IntegerType()))\ .withColumn("latitude", df['latitude'].cast(FloatType()))\ .withColumn("longitude", df['longitude'].cast(FloatType()))\ .withColumn("minimum_nights", df['minimum_nights'].cast(IntegerType()))\ .withColumn("number_of_reviews",df['number_of_reviews'].cast(IntegerType()))\ .withColumn("last_review", df['last_review'].cast(DateType()))\ .withColumn("reviews_per_month",df['reviews_per_month'].cast(IntegerType()))\ .withColumn("calculated_host_listings_count",df['calculated_host_listings_count'].cast(IntegerType()))\ .withColumn("availability_365", df['availability_365'].cast(IntegerType()))\ .withColumn("number_of_reviews_ltm",df['number_of_reviews_ltm'].cast(IntegerType())\ .withColumn("license", df['license'].cast(IntegerType()))\ .withColumn("price", df['price'].cast(IntegerType()))

Remove columns from the dataset that contain a high number of null (missing) values.

null_counts = df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).collect()[0].asDict() col_to_drop = [k for k, v in null_counts.items() if v >400 ] droping = df.drop(col_to_drop)

count rows in dataframe

droping.count()

counting room types available

df1 = df.groupBy("room_type").agg(count("room_type").alias("counted")) count_room_type = df1.sort(df1["counted"].desc()).limit(4)

total earning by room types

df2 = df.groupBy("room_type").agg(sum("price").alias("total")) total_earn_room_type = df2.sort(df2.total.desc()).limit(4)

Save the DataFrames under different folders within the S3_OUTPUT_DATA bucket

df.write.mode('overwrite').csv(S3_OUTPUT_DATA + "/airbnb-dataframe") count_room_type.write.mode('overwrite').csv(S3_OUTPUT_DATA + "/count-room-type") total_earn_room_type.write.mode('overwrite').csv(S3_OUTPUT_DATA + "/total-earn-room-type") droping.write.mode('overwrite').csv(S3_OUTPUT_DATA + "/droping")

we write transform code to the AWS S3 in 'transformed-data' Folder. Let’s check our ‘transformed-data’.

transformed-file

we can see that transformed in AWS S3 bucket.

conclusion

I hope you have fully understood this project. You'll find straightforward, explanatory articles on various data science topics on Analytics Vidhya. Don't forget to revisit this site.

About

hello everyone, this is AWS ETL project using Spark and AWS glue. in this project I have basically performed 3 steps Extract, Transform and load that’s why we called it ETL project. i performed these steps with AWS various services like AWS S3, AWS Glue, AWS IAM.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages