Skip to content

WinThitiwat/Data_Lake_with_Spark

Repository files navigation

Data Lakes with Spark

Overview of the project

This is to simulate a situation that a music startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to. So, this project aims to:

  • Build an ETL pipeline for a data lake hosted on S3
  • Load JSON data from S3
  • Process the data into analytics tables using Spark (Amazon EMR Spark)
  • Load data back into S3 as parquet files

System Architecture

alt text

Datasets

Song Data

Song dataset is a subset of real data from the Million Song Dataset. The data is in JSON format and contains metadata about a song and the artist of that song.

Sample Song Data:

{"num_songs":1,"artist_id":"ARD7TVE1187B99BFB1","artist_latitude":null,"artist_longitude":null,"artist_location":"California - LA","artist_name":"Casual","song_id":"SOMZWCG12A8C13C480","title":"I Didn't Mean To","duration":218.93179,"year":0}

Log Data

Log dataset is in JSON format generated by this event simulator based on the songs in the dataset above.

Sample Log Data:

{"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}

ETL Pipeline Process

  1. Load data from S3. Note that S3 bucket needs to be added into the config file. Check out Setup Configurations File
  • Song data: s3a://udacity-dend/song-data/
  • Log data: s3a://udacity-dend/log-data/
  1. Process data using Spark and transform into 5 different tables as followed:

Fact table:

  • songplays - records in log data associated with song plays i.e. records with page NextSong

    songplay_id, start_time, userId, level, song_id, artist_id, session_id, location, userAgent, month, year
    

Dimension tables:

  • users - users in the app

    userId, firstName, lastName, gender, level
    
  • songs - songs in music database

    song_id, title, artist_id, year, duration
    
  • artists - artists in music database

    artist_id, name, location, latitude, longitude
    
  • time - timestamps of records in songplays broken down into specific units

    ts, start_time, hour, day, week, month, year, weekday
    
  1. Load processed data back to data lake resides in S3 by writing data into parquet files. Following tables are saved as partitioned parquet file, which partitioned by year and month
songs, time, songplays

The S3 bucket can be designated in Setup Configurations File

Setup Configurations File

At the Root Project, create a config file named dl.config with following configuration.

[AWS] # AWS IAM Credential
AWS_ACCESS_KEY_ID=<iam_access_key_id>
AWS_SECRET_ACCESS_KEY=<iam_secret_access_key>

[S3]
INPUT_DATA=s3a://udacity-dend/
OUTPUT_DATA=s3://<your_s3_bucket>/data_outputs/

How to run

This project is run on Spark cluster that was set up on Amazon EMR with follow setup

--use-default-roles  
--release-label emr-5.32.0 
--instance-count 2 
--applications Name=Spark 
--bootstrap-actions Path="s3://<your_S3_bucket>/path/to/bootstrap_emr.sh" 
--ec2-attributes KeyName=spark-cluster 
--instance-type m5.xlarge 
--instance-count 3 

Note:

  1. Make sure that the file to be run is in your Spark cluster. In case the development environment is on your laptop local, then you need to move the file into your Spark edge node using SSH.
ssh -i <key_pair_pem_file.pem> <local_file_path i.e. etl.py> hadoop@<EMR_MasterNode_endpoint>:~/<path_location>
  1. Make sure that your EMR DefaultRole has S3 access policy attached.

After the file, and your EMR are ready state, then kick off the Spark job.

spark-submit etl.py --master yarn --deploy-mode client --driver-memory 4g --num-executors 2 --executor-memory 2g --executor-core 2

Project Files

  • data - a small data source in local for data profiling and testing
  • data_profiling.ipynb - a Python notebook for data profiling to examine and summarize the data source before the actual pipeline implementation
  • etl.py - a Python file implementing the actual ETL data pipeline process for all datasets
  • requirements.txt - a text file containing all mandatory dependencies for the project

Project Author

  • Author: Thitiwat Watanajaturaporn
  • Note: this project is part of Udacity's Data Engineering Nanodegree Program.

About

ETL process to S3 Data Lake through EMR, Spark, Hadoop, Schema-on-Read

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published