Skip to content

abelfp/data_engineering_cmb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Capstone Project - CMB Simulated Data

You can find the Write Up for this project under the directory src/write_up/project_write_up.pdf.

CMB Data

The data processed in this package was obtained from LAMBDA - NASA. You can learn more on the data here. The simulated data consists of Halo and Galaxy catalogs. These files can be downloaded as binary or ascii files. I downloaded the ascii files and uploaded them to the S3 bucket s3://abelfp-physics-data-raw/ in the us-east-1 region.

The Galaxy catalogs weight over 80 GB, and the Halo Catalogs weight over 800 MB.

This package transforms the data into parquet format and loads them to a data lake in the bucket s3://abelfp-physics-data/data_lake/ under cmb_simulated/ and halo_simulated/. cmb_simulated is partitioned by frequency and source_type since the data files were split by the type of galaxies with Basic Infrared, High Flux Infrared, and Radio.

Running the Data Pipelines

The pipelines run in AWS EMR, to run the pipelines, I use the AWS CLI tool for creating the cluster, bootstraping the script, and running it through a Spark step. You can configure the cluster hardware in the bin/create_submit_emr_job.sh and you can configure any Spark setting on the configuration/cmb_data_steps.json.

To run the pipelines, I need to have AWS CLI set up with a user that has permissions to create clusters. Then I run:

$ bash bin/create_submit_emr_job.sh

About

Data Engineering project for processing Comic Microwave Background simulated data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors