In the final project of the Data Engineering Nanodegree Udacity gave us the opportunity to select data sources for ourselves to build a data pipeline project.
I was interested in world development in terms of:
- trade
- climate change and ecological impact
- population and health
- food and nutrition
- education
I checked out datasets given in research/_DataResearch.md. For the sake of a realistic scope I limited myself to datasets involved with trades and temperature per year.
This are the datasets that I chose to work with:
- Global Commodities and Trade Statistics of the United Nations (577.462 rows) source and
- Climate Change: Earth Surface Temperature Data from NOAA (8.225.871 rows) source
In preparation for the staging tasks, I did exploratory data analysis for both datasets, documented here:
- [Global Commodities and Trade Statistics](notebooks/Data Exploration - Global Commodity Trade Statistics.ipynb) and
- [ Climate Change: Earth Surface Temperature Data](notebooks/Data Exploration - Earth Surface Temperature Data.ipynb)
Basic results (column count, names, data types, numeric value ranges, string examples and missing value analysis) can also be found here:
The source data consists of two CSV files:
- commodity_trade_statistics_data.csv (1.1 GB) and
- GlobalTemperatures.csv (21.6 MB)
that have been uploaded to AWS S3 for access. For uploading this script was used: data/helper/load_to_s3.py.
The project goal is to create a joint data schema for the above selected datasets as well as a pipeline to import the data into it. Possible use cases are:
- Which countries where the hottest over the years?
- What is the top imported/exported commodity by country and year
- What is the main commodity imported/exported by trade_usd by the hottest/coldest countries?
Some results using the finished database can be viewed here: [World Development Use Cases](notebooks/World Development Use Case.ipynb)
The raw data is downloaded from kaggle and uploaded to S3. A airflow-powered data pipeline is used to ETL the data into a constellation schema. Therefore data is staged - using Spark for the trades data and pandas for the temperature data. Subsequently a common dimension table for country_and_area has been devised - partly by hand. Next more dimensional tables and ultimately fact tables have been created automatically.
Finally, a data quality check is used to check that data has been ETLed as expected.
The complete processes airflow DAG looks like this:
An airflow DAG orchestrates the whole process. It consists of the following steps:
- Begin_execution: does nothing but indicating the start of the pipepline.
- Recreate_db_schema: (Re-)creates the target database's constellation schema (executes DROP and CREATE SQL Statements) and the staging tables.
- Staging (NB: staging tables are not part of the final DB schema displayed below)):
- stage_commodities: Staging of the global temperatures data (executes a PythonOperator using pandas) into temperature_staging table.
- stage_temperatures: Staging of global commodities and trade statistics data (executes a BashOperator that executes a Spark Job, run on a standalone cluster in docker in client mode) into commodities_staging table.
- Create a joint country_and_area and regions table from both sources (temperature and commodities) staging tables:
- update_temperature_countries: country_and_area labels in temperature_staging, according to update_temperature_countries, including rules like writing conjunctions in smaller case and 'Islands' instead of 'Isla' or 'Islds'.
- copy_temperature_countries: copy country_and_area labels in temperature_staging: for country_or_area labels present in commodities_staging, but not in temperature_staging, entries of the geographically closest country_or_area where copied and named according to the country_or_area label in commodities_staging. A list can be found here: copy_temperature_countries.
- update_commodity_countries: update country_and_area labels in temperature_staging, according to update_commodity_countries, including rules like writing 'Islands' instead of 'Isla' or 'Islds' or shortened country_or_area names (i.e. 'Bolivia (Plurinational State of)' => 'Bolivia').
- create_common_countries_table: unify country_and_area labels of temperature_staging and commodities_staging and input its data into the dimensional country_and_area table.
- Load_dim_<dim_table_name> with dim_table_name in (flow, commodities, categories, quantities). Creating dimensional tables from staged global trade statistics.
- Load_fact_temperatures_table: Creates fact table 'temperatures' from temperature_staging and country_or_area table.
- Load_fact_trades_table: Create fact table 'trades' from commodities_staging and dimensional tables (country_or_area, flow, commodities, categories, quantities) and fact table 'temperatures'.
- Check_data_quality: Checks the number of records saved to the database by comparing these with expected values found in in dictionary expected_table_counts.
- End_execution: does nothing but indicating the end of the pipeline.
The names in cursiv represent the task_ids of the tasks in the world_dag.py file. You can inspect these to see what (custom) operators and scripts have been used.
The resulting constellation schema (i.e. a snowflake schema with more than one facts table) looks like this:
- flows: dimension table - indicates type of trade (import, export, re-import, re-export)
- commodities: dimension table - type of commodity that has been traded
- categories: dimension table - category of commodity type that has been traded (read: group of commodites)
- quantities: dimension table - weight, volume or length, no pieces etc.
- country_or_area: dimension table - consolidated countries table from stage_commodities and stage_temperature
- temperatures: facts table - provides average temperatures for year and country_or_area as well as average certainty of the measurement
- trades: facts table - provides commodities in the qu for year and country
Legend: PK = Primary Key FK = Foreign Key
table: flows
Column | Type | PK | FK to table |
---|---|---|---|
flow_id | INT | Yes | |
flow_type | VARCHAR |
table: commodities
Column | Type | PK | FK to table |
---|---|---|---|
commodity_id | INT | Yes | |
commodity_name | VARCHAR | ||
commodity_code | VARCHAR | ||
category_id | VARCHAR | categories |
table: categories
Column | Type | PK | FK to table |
---|---|---|---|
category_id | INT | Yes | |
category_name | VARCHAR |
table: quantities
Column | Type | PK | FK to table |
---|---|---|---|
quantity_id | INT | Yes | |
quantity_name | VARCHAR |
table: country_or_area
Column | Type | PK | FK to table |
---|---|---|---|
country_or_area_id | INT | Yes | |
country_or_area | VARCHAR |
table: temperatures
Column | Type | PK | FK to table |
---|---|---|---|
temperature_id | INT | Yes | |
country_or_area_id | INT | country_or_area | |
year | INT | ||
temperature | FLOAT | ||
certainty | FLOAT | ||
rank | FLOAT |
table: trades
Column | Type | PK | FK to table |
---|---|---|---|
trade_id | INT | Yes | |
flow_id | INT | flows | |
commodity_id | INT | commodities | |
quantity_id | INT | quantities | |
country_or_area_id | INT | country_or_area | |
ntity_amount | FLOAT | ||
quantity_amount | FLOAT | ||
trade_usd | FLOAT | ||
year | INT | ||
weight_kg | INT | ||
temperature_id | INT | temperatures |
You can find an outline of the current tech stack in the docker-compose.yml
It consits of:
- Apache Airflow as pipeline orchestrator, run with a SequentialExecutor:
- a container (webserver) for the webservice: container is based on puckel/docker-airflow_image, extend with info from this medium article, result can be found here: Docker.airflow_spark
- a container (postgres) for the configuration database: postgres:9.6
- Apache Spark as cluster computing framework, run as a stand-alone cluster:
- a container (spark-master) for one master node: based on bde2020/spark-master:2.4.4-hadoop2.7 image
- a container (spark-worker-1) as worker node: based on bde2020/spark-worker:2.4.4-hadoop2.7
- PostgreSQL as data storage for staged and finally processed data in above described schema, a container based on docker image: postgres:10.10-alpine
- AWS S3 as file storage for the input data
I choose this stack, as I wanted to create a locally executable version of the whole project, that with some adaptations (i.e. move to kubernetes) could be run in the cloud as well, i.e. AWS EKS.
How would the current approach be changed if one of the following conditions materialized:
To remove bottlenecks at the staging step, the temperature_staging which is currently performed by a PythonOperator could be reformulated to a Spark Job.
As such it could be run like the commodities_staging.py already is.
In this project the Spark integration is using a stand-alone cluster with one master and one worker node.
This could be extended by adding more worker nodes. For best performance also the stage_commodities.py can be optimized.
And obviously if this was not enough to be run on a single machine based on the docker-compose.yml a kubernetes implementation could be built and run on AWS or GCP.
Postgres could manage 100x (100 GB in this case).
But in order to be highly available a multi-machine setup could be used.
Also AWS Redshift could be used as Cloud alternative to easily be able to also work with 1000x data.
This wouldn't be to hard to achieve:
- An AWS Redshift cluster of appropriate size needed to be setup
- a new AWS connection needed to be defined in Airflow and
- minor changes needed to be made to the operators and the SQL code to make it work with AWS Redshift capabilities. This could be:
- change autoincrement from SERIAL to IDENTITY(1, 1)
- adding dist_keys, sort_keys and
- distribution strategies
As stated for scenario 1, we would need more parallelization for the Spark Jobs to achieve results readily. Again for more processing power, more nodes and more machines could be used locally or in the cloud using Kubernetes instead of docker alone. Furthermore, in the staging phase, I would not move the data directly into the database, but into multiple files, e.g. in S3 or HDFS. From there they could be loaded in a parallel fashion into the database and further processed in parallel on AWS Redshift.
Again, parallelization is key to meet this requirement. Using multiple machines and for smooth scaling the usage of a cloud provider would make best sense.
- This tutorial assumes a Ubuntu installation (specifically: 18.04)
- Assumes pipenv
Create your project folder and cd into it. Now enter:
> git clone https://github.com/genughaben/world-development.git
> pipenv install
> pipenv shell
Copy and customize config values:
> cp config_template.cfg config.cfg
Now, customize values in the newly created config.cfg as required. Primarily you need to add your AWS credentials to access S3.
NB: config.cfg is automatically excluded from git repo. If you should use another name, add it got .gitignore and update config variable usage across project.
> docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" -t puckel/docker-airflow .
> docker build --rm --build-arg PYTHON_DEPS="flask_oauthlib>=0.9" -t puckel/docker-airflow .
> cd airflow/docker
> docker-compose up
After starging, you can reach airflows UI in your browser entering: localhost:8080/admin.
Goto localhost:8080/admin
- Open Admin -> Connections
- Create new connection with clicking on "Create"
Field | Field value |
---|---|
Conn Id | postgres |
Conn Type | Postgres |
Host | db |
Schema | world |
Login | genughaben |
Password | docker |
Port | 5432 |
- Switch on 'world_dag'
- Execute 'world_dag' by clicking on 'Trigger DAG'-button (first button in Links column)
> docker-compose down
> docker-compose down
> docker-compose up --force-recreate
- Update init.sql in local/script folder as you please
Make sure your docker-container is running
> docker exec -it local_db_1 bash
bash> psql -U genughaben -d world
This section is for local development only. Only necessary in case you are interested in writing commodities data into the PostgreSQL DB on your local machine.
Prerequisits:
- Make sure you have a local installation of PostgreSQL and its running. See how-to here: Development Utils
Login to Postgresql CLI and create a Database called 'dummy'.
> sudo -u postgres psql
# Make sure your use as CreateDB rights
psql-cli> CREATE DATABASE dummy;
Now you can execute:
# create world Database and Tables
> python local-etl/create_table.py
# execute ETL:
> python local-etl/commodities-pandas-etl.py
NB: Only works if you have locally installed PySpark available!
> psql -U genughaben -d postgres
# Now create a Databse if you have not done so alread:
psql-cli> CREATE DATABASE dummy;
If you have customized your config.cfg as required you can run:
> python local-etl/spark-etl
to execute the spark based etl script.
First start the the container setup:
> cd airflow/docker
> docker-compose up
Submit Spark job from terminal outside docker:
# First enter container:
> docker exec -it local_spark_1 bash
# Next execute spark script like so:
> /spark/bin/spark-submit --master local[*] --driver-class-path spark/dependencies/postgresql-42.2.8.jar spark/scripts/stage_commodities.py
First start the the container setup:
> cd airflow/docker
> docker-compose up
In order to execute a Spark Job from your local terminal so that it runs in the docker container, you need the master nodes IP-Address. You get it entering:
> docker ps
# find the ID (first column) of the spark-master container
> docker inspect <ID-of-spark-master-container>
You find the IP-Address on the bottom in: "Networks" => ... => "IPAddress"L <INPUT_SPARK_MASTER_CONTAINER_IP>
In airflow/docker/spark/local_container_spark_submit.sh substitute <INPUT_SPARK_MASTER_CONTAINER_IP> with the IP-Address. Now execute:
> bash local_container_spark_submit.sh
> sudo apt-get install postgresql postgresql-contrib
Creating a role:
> sudo -u postgres psql
Create a Postgres user for airflow (still in psql console)
postgres=# CREATE USER <your-user-name> PASSWORD <your password>;
CREATE ROLE
postgres=# CREATE DATABASE <your-db-name>;
CREATE DATABASE
postgres=# GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO <your-user-name>;
GRANT
postgres=# \du
List of roles
Role name | Attributes | Member of
— — — — — -+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — + — — — — — -
<your-user-name> | | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
Add CREATE DB rights to :
psql> ALTER USER <your-user-name> CREATEDB;
Not exit console and check whether database is setup and can be accessed by user
> psql -d airflow
psql (10.10 (Ubuntu 10.10-0ubuntu0.18.04.1))
Type "help" for help.
airflow=> \conninfo
You should see something like this:
You are connected to database "airflow" as user "<your-user-name> " via socket in "/var/run/postgresql" at port "5432".
airflow=>
In order to allow airflow access to Postgres, pg_hba.conf needs to be configured:
> sudo nano /etc/postgresql/10/main/pg_hba.conf
Change int entry 'IPv4 local connections' the ADDRESS to 0.0.0.0/0 and the METHOD to trust. In the end there should be:
# IPv4 local connections:
TYPE DATABASE USER ADDRESS METHOD
...
# IPv4 local connections <- replace line after this with next line
host all all 0.0.0.0/0 trust <- use this line as replacement
You now need to restart Postgres entering:
> sudo service postgresql restart
Open postgresql.conf entering:
> sudo nano /etc/postgresql/10/main/postgresql.conf
Update in the section 'CONNECTIONS AND AUTHENTICATION' listen_addresses from 'localhost' to '*'
# — Connection Settings -
#listen_addresses = ‘localhost’ # what IP address(es) to listen on; <-- before
listen_addresses = ‘*’ # for Airflow connection <-- after
And restart Postgres again:
> sudo service postgresql restart
Based on https://medium.com/@taufiq_ibrahim/apache-airflow-installation-on-ubuntu-ddc087482c14
Also helpful: https://medium.com/@srivathsankr7/apache-airflow-a-practical-guide-5164ff19d18b
Using local PostgreSQL for testing purposes Enter CLI:
> sudo -u postgres psql
Helpful commands:
command | comment |
---|---|
\du | show user |
\l | displays list of databases |
\c <database_name> | choose database for usage |
\dt | show tables |
\d <table_name> | show table schema |
Further reading: