This project is created with the goal to show the cloud integrations on airflow and the ability to process and transfer files though multi-cloud providers with built-in airflow-operators.
Transfer GCP bucket to AWS bucket using airflow built-in operators (GoogleCloudStorageToS3Operator
).
- AWS account
- GCP account
- Python 3.7 on installation host (virtualenv installed)
Before proceed with installation, make sure you're using a gcp service-account with the right permissions to access
bucket objects. You can use: https://console.cloud.google.com/iam-admin/troubleshooter and search
for storage.objects.list
.
-
Step 1. Create and activate virtual environment (py3.7): (Below steps assumes that you're located on the root project folder location). You can skip this step if you open your project with PyCharm and create the virtual environment from there.
-
Step 1.1 - Create virtual environment (optional name, recommended:
venv
):virtualenv --python=/Library/Frameworks/Python.framework/Versions/3.7/bin/python3 venv
Note: First confirm your python version path
-
Step 1.2 - Activate virtual environment
source venv/bin/activate
Note: Once you have created and activated your virtual environment, you need to set it as
python iterpreter
on PyCharm. -
Step 1.3 - Dependencies installation: (instructions for command line - this can be also done using PyCharm tools, just make sure you're using project virtual environment and not python global installation).
pip install -r requirements.txt
-
-
Step 2 - Initialize airflow and render dags.
-
Step 2.1 - Define airflow home: open 2 terminals (project root location) and execute below command in both of them. This will let airflow knows which is the folder to use for airflow instance (project application).
export AIRFLOW_HOME=./airflow
-
Step 2.2 - Initialize airflow: In one of the terminals where you defined AIRFLOW_HOME, execute below command to initialize our airflow instance (after you run it, you'll see more files -related to airflow instance- inside $AIRFLOW_HOME directory).
airflow initd
-
Step 2.3 - Start airflow: This step and the next one will start 2 background processes, both of them requires $AIFLOW_HOME defined.
-
Step 2.3.1 - Start airflow scheduler: In one of the two sessions that we have opened with AIRFLOW_HOME, run below command to start airflow scheduler:
airflow scheduler
-
Step 2.3.2 - Start airflow webserver: The only pending thing is turn on our webserver to start running our dags, so in the other terminal run the below command to turn it on (note: you can specify a different port if you want).
airflow webserver -p 8084
-
Step 2.3.3 - Open airflow webserver and verify installation: At this point we only need to verify everything is running as expected and our dags (located on $AIRFLOW_HOME/dags/) are rendered on dashboard (Note: Even though webserver should display them almost immediate, refresh the browser after a minute just to make sure. This shouldn't take more than that).
Open: http://localhost:8084/admin/
-
-
-
Step 3 - Cloud accounts configuration.
-
Step 3.1 - Airflow connections: This step is to define our cloud providers connections and allow authentications for airflow operators.
-
Step 3.1.1 - AWS connection: On this example we'll use default connection for aws, in this case is
aws_default
. For this, follow these steps (on webserver):Admin -> Click on edit button for
aws_default
connection.Validate that
connection type
isAmazon Web Services
andExtra
has the region of your destination bucket.Now, you'll set
Login
with yourAWS_ACCESS_KEY
andPassword
should haveAWS_SECRET_ACCESS_KEY
. -
Step 3.1.2 - GCP connection: Similar than AWS connection, we'll use default gcp connection (already created by airflow installation):
Admin -> Click on edit button for
google_cloud_default
connection.Validate that
connection type
isGoogle Cloud Platform
and forScopers
assign:https://www.googleapis.com/auth/cloud-platform
.Now, you'll set
Project Id
depending on your gcp project andKeyfile JSON
with the content of your service user (Guide: https://cloud.google.com/iam/docs/creating-managing-service-account-keys).
-
-
Step 3.2 - Airflow variables: Since the POC dag (
airflow/dags/transfer_data_gcp_to_aws_dag.py
) has source (gcp bucket) and destination (aws bucket) parametrized, airflow-variable will store the values for them.-
Step 3.2.1 -
AWS_BUCKET
variable: Go to Admin -> Variables and create a new variable with the nameAWS_BUCKET
. It's important the format of the bucket name, ensure that you have a valid format like:s3://aws-destination-bucket/
. -
Step 3.2.1 -
GCP_BUCKET
variable: Go to Admin -> Variables and create a new variable with the nameGCP_BUCKET
. For this variable, gcp operator does not require prefix (gcs) on the bucket name, a valid bucket name forGCP_BUCKET
is:gcp-source-bucket
.
-
-
Step 3.3 - Local AWS config: AWS works with
boto3
, therefore we need to create a file for aws-account authentication. For this we need to run:touch ~/.boto
The content for
.boto
should be:[Credentials] aws_access_key_id = YOUR_ACCESS_KEY aws_secret_access_key = YOUR_SECRET_ACCESS_KEY
-
And that's all, you can test the dag with a manual run and then schedule it as per your needs.
References: