### <u>**SETTING UP DATA PIPELINE FOR AWS S3, AWS EC2 AND AIRFLOW DAG** </u>


#### NOTE:

- This instructions assumes that you already have your
  - python script with function codes.
  - dag file with code.
  - You can have a look at my dag and python files at my [Github repo](https://github.com/benkeyben/10alytics_air_realtor/tree/main). If you find it helpful, please give me a star at the upper left section of my Github repo page. I'm glad you did.
- Code blocks are **_italized_** and **_bolded_**. Copy code carefully


### Setting up Git and VS Code

- Install Git on your local machine.
- Open VS Code and navigate to your working folder where you have your dag and python files.
- Access the terminal in VS Code by pressing **_Ctrl + Shift + backtick_**.
- Choose Git Bash from the dropdown menu on the top right corner of the terminal.


### Create EC2 Instance and Key-value login in AWS console

- Launch an EC2 instance, generate key-value login, and save the key-value login in your working directory.

- Select **_Ubuntu OS_** and **_t2.small_** or **_t3.small_** as the instance type.

- Enable SSH, HTTPS, and HTTP traffic in Network settings.

- Click on Launch instance at the bottom part.

- Select your new instance created and click on the connect tab.

- Click on SSH client tab.

- Copy the second line of code which looks like
  **_chmod 400 your-ec2-key-value-login_name_** onto your gitbash terminal.

- Copy the ssh connection line of code at the example section that starts with **_ssh -i_**.

- Paste it in gitbash terminal you opened in vs code and follow the instruction.


### Configure Ubuntu SSH Terminal

- Update the package index with

  **_sudo apt update_**

- Install Python 3 package manager (Pip)

  **_sudo apt install python3-pip_**

- Install SQLite and Python 3.10 virtual environment module.

  **_sudo apt install sqlite3_**

  **_sudo apt install python3-venv_**

- Create and activate a Python virtual environment.

  **_python3 -m venv venv_**

  **_source venv/bin/activate_**

- Install Apache Airflow version 2.8.1 with PostgreSQL support.

  **_pip install 'apache-airflow==2.8.1' --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.8.txt"_**

- Initialize Airflow metadata database and set up PostgreSQL.

  **_airflow db migrate_**

- Installs PostgreSQL and additional contrib packages.

  **_sudo apt-get install postgresql postgresql-contrib_**

- Switch to the 'postgres' user, create a database and user, and grant privileges.

  **_psql_**

  **_sudo -i -u postgres_**

  **_CREATE DATABASE airflow;_**

  **_CREATE USER airflow WITH PASSWORD 'airflow';_**

  **_GRANT ALL PRIVILEGES ON DATABASE airflow TO airflow;_**

- Press **_Ctrl + D_** twice to leave postgres prompt back to your ssh terminal.

- Navigate to airflow directory.

  **_cd airflow_**

- Update Airflow configuration to use PostgreSQL.

  **_sed -i 's#sqlite:////home/ubuntu/airflow/airflow.db#postgresql+psycopg2://airflow:airflow@localhost/airflow#g' airflow.cfg_**

- Modify Airflow configuration to use LocalExecutor.

  **_sed -i 's#SequentialExecutor#LocalExecutor#g' airflow.cfg_**

- Initializes the Airflow metadata database again with the updated configuration

  **_airflow db migrate_**

- Create an Airflow user with administrative privileges.

  **_airflow users create -u airflow -f airflow -l airflow -r Admin -e airflow@gmail.com_**

- If asked for password, enter password you can remember, if possible use airflow as password (Not recommended in real prudction mode)


### Open Port 8080 on EC2 Instance.

- In EC2 Dashboard, select the instance and navigate to **_Security_**. If instance is not created, create it.

- Locate and click on **_Security group_**.

- Edit **_Inbound rules_** at the Inbound rules section. Click on **_Add rule button_**. The Inbound rules are numbered so scroll to the last one.
  - Under **_Type_**, select Custom TCP
  - Under **_Port range_**, type 8080
  - Under **_Source_**, select Anywhere IPv4
- Click on Save rules button.


### Run Airflow Webserver and Scheduler in Ubuntu SSH Terminal

- On the Ubuntu SSH terminal, run

  **_airflow webserver &_**

  to start the webserver

- Wait for the prompt to return and run

  **_airflow scheduler._**


### Access Airflow UI

- In EC2 Dashboard, select your EC2 instance.

- Under **_Public IPv4 DNS_**, copy the url.

- Paste it in a new browser tab, add ":8080" at the end (e.g. **_ec2-23-43-67-23-compute-1-amazon.com:8080_**).

- Log in with the Airflow username and password you created above in **_Configuring Ubuntu SSH Terminal_** section (i.e username: airflow, password: airflow).


### Configure Custom DAG Folder in Ubuntu SSH Terminal

- Stop the webserver by pressing **_Ctrl + C_** twice in the Ubuntu SSH Terminal.

- Navigate to the Airflow directory and create a new DAG folder to keep your dag and python files.

  **_cd airflow_**

  **_mkdir your_dag_folder_name_here_**

- Edit the airflow.cfg file, updating the dag_folder path after the last forward slash and changing the value of Load_example to False.

  **_nano airflow.cfg_**

  **_dag_folder=ubuntu/airflow/your_dag_folder_name_here_**

  **_Load_example=True_**

- Save and exit the configuration file.

  **_Ctrl + O_** and **_Ctrl + X_**

- Navigate to the new DAG folder and create the DAG file and Python files and paste in the corresponding codes.

  **_cd your_dag_folder_name_here_**

  **_nano your_python_filename.py_**

  **_nano your_dag_filename.py_**

  Make sure to save and exit.

- You can have a look at my dag and python files at my [Github repo](https://github.com/benkeyben/10alytics_air_realtor/tree/main)

- Run

  **_airflow db migrate_**

  to apply changes and check for errors on the terminals.

- Run the scheduler with

  **_airflow scheduler_**


### Reload the Airflow UI in the web browser

- Reload the airflow web page.

- Check your S3 bucket to see if the data has been loaded.

- Delete S3 bucket objects and terminate your EC2 instance.


### Thank you for reading

If you encounter any issues or have questions about the project or suggestions for improvement, please feel free to leave a comment.
If you find this information helpful, please give me a big star at the upper left section of my [Github repo](https://github.com/benkeyben/10alytics_air_realtor/tree/main). I'm glad you did.
