TLC Trip Record Data Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
More info about dataset can be found here:
Website - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Data Dictionary - https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
- Project Overview
- Project Architecture
2.1. Data Modeling
2.2. Data Ingestion
2.3. Compute Engine Building
2.4. ETL Process
2.5. BigQuery Data Warehouse
2.6. Data Reporting - Credits
- Contact
This project can be defined as End-to-end Data Engineering Project applied in Google Cloud Platform. NYC Uber driver dataset is used for these project.Uber dataset is investigated and new data model is created in first process. A new cloud bucket is created as a source of the raw data. In next step, new Virtual Machine is created as computing engine to install and Mage.ai tool permenantly. Python scripts are built for data extraction, transformation and loading in Mage.ai with consistent order for get raw data from source bucket, manipulate and load into BigQuery Datawarehouse for analysing. In BigQuery, new big table is created by join dim tables with fact tables to get desired columns. In final step Report is created in LookerStudio.
You can find the detailed information on the diagram below:
- Uber dataset is investigated in scope of understanding main structure of the dataset, detail knowledge about data as column descriptions are completed.
- New data model is created in first process.
- A new cloud bucket is created as a source of the raw data.The raw dataset is uploaded into this bucket as .csv file.
- New Virtual Machine instance is created for using as a Compute Engine to install and run Orchestration tool(Mage.ai).
- VM is created and access the instance via SSH.
- All requirements(python3 and mage.ai) are installed.
- Requirements:
# Install Python and pip
sudo apt-get update
sudo apt-get install python3-distutils
sudo apt-get install python3-apt
sudo apt-get install wget
wget https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py
sudo pip3 install mage-ai
Install Pandas
sudo pip3 install pandas
sudo pip3 install google-cloud
sudo pip3 install google-cloud-bigquery
- Python scripts are built for data extraction, transformation and loading in Mage.ai with consistent order for get raw data from source bucket, manipulate and load into BigQuery Datawarehouse for analysing.
- When load process is done, there are dim tables and 1 fact table in BigQuery Data Warehouse as expected.
- In BigQuery, new big table(tbl_analytics) is created by join dim tables with fact tables to get desired columns.
- BigQuery is connected with Looker Studio BI , and used the Views of the DB to create interactive and insightful data visualizations.
- Data Source: Google Cloud Bucket Storage
- Orchestration: Mage.ai
- Compute Engine: Google Cloud VM Instance
- ETL Process: Mage.ai, Python
- Date Warehousing: Bigquery, t-SQL
- Storage: Google Cloud Bucket Storage
- Data Visualization: Looker Studio
- This Project is inspired by the video of the YouTube Channel "Darshil Parmar"