🚧 Work in Progress 🚧 You can check the project status here
This repo contains my unguided and open-ended data project inspired by Datacamp Unguided Projects. It which primarily aims further apply and develop my data engineering and analysis skills using real-world dataset. The project uses the NYC Taxi and Limousine Commission (NYC TLC) Trip Record Data presented a valuable opportunity due to its size, complexity, and publicly available nature.
I think this kind of project is much better than the guided projects when it comes to improving our skills. Its unguided nature compels us to discover what we know and don't know, something we don't usually experience when doing guided projects.
That is not to say guided projects have less value! Guided projects are one of the best ways to have hands-on experience and develop foundational knowledge.
- Install
poetry
(or usepip
, see notes below) - Create a virtual environment at the root directory
- At the root directory, run
poetry install
to install the project dependencies found atpyproject.toml
- Once installed, navigate to
src/de_portfolio_nyc_tlc
and follow the instructions listed on theREADME.md
to run dagster
Note: if you prefer using
pip
, just copy the dependencies frompyproject.toml
at the root directory and adjust it to your pip installation config
This project aims to achieve two key objectives:
Develop Data Engineering Skills. I seek to gain hands-on experience in the data engineering lifecycle by working with a real-world dataset. This includes practicing data ingestion, transformation, cleaning, and exploration using relevant tools and techniques.
Conduct Exploratory Data Analysis (EDA) of NYC Taxi Trip Data. I am leveraging the NYC TLC trip data to explore patterns, trends, and insights within the dataset. This involves understanding the data structure, identifying data quality issues, and performing initial analysis to uncover potential areas for further investigation.
This approach allows me to:
- Further learn and apply data engineering concepts through practical implementation
- Gain familiarity with the NYC TLC data and its potential uses for data-driven decision-making (although for hypothetical scenarios for now)
- Develop my overall data skills 🙌🏻
For this project, I'm using the following sources:
- NYC TLC Trip Record Data to download the
.parquet
files prepared by the NYC TLC - NYC Open Data API to stream download NYC TLC trip records via
httpx.stream()
- Of course, Python and SQL!!! 🍞🧈
- Dagster (orchestration)
- TBF
- TBF