<<<<<<< HEAD
A Comprehensive Learning Path with Real-World Projects
This is a complete, project-based Data Engineering curriculum designed to take you from SQL and Python basics to building production-ready data pipelines. You'll work on two major real-world projects while mastering modern data engineering tools.
Time Commitment: 6-10 hours/week
Duration: 16-20 weeks
Target Role: Data Engineer at Tech Companies
Environment: Local development β GCP Cloud
By completing this curriculum, you will:
β
Design and optimize relational databases
β
Write complex SQL queries with confidence
β
Build automated data pipelines with Apache Airflow
β
Process large datasets with Python (Pandas, Polars, DuckDB)
β
Integrate data from multiple APIs
β
Deploy production pipelines on GCP
β
Implement comprehensive testing and monitoring
β
Apply data engineering best practices from day one
- Module 1: SQL Fundamentals & Database Design
- Module 2: Python Essentials for Data Engineering
- Module 3: Setting Up Your Development Environment
- Module 4: Advanced SQL & PostgreSQL
- Module 5: Data Manipulation with Pandas & Polars
- Module 6: Introduction to DuckDB
- Module 7: Apache Airflow Fundamentals
- Module 8: API Integration & Data Extraction
- Module 9: Data Quality & Testing
- Module 10: PySpark for Big Data
- Module 11: GCP Data Engineering Services
- Module 12: Production Best Practices
- Project 1: Digital Marketing Analytics Pipeline
- Project 2: Brazilian Outdoor Adventure Platform
Build an end-to-end data pipeline that extracts, transforms, and visualizes marketing campaign data from multiple sources.
Technologies: Airflow, PostgreSQL, Python, APIs, DuckDB
Create a comprehensive data platform for outdoor enthusiasts in Brazil, integrating:
- Gear pricing from e-commerce sites
- Weather data for camping/climbing locations
- Trail databases with difficulty ratings
- Brazilian national parks information
Focus Areas:
- Serra do Mar, Chapada Diamantina, Serra da Mantiqueira
- Seasonal weather patterns
- Gear recommendations and pricing analysis
Technologies: Airflow, PostgreSQL, APIs, Python, GCP
| Category | Tools |
|---|---|
| Databases | PostgreSQL, DuckDB |
| Languages | SQL, Python 3.10+ |
| Data Processing | Pandas, Polars, PySpark |
| Orchestration | Apache Airflow |
| Cloud | Google Cloud Platform (BigQuery, Cloud Storage, Dataflow) |
| Testing | pytest, Great Expectations |
| Version Control | Git, GitHub |
Data Engineer Python SQL Path/
βββ README.md # This file
βββ modules/ # Learning modules
β βββ module_01_sql_fundamentals/
β βββ module_02_python_essentials/
β βββ module_03_environment_setup/
β βββ ...
βββ projects/ # Real-world projects
β βββ digital_marketing_pipeline/
β βββ brazilian_outdoor_platform/
βββ datasets/ # Sample and real datasets
β βββ marketing_data/
β βββ outdoor_adventure_data/
βββ sql_queries/ # SQL practice queries
βββ python_scripts/ # Python examples
βββ airflow_dags/ # Airflow DAG examples
βββ tests/ # Test suites
βββ docs/ # Additional documentation
β βββ setup_guides/
β βββ troubleshooting/
β βββ best_practices/
βββ requirements.txt # Python dependencies
- Computer with Windows, macOS, or Linux
- 8GB+ RAM (16GB recommended)
- 20GB free disk space
- Internet connection
- Clone this repository
- Follow Module 3 setup instructions
- Start with Module 1: SQL Fundamentals
- Work through modules sequentially
- Build projects as you learn
- Read Theory: Understand concepts before coding
- Run Examples: Execute all provided code
- Practice Questions: Solve problems independently
- Build Projects: Apply knowledge to real scenarios
- Test Everything: Write tests as you code
- Review & Refine: Revisit modules as needed
- Theory & Reading: 2-3 hours
- Hands-on Coding: 3-4 hours
- Project Work: 2-3 hours
- Review & Practice: 1-2 hours
Each module includes:
- π Theory: Detailed concept explanations
- π» Code Examples: Runnable, commented code
- β Practice Questions: With detailed solutions
- π§ Hands-on Tutorials: Step-by-step implementations
- β Testing Strategies: Quality assurance practices
- π¨ Common Pitfalls: Issues to avoid
- π Additional Resources: Further reading
- Real tools used in tech companies
- Production-ready code patterns
- Best practices from day one
- Every concept tied to real projects
- Executable examples
- Hands-on learning
- Beginner to advanced topics
- Local development to cloud deployment
- Theory to production implementation
- Real Brazilian outdoor data
- Local e-commerce APIs
- Regional weather patterns
- National parks information
- β Module completion checklists
- β Hands-on exercises with solutions
- β Project milestones
- β Code review guidelines
- β Performance benchmarks
- Common issues documented in
docs/troubleshooting/ - Error handling guides
- Debugging strategies
- Code style guidelines
- Performance optimization tips
- Security considerations
- Scalability patterns
After completing this course, you'll be prepared for:
- Data Engineer roles at tech companies
- Analytics Engineer positions
- Database Developer roles
- Data Platform Engineer positions
Skills Portfolio:
- Production data pipelines
- Cloud deployments (GCP)
- Complex SQL queries
- API integrations
- Automated testing
- Performance optimization
- β Set up your development environment (Module 3)
- β Complete SQL Fundamentals (Module 1)
- β Start Python Essentials (Module 2)
- β Begin Brazilian Outdoor Project planning
Start with Module 1: Navigate to modules/module_01_sql_fundamentals/ and begin your Data Engineering journey!
data_engineer_learning_python_sql_path
origin/main