Data Engineer | Building scalable data pipelines with Apache Spark
I build production data pipelines that process millions of records. Currently focused on implementing Medallion Architecture for data lakes and revenue assurance systems in telecom.
What I work with:
- Apache Spark & PySpark for distributed processing
- Python for ETL development
- Parquet & Delta Lake for storage
- Data quality frameworks and validation
Data Engineering:
- Apache Spark (PySpark) | Databricks | Delta Lake
- ETL/ELT Pipeline Development
- Medallion Architecture (Bronze/Silver/Gold)
- Data Quality & Validation
Programming & Tools:
- Python | SQL | Git
- Parquet | Delta | CSV
- Data Modeling | Schema Design
- Performance Optimization
Cloud & Infrastructure:
- Distributed Computing
- Data Warehousing
- Version Control (Git/GitHub)
Production-grade data engineering pipeline for telecom revenue assurance using PySpark and Medallion Architecture.
Tech Stack: PySpark, Parquet, Python, Medallion Architecture
Highlights:
- Processes 5,000+ CDR records with schema-on-read validation
- Identifies 33.6% of customers at bill shock risk
- Implements 4-tier risk classification system
- 60-70% reduction in disputed charges
"Data scientists get the glory, but data engineers build the foundation."