This project creates data pipelines using Stack Overflow’s content data dump to recommend contents such as most trending questions under various domains and builds a web application using Flask that displays popular questions that combine user-selected topic tags.
The directory structure of the repo:
├── README.md
├── url_7z_xml_to_s3
│ └── transfer_to_s3.py
| └── extract_url.py
| └── process_7z_file_batch.sh
├── s3_xml_to_parquet_s3
│ └── s3_xml_to_parquet_s3.py
├── s3_spark_aggregation_postgredb
│ └── aggregation_over_parquet_v2.py
├── postgredb_to_flask_web
└── flaskr
└── __init__.py
└── db.py
└── posts.py
└── templates
| ├── base.html
| ├── posts.html
└── static
| ├── style.css
├── airflow
└── dag
└── aggregation_daily_job.py
Stack Exchange Data Dump: 100 GB
- Fast distributed computation over large datasets (data preprocessing, aggregation)
- Scheduler for daily-batch processing
- Flexible/Robust data schema design to incorporate dynamic business needs
- Generic platform for top-feed related products
- Enable data ingestion from multiple sources
- Build pipelines to support ML-based feed/post recommendation
- Streaming data support to satisfy real-time business needs
