A comprehensive end-to-end data analytics platform that combines automated ETL pipelines, time series forecasting (ARIMA/SARIMA + Machine Learning), and interactive dashboards to support airport operations management and retail demand planning.
This platform demonstrates full-stack data analytics capabilities including:
- Automated ETL Pipeline: Data extraction, transformation, validation, and storage
- Advanced Forecasting Models: Statistical time series (ARIMA/SARIMA) and machine learning (XGBoost) with ensemble methods
- Interactive Business Intelligence Dashboard: Real-time analytics and visualizations
- Data Quality Assurance: Automated validation and monitoring systems
Built with production-ready practices and designed to be easily extensible for real-world deployment.
- Python 3.9+ (3.9, 3.10, 3.11 recommended; 3.12 supported with minimal requirements)
- Streamlit - Interactive web dashboard framework
- Pandas - Data manipulation and analysis
- NumPy - Numerical computations
- scikit-learn - Machine learning utilities and preprocessing
- XGBoost - Gradient boosting for regression tasks
- statsmodels - Statistical modeling (ARIMA/SARIMA time series models)
- LightGBM - Alternative gradient boosting framework (optional)
- Plotly - Interactive charts and visualizations
- Matplotlib - Static plotting (support)
- Seaborn - Statistical visualization (support)
- Custom validation framework with automated quality checks
- JSON-based validation result storage
- Git - Version control
- pytest - Unit testing framework
- python-dotenv - Environment variable management
- Virtual Environments (venv) - Dependency isolation
- CSV - Primary data storage format (human-readable, version-controllable)
- SQLAlchemy - Database abstraction layer (for future PostgreSQL integration)
- Pickle - Model serialization
FlowCast/
├── app/ # Streamlit Dashboard Application
│ ├── main.py # Dashboard entry point and routing
│ ├── page_modules/ # Dashboard page modules
│ │ ├── overview.py # Project overview and key metrics
│ │ ├── operations.py # Operational analytics dashboard
│ │ ├── forecast.py # Forecasting model analysis
│ │ ├── retail.py # Retail demand analysis
│ │ └── data_quality.py # Data quality monitoring
│ └── utils.py # Shared utility functions
│
├── etl/ # ETL Pipeline
│ ├── extract.py # Data generation and extraction
│ ├── transform.py # Data cleaning and transformation
│ ├── validate.py # Data validation and quality checks
│ ├── load.py # Data storage
│ └── run_pipeline.py # Pipeline orchestration
│
├── models/ # Forecasting Models
│ ├── baseline.py # Baseline forecasting models
│ ├── arima.py # ARIMA/SARIMA implementation
│ ├── ml_models.py # XGBoost with feature engineering
│ ├── ensemble.py # Ensemble methods
│ ├── evaluate.py # Model evaluation metrics
│ └── train_models.py # Model training orchestration
│
├── data/ # Data Storage
│ ├── raw/ # Raw generated data
│ ├── processed/ # Cleaned and validated data
│ ├── models/ # Trained models and forecasts
│ └── outputs/ # Additional outputs
│
├── tests/ # Unit tests
├── config.py # Configuration management
├── requirements.txt # Python dependencies (Python 3.9-3.11)
├── requirements-minimal.txt # Minimal dependencies (Python 3.12)
├── setup.sh # Automated setup script
└── Makefile # Development automation
- Python 3.9+ (3.9, 3.10, or 3.11 recommended for full feature support)
- pip - Python package manager
- Terminal/Command Line - Basic command-line knowledge
- Web Browser - For accessing the dashboard (Chrome, Firefox, Safari, or Edge)
git clone <repository-url>
cd FlowCast# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activateFor Python 3.9, 3.10, or 3.11:
pip install -r requirements.txtFor Python 3.12 (minimal requirements - excludes some optional packages):
pip install -r requirements-minimal.txtAlternative: Use automated setup script
chmod +x setup.sh
./setup.shRun the complete ETL pipeline to generate synthetic data, clean it, validate it, and store it:
python -m etl.run_pipelineWhat this does:
- Generates synthetic passenger traffic data (2020-2024) with realistic patterns
- Generates weather data (temperature, precipitation, humidity)
- Creates holiday calendar
- Generates retail sales data correlated with passenger volume
- Cleans data (removes duplicates, handles missing values, caps outliers)
- Validates data quality (checks for negatives, reasonable ranges, completeness)
- Saves cleaned datasets to CSV files in
data/processed/
Expected output:
- Processed datasets in
data/processed/ - Validation results in JSON format
- Merged dataset combining all sources
Time: ~30 seconds to 2 minutes depending on system
Train all forecasting models on the processed data:
python -m models.train_modelsWhat this does:
- Trains baseline model (Seasonal Naive forecasting)
- Trains ARIMA/SARIMA models with automatic parameter selection
- Trains XGBoost model with feature engineering (lags, rolling statistics, time features)
- Creates ensemble model combining all approaches
- Evaluates model performance (MAE, RMSE, MAPE metrics)
- Saves trained models and forecast results
Expected output:
- Trained models in
data/models/ - Forecast results in
data/models/forecast_results.csv - Model performance metrics
Time: 2-10 minutes depending on system performance
Start the Streamlit web application:
streamlit run app/main.pyThe dashboard will automatically open in your default web browser at http://localhost:8501
Dashboard Pages:
- Overview - Project introduction, key metrics, and system architecture
- Operations Dashboard - Passenger traffic analysis, trends, and anomaly detection
- Forecast Analysis - Model comparison, accuracy metrics, and future predictions
- Retail Demand - Sales analysis, correlation with passenger volume, and staffing recommendations
- Data Quality - Validation results, quality scores, and data freshness monitoring
To stop the dashboard: Press Ctrl+C in the terminal
The project includes a Makefile for common tasks:
# Setup virtual environment and install dependencies
make setup
# Run ETL pipeline
make etl
# Train models
make train
# Run dashboard
make dashboard
# Run all: ETL + Train + Dashboard (sequential)
make all
# Clean generated data and models
make clean- Project architecture diagram
- High-level KPIs (total passengers, daily averages, data coverage)
- Overall passenger traffic trends
- Model performance summary
- Operational metrics with date range filtering
- Daily passenger traffic trend visualization
- Year-over-year growth analysis
- Monthly aggregation charts
- Statistical anomaly detection (identifies unusual days)
- Model performance comparison (MAE, RMSE, MAPE)
- Forecast vs actual comparison charts
- Multi-model selection and visualization
- Future forecast projections with adjustable horizon
- Model accuracy metrics table
- Retail performance metrics (total sales, conversion rates)
- Sales trend visualization
- Sales vs passenger volume correlation analysis
- Category breakdown (Duty Free, Food & Beverage, Retail, Services)
- Staffing recommendations based on passenger forecasts
- Monthly sales analysis
- Overall data quality score (percentage)
- Recent validation check results with filtering
- Check status breakdown by dataset type
- Failed checks details
- Data freshness information (last update dates)
Interactive Features:
- Date range filters for time-based analysis
- Model selection (multi-select dropdowns)
- Interactive charts (zoom, pan, hover tooltips)
- Help tooltips (?) on each section explaining concepts
- Real-time data loading from CSV files
Problem: Packages failing to install (especially on Python 3.12)
Solution:
# Use minimal requirements
pip install -r requirements-minimal.txt
# Or use Python 3.11
brew install python@3.11 # macOS
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txtProblem: ModuleNotFoundError when running scripts
Solution:
# Ensure virtual environment is activated (you should see (venv) in prompt)
source venv/bin/activate
# Reinstall dependencies
pip install -r requirements-minimal.txtProblem: Dashboard shows "No data available" messages
Solution:
# Run ETL pipeline first
python -m etl.run_pipeline
# If training models, also run:
python -m models.train_models
# Refresh browserProblem: Port 8501 is already in use
Solution:
# Find process using port
lsof -i :8501 # macOS/Linux
netstat -ano | findstr :8501 # Windows
# Kill the process (replace <PID> with process ID)
kill -9 <PID> # macOS/Linux
taskkill /PID <PID> /F # Windows
# Or use different port
streamlit run app/main.py --server.port 8502Problem: Errors related to XGBoost or LightGBM
Solution:
# These are optional for core functionality
# Use minimal requirements which excludes them:
pip install -r requirements-minimal.txt
# Core models (ARIMA, baseline) will still work- Ensemble Model: Typically achieves lowest error rates by combining multiple approaches
- XGBoost: Strong performance with feature engineering
- ARIMA/SARIMA: Effective for capturing seasonal patterns
- Baseline Models: Provide benchmarks for comparison
- ETL Pipeline: Processes 5+ years of daily data in under 2 minutes
- Model Training: Complete training pipeline in 2-10 minutes
- Dashboard Loading: Near-instantaneous data loading from CSV
- Add extraction function in
etl/extract.py - Add transformation logic in
etl/transform.py - Add validation rules in
etl/validate.py - Update merge function to include new data
- Update dashboard pages to visualize new data
- Create model implementation in
models/ - Add training logic to
models/train_models.py - Include in ensemble method if desired
- Update dashboard to display new model forecasts
- Create new page file in
app/page_modules/ - Import and route in
app/main.py - Use utility functions from
app/utils.py - Follow existing page structure and styling
- Automated ETL pipeline with error handling
- Data validation framework with quality scoring
- Synthetic data generation with realistic patterns
- CSV-based storage for portability and version control
- Multiple model types (statistical + ML) for robust predictions
- Feature engineering (lag features, rolling statistics, time encoding)
- Ensemble methods for improved accuracy
- Comprehensive model evaluation metrics
- Modular architecture with clear separation of concerns
- Configuration management
- Error handling and graceful fallbacks
- Code organization following best practices
