A rigorous statistical analysis of cloud computing versus on-premise infrastructure across 48 testable hypotheses spanning cost, performance, reliability, security, sustainability, and business outcomes.
cloudaudit/
├── config/
│ ├── settings.py # Paths, singleton accessors (AWS, AZURE, SECURITY, …)
│ ├── statistical_config.py # Alpha levels, confidence intervals, FDR settings
│ ├── on_prem_costs.py # On-premise hardware & operational cost defaults
│ └── endpoints/ # API endpoint dataclasses (one per provider)
│ ├── aws_endpoints.py
│ ├── azure_endpoints.py
│ ├── gcp_endpoints.py
│ ├── infracost_endpoints.py
│ ├── security_endpoints.py
│ ├── energy_endpoints.py
│ ├── sentiment_endpoints.py
│ └── outage_endpoints.py
│
├── source/
│ ├── constants.py # Centralized project constants & defaults
│ │
│ ├── data/ # Data loading layer (14+ loaders)
│ │ ├── data_loader.py # Abstract base class (fetch → transform → validate → cache → save)
│ │ ├── loader_registry.py # Runtime loader discovery
│ │ ├── synthetic.py # SyntheticDataFactory facade
│ │ ├── kaggle_loader.py # Generic Kaggle dataset loader
│ │ ├── sentiment_loader.py # HackerNews Algolia API
│ │ ├── spare_cores_loader.py # SpareCores convenience loader
│ │ ├── benchmark_loaders/ # Cross-cloud benchmark data
│ │ │ ├── queries.py # SQL queries (separated from logic)
│ │ │ └── spare_cores_loader.py # SpareCores via sparecores-data package
│ │ ├── cloud_pricing_loaders/ # Cloud pricing APIs
│ │ │ ├── aws_pricing_loader.py
│ │ │ ├── azure_pricing_loader.py
│ │ │ └── infracost_loader.py
│ │ ├── survey_loaders/ # Survey datasets
│ │ │ ├── stack_overflow_survey_loader.py
│ │ │ └── cncf_survey_loader.py
│ │ ├── security_loaders/ # Vulnerability & breach data
│ │ │ ├── nvd_loader.py
│ │ │ ├── cisakevl_loader.py
│ │ │ ├── hibp_loader.py
│ │ │ └── vcdbl_loader.py
│ │ ├── outage_loaders/ # Outage & incident data
│ │ │ ├── gcp_outage_loader.py
│ │ │ ├── outages_project_loader.py
│ │ │ └── outage_analyzer.py
│ │ ├── energy_loaders/ # Carbon & energy data
│ │ │ ├── google_carbon_loader.py
│ │ │ └── ccf_coefficients_loader.py
│ │ └── synthetic_generators/ # Synthetic data generators (fallback when APIs unavailable)
│ │ ├── synthetic_generator.py # Abstract base
│ │ ├── spare_cores_synthetic.py
│ │ ├── azure_pricing_synthetic.py
│ │ ├── gcp_outages_synthetic.py
│ │ ├── google_carbon_synthetic.py
│ │ ├── cisa_kev_synthetic.py
│ │ ├── hibp_synthetic.py
│ │ └── hackernews_synthetic.py
│ │
│ ├── models/ # Domain models
│ │ ├── hypothesis/ # 48 hypotheses with full metadata
│ │ │ ├── hypothesis.py
│ │ │ ├── hypothesis_family.py
│ │ │ ├── hypothesis_catalog.py
│ │ │ ├── test_direction.py
│ │ │ └── build_catalog.py
│ │ ├── workloads/ # Workload profiles & simulation
│ │ │ ├── workload_type.py
│ │ │ ├── workload_profile.py
│ │ │ ├── workload_simulator.py
│ │ │ ├── archetype_profiles.py
│ │ │ └── demand_strategies/ # Pluggable demand generation strategies
│ │ │ ├── demand_strategy.py # Abstract base
│ │ │ ├── steady_state_demand.py
│ │ │ ├── diurnal_demand.py
│ │ │ ├── bursty_demand.py
│ │ │ ├── batch_demand.py
│ │ │ ├── growth_demand.py
│ │ │ └── seasonal_demand.py
│ │ ├── cost_model.py # Abstract CostModel base
│ │ ├── cost_component.py # CostComponent dataclass
│ │ ├── cloud_cost_model.py # Cloud TCO builder
│ │ ├── on_prem_cost_model.py # On-prem TCO builder
│ │ ├── tco_comparator.py # Side-by-side comparison + Monte Carlo
│ │ └── monte_carlo_simulator.py # Monte Carlo cost simulation engine
│ │
│ ├── analysis/ # Statistical testing framework (20+ tests)
│ │ ├── statistical_test.py # Abstract StatisticalTest base
│ │ ├── test_registry.py # TestRegistry for runtime discovery
│ │ ├── multiple_testing.py # FDR, Bonferroni, Holm correction
│ │ ├── frequentist/ # Classical hypothesis tests
│ │ │ ├── welch_t_test.py
│ │ │ ├── paired_t_test.py
│ │ │ ├── one_sample_t_test.py
│ │ │ ├── one_way_anova.py
│ │ │ ├── mann_whitney_u.py
│ │ │ ├── kruskal_wallis.py
│ │ │ ├── levenes_test.py
│ │ │ ├── spearman_correlation.py
│ │ │ ├── chi_square_test.py
│ │ │ ├── cochran_armitage_trend.py
│ │ │ ├── linear_regression.py
│ │ │ ├── logistic_regression.py
│ │ │ ├── proportion_z_test.py
│ │ │ └── tost_equivalence.py
│ │ ├── inference/ # Bootstrap, Bayesian & resampling methods
│ │ │ ├── resampling_engine.py
│ │ │ ├── bootstrap_engine.py
│ │ │ ├── permutation_engine.py
│ │ │ ├── bootstrap_ci.py
│ │ │ ├── bootstrap_mean_difference.py
│ │ │ ├── bootstrap_cv_comparison.py
│ │ │ ├── bootstrap_correlation.py
│ │ │ ├── bayesian_t_test.py
│ │ │ ├── bayesian_equivalence.py
│ │ │ └── bayesian_variance_ratio.py
│ │ └── specialized/ # Domain-specific analyses
│ │ ├── garch_volatility.py
│ │ ├── mann_kendall_trend.py
│ │ ├── poisson_regression.py
│ │ ├── sentiment_analyzer.py
│ │ ├── sentiment_scorer.py
│ │ └── survival.py
│ │
│ ├── visualization/ # Plotting framework
│ │ ├── dashboard.py # Multi-panel summary dashboard
│ │ ├── plotters/
│ │ │ ├── plotter_style.py # Matplotlib/Seaborn style config
│ │ │ ├── plotter.py # Base plotter utilities
│ │ │ ├── cost_plotter.py # TCO, pricing, Monte Carlo, break-even
│ │ │ └── performance_plotter.py # Benchmarks, variance, carbon, outages
│ │ └── panels/ # Dashboard panel components
│ │ ├── dashboard_panel.py # Abstract base
│ │ ├── verdict_summary_panel.py
│ │ ├── effect_sizes_panel.py
│ │ ├── p_value_distribution_panel.py
│ │ ├── family_breakdown_panel.py
│ │ ├── significance_matrix_panel.py
│ │ └── top_findings_panel.py
│ │
│ ├── utils/ # Shared utilities (one class per file)
│ │ ├── data_cache.py # Disk cache (parquet with TTL)
│ │ ├── cache_entry.py # Cache entry dataclass
│ │ ├── data_validator.py # Schema-based data validation
│ │ ├── schema_rule.py # Validation rule definition
│ │ ├── severity.py # Validation severity levels
│ │ ├── validation_issue.py # Validation issue dataclass
│ │ ├── outlier_detector.py # Outlier detection dispatcher
│ │ ├── report_builder.py # ReportBuilder (accumulates results)
│ │ ├── hypothesis_result.py # HypothesisResult dataclass
│ │ ├── effect_size.py # EffectSize dataclass
│ │ ├── effect_size_calculator.py # Effect size computation utilities
│ │ ├── test_verdict.py # TestVerdict enum
│ │ ├── checks/ # Validation check strategies
│ │ │ ├── validation_check.py # Abstract base
│ │ │ ├── dtype_check.py
│ │ │ ├── null_check.py
│ │ │ ├── range_check.py
│ │ │ ├── allowed_values_check.py
│ │ │ ├── uniqueness_check.py
│ │ │ └── custom_check.py
│ │ ├── outlier_strategies/ # Outlier detection strategies
│ │ │ ├── outlier_strategy.py # Abstract base
│ │ │ ├── iqr_outlier_strategy.py
│ │ │ └── zscore_outlier_strategy.py
│ │ └── report_exporters/ # Report export formats
│ │ ├── report_exporter.py # Abstract base
│ │ ├── dataframe_exporter.py
│ │ ├── json_exporter.py
│ │ └── markdown_exporter.py
│ │
│ ├── analysis_task.py # AnalysisTask dataclass
│ ├── research_pipeline.py # Full research orchestrator
│ ├── pipeline_data_manager.py # Pipeline data management
│ ├── quick_analysis.py # Lightweight single-hypothesis runner
│ │
│ └── tests/ # All project tests
│ ├── test_apis.py # Connectivity check for all external APIs
│ ├── test_analysis.py # Unit tests for statistical test classes
│ ├── test_models.py # Unit tests for hypothesis catalog & TCO models
│ └── test_loaders.py # Unit tests for data loaders
│
├── notebooks/
│ ├── 00_theoretical_framework.ipynb # Hypothesis definitions & methodology overview
│ ├── 01_data_collection.ipynb # Data fetching, validation & source mapping
│ ├── 02_exploratory_analysis.ipynb # EDA, distributions, normality checks
│ ├── 03_cost_analysis.ipynb # Chapter 1: H1–H8, H38–H41 (Cost & TCO)
│ ├── 04_performance_analysis.ipynb # Chapter 2: H9–H13, H22–H24, H34–H37
│ ├── 05_reliability_security.ipynb # Chapter 3: H14–H21 (Reliability & Security)
│ └── 06_outcomes_synthesis.ipynb # Chapter 4: H25–H33, H42–H48
│
├── data/
│ ├── raw/ # Fetched datasets (CSV, auto-saved by loaders)
│ ├── processed/ # Analysis outputs (results.csv, results.json)
│ └── .cache/ # Parquet cache (TTL-based)
│
└── requirements.txt
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Verify API connectivity
python -m source.tests.test_apis
# Run unit tests
python -m pytest source/tests/ -v
# Launch notebooks
jupyter notebook notebooks/| Source | Loader | Records | API |
|---|---|---|---|
| AWS EC2 Pricing | AWSPricingLoader |
~60 000 | AWS Bulk Pricing |
| Azure VM Pricing | AzurePricingLoader |
~8 000 | Azure Retail Prices |
| Spare Cores Benchmarks | SpareCoresLoader |
~2 000 | sparecores-data (SQLite) |
| NVD Vulnerabilities | NVDLoader |
~2 000 | NVD CVE 2.0 |
| CISA Known Exploited | CISAKEVLoader |
~1 500 | CISA KEV |
| HIBP Breaches | HIBPLoader |
~970 | Have I Been Pwned |
| HackerNews Sentiment | HackerNewsSentimentLoader |
~1 700 | Algolia HN Search |
| Google Carbon | GoogleCarbonLoader |
~40 | GitHub CSV |
| GCP Outages | GCPOutageLoader |
~4 | GCP Status JSON |
All data is cached locally after first fetch. Loaders auto-save CSV to data/raw/.
- Template Method —
DataLoader.load()orchestrates fetch → transform → validate → cache → save;SyntheticGenerator.generate()defines skeleton for synthetic data creation - Strategy —
StatisticalTestwith interchangeable test implementations;DemandStrategyfor workload demand generation (diurnal, bursty, batch, etc.);OutlierStrategyfor pluggable outlier detection (IQR, Z-score) - Registry —
LoaderRegistryandTestRegistryfor runtime discovery - Factory —
HypothesisCatalogbuilds and registers all 48 hypotheses viabuild_catalog;SyntheticDataFactoryfacade delegates to individual generators - Builder —
CloudCostModel.build()andOnPremCostModel.build()with fluent API - Composition —
TCOComparatorcomposes two cost models;DashboardBuildercomposes panels into a report dashboard - Observer-like —
ReportBuilderaccumulatesHypothesisResultfrom independent tests - Exporter —
ReportExporterbase withDataFrameExporter,JsonExporter,MarkdownExporterfor multi-format output
| Category | Hypotheses | Chapter |
|---|---|---|
| A: Cost & TCO | H1–H8 | 1 |
| B: Performance | H9–H13 | 2 |
| C: Reliability | H14–H17 | 3 |
| D: Security | H18–H21 | 3 |
| E: Scalability | H22–H23 | 2 |
| F: Developer | H24–H26 | 2 & 4 |
| G: Energy | H27–H29 | 4 |
| H: Company | H30–H33 | 4 |
| I: Workload | H34–H37 | 2 |
| J: Hidden | H38–H48 | 1 & 4 |
- Parametric — Welch's t-test, paired t-test, one-sample t, ANOVA, linear/logistic regression, TOST equivalence
- Non-parametric — Mann-Whitney U, Kruskal-Wallis, Levene's, Spearman, chi-square, Cochran-Armitage, proportion Z
- Bootstrap — BCa confidence intervals, mean difference, CV comparison, correlation
- Bayesian — BEST (Kruschke), ROPE equivalence, variance ratio (conjugate InverseGamma)
- Time series — GARCH(1,1), Mann-Kendall trend, Poisson regression
- Survival — Kaplan-Meier + log-rank test
- NLP — VADER sentiment, TextBlob, domain keyword scoring
- Correction — Benjamini-Hochberg FDR, Bonferroni, Holm (within hypothesis families)