Cloud-Audit — Cloud vs On-Premise Statistical Analysis

A rigorous statistical analysis of cloud computing versus on-premise infrastructure across 48 testable hypotheses spanning cost, performance, reliability, security, sustainability, and business outcomes.

Project Structure

cloudaudit/
├── config/
│   ├── settings.py                # Paths, singleton accessors (AWS, AZURE, SECURITY, …)
│   ├── statistical_config.py      # Alpha levels, confidence intervals, FDR settings
│   ├── on_prem_costs.py           # On-premise hardware & operational cost defaults
│   └── endpoints/                 # API endpoint dataclasses (one per provider)
│       ├── aws_endpoints.py
│       ├── azure_endpoints.py
│       ├── gcp_endpoints.py
│       ├── infracost_endpoints.py
│       ├── security_endpoints.py
│       ├── energy_endpoints.py
│       ├── sentiment_endpoints.py
│       └── outage_endpoints.py
│
├── source/
│   ├── constants.py                   # Centralized project constants & defaults
│   │
│   ├── data/                          # Data loading layer (14+ loaders)
│   │   ├── data_loader.py            # Abstract base class (fetch → transform → validate → cache → save)
│   │   ├── loader_registry.py        # Runtime loader discovery
│   │   ├── synthetic.py              # SyntheticDataFactory facade
│   │   ├── kaggle_loader.py          # Generic Kaggle dataset loader
│   │   ├── sentiment_loader.py       # HackerNews Algolia API
│   │   ├── spare_cores_loader.py     # SpareCores convenience loader
│   │   ├── benchmark_loaders/        # Cross-cloud benchmark data
│   │   │   ├── queries.py            # SQL queries (separated from logic)
│   │   │   └── spare_cores_loader.py # SpareCores via sparecores-data package
│   │   ├── cloud_pricing_loaders/    # Cloud pricing APIs
│   │   │   ├── aws_pricing_loader.py
│   │   │   ├── azure_pricing_loader.py
│   │   │   └── infracost_loader.py
│   │   ├── survey_loaders/           # Survey datasets
│   │   │   ├── stack_overflow_survey_loader.py
│   │   │   └── cncf_survey_loader.py
│   │   ├── security_loaders/         # Vulnerability & breach data
│   │   │   ├── nvd_loader.py
│   │   │   ├── cisakevl_loader.py
│   │   │   ├── hibp_loader.py
│   │   │   └── vcdbl_loader.py
│   │   ├── outage_loaders/           # Outage & incident data
│   │   │   ├── gcp_outage_loader.py
│   │   │   ├── outages_project_loader.py
│   │   │   └── outage_analyzer.py
│   │   ├── energy_loaders/           # Carbon & energy data
│   │   │   ├── google_carbon_loader.py
│   │   │   └── ccf_coefficients_loader.py
│   │   └── synthetic_generators/     # Synthetic data generators (fallback when APIs unavailable)
│   │       ├── synthetic_generator.py      # Abstract base
│   │       ├── spare_cores_synthetic.py
│   │       ├── azure_pricing_synthetic.py
│   │       ├── gcp_outages_synthetic.py
│   │       ├── google_carbon_synthetic.py
│   │       ├── cisa_kev_synthetic.py
│   │       ├── hibp_synthetic.py
│   │       └── hackernews_synthetic.py
│   │
│   ├── models/                        # Domain models
│   │   ├── hypothesis/                # 48 hypotheses with full metadata
│   │   │   ├── hypothesis.py
│   │   │   ├── hypothesis_family.py
│   │   │   ├── hypothesis_catalog.py
│   │   │   ├── test_direction.py
│   │   │   └── build_catalog.py
│   │   ├── workloads/                 # Workload profiles & simulation
│   │   │   ├── workload_type.py
│   │   │   ├── workload_profile.py
│   │   │   ├── workload_simulator.py
│   │   │   ├── archetype_profiles.py
│   │   │   └── demand_strategies/     # Pluggable demand generation strategies
│   │   │       ├── demand_strategy.py       # Abstract base
│   │   │       ├── steady_state_demand.py
│   │   │       ├── diurnal_demand.py
│   │   │       ├── bursty_demand.py
│   │   │       ├── batch_demand.py
│   │   │       ├── growth_demand.py
│   │   │       └── seasonal_demand.py
│   │   ├── cost_model.py             # Abstract CostModel base
│   │   ├── cost_component.py         # CostComponent dataclass
│   │   ├── cloud_cost_model.py       # Cloud TCO builder
│   │   ├── on_prem_cost_model.py     # On-prem TCO builder
│   │   ├── tco_comparator.py         # Side-by-side comparison + Monte Carlo
│   │   └── monte_carlo_simulator.py  # Monte Carlo cost simulation engine
│   │
│   ├── analysis/                      # Statistical testing framework (20+ tests)
│   │   ├── statistical_test.py       # Abstract StatisticalTest base
│   │   ├── test_registry.py          # TestRegistry for runtime discovery
│   │   ├── multiple_testing.py       # FDR, Bonferroni, Holm correction
│   │   ├── frequentist/              # Classical hypothesis tests
│   │   │   ├── welch_t_test.py
│   │   │   ├── paired_t_test.py
│   │   │   ├── one_sample_t_test.py
│   │   │   ├── one_way_anova.py
│   │   │   ├── mann_whitney_u.py
│   │   │   ├── kruskal_wallis.py
│   │   │   ├── levenes_test.py
│   │   │   ├── spearman_correlation.py
│   │   │   ├── chi_square_test.py
│   │   │   ├── cochran_armitage_trend.py
│   │   │   ├── linear_regression.py
│   │   │   ├── logistic_regression.py
│   │   │   ├── proportion_z_test.py
│   │   │   └── tost_equivalence.py
│   │   ├── inference/                 # Bootstrap, Bayesian & resampling methods
│   │   │   ├── resampling_engine.py
│   │   │   ├── bootstrap_engine.py
│   │   │   ├── permutation_engine.py
│   │   │   ├── bootstrap_ci.py
│   │   │   ├── bootstrap_mean_difference.py
│   │   │   ├── bootstrap_cv_comparison.py
│   │   │   ├── bootstrap_correlation.py
│   │   │   ├── bayesian_t_test.py
│   │   │   ├── bayesian_equivalence.py
│   │   │   └── bayesian_variance_ratio.py
│   │   └── specialized/              # Domain-specific analyses
│   │       ├── garch_volatility.py
│   │       ├── mann_kendall_trend.py
│   │       ├── poisson_regression.py
│   │       ├── sentiment_analyzer.py
│   │       ├── sentiment_scorer.py
│   │       └── survival.py
│   │
│   ├── visualization/                 # Plotting framework
│   │   ├── dashboard.py              # Multi-panel summary dashboard
│   │   ├── plotters/
│   │   │   ├── plotter_style.py      # Matplotlib/Seaborn style config
│   │   │   ├── plotter.py            # Base plotter utilities
│   │   │   ├── cost_plotter.py       # TCO, pricing, Monte Carlo, break-even
│   │   │   └── performance_plotter.py # Benchmarks, variance, carbon, outages
│   │   └── panels/                    # Dashboard panel components
│   │       ├── dashboard_panel.py          # Abstract base
│   │       ├── verdict_summary_panel.py
│   │       ├── effect_sizes_panel.py
│   │       ├── p_value_distribution_panel.py
│   │       ├── family_breakdown_panel.py
│   │       ├── significance_matrix_panel.py
│   │       └── top_findings_panel.py
│   │
│   ├── utils/                         # Shared utilities (one class per file)
│   │   ├── data_cache.py             # Disk cache (parquet with TTL)
│   │   ├── cache_entry.py            # Cache entry dataclass
│   │   ├── data_validator.py         # Schema-based data validation
│   │   ├── schema_rule.py            # Validation rule definition
│   │   ├── severity.py               # Validation severity levels
│   │   ├── validation_issue.py       # Validation issue dataclass
│   │   ├── outlier_detector.py       # Outlier detection dispatcher
│   │   ├── report_builder.py         # ReportBuilder (accumulates results)
│   │   ├── hypothesis_result.py      # HypothesisResult dataclass
│   │   ├── effect_size.py            # EffectSize dataclass
│   │   ├── effect_size_calculator.py # Effect size computation utilities
│   │   ├── test_verdict.py           # TestVerdict enum
│   │   ├── checks/                   # Validation check strategies
│   │   │   ├── validation_check.py         # Abstract base
│   │   │   ├── dtype_check.py
│   │   │   ├── null_check.py
│   │   │   ├── range_check.py
│   │   │   ├── allowed_values_check.py
│   │   │   ├── uniqueness_check.py
│   │   │   └── custom_check.py
│   │   ├── outlier_strategies/       # Outlier detection strategies
│   │   │   ├── outlier_strategy.py         # Abstract base
│   │   │   ├── iqr_outlier_strategy.py
│   │   │   └── zscore_outlier_strategy.py
│   │   └── report_exporters/         # Report export formats
│   │       ├── report_exporter.py          # Abstract base
│   │       ├── dataframe_exporter.py
│   │       ├── json_exporter.py
│   │       └── markdown_exporter.py
│   │
│   ├── analysis_task.py              # AnalysisTask dataclass
│   ├── research_pipeline.py          # Full research orchestrator
│   ├── pipeline_data_manager.py      # Pipeline data management
│   ├── quick_analysis.py             # Lightweight single-hypothesis runner
│   │
│   └── tests/                         # All project tests
│       ├── test_apis.py               # Connectivity check for all external APIs
│       ├── test_analysis.py           # Unit tests for statistical test classes
│       ├── test_models.py             # Unit tests for hypothesis catalog & TCO models
│       └── test_loaders.py            # Unit tests for data loaders
│
├── notebooks/
│   ├── 00_theoretical_framework.ipynb # Hypothesis definitions & methodology overview
│   ├── 01_data_collection.ipynb       # Data fetching, validation & source mapping
│   ├── 02_exploratory_analysis.ipynb  # EDA, distributions, normality checks
│   ├── 03_cost_analysis.ipynb         # Chapter 1: H1–H8, H38–H41 (Cost & TCO)
│   ├── 04_performance_analysis.ipynb  # Chapter 2: H9–H13, H22–H24, H34–H37
│   ├── 05_reliability_security.ipynb  # Chapter 3: H14–H21 (Reliability & Security)
│   └── 06_outcomes_synthesis.ipynb    # Chapter 4: H25–H33, H42–H48
│
├── data/
│   ├── raw/                           # Fetched datasets (CSV, auto-saved by loaders)
│   ├── processed/                     # Analysis outputs (results.csv, results.json)
│   └── .cache/                        # Parquet cache (TTL-based)
│
└── requirements.txt

Quick Start

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Verify API connectivity
python -m source.tests.test_apis

# Run unit tests
python -m pytest source/tests/ -v

# Launch notebooks
jupyter notebook notebooks/

Data Sources

Source	Loader	Records	API
AWS EC2 Pricing	`AWSPricingLoader`	~60 000	AWS Bulk Pricing
Azure VM Pricing	`AzurePricingLoader`	~8 000	Azure Retail Prices
Spare Cores Benchmarks	`SpareCoresLoader`	~2 000	sparecores-data (SQLite)
NVD Vulnerabilities	`NVDLoader`	~2 000	NVD CVE 2.0
CISA Known Exploited	`CISAKEVLoader`	~1 500	CISA KEV
HIBP Breaches	`HIBPLoader`	~970	Have I Been Pwned
HackerNews Sentiment	`HackerNewsSentimentLoader`	~1 700	Algolia HN Search
Google Carbon	`GoogleCarbonLoader`	~40	GitHub CSV
GCP Outages	`GCPOutageLoader`	~4	GCP Status JSON

All data is cached locally after first fetch. Loaders auto-save CSV to data/raw/.

OOP Design Patterns

Template Method — DataLoader.load() orchestrates fetch → transform → validate → cache → save; SyntheticGenerator.generate() defines skeleton for synthetic data creation
Strategy — StatisticalTest with interchangeable test implementations; DemandStrategy for workload demand generation (diurnal, bursty, batch, etc.); OutlierStrategy for pluggable outlier detection (IQR, Z-score)
Registry — LoaderRegistry and TestRegistry for runtime discovery
Factory — HypothesisCatalog builds and registers all 48 hypotheses via build_catalog; SyntheticDataFactory facade delegates to individual generators
Builder — CloudCostModel.build() and OnPremCostModel.build() with fluent API
Composition — TCOComparator composes two cost models; DashboardBuilder composes panels into a report dashboard
Observer-like — ReportBuilder accumulates HypothesisResult from independent tests
Exporter — ReportExporter base with DataFrameExporter, JsonExporter, MarkdownExporter for multi-format output

The 48 Hypotheses

Category	Hypotheses	Chapter
A: Cost & TCO	H1–H8	1
B: Performance	H9–H13	2
C: Reliability	H14–H17	3
D: Security	H18–H21	3
E: Scalability	H22–H23	2
F: Developer	H24–H26	2 & 4
G: Energy	H27–H29	4
H: Company	H30–H33	4
I: Workload	H34–H37	2
J: Hidden	H38–H48	1 & 4

Statistical Methods

Parametric — Welch's t-test, paired t-test, one-sample t, ANOVA, linear/logistic regression, TOST equivalence
Non-parametric — Mann-Whitney U, Kruskal-Wallis, Levene's, Spearman, chi-square, Cochran-Armitage, proportion Z
Bootstrap — BCa confidence intervals, mean difference, CV comparison, correlation
Bayesian — BEST (Kruschke), ROPE equivalence, variance ratio (conjugate InverseGamma)
Time series — GARCH(1,1), Mann-Kendall trend, Poisson regression
Survival — Kaplan-Meier + log-rank test
NLP — VADER sentiment, TextBlob, domain keyword scoring
Correction — Benjamini-Hochberg FDR, Bonferroni, Holm (within hypothesis families)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud-Audit — Cloud vs On-Premise Statistical Analysis

Project Structure

Quick Start

Data Sources

OOP Design Patterns

The 48 Hypotheses

Statistical Methods

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 231 Commits
config		config
data		data
notebooks		notebooks
source		source
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Cloud-Audit — Cloud vs On-Premise Statistical Analysis

Project Structure

Quick Start

Data Sources

OOP Design Patterns

The 48 Hypotheses

Statistical Methods

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages