Skip to content

ba1chev/Cloud-Audit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

231 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cloud-Audit — Cloud vs On-Premise Statistical Analysis

A rigorous statistical analysis of cloud computing versus on-premise infrastructure across 48 testable hypotheses spanning cost, performance, reliability, security, sustainability, and business outcomes.

Project Structure

cloudaudit/
├── config/
│   ├── settings.py                # Paths, singleton accessors (AWS, AZURE, SECURITY, …)
│   ├── statistical_config.py      # Alpha levels, confidence intervals, FDR settings
│   ├── on_prem_costs.py           # On-premise hardware & operational cost defaults
│   └── endpoints/                 # API endpoint dataclasses (one per provider)
│       ├── aws_endpoints.py
│       ├── azure_endpoints.py
│       ├── gcp_endpoints.py
│       ├── infracost_endpoints.py
│       ├── security_endpoints.py
│       ├── energy_endpoints.py
│       ├── sentiment_endpoints.py
│       └── outage_endpoints.py
│
├── source/
│   ├── constants.py                   # Centralized project constants & defaults
│   │
│   ├── data/                          # Data loading layer (14+ loaders)
│   │   ├── data_loader.py            # Abstract base class (fetch → transform → validate → cache → save)
│   │   ├── loader_registry.py        # Runtime loader discovery
│   │   ├── synthetic.py              # SyntheticDataFactory facade
│   │   ├── kaggle_loader.py          # Generic Kaggle dataset loader
│   │   ├── sentiment_loader.py       # HackerNews Algolia API
│   │   ├── spare_cores_loader.py     # SpareCores convenience loader
│   │   ├── benchmark_loaders/        # Cross-cloud benchmark data
│   │   │   ├── queries.py            # SQL queries (separated from logic)
│   │   │   └── spare_cores_loader.py # SpareCores via sparecores-data package
│   │   ├── cloud_pricing_loaders/    # Cloud pricing APIs
│   │   │   ├── aws_pricing_loader.py
│   │   │   ├── azure_pricing_loader.py
│   │   │   └── infracost_loader.py
│   │   ├── survey_loaders/           # Survey datasets
│   │   │   ├── stack_overflow_survey_loader.py
│   │   │   └── cncf_survey_loader.py
│   │   ├── security_loaders/         # Vulnerability & breach data
│   │   │   ├── nvd_loader.py
│   │   │   ├── cisakevl_loader.py
│   │   │   ├── hibp_loader.py
│   │   │   └── vcdbl_loader.py
│   │   ├── outage_loaders/           # Outage & incident data
│   │   │   ├── gcp_outage_loader.py
│   │   │   ├── outages_project_loader.py
│   │   │   └── outage_analyzer.py
│   │   ├── energy_loaders/           # Carbon & energy data
│   │   │   ├── google_carbon_loader.py
│   │   │   └── ccf_coefficients_loader.py
│   │   └── synthetic_generators/     # Synthetic data generators (fallback when APIs unavailable)
│   │       ├── synthetic_generator.py      # Abstract base
│   │       ├── spare_cores_synthetic.py
│   │       ├── azure_pricing_synthetic.py
│   │       ├── gcp_outages_synthetic.py
│   │       ├── google_carbon_synthetic.py
│   │       ├── cisa_kev_synthetic.py
│   │       ├── hibp_synthetic.py
│   │       └── hackernews_synthetic.py
│   │
│   ├── models/                        # Domain models
│   │   ├── hypothesis/                # 48 hypotheses with full metadata
│   │   │   ├── hypothesis.py
│   │   │   ├── hypothesis_family.py
│   │   │   ├── hypothesis_catalog.py
│   │   │   ├── test_direction.py
│   │   │   └── build_catalog.py
│   │   ├── workloads/                 # Workload profiles & simulation
│   │   │   ├── workload_type.py
│   │   │   ├── workload_profile.py
│   │   │   ├── workload_simulator.py
│   │   │   ├── archetype_profiles.py
│   │   │   └── demand_strategies/     # Pluggable demand generation strategies
│   │   │       ├── demand_strategy.py       # Abstract base
│   │   │       ├── steady_state_demand.py
│   │   │       ├── diurnal_demand.py
│   │   │       ├── bursty_demand.py
│   │   │       ├── batch_demand.py
│   │   │       ├── growth_demand.py
│   │   │       └── seasonal_demand.py
│   │   ├── cost_model.py             # Abstract CostModel base
│   │   ├── cost_component.py         # CostComponent dataclass
│   │   ├── cloud_cost_model.py       # Cloud TCO builder
│   │   ├── on_prem_cost_model.py     # On-prem TCO builder
│   │   ├── tco_comparator.py         # Side-by-side comparison + Monte Carlo
│   │   └── monte_carlo_simulator.py  # Monte Carlo cost simulation engine
│   │
│   ├── analysis/                      # Statistical testing framework (20+ tests)
│   │   ├── statistical_test.py       # Abstract StatisticalTest base
│   │   ├── test_registry.py          # TestRegistry for runtime discovery
│   │   ├── multiple_testing.py       # FDR, Bonferroni, Holm correction
│   │   ├── frequentist/              # Classical hypothesis tests
│   │   │   ├── welch_t_test.py
│   │   │   ├── paired_t_test.py
│   │   │   ├── one_sample_t_test.py
│   │   │   ├── one_way_anova.py
│   │   │   ├── mann_whitney_u.py
│   │   │   ├── kruskal_wallis.py
│   │   │   ├── levenes_test.py
│   │   │   ├── spearman_correlation.py
│   │   │   ├── chi_square_test.py
│   │   │   ├── cochran_armitage_trend.py
│   │   │   ├── linear_regression.py
│   │   │   ├── logistic_regression.py
│   │   │   ├── proportion_z_test.py
│   │   │   └── tost_equivalence.py
│   │   ├── inference/                 # Bootstrap, Bayesian & resampling methods
│   │   │   ├── resampling_engine.py
│   │   │   ├── bootstrap_engine.py
│   │   │   ├── permutation_engine.py
│   │   │   ├── bootstrap_ci.py
│   │   │   ├── bootstrap_mean_difference.py
│   │   │   ├── bootstrap_cv_comparison.py
│   │   │   ├── bootstrap_correlation.py
│   │   │   ├── bayesian_t_test.py
│   │   │   ├── bayesian_equivalence.py
│   │   │   └── bayesian_variance_ratio.py
│   │   └── specialized/              # Domain-specific analyses
│   │       ├── garch_volatility.py
│   │       ├── mann_kendall_trend.py
│   │       ├── poisson_regression.py
│   │       ├── sentiment_analyzer.py
│   │       ├── sentiment_scorer.py
│   │       └── survival.py
│   │
│   ├── visualization/                 # Plotting framework
│   │   ├── dashboard.py              # Multi-panel summary dashboard
│   │   ├── plotters/
│   │   │   ├── plotter_style.py      # Matplotlib/Seaborn style config
│   │   │   ├── plotter.py            # Base plotter utilities
│   │   │   ├── cost_plotter.py       # TCO, pricing, Monte Carlo, break-even
│   │   │   └── performance_plotter.py # Benchmarks, variance, carbon, outages
│   │   └── panels/                    # Dashboard panel components
│   │       ├── dashboard_panel.py          # Abstract base
│   │       ├── verdict_summary_panel.py
│   │       ├── effect_sizes_panel.py
│   │       ├── p_value_distribution_panel.py
│   │       ├── family_breakdown_panel.py
│   │       ├── significance_matrix_panel.py
│   │       └── top_findings_panel.py
│   │
│   ├── utils/                         # Shared utilities (one class per file)
│   │   ├── data_cache.py             # Disk cache (parquet with TTL)
│   │   ├── cache_entry.py            # Cache entry dataclass
│   │   ├── data_validator.py         # Schema-based data validation
│   │   ├── schema_rule.py            # Validation rule definition
│   │   ├── severity.py               # Validation severity levels
│   │   ├── validation_issue.py       # Validation issue dataclass
│   │   ├── outlier_detector.py       # Outlier detection dispatcher
│   │   ├── report_builder.py         # ReportBuilder (accumulates results)
│   │   ├── hypothesis_result.py      # HypothesisResult dataclass
│   │   ├── effect_size.py            # EffectSize dataclass
│   │   ├── effect_size_calculator.py # Effect size computation utilities
│   │   ├── test_verdict.py           # TestVerdict enum
│   │   ├── checks/                   # Validation check strategies
│   │   │   ├── validation_check.py         # Abstract base
│   │   │   ├── dtype_check.py
│   │   │   ├── null_check.py
│   │   │   ├── range_check.py
│   │   │   ├── allowed_values_check.py
│   │   │   ├── uniqueness_check.py
│   │   │   └── custom_check.py
│   │   ├── outlier_strategies/       # Outlier detection strategies
│   │   │   ├── outlier_strategy.py         # Abstract base
│   │   │   ├── iqr_outlier_strategy.py
│   │   │   └── zscore_outlier_strategy.py
│   │   └── report_exporters/         # Report export formats
│   │       ├── report_exporter.py          # Abstract base
│   │       ├── dataframe_exporter.py
│   │       ├── json_exporter.py
│   │       └── markdown_exporter.py
│   │
│   ├── analysis_task.py              # AnalysisTask dataclass
│   ├── research_pipeline.py          # Full research orchestrator
│   ├── pipeline_data_manager.py      # Pipeline data management
│   ├── quick_analysis.py             # Lightweight single-hypothesis runner
│   │
│   └── tests/                         # All project tests
│       ├── test_apis.py               # Connectivity check for all external APIs
│       ├── test_analysis.py           # Unit tests for statistical test classes
│       ├── test_models.py             # Unit tests for hypothesis catalog & TCO models
│       └── test_loaders.py            # Unit tests for data loaders
│
├── notebooks/
│   ├── 00_theoretical_framework.ipynb # Hypothesis definitions & methodology overview
│   ├── 01_data_collection.ipynb       # Data fetching, validation & source mapping
│   ├── 02_exploratory_analysis.ipynb  # EDA, distributions, normality checks
│   ├── 03_cost_analysis.ipynb         # Chapter 1: H1–H8, H38–H41 (Cost & TCO)
│   ├── 04_performance_analysis.ipynb  # Chapter 2: H9–H13, H22–H24, H34–H37
│   ├── 05_reliability_security.ipynb  # Chapter 3: H14–H21 (Reliability & Security)
│   └── 06_outcomes_synthesis.ipynb    # Chapter 4: H25–H33, H42–H48
│
├── data/
│   ├── raw/                           # Fetched datasets (CSV, auto-saved by loaders)
│   ├── processed/                     # Analysis outputs (results.csv, results.json)
│   └── .cache/                        # Parquet cache (TTL-based)
│
└── requirements.txt

Quick Start

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Verify API connectivity
python -m source.tests.test_apis

# Run unit tests
python -m pytest source/tests/ -v

# Launch notebooks
jupyter notebook notebooks/

Data Sources

Source Loader Records API
AWS EC2 Pricing AWSPricingLoader ~60 000 AWS Bulk Pricing
Azure VM Pricing AzurePricingLoader ~8 000 Azure Retail Prices
Spare Cores Benchmarks SpareCoresLoader ~2 000 sparecores-data (SQLite)
NVD Vulnerabilities NVDLoader ~2 000 NVD CVE 2.0
CISA Known Exploited CISAKEVLoader ~1 500 CISA KEV
HIBP Breaches HIBPLoader ~970 Have I Been Pwned
HackerNews Sentiment HackerNewsSentimentLoader ~1 700 Algolia HN Search
Google Carbon GoogleCarbonLoader ~40 GitHub CSV
GCP Outages GCPOutageLoader ~4 GCP Status JSON

All data is cached locally after first fetch. Loaders auto-save CSV to data/raw/.

OOP Design Patterns

  • Template MethodDataLoader.load() orchestrates fetch → transform → validate → cache → save; SyntheticGenerator.generate() defines skeleton for synthetic data creation
  • StrategyStatisticalTest with interchangeable test implementations; DemandStrategy for workload demand generation (diurnal, bursty, batch, etc.); OutlierStrategy for pluggable outlier detection (IQR, Z-score)
  • RegistryLoaderRegistry and TestRegistry for runtime discovery
  • FactoryHypothesisCatalog builds and registers all 48 hypotheses via build_catalog; SyntheticDataFactory facade delegates to individual generators
  • BuilderCloudCostModel.build() and OnPremCostModel.build() with fluent API
  • CompositionTCOComparator composes two cost models; DashboardBuilder composes panels into a report dashboard
  • Observer-likeReportBuilder accumulates HypothesisResult from independent tests
  • ExporterReportExporter base with DataFrameExporter, JsonExporter, MarkdownExporter for multi-format output

The 48 Hypotheses

Category Hypotheses Chapter
A: Cost & TCO H1–H8 1
B: Performance H9–H13 2
C: Reliability H14–H17 3
D: Security H18–H21 3
E: Scalability H22–H23 2
F: Developer H24–H26 2 & 4
G: Energy H27–H29 4
H: Company H30–H33 4
I: Workload H34–H37 2
J: Hidden H38–H48 1 & 4

Statistical Methods

  • Parametric — Welch's t-test, paired t-test, one-sample t, ANOVA, linear/logistic regression, TOST equivalence
  • Non-parametric — Mann-Whitney U, Kruskal-Wallis, Levene's, Spearman, chi-square, Cochran-Armitage, proportion Z
  • Bootstrap — BCa confidence intervals, mean difference, CV comparison, correlation
  • Bayesian — BEST (Kruschke), ROPE equivalence, variance ratio (conjugate InverseGamma)
  • Time series — GARCH(1,1), Mann-Kendall trend, Poisson regression
  • Survival — Kaplan-Meier + log-rank test
  • NLP — VADER sentiment, TextBlob, domain keyword scoring
  • Correction — Benjamini-Hochberg FDR, Bonferroni, Holm (within hypothesis families)

About

A data-driven research platform that statistically evaluates cloud vs. on-premise infrastructure across 48 hypotheses spanning cost, performance, reliability, security, and scalability. Uses real-world data from AWS, Azure, GCP, and security/benchmark sources with 20+ statistical methods to deliver evidence-based verdicts

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages