# A Cloud Price Comparison
How do cloud providers stack up?
 
Making apples to apples comparisons between different cloud providers is very difficult, because each one offers instances with varying vCPUs, RAM, SSD space and HDD space. To further obfuscate matters, slightly different billing systems, promises of arcane discounting, only providing pricing in USD, and inconsistent naming conventions are sprinkled throughout.

As an attempt to provide a clearer price comparison, I'll be using the [random forest algorithm](https://en.wikipedia.org/wiki/Random_forest) to "[normalise](https://en.wikipedia.org/wiki/Normalization_(statistics)" the pricing of compute instances across different cloud providers.

In essence, **If every cloud provider offered the same size compute instances, how expensive would they be?**

 ## Importing libraries

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor

## The dataset

I'll be taking the price tables of:
* Google Cloud - [Predefined machine types](https://cloud.google.com/compute/pricing#predefined_machine_types)
* AWS - [On demand instances](https://aws.amazon.com/ec2/pricing/on-demand/)
* Azure - [Linux virtual machines](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/)

and converting them into the instance sizes offered by [Catalyst Cloud](https://www.catalyst.net.nz/catalyst-cloud/prices). You can find the datasets and their sources [here](https://github.com/catalyst-cloud/catalystcloud-price-comparison/raw/master/dataset/Cloud%20price%20comparison.ods).

Importing the datasets.

In [2]:
catalyst_dataset = pd.read_csv("dataset/Rigorious Cloud price comparison - Exportable Catalyst prices.csv")
google_dataset = pd.read_csv("dataset/Rigorious Cloud price comparison - Exportable Google prices.csv")
aws_dataset = pd.read_csv("dataset/Rigorious Cloud price comparison - Exportable AWS prices.csv")
azure_dataset = pd.read_csv("dataset/Rigorious Cloud price comparison - Exportable Azure prices.csv")

Previewing the datasets.

In [3]:
catalyst_dataset.head(6)

Unnamed: 0,Flavour,vCPUs,RAM (GiB),HDD (GB),SSD (GB),Price per hour (NZD)
0,c1.c1r1,1.0,1.0,0.0,0.0,0.044
1,c1.c1r2,1.0,2.0,0.0,0.0,0.062
2,c1.c1r4,1.0,4.0,0.0,0.0,0.098
3,c1.c2r1,2.0,1.0,0.0,0.0,0.07
4,c1.c2r2,2.0,2.0,0.0,0.0,0.088
5,c1.c2r4,2.0,4.0,0.0,0.0,0.124


Now we'll split the data into NumPy arrays of input features (X) and labels (Y).

In [4]:
def split_dataset (dataset):
    x = dataset[["vCPUs", "RAM (GiB)", "HDD (GB)", "SSD (GB)"]].values
    y = dataset["Price per hour (NZD)"].values
    
    return (x, y)

In [5]:
catalyst_x, catalyst_y = split_dataset(catalyst_dataset)
google_x, google_y = split_dataset(google_dataset)
aws_x, aws_y = split_dataset(aws_dataset)
azure_x, azure_y = split_dataset(azure_dataset)

## The math
Random forests are a machine learning algorithm. We'll train a model for each cloud provider, then ask each model to predict their pricing of Catalyst Cloud's catalog of compute instances.

In [6]:
# Initialise forests
google_forest = RandomForestRegressor(n_estimators=64)
aws_forest = RandomForestRegressor(n_estimators=64)
azure_forest = RandomForestRegressor(n_estimators=64)

# Train forests
google_forest.fit(google_x, google_y)
aws_forest.fit(aws_x, aws_y)
azure_forest.fit(azure_x, azure_y)

# Predict Catalyst X
google_cata_price = google_forest.predict(catalyst_x)
aws_cata_price = aws_forest.predict(catalyst_x)
azure_cata_price = azure_forest.predict(catalyst_x)

Now we concatenate the results together with the input features the results are predictions of.

In [7]:
prices = catalyst_dataset.rename(index=str, columns={"Price per hour (NZD)": "Catalyst"})
prices["Google"], prices["AWS"], prices["Azure"] = [google_cata_price, aws_cata_price, azure_cata_price]

## The results

Now we have the results together, where we can compare the prices against each other on an even scale.

A good scientist would, at this point, verify their results by comparing an intersection between the predicted output and the actual output. I would love to do this. However I could find no such intersection.

You can find the datasets this analysis is based on [here](https://github.com/catalyst-cloud/catalystcloud-price-comparison/raw/master/dataset/Cloud%20price%20comparison.ods), and a chart plotting this data, [here](https://object-storage.nz-por-1.catalystcloud.io/v1/AUTH_8ccc3286887e49cb9a40f023eba693b4/catalyst-cloud-price-comp/).

In [8]:
prices

Unnamed: 0,Flavour,vCPUs,RAM (GiB),HDD (GB),SSD (GB),Catalyst,Google,AWS,Azure
0,c1.c1r1,1.0,1.0,0.0,0.0,0.044,0.035936,0.028745,0.067562
1,c1.c1r2,1.0,2.0,0.0,0.0,0.062,0.065498,0.043465,0.087703
2,c1.c1r4,1.0,4.0,0.0,0.0,0.098,0.103079,0.088082,0.138484
3,c1.c2r1,2.0,1.0,0.0,0.0,0.07,0.114459,0.074562,0.109266
4,c1.c2r2,2.0,2.0,0.0,0.0,0.088,0.138098,0.083121,0.125781
5,c1.c2r4,2.0,4.0,0.0,0.0,0.124,0.154654,0.119331,0.178406
6,c1.c2r8,2.0,8.0,0.0,0.0,0.196,0.201505,0.174281,0.253453
7,c1.c2r16,2.0,16.0,0.0,0.0,0.339,0.279824,0.208991,0.338609
8,c1.c4r2,4.0,2.0,0.0,0.0,0.14,0.252134,0.321487,0.246344
9,c1.c4r4,4.0,4.0,0.0,0.0,0.176,0.270619,0.329594,0.280797
