wip to add AWS (#2)

* wip to add AWS * add support for cluster deletion * finish up scaling and example, a few bug fixes the steps here will do a full creation of the cluster, which include VPC, subnets (private and public), and security group, associating (and creating if needed) a pem, getting the endpoint and certificate to make a kube config yaml file to authenticate, and then another stack to create the workers pool. This is interesting that AWS first creates you an "empty" cluster, meaning just a control plane, and then you need to create the workers as a separate request, and apply a config map secret to the kube-system so the control plane can see the workers! This is so much more complex than GKE, and now that I have everything working to go UP I have to go backwards and figure out how to delete everything before looking into scaling... hahahahahahah aahhhhh! :) Signed-off-by: vsoch <vsoch@users.noreply.github.com>
converged-computing · May 25, 2023 · e75b9a0 · e75b9a0
1 parent 94c391e
commit e75b9a0
Show file tree

Hide file tree

Showing 24 changed files with 1,057 additions and 85 deletions.
diff --git a/.gitignore b/.gitignore
@@ -12,3 +12,6 @@ dist/
 __pycache__
 *.img
 /.eggs
+*auth-config.yaml
+*kubeconfig.yaml
+*kubeconfig-*.yaml
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,4 +14,5 @@ and **Merged pull requests**. Critical items to know are:
 The versions coincide with releases on pip. Only major versions will be released as tags on Github.
 
 ## [0.0.x](https://github.com/converged-computing/kubescaler/tree/main) (0.0.x)
+ - support for AWS EKS and first versioned release (0.0.1)
  - initial skeleton release of project (0.0.0)
diff --git a/README.md b/README.md
@@ -16,7 +16,7 @@ up and down, of your Kubernetes clusters in Python. We currently have support fo
 we use, namely:
 
 - Google (GKE)
-- Amazon (EKS) (under development)
+- Amazon (EKS)
 
 🚧️ **under development** 🚧️
 
@@ -44,6 +44,11 @@ tool to generate a contributors graphic below.
 
 <!-- ALL-CONTRIBUTORS-LIST:END -->
 
+## TODO
+
+ - fix up GKE scale function to only be one function, we don't need to reset max and min again
+ - run experiments for scaling on EKS
+
 ## License
 
 HPCIC DevTools is distributed under the terms of the MIT license.

diff --git a/examples/aws/README.md b/examples/aws/README.md
@@ -0,0 +1,40 @@
+# AWS Examples
+
+## Create and Delete a Cluster
+
+This example shows creating and deleting a cluster. You should be able to run
+this  also if a cluster is already created. First, make sure your AWS credentials
+are exported:
+
+```bash
+export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxx
+export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxx
+export AWS_SESSION_TOKEN=xxxxxxxxxxxxxxxxxxxxxx
+```
+
+And then run the script (using defaults, min size 1, max size 3)
+
+```bash
+$ python create-delete-cluster.py --min-node-count 1 --max-node-count 3 --machine-type m5.large
+```
+
+## Test Scale
+
+Here are some example runs for testing the time it takes to scale a cluster up.
+We also time separate components of scaling, like creating the worker pool and
+the vpc. We do small max sizes here since it's just a demo! This first example runs on GKE:
+
+```bash
+$ pip install -e .[aws]
+$ pip install -e kubescaler[aws]
+```
+```bash
+# Test scale up in increments of 1 (up to 3) for c2-standard-8 (the default) just one iteration!
+$ python test-scale.py --increment 1 small-cluster --max-node-count 3 --min-node-count 0 --start-iter 0 --end-iter 1
+
+# Slightly more reasonable experiment
+$ python test-scale.py --increment 1 test-cluster --max-node-count 32 --min-node-count 0 --start-iter 0 --end-iter 10
+
+# Test scale down in increments of 2 (5 down to 1) for 10 iterations (default)
+$ python test-scale.py --increment 2 test-cluster --down --max-node-count 5 --down
+```
diff --git a/examples/aws/create-delete-cluster.py b/examples/aws/create-delete-cluster.py
@@ -0,0 +1,74 @@
+#!/usr/bin/env python3
+
+import argparse
+import sys
+import time
+
+from kubescaler.scaler import EKSCluster
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="K8s Cluster Creator / Destroyer!",
+        formatter_class=argparse.RawTextHelpFormatter,
+    )
+    parser.add_argument(
+        "cluster_name", nargs="?", help="Cluster name suffix", default="flux-cluster"
+    )
+    parser.add_argument(
+        "--experiment", help="Experiment name (defaults to script name)", default=None
+    )
+    parser.add_argument("--node-count", help="starting node count", type=int, default=2)
+    parser.add_argument(
+        "--max-node-count", help="maximum node count", type=int, default=3
+    )
+    parser.add_argument(
+        "--min-node-count",
+        help="minimum node count",
+        type=int,
+        default=1,
+    )
+    parser.add_argument("--machine-type", help="AWS machine type", default="m5.large")
+    return parser
+
+
+def main():
+    """
+    Demonstrate creating and deleting a cluster. If the cluster exists,
+    we should be able to retrieve it and not create a second one.
+    """
+    parser = get_parser()
+
+    # If an error occurs while parsing the arguments, the interpreter will exit with value 2
+    args, _ = parser.parse_known_args()
+
+    # Pull cluster name out of argument
+    cluster_name = args.cluster_name
+
+    # Derive the experiment name, either named or from script
+    experiment_name = args.experiment
+    if not experiment_name:
+        experiment_name = sys.argv[0].replace(".py", "")
+    time.sleep(2)
+
+    # Update cluster name to include experiment name
+    cluster_name = f"{experiment_name}-{cluster_name}"
+    print(f"📛️ Cluster name is {cluster_name}")
+
+    print(
+        f"⭐️ Creating the cluster sized {args.min_node_count} to {args.max_node_count}..."
+    )
+    cli = EKSCluster(
+        name=cluster_name,
+        node_count=args.node_count,
+        max_nodes=args.max_node_count,
+        min_nodes=args.min_node_count,
+        machine_type=args.machine_type,
+    )
+    cli.create_cluster()
+    print("⭐️ Deleting the cluster...")
+    cli.delete_cluster()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/aws/test-scale.py b/examples/aws/test-scale.py
@@ -0,0 +1,179 @@
+#!/usr/bin/env python3
+
+import argparse
+import json
+import os
+import sys
+import time
+
+from kubescaler.scaler import EKSCluster
+from kubescaler.utils import read_json
+
+# Save data here
+here = os.path.dirname(os.path.abspath(__file__))
+
+# Create data output directory
+data = os.path.join(here, "data")
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="K8s Scaling Experiment Runner",
+        formatter_class=argparse.RawTextHelpFormatter,
+    )
+    parser.add_argument(
+        "cluster_name", nargs="?", help="Cluster name suffix", default="flux-cluster"
+    )
+    parser.add_argument(
+        "--outdir",
+        help="output directory for results",
+        default=data,
+    )
+    parser.add_argument(
+        "--experiment", help="Experiment name (defaults to script name)", default=None
+    )
+    parser.add_argument(
+        "--start-iter", help="start at this iteration", type=int, default=0
+    )
+    parser.add_argument(
+        "--end-iter", help="end at this iteration", type=int, default=3, dest="iters"
+    )
+    parser.add_argument(
+        "--max-node-count", help="maximum node count", type=int, default=3
+    )
+    parser.add_argument(
+        "--min-node-count", help="minimum node count", type=int, default=0
+    )
+    parser.add_argument(
+        "--start-node-count",
+        help="start at this many nodes and go up",
+        type=int,
+        default=1,
+    )
+    parser.add_argument("--machine-type", help="AWS machine type", default="m5.large")
+    parser.add_argument(
+        "--increment", help="Increment by this value", type=int, default=1
+    )
+    parser.add_argument(
+        "--down", action="store_true", help="Test scaling down", default=False
+    )
+    return parser
+
+
+def main():
+    """
+    This experiment will test scaling a cluster, three times, each
+    time going from 2 nodes to 32. We want to understand if scaling is
+    impacted by cluster size.
+    """
+    parser = get_parser()
+
+    # If an error occurs while parsing the arguments, the interpreter will exit with value 2
+    args, _ = parser.parse_known_args()
+
+    # Pull cluster name out of argument
+    cluster_name = args.cluster_name
+
+    # Derive the experiment name, either named or from script
+    experiment_name = args.experiment
+    if not experiment_name:
+        experiment_name = sys.argv[0].replace(".py", "")
+    time.sleep(2)
+
+    # Shared tags for logging and output
+    if args.down:
+        direction = "decrease"
+        tag = "down"
+    else:
+        direction = "increase"
+        tag = "up"
+
+    # Update cluster name to include tag and increment
+    experiment_name = f"{experiment_name}-{tag}-{args.increment}"
+    print(f"📛️ Experiment name is {experiment_name}")
+
+    # Prepare an output directory, named by cluster
+    outdir = os.path.join(args.outdir, experiment_name, cluster_name)
+    if not os.path.exists(outdir):
+        print(f"📁️ Creating output directory {outdir}")
+        os.makedirs(outdir)
+
+    # Define stopping conditions for two directions
+    def less_than_max(node_count):
+        return node_count <= args.max_node_count
+
+    def greater_than_zero(node_count):
+        return node_count > 0
+
+    # Update cluster name to include experiment name
+    cluster_name = f"{experiment_name}-{cluster_name}"
+    print(f"📛️ Cluster name is {cluster_name}")
+
+    # Create 10 clusters, each going up to 32 nodes
+    for iter in range(args.start_iter, args.iters):
+        results_file = os.path.join(outdir, f"scaling-{iter}.json")
+
+        # Start at the max if we are going down, otherwise the starting count
+        node_count = args.max_node_count if args.down else args.start_node_count
+        print(
+            f"⭐️ Creating the initial cluster, iteration {iter} with size {node_count}..."
+        )
+        cli = EKSCluster(
+            name=cluster_name,
+            node_count=node_count,
+            machine_type=args.machine_type,
+            min_nodes=args.min_node_count,
+            max_nodes=args.max_node_count,
+        )
+        # Load a result if we have it
+        if os.path.exists(results_file):
+            result = read_json(results_file)
+            cli.times = result["times"]
+
+        # Create the cluster (this times it)
+        res = cli.create_cluster()
+        print(f"📦️ The cluster has {cli.node_count} nodes!")
+
+        # Flip between functions to decide to keep going based on:
+        # > 0 (we are decreasing from the max node count)
+        # <= max nodes (we are going up from a min node count)
+        keep_going = less_than_max
+        if args.down:
+            keep_going = greater_than_zero
+
+        # Continue scaling until we reach stopping condition
+        while keep_going(node_count):
+            old_size = node_count
+
+            # Are we doing down or up?
+            if args.down:
+                node_count -= args.increment
+            else:
+                node_count += args.increment
+
+            print(
+                f"⚖️ Iteration {iter}: scaling to {direction} by {args.increment}, from {old_size} to {node_count}"
+            )
+
+            # Scale the cluster - we should do similar logic for the GKE client (one function)
+            start = time.time()
+            res = cli.scale(node_count)
+            end = time.time()
+            seconds = round(end - start, 3)
+            cli.times[f"scale_{tag}_{old_size}_to_{node_count}"] = seconds
+            print(
+                f"📦️ Scaling from {old_size} to {node_count} took {seconds} seconds, and the cluster now has {res.initial_node_count} nodes!"
+            )
+
+            # Save the times as we go
+            print(json.dumps(cli.data, indent=4))
+            cli.save(results_file)
+
+        # Delete the cluster and clean up
+        cli.delete_cluster()
+        print(json.dumps(cli.data, indent=4))
+        cli.save(results_file)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/README.md → examples/google/README.md b/examples/README.md → examples/google/README.md
diff --git a/...st-scale-up-1/test-cluster/scaling-0.json → ...st-scale-up-1/test-cluster/scaling-0.json b/...st-scale-up-1/test-cluster/scaling-0.json → ...st-scale-up-1/test-cluster/scaling-0.json
diff --git a/examples/test-scale.py → examples/google/test-scale.py b/examples/test-scale.py → examples/google/test-scale.py
@@ -6,7 +6,7 @@
 import sys
 import time
 
-from kubescaler import GKECluster
+from kubescaler.scaler import GKECluster
 from kubescaler.utils import read_json
 
 # Save data here

diff --git a/kubescaler/__init__.py b/kubescaler/__init__.py
@@ -1,2 +1 @@
-from kubescaler.scaler import GKECluster
 from kubescaler.version import __version__
diff --git a/kubescaler/cluster.py b/kubescaler/cluster.py
@@ -3,6 +3,9 @@
 #
 # SPDX-License-Identifier: (MIT)
 
+import os
+
+import kubescaler.defaults as defaults
 from kubescaler.utils import write_json
 
 
@@ -16,25 +19,37 @@ def __init__(
         name=None,
         description=None,
         tags=None,
-        node_count=4,
+        region=None,
+        node_count=2,
         sleep_seconds=3,
         sleep_multiplier=1,
-        max_nodes=32,
+        max_nodes=3,
+        min_nodes=0,
+        machine_type=None,
+        kubernetes_version=None,
     ):
         """
         A simple class to control creating a cluster
         """
         self.node_count = node_count
-        self.tags = tags or ["kubescaler-cluster"]
-        self.name = name or "kubescaler-cluster"
+
+        # List or dict depending on cloud
+        self.tags = tags
+        self.name = os.path.basename(name or "kubescaler-cluster")
         self.max_nodes = max_nodes
+        self.min_nodes = max(0, min_nodes)
         self.description = description or "A Kubescaler testing cluster"
         self.sleep_seconds = sleep_seconds
+        self.kubernetes_version = kubernetes_version or defaults.kubernetes_version
+        self.machine_type = machine_type
 
         # Sleep time multiplication factor must be > 1, defaults to 1.5
         self.sleep_multiplier = max(sleep_multiplier or 1, 1)
         self.sleep_time = sleep_seconds or 2
 
+        # Region or default region
+        self.region = region or self.default_region
+
         # Easy way to save times
         self.times = {}