Skip to content

Commit

Permalink
wip to add AWS (#2)
Browse files Browse the repository at this point in the history
* wip to add AWS
* add support for cluster deletion
* finish up scaling and example, a few bug fixes

the steps here will do a full creation of the cluster, which include
VPC, subnets (private and public), and security group, associating (and
creating if needed) a pem, getting the endpoint and certificate to make
a kube config yaml file to authenticate, and then another stack to
create the workers pool. This is interesting that AWS first creates you
an "empty" cluster, meaning just a control plane, and then you need
to create the workers as a separate request, and apply a config map
secret to the kube-system so the control plane can see the workers!
This is so much more complex than GKE, and now that I have everything
working to go UP I have to go backwards and figure out how to delete
everything before looking into scaling... hahahahahahah aahhhhh! :)

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed May 25, 2023
1 parent 94c391e commit e75b9a0
Show file tree
Hide file tree
Showing 24 changed files with 1,057 additions and 85 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,6 @@ dist/
__pycache__
*.img
/.eggs
*auth-config.yaml
*kubeconfig.yaml
*kubeconfig-*.yaml
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,5 @@ and **Merged pull requests**. Critical items to know are:
The versions coincide with releases on pip. Only major versions will be released as tags on Github.

## [0.0.x](https://github.com/converged-computing/kubescaler/tree/main) (0.0.x)
- support for AWS EKS and first versioned release (0.0.1)
- initial skeleton release of project (0.0.0)
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ up and down, of your Kubernetes clusters in Python. We currently have support fo
we use, namely:

- Google (GKE)
- Amazon (EKS) (under development)
- Amazon (EKS)

🚧️ **under development** 🚧️

Expand Down Expand Up @@ -44,6 +44,11 @@ tool to generate a contributors graphic below.

<!-- ALL-CONTRIBUTORS-LIST:END -->

## TODO

- fix up GKE scale function to only be one function, we don't need to reset max and min again
- run experiments for scaling on EKS

## License

HPCIC DevTools is distributed under the terms of the MIT license.
Expand Down
40 changes: 40 additions & 0 deletions examples/aws/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# AWS Examples

## Create and Delete a Cluster

This example shows creating and deleting a cluster. You should be able to run
this also if a cluster is already created. First, make sure your AWS credentials
are exported:

```bash
export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxx
export AWS_SESSION_TOKEN=xxxxxxxxxxxxxxxxxxxxxx
```

And then run the script (using defaults, min size 1, max size 3)

```bash
$ python create-delete-cluster.py --min-node-count 1 --max-node-count 3 --machine-type m5.large
```

## Test Scale

Here are some example runs for testing the time it takes to scale a cluster up.
We also time separate components of scaling, like creating the worker pool and
the vpc. We do small max sizes here since it's just a demo! This first example runs on GKE:

```bash
$ pip install -e .[aws]
$ pip install -e kubescaler[aws]
```
```bash
# Test scale up in increments of 1 (up to 3) for c2-standard-8 (the default) just one iteration!
$ python test-scale.py --increment 1 small-cluster --max-node-count 3 --min-node-count 0 --start-iter 0 --end-iter 1

# Slightly more reasonable experiment
$ python test-scale.py --increment 1 test-cluster --max-node-count 32 --min-node-count 0 --start-iter 0 --end-iter 10

# Test scale down in increments of 2 (5 down to 1) for 10 iterations (default)
$ python test-scale.py --increment 2 test-cluster --down --max-node-count 5 --down
```
74 changes: 74 additions & 0 deletions examples/aws/create-delete-cluster.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#!/usr/bin/env python3

import argparse
import sys
import time

from kubescaler.scaler import EKSCluster


def get_parser():
parser = argparse.ArgumentParser(
description="K8s Cluster Creator / Destroyer!",
formatter_class=argparse.RawTextHelpFormatter,
)
parser.add_argument(
"cluster_name", nargs="?", help="Cluster name suffix", default="flux-cluster"
)
parser.add_argument(
"--experiment", help="Experiment name (defaults to script name)", default=None
)
parser.add_argument("--node-count", help="starting node count", type=int, default=2)
parser.add_argument(
"--max-node-count", help="maximum node count", type=int, default=3
)
parser.add_argument(
"--min-node-count",
help="minimum node count",
type=int,
default=1,
)
parser.add_argument("--machine-type", help="AWS machine type", default="m5.large")
return parser


def main():
"""
Demonstrate creating and deleting a cluster. If the cluster exists,
we should be able to retrieve it and not create a second one.
"""
parser = get_parser()

# If an error occurs while parsing the arguments, the interpreter will exit with value 2
args, _ = parser.parse_known_args()

# Pull cluster name out of argument
cluster_name = args.cluster_name

# Derive the experiment name, either named or from script
experiment_name = args.experiment
if not experiment_name:
experiment_name = sys.argv[0].replace(".py", "")
time.sleep(2)

# Update cluster name to include experiment name
cluster_name = f"{experiment_name}-{cluster_name}"
print(f"📛️ Cluster name is {cluster_name}")

print(
f"⭐️ Creating the cluster sized {args.min_node_count} to {args.max_node_count}..."
)
cli = EKSCluster(
name=cluster_name,
node_count=args.node_count,
max_nodes=args.max_node_count,
min_nodes=args.min_node_count,
machine_type=args.machine_type,
)
cli.create_cluster()
print("⭐️ Deleting the cluster...")
cli.delete_cluster()


if __name__ == "__main__":
main()
179 changes: 179 additions & 0 deletions examples/aws/test-scale.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
#!/usr/bin/env python3

import argparse
import json
import os
import sys
import time

from kubescaler.scaler import EKSCluster
from kubescaler.utils import read_json

# Save data here
here = os.path.dirname(os.path.abspath(__file__))

# Create data output directory
data = os.path.join(here, "data")


def get_parser():
parser = argparse.ArgumentParser(
description="K8s Scaling Experiment Runner",
formatter_class=argparse.RawTextHelpFormatter,
)
parser.add_argument(
"cluster_name", nargs="?", help="Cluster name suffix", default="flux-cluster"
)
parser.add_argument(
"--outdir",
help="output directory for results",
default=data,
)
parser.add_argument(
"--experiment", help="Experiment name (defaults to script name)", default=None
)
parser.add_argument(
"--start-iter", help="start at this iteration", type=int, default=0
)
parser.add_argument(
"--end-iter", help="end at this iteration", type=int, default=3, dest="iters"
)
parser.add_argument(
"--max-node-count", help="maximum node count", type=int, default=3
)
parser.add_argument(
"--min-node-count", help="minimum node count", type=int, default=0
)
parser.add_argument(
"--start-node-count",
help="start at this many nodes and go up",
type=int,
default=1,
)
parser.add_argument("--machine-type", help="AWS machine type", default="m5.large")
parser.add_argument(
"--increment", help="Increment by this value", type=int, default=1
)
parser.add_argument(
"--down", action="store_true", help="Test scaling down", default=False
)
return parser


def main():
"""
This experiment will test scaling a cluster, three times, each
time going from 2 nodes to 32. We want to understand if scaling is
impacted by cluster size.
"""
parser = get_parser()

# If an error occurs while parsing the arguments, the interpreter will exit with value 2
args, _ = parser.parse_known_args()

# Pull cluster name out of argument
cluster_name = args.cluster_name

# Derive the experiment name, either named or from script
experiment_name = args.experiment
if not experiment_name:
experiment_name = sys.argv[0].replace(".py", "")
time.sleep(2)

# Shared tags for logging and output
if args.down:
direction = "decrease"
tag = "down"
else:
direction = "increase"
tag = "up"

# Update cluster name to include tag and increment
experiment_name = f"{experiment_name}-{tag}-{args.increment}"
print(f"📛️ Experiment name is {experiment_name}")

# Prepare an output directory, named by cluster
outdir = os.path.join(args.outdir, experiment_name, cluster_name)
if not os.path.exists(outdir):
print(f"📁️ Creating output directory {outdir}")
os.makedirs(outdir)

# Define stopping conditions for two directions
def less_than_max(node_count):
return node_count <= args.max_node_count

def greater_than_zero(node_count):
return node_count > 0

# Update cluster name to include experiment name
cluster_name = f"{experiment_name}-{cluster_name}"
print(f"📛️ Cluster name is {cluster_name}")

# Create 10 clusters, each going up to 32 nodes
for iter in range(args.start_iter, args.iters):
results_file = os.path.join(outdir, f"scaling-{iter}.json")

# Start at the max if we are going down, otherwise the starting count
node_count = args.max_node_count if args.down else args.start_node_count
print(
f"⭐️ Creating the initial cluster, iteration {iter} with size {node_count}..."
)
cli = EKSCluster(
name=cluster_name,
node_count=node_count,
machine_type=args.machine_type,
min_nodes=args.min_node_count,
max_nodes=args.max_node_count,
)
# Load a result if we have it
if os.path.exists(results_file):
result = read_json(results_file)
cli.times = result["times"]

# Create the cluster (this times it)
res = cli.create_cluster()
print(f"📦️ The cluster has {cli.node_count} nodes!")

# Flip between functions to decide to keep going based on:
# > 0 (we are decreasing from the max node count)
# <= max nodes (we are going up from a min node count)
keep_going = less_than_max
if args.down:
keep_going = greater_than_zero

# Continue scaling until we reach stopping condition
while keep_going(node_count):
old_size = node_count

# Are we doing down or up?
if args.down:
node_count -= args.increment
else:
node_count += args.increment

print(
f"⚖️ Iteration {iter}: scaling to {direction} by {args.increment}, from {old_size} to {node_count}"
)

# Scale the cluster - we should do similar logic for the GKE client (one function)
start = time.time()
res = cli.scale(node_count)
end = time.time()
seconds = round(end - start, 3)
cli.times[f"scale_{tag}_{old_size}_to_{node_count}"] = seconds
print(
f"📦️ Scaling from {old_size} to {node_count} took {seconds} seconds, and the cluster now has {res.initial_node_count} nodes!"
)

# Save the times as we go
print(json.dumps(cli.data, indent=4))
cli.save(results_file)

# Delete the cluster and clean up
cli.delete_cluster()
print(json.dumps(cli.data, indent=4))
cli.save(results_file)


if __name__ == "__main__":
main()
File renamed without changes.
2 changes: 1 addition & 1 deletion examples/test-scale.py → examples/google/test-scale.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import sys
import time

from kubescaler import GKECluster
from kubescaler.scaler import GKECluster
from kubescaler.utils import read_json

# Save data here
Expand Down
1 change: 0 additions & 1 deletion kubescaler/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
from kubescaler.scaler import GKECluster
from kubescaler.version import __version__
23 changes: 19 additions & 4 deletions kubescaler/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
#
# SPDX-License-Identifier: (MIT)

import os

import kubescaler.defaults as defaults
from kubescaler.utils import write_json


Expand All @@ -16,25 +19,37 @@ def __init__(
name=None,
description=None,
tags=None,
node_count=4,
region=None,
node_count=2,
sleep_seconds=3,
sleep_multiplier=1,
max_nodes=32,
max_nodes=3,
min_nodes=0,
machine_type=None,
kubernetes_version=None,
):
"""
A simple class to control creating a cluster
"""
self.node_count = node_count
self.tags = tags or ["kubescaler-cluster"]
self.name = name or "kubescaler-cluster"

# List or dict depending on cloud
self.tags = tags
self.name = os.path.basename(name or "kubescaler-cluster")
self.max_nodes = max_nodes
self.min_nodes = max(0, min_nodes)
self.description = description or "A Kubescaler testing cluster"
self.sleep_seconds = sleep_seconds
self.kubernetes_version = kubernetes_version or defaults.kubernetes_version
self.machine_type = machine_type

# Sleep time multiplication factor must be > 1, defaults to 1.5
self.sleep_multiplier = max(sleep_multiplier or 1, 1)
self.sleep_time = sleep_seconds or 2

# Region or default region
self.region = region or self.default_region

# Easy way to save times
self.times = {}

Expand Down

0 comments on commit e75b9a0

Please sign in to comment.