Skip to content

Commit

Permalink
K3s setup (#6)
Browse files Browse the repository at this point in the history
* start k3s by flux jobs
* k3s setup in full flux nodes working properly
* k3s in root mode with flux job, part 1 working
* some tweaking and more testing done
* remove sensitive information
* added more instructions in readme, updated protocol for basic flux setup also
  • Loading branch information
rajibhossen committed Jun 15, 2023
1 parent 8c10b4e commit c363df8
Show file tree
Hide file tree
Showing 8 changed files with 393 additions and 44 deletions.
3 changes: 2 additions & 1 deletion examples/autoscale/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -149,11 +149,12 @@ resource "aws_security_group" "security_group" {
cidr_blocks = ["0.0.0.0/0"]
}

# temporariliy allowing all protocols for internal communication
ingress {
description = "Allow internal traffic"
from_port = 0
to_port = 0
protocol = "tcp"
protocol = "-1"
cidr_blocks = [local.cidr_block_a, local.cidr_block_b, local.cidr_block_c]
}

Expand Down
42 changes: 36 additions & 6 deletions examples/k3s/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,24 @@
# Currently Under Construction

# Instructions
Assumes you already have the image from the main instructions [../../README.md](README.md)

## Export AWS credentials to environment variables.

```bash
export AWS_ACCESS_KEY_ID=<>
export AWS_SECRET_ACCESS_KEY=<>
export AWS_SESSION_TOKEN=<>
export AWS_DEFAULT_REGION=us-east-1
export TF_VAR_aws_secret=$AWS_SECRET_ACCESS_KEY
export TF_VAR_aws_key=$AWS_ACCESS_KEY_ID
export TF_VAR_aws_session=$AWS_SESSION_TOKEN
```

Assumes you already have the image from the main instructions [README.md](../../README.md)
And then init and build:

Note: By Default, the instances only allow ssh from the specific machines. Change `ip_address_allowed` from the `main.tf` file according to your needs.

```bash
$ make init
$ make fmt
Expand All @@ -16,22 +31,37 @@ Or they all can be run with `make`:
```bash
$ make
```
K3S binary will be available in the instances once they are launched.

### Upload K3S starter script and flux job submit script to ALL the nodes
The important files for K3S setup are - [k3s_starter.sh](../scripts/k3s_starter.sh), [k3s_cleanup.sh](../scripts/k3s_cleanup.sh), [k3s_agent_cleanup.sh](../scripts/k3s_agent_cleanup.sh). If you use git clone, make sure you change the directory in [k3s_starter.sh](scripts/k3s_starter.sh) so that it points to the cleaning files. Optionally, you can upload from your local directory to the instances following the below commands. [flux_batch_job.sh](../scripts/flux_batch_job.sh) run all the necessary files to install k3s along with your hpc jobs.

```bash
$ scp -i "mykey.pem" k3s_starter.sh rocky@ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com
$ scp -i "mykey.pem" k3s_cleanup.sh rocky@ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com
$ scp -i "mykey.pem" k3s_agent_cleanup.sh rocky@ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com
$ scp -i "mykey.pem" flux_batch_job.sh rocky@ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com
```

### Note: the k3s deployment script (k3s_starter.sh) assume the clean up scripts are in the user home directory.

You can then shell into any node, and check the status of K3S.
You can then shell into any node, and submit flux jobs.

```bash
$ ssh -o 'IdentitiesOnly yes' -i "mykey.pem" rocky@ec2-xx-xxx-xx-xxx.compute-1.amazonaws.com
```

Check the cluster status, the overlay status, and try running a job:
### Now, Run flux job that will start K3S, and will run your workload
Be sure to change k3s secret value, number of instances, and any modifications!
The below command runs a job with three nodes.

```bash
$ kubectl get nodes
$ flux batch -N 3 --error k3s_installation.out --output k3s_installation.out flux_batch_job.sh "k3s_secret_token"
```

You can look at the startup script logs like this if you need to debug.
You can look at the script logs/ runtime logs like this if you need to debug.
```bash
$ cat /var/log/cloud-init-output.log
$ cat $HOME/<script_name>.out
```

That's it. Enjoy!
Expand Down
37 changes: 20 additions & 17 deletions examples/k3s/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@ locals {
ami = "ami-077679e63cd2e9248"
instance_type = "m4.large"
vpc_cidr = "10.0.0.0/16"
key_name = "<AWS Key Name>"
key_name = "<>"

# Must be larger than ami
volume_size = 30

# Set autoscaling to consistent size so we don't scale for now
min_size = 1
min_size = 3
max_size = 3
desired_size = 1
desired_size = 3

cidr_block_a = "10.0.1.0/24"
cidr_block_b = "10.0.2.0/24"
Expand All @@ -27,7 +27,7 @@ locals {

# "0.0.0.0/0" allows from anywhere - update
# this to be just your ip / collaborators
ip_address_allowed = ["0.0.0.0/0"]
ip_address_allowed = ["134.9.73.0/24"]
}

# Example queries to get public ip addresses or private DNS names
Expand Down Expand Up @@ -157,7 +157,7 @@ resource "aws_security_group" "security_group" {
description = "Allow internal traffic"
from_port = 0
to_port = 0
protocol = "tcp"
protocol = "-1"
cidr_blocks = [local.cidr_block_a, local.cidr_block_b, local.cidr_block_c]
}

Expand Down Expand Up @@ -341,10 +341,13 @@ resource "aws_autoscaling_group" "autoscaling_group" {
health_check_type = "ELB"
capacity_rebalance = false

# Temporariliy suspending the healthcheck process so it doesn't check for instance health now.
suspended_processes = ["HealthCheck"]

# Make this really large so we don't check soon :)
health_check_grace_period = 10000
desired_capacity = local.desired_size
target_group_arns = [aws_lb_target_group.target_group.arn]
# health_check_grace_period = 10000
desired_capacity = local.desired_size
target_group_arns = [aws_lb_target_group.target_group.arn]

termination_policies = ["NewestInstance"]

Expand All @@ -366,12 +369,12 @@ resource "aws_autoscaling_group" "autoscaling_group" {
}
}

resource "aws_autoscaling_schedule" "autoscaling_by_schedule" {
scheduled_action_name = "${local.name}-autoscaling-schedule"
min_size = local.min_size
max_size = local.max_size
desired_capacity = local.max_size
start_time = timeadd(timestamp(), "5m") #adjust to runtime
time_zone = "US/Pacific" #set to your region
autoscaling_group_name = aws_autoscaling_group.autoscaling_group.name
}
# resource "aws_autoscaling_schedule" "autoscaling_by_schedule" {
# scheduled_action_name = "${local.name}-autoscaling-schedule"
# min_size = local.min_size
# max_size = local.max_size
# desired_capacity = local.max_size
# start_time = timeadd(timestamp(), "5m") #adjust to runtime
# time_zone = "US/Pacific" #set to your region
# autoscaling_group_name = aws_autoscaling_group.autoscaling_group.name
# }
17 changes: 17 additions & 0 deletions examples/scripts/flux_batch_job.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/bin/bash

# Read all the hostname the job is running
nodenames_string=$(flux exec -r all hostname)
echo $nodenames_string
# separate the names into an array
IFS=' '
read -ra nodenames_array <<< $nodenames_string
leader=${nodenames_array[0]}
echo $leader

secret_token=${1}
[ $# -eq 0 ] && { echo "Usage: $0 argument, Provide k3s secret"; exit 1; }

flux submit -N 3 --wait --error ./k3s_starter.out --output ./k3s_starter.out sh ./k3s_starter.sh "${leader}" "${secret_token}"

echo "JOB COMPLETE"
92 changes: 72 additions & 20 deletions examples/scripts/k3s-setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -42,25 +42,77 @@ chown -R flux /run/flux
# See the README.md for commands how to set this manually without systemd
systemctl restart flux.service


## These are for installing K3S

LEADER=($(echo $NODELIST | tr "," "\n"))

if [[ "$LEADER" == $(hostname) ]]; then
curl -sfL https://get.k3s.io | K3S_KUBECONFIG_MODE="644" K3S_TOKEN="${k3s_token_name}" sh -
else
#Check if K3S API Server is running or not
while :
do
curl --max-time 0.5 -k -o /dev/null https://"$LEADER":6443/livez
res=$?
if test "$res" != "0"; then
echo "the curl command failed with: $res"
sleep 5
else
echo "The K3S service is UP!"
break
fi
done
curl -sfL https://get.k3s.io | K3S_URL=https://"$LEADER":6443 K3S_TOKEN="${k3s_token_name}" K3S_KUBECONFIG_MODE="644" sh -
fi
sudo curl -Lo /usr/bin/k3s https://github.com/k3s-io/k3s/releases/download/v1.26.5+k3s1/k3s
sudo chmod a+x /usr/bin/k3s

# Systemd file for K3S Manager Node
sudo tee /etc/systemd/system/k3s.service >/dev/null << EOF
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target
[Install]
WantedBy=multi-user.target
[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/k3s server
EOF

# Systemd file for K3S Agent Node
sudo tee /etc/systemd/system/k3s-agent.service >/dev/null << EOF
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
Wants=network-online.target
After=network-online.target
[Install]
WantedBy=multi-user.target
[Service]
Type=notify
EnvironmentFile=-/etc/default/%N
EnvironmentFile=-/etc/sysconfig/%N
EnvironmentFile=-/etc/systemd/system/k3s-agent.service.env
KillMode=process
Delegate=yes
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/k3s agent
EOF

# Loading service units
sudo systemctl daemon-reload
85 changes: 85 additions & 0 deletions examples/scripts/k3s_agent_cleanup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#!/bin/sh
set -x
[ $(id -u) -eq 0 ] || exec sudo $0 $@

for service in /etc/systemd/system/k3s*.service; do
[ -s $service ] && systemctl stop $(basename $service)
done

for service in /etc/init.d/k3s*; do
[ -x $service ] && $service stop
done

pschildren() {
ps -e -o ppid= -o pid= | \
sed -e 's/^\s*//g; s/\s\s*/\t/g;' | \
grep -w "^$1" | \
cut -f2
}

pstree() {
for pid in $@; do
echo $pid
for child in $(pschildren $pid); do
pstree $child
done
done
}

killtree() {
kill -9 $(
{ set +x; } 2>/dev/null;
pstree $@;
set -x;
) 2>/dev/null
}

getshims() {
ps -e -o pid= -o args= | sed -e 's/^ *//; s/\s\s*/\t/;' | grep -w 'k3s/data/[^/]*/bin/containerd-shim' | cut -f1
}

killtree $({ set +x; } 2>/dev/null; getshims; set -x)

do_unmount_and_remove() {
set +x
while read -r _ path _; do
case "$path" in $1*) echo "$path" ;; esac
done < /proc/self/mounts | sort -r | xargs -r -t -n 1 sh -c 'umount "$0" && rm -rf "$0"'
set -x
}

do_unmount_and_remove '/run/k3s'
do_unmount_and_remove '/var/lib/rancher/k3s'
do_unmount_and_remove '/var/lib/kubelet/pods'
do_unmount_and_remove '/var/lib/kubelet/plugins'
do_unmount_and_remove '/run/netns/cni-'

# Remove CNI namespaces
ip netns show 2>/dev/null | grep cni- | xargs -r -t -n 1 ip netns delete

# Delete network interface(s) that match 'master cni0'
ip link show 2>/dev/null | grep 'master cni0' | while read ignore iface ignore; do
iface=${iface%%@*}
[ -z "$iface" ] || ip link delete $iface
done
ip link delete cni0
ip link delete flannel.1
ip link delete flannel-v6.1
ip link delete kube-ipvs0
ip link delete flannel-wg
ip link delete flannel-wg-v6
rm -rf /var/lib/cni/
iptables-save | grep -v KUBE- | grep -v CNI- | grep -iv flannel | iptables-restore
ip6tables-save | grep -v KUBE- | grep -v CNI- | grep -iv flannel | ip6tables-restore

systemctl disable k3s-agent
systemctl reset-failed k3s-agent
systemctl daemon-reload

rm -f /etc/systemd/system/k3s-agent.service.env

rm -rf /etc/rancher/k3s
rm -rf /run/k3s
rm -rf /run/flannel
rm -rf /var/lib/rancher/k3s
rm -rf /var/lib/kubelet

0 comments on commit c363df8

Please sign in to comment.