Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 55 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,55 @@
# omnia
Software tools for standing up Slurm/Kubernetes clusters on Dell EMC PowerEdge servers from factory OS images
Dancing to the beat of a different drum.

# Short Version:

Install Kubernetes and all dependencies
```
ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml
```

Initialize K8S cluster
```
ansible-playbook -i host_inventory_file build-kubernetes-cluster.yml --tags "init"
```


# What this does:

## Build/Install

### Add additional repositories:

- Kubernetes (Google)
- El Repo (nvidia drivers)
- Nvidia (nvidia-docker)
- EPEL (Extra Packages for Enterprise Linux)

### Install common packages
- gcc
- python-pip
- docker
- kubelet
- kubeadm
- kubectl
- nvidia-detect
- kmod-nvidia
- nvidia-x11-drv
- nvidia-container-runtime
- ksonnet (CLI framework for K8S configs)

### Enable GPU Device Plugins (nvidia-container-runtime-hook)

### Modify kubeadm config to allow GPUs as schedulable resource

### Start and enable services
- Docker
- Kubelet

## Initialize Cluster
### Head/master
- Start K8S pass startup token to compute/slaves
- Initialize networking (Currently using WeaveNet)
-Setup K8S Dashboard
- Create dynamic/persistent volumes
### Compute/slaves
- Join k8s cluster
35 changes: 35 additions & 0 deletions build-kubernetes-cluster.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
#Playbook for kubernetes cluster

#collect info from everything
- hosts: all

# Apply Common Installation and Config
- hosts: cluster
gather_facts: false
roles:
- common

# Apply GPU Node Config
- hosts: gpus
gather_facts: false
roles:
- computeGPU

# Apply Master Config
- hosts: master
gather_facts: false
roles:
- master

# Start K8s on master server
- hosts: master
gather_facts: false
roles:
- startmaster

# Start K8s worker servers
- hosts: compute,gpus
gather_facts: false
roles:
- startworkers
30 changes: 30 additions & 0 deletions example.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
apiVersion: "kubeflow.org/v1alpha2"
kind: "TFJob"
metadata:
name: "example-job"
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
containers:
- image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
name: tensorflow
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: WORKER
template:
spec:
containers:
- image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
name: tensorflow
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: PS
template:
spec:
containers:
- image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
name: tensorflow
restartPolicy: OnFailure
19 changes: 19 additions & 0 deletions host_inventory_file
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
[master]
friday

[compute]
compute[000:005]
#compute000
#compute001
#compute002

[gpus]
compute[003:005]

[workers:children]
compute
gpus

[cluster:children]
master
workers
3 changes: 3 additions & 0 deletions roles/common/files/k8s.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1

8 changes: 8 additions & 0 deletions roles/common/files/kubernetes.repo
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg

3 changes: 3 additions & 0 deletions roles/common/files/nvidia
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" exec nvidia-container-runtime-hook "$@"

21 changes: 21 additions & 0 deletions roles/common/handlers/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---

#- name: Enable docker service
#service:
#name: docker
#enabled: yes
#
- name: Start and Enable docker service
service:
name: docker
state: restarted
enabled: yes
#tags: install

- name: Start and Enable Kubernetes - kubelet
service:
name: kubelet
state: started
enabled: yes
#tags: install

134 changes: 134 additions & 0 deletions roles/common/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
---

- name: add kubernetes repo
copy: src=kubernetes.repo dest=/etc/yum.repos.d/ owner=root group=root mode=644
tags: install

# add ElRepo GPG Key
- rpm_key:
state: present
key: https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
tags: install

- name: add ElRepo (Nvidia kmod drivers)
yum:
name: http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
state: present
tags: install

- name: update sysctl to handle incorrectly routed traffic when iptables is bypassed
copy: src=k8s.conf dest=/etc/sysctl.d/ owner=root group=root mode=644
tags: install

- name: update sysctl
command: /sbin/sysctl --system
tags: install

- name: Install EPEL Repository
yum: name=epel-release state=present
tags: install

#likely need to add a reboot hook in here
#- name: update kernel and all other system packages
#yum: name=* state=latest
#tags: install

- name: disable swap
command: /sbin/swapoff -a
tags: install

# Disable selinux
- selinux:
state: disabled
tags: install

- name: install common packages
yum:
name:
- gcc
- nfs-utils
- python-pip
- docker
- bash-completion
- kubelet
- kubeadm
- kubectl
- nvidia-detect
state: present
tags: install

- name: install InfiniBand Support
yum:
name: "@Infiniband Support"
state: present

- name: Install KSonnet
unarchive:
src: https://github.com/ksonnet/ksonnet/releases/download/v0.13.1/ks_0.13.1_linux_amd64.tar.gz
dest: /usr/bin/
extra_opts: [--strip-components=1]
remote_src: yes
exclude:
- ks_0.11.0_linux_amd64/CHANGELOG.md
- ks_0.11.0_linux_amd64/CODE-OF-CONDUCT.md
- ks_0.11.0_linux_amd64/CONTRIBUTING.md
- ks_0.11.0_linux_amd64/LICENSE
- ks_0.11.0_linux_amd64/README.md
tags: install

- name: upgrade pip
command: /bin/pip install --upgrade pip
tags: install

#- name: Enable DevicePlugins for all GPU nodes (nvidia-container-runtime-hook)
#copy: src=nvidia dest=/usr/libexec/oci/hooks.d/ owner=root group=root mode=755
#tags: install

- name: Add KUBE_EXTRA_ARGS to enable GPUs
lineinfile:
path: /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
line: 'Environment="KUBELET_EXTRA_ARGS=--feature-gates=DevicePlugins=true"'
insertbefore: 'KUBELET_KUBECONFIG_ARGS='
tags: install

- name: Start and Enable docker service
service:
name: docker
state: restarted
enabled: yes
tags: install

- name: Start and Enable Kubernetes - kubelet
service:
name: kubelet
state: restarted
enabled: yes
tags: install

- name: Start and rpcbind service
service:
name: rpcbind
state: restarted
enabled: yes
tags: install

- name: Start and nfs-server service
service:
name: nfs-server
state: restarted
enabled: yes
tags: install

- name: Start and nfs-lock service
service:
name: nfs-lock
state: restarted
enabled: yes
tags: install

- name: Start and nfs-idmap service
service:
name: nfs-idmap
state: restarted
enabled: yes
tags: install
10 changes: 10 additions & 0 deletions roles/common/vars/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---

common_packages:
- epel-release
- python-pip
- docker
- bash-completion
- kubelet
- kubeadm
- kubectl
3 changes: 3 additions & 0 deletions roles/computeGPU/files/k8s.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1

8 changes: 8 additions & 0 deletions roles/computeGPU/files/kubernetes.repo
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg

3 changes: 3 additions & 0 deletions roles/computeGPU/files/nvidia
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" exec nvidia-container-runtime-hook "$@"

21 changes: 21 additions & 0 deletions roles/computeGPU/handlers/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---

#- name: Enable docker service
#service:
#name: docker
#enabled: yes
#
- name: Start and Enable docker service
service:
name: docker
state: restarted
enabled: yes
#tags: install

- name: Start and Enable Kubernetes - kubelet
service:
name: kubelet
state: started
enabled: yes
#tags: install

Loading