diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 514068ffa..71c86f47f 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -7,59 +7,83 @@ These guidelines are based on the [pravega project](https://github.com/pravega/p This document will evolve as the project matures. Please be sure to regularly refer back in order to stay in-line with contribution guidelines. -## Issues and Pull Requests -To produce a pull request against Omnia, follow these steps: - -* **Create an issue:** Create an issue and describe what you are trying to solve. It doesn't matter whether it is a new feature, a bug fix, or an improvement. All pull requests need to be associated to an issue. See more here: Creating an issue -* **Issue branch:** Create a new branch on your fork of the repository. Typically, you need to branch off master, but there could be exceptions. To branch off master, use git checkout master; git checkout -b . -* **Push the changes:** To be able to create a pull request, push the changes to origin: git push --set-upstream origin . I'm assuming that origin is your personal repo, e.g., `lwilson/omnia.git`. -* **Branch name:** Use the following pattern to create your new branch name: issue-number-description, e.g., issue-1023-reformat-testutils. -* **Create a pull request:** Github gives you the option of creating a pull request. Give it a title following this format Issue ###: Description, _e.g., Issue 1023: Reformat testutils. Follow the guidelines in the description and try to provide as much information as possible to help the reviewer understand what is being addressed. It is important that you try to do a good job with the description to make the job of the code reviewer easier. A good description not only reduces review time, but also reduces the probability of a misunderstanding with the pull request. -* **Merging:** Merging of pull requests will be handled by project mantainers - -When preparing a pull request it is important to stay up-to-date with the master. We recommend that you rebase against the upstream repository _frequently_. To do this, use the following commands: -``` -git pull --rebase upstream master #upstream is dellhpc/omnia -git push --force origin #origin is your fork of the repository (e.g., /omnia.git) +## How to Contribute to Omnia +Contributions to Omnia are made through [Pull Requests (PRs)](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests). To make a pull request against Omnia, use the following steps: + +1. **Create an issue:** [Create an issue](https://help.github.com/en/github/managing-your-work-on-github/creating-an-issue) and describe what you are trying to solve. It does not matter whether it is a new feature, a bug fix, or an improvement. All pull requests need to be associated to an issue. When creating an issue, be sure to use the appropriate issue template (bug fix or feature request) and complete all of the required fields. If your issue does not fit in either a bug fix or feature request, then create a blank issue and be sure to including the following information: + * **Problem description:** Describe what you believe needs to be addressed + * **Problem location:** In which file and at what line does this issue occur? + * **Suggested resolution:** How do you intend to resolve the problem? +2. **Create a personal fork:** All work on Omnia should be done in a [fork of the repository](https://help.github.com/en/github/getting-started-with-github/fork-a-repo). Only the maintiners are allowed to commit directly to the project repository. +3. **Issue branch:** [Create a new branch](https://help.github.com/en/desktop/contributing-to-projects/creating-a-branch-for-your-work) on your fork of the repository. All contributions should be branched from `devel`. Use `git checkout devel; git checkout -b ` to create the new branch. + * **Branch name:** The branch name should be based on the issue you are addressing. Use the following pattern to create your new branch name: issue-number, e.g., issue-1023. +4. **Commit changes to the issue branch:** It is important to commit your changes to the issue branch. Commit messages should be descriptive of the changes being made. + * **Signing your commits:** All commits to Omnia need to be signed with the [Developer Certificate of Origin (DCO)](https://developercertificate.org/) in order to certify that the contributor has permission to contribute the code. In order to sign commits, use either the `--signoff` or `-s` option to `git commit`: + ``` + git commit --signoff + git commit -s + ``` + Make sure you have your user name and e-mail set. The `--signoff | -s` option will use the configured user name and e-mail, so it is important to configure it before the first time you commit. Check the following references: + + * [Setting up your github user name](https://help.github.com/articles/setting-your-username-in-git/) + * [Setting up your e-mail address](https://help.github.com/articles/setting-your-commit-email-address-in-git/) + +5. **Push the changes to your personal repo:** To be able to create a pull request, push the changes to origin: `git push origin `. Here I assume that `origin` is your personal repo, e.g., `lwilson/omnia.git`. +6. **Create a pull request:** [Create a pull request](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request) with a title following this format Issue ###: Description (_i.e., Issue 1023: Reformat testutils_). It is important that you do a good job with the description to make the job of the code reviewer easier. A good description not only reduces review time, but also reduces the probability of a misunderstanding with the pull request. + * **Important:** When preparing a pull request it is important to stay up-to-date with the project repository. We recommend that you rebase against the upstream repo _frequently_. To do this, use the following commands: + ``` + git pull --rebase upstream master #upstream is dellhpc/omnia + git push --force origin #origin is your fork of the repository (e.g., /omnia.git) + ``` + * **PR Description:** Be sure to fully describe the pull request. Ideally, your PR description will contain: + 1. A description of the main point (_e.g., why was this PR made?_), + 2. Linking text to the related issue (_e.g., This PR closes issue #_), + 3. How the changes solves the problem, and + 4. How to verify that the changes work correctly. + +## Omnia Branches and Contribution Flow +The diagram below describes the contribution flow. Omnia has two lifetime branches: `devel` and `master`. The `master` branch is reserved for releases and their associated tags. The `devel` branch is where all development work occurs. The `devel` branch is also the default branch for the project. + +![Omnia Branch Flowchart](docs/images/omnia-branch-structure.png "Flowchart of Omnia branches") + +## Developer Certificate of Origin +Contributions to Omnia must be signed with the [Developer Certificate of Origin (DCO)](https://developercertificate.org/): ``` -## Creating an Issue -When creating an issue, there are two important parts: title and description. The title should be succinct, but give a good idea of what the issue is about. Try to add all important keywords to make it clear to the reader. For example, if the issue is about changing the log level of some messages in the segment store, then instead of saying "Log level" say "Change log level in the segment store". The suggested way includes both the goal where in the code we are supposed to do it. +Developer Certificate of Origin +Version 1.1 -For the description, there three parts: +Copyright (C) 2004, 2006 The Linux Foundation and its contributors. +1 Letterman Drive +Suite D4700 +San Francisco, CA, 94129 -* *Problem description:* Describe what it is that we need to change. If it is a bug, describe the observed symptoms. If it is a new feature, describe it is supposed to be with as much detail as possible. +Everyone is permitted to copy and distribute verbatim copies of this +license document, but changing it is not allowed. -* *Problem location:* This part refers to where in the code we are supposed to make changes. For example, if it is bug in the client, then in this part say at least "Client". If you know more about it, then please add it. For example, if you that there is an issue with SegmentOutputStreamImpl, say it in this part. -* *Suggestion for an improvement:* This section is designed to let you give a suggestion for how to fix the bug described in the Problem description or how to implement the feature described in that same section. Please make an effort to separate between problem statement (Problem Description section) and solution (Suggestion for an improvement). +Developer's Certificate of Origin 1.1 -We next discuss how to create a pull request. - -## Creating a Pull Request -When creating a pull request, there are also two important parts: title and description. The title can be the same as the one of the issue, but it must be prefixed with the issue number, e.g.: -``` -Issue 724: Change log level in the segment store -``` -The description has four parts: +By making a contribution to this project, I certify that: -* __Changelog description*:__ This section should be the two or three main points about this PR. A detailed description should be left for the What the code does section. The two or three points here should be used by a committer for the merge log. -* __Purpose of the change:__ Say whether this closes an issue or perhaps is a subtask of an issue. This section should link the PR to at least one issue. -* __What the code does:__ Use this section to freely describe the changes in this PR. Make sure to give as much detail as possible to help a reviewer to do a better job understanding your changes. -* __How to verify it:__ For most of the PRs, the answer here will be trivial: the build must pass, system tests must pass, visual inspection, etc. This section becomes more important when the way to reproduce the issue the PR is resolving is non-trivial, like running some specific command or workload generator. +(a) The contribution was created in whole or in part by me and I + have the right to submit it under the open source license + indicated in the file; or -## Signing Your Commits -We require that developers sign off their commits to certify that they have permission to contribute the code in a pull request. This way of certifying is commonly known as the [Developer Certificate of Origin (DCO)](https://developercertificate.org/). We encourage all contributors to read the DCO text before signing a commit and making contributions. +(b) The contribution is based upon previous work that, to the best + of my knowledge, is covered under an appropriate open source + license and I have the right under that license to submit that + work with modifications, whether created in whole or in part + by me, under the same open source license (unless I am + permitted to submit under a different license), as indicated + in the file; or -To make sure that pull requests have all commits signed off, we use the [Probot DCO plugin](https://probot.github.io/apps/dco/). +(c) The contribution was provided directly to me by some other + person who certified (a), (b) or (c) and I have not modified + it. -### Signing off a commit - -#### Using the command line -To make sure that pull requests have all commits signed off, we use the Probot DCO plugin. -Use either `--signoff` or `-s` with the commit command. - -Make sure you have your user name and e-mail set. The `--signoff | -s` option will use the configured user name and e-mail, so it is important to configure it before the first time you commit. Check the following references: - -[Setting up your github user name](https://help.github.com/articles/setting-your-username-in-git/) - -[Setting up your e-mail address](https://help.github.com/articles/setting-your-commit-email-address-in-git/) +(d) I understand and agree that this project and the contribution + are public and that a record of the contribution (including all + personal information I submit with it, including my sign-off) is + maintained indefinitely and may be redistributed consistent with + this project or the open source license(s) involved. +``` diff --git a/LICENSE b/LICENSE index 261eeb9e9..5ecf4f511 100644 --- a/LICENSE +++ b/LICENSE @@ -186,7 +186,7 @@ same "printed page" as the copyright notice for easier identification within third-party archives. - Copyright [yyyy] [name of copyright owner] + Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. diff --git a/README.md b/README.md index 9bd9c8ffa..e25e64656 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,17 @@ -# Omnia + + +![GitHub](https://img.shields.io/github/license/dellhpc/omnia) ![GitHub issues](https://img.shields.io/github/issues-raw/dellhpc/omnia) ![GitHub release (latest by date including pre-releases)](https://img.shields.io/github/v/release/dellhpc/omnia?include_prereleases) ![GitHub last commit (branch)](https://img.shields.io/github/last-commit/dellhpc/omnia/devel) ![GitHub commits since tagged version](https://img.shields.io/github/commits-since/dellhpc/omnia/omnia-v0.2/devel) + #### Ansible playbook-based deployment of Slurm and Kubernetes on Dell EMC PowerEdge servers running an RPM-based Linux OS Omnia (Latin: all or everything) is a deployment tool to turn Dell EMC PowerEdge servers with RPM-based Linux images into a functioning Slurm/Kubernetes cluster. ## Omnia Documentation -For Omnia documentation, including installation and contribution instructions, see [docs](docs/README.md). +For Omnia documentation, including installation and contribution instructions, please see the [website](https://dellhpc.github.io/omnia). -### Current maintainers: +## Current maintainers: * Lucas A. Wilson (Dell Technologies) * John Lockman (Dell Technologies) + +## Omnia Contributors: +Dell Technologies Universita di Pisa diff --git a/docs/CONTRIBUTORS.md b/docs/CONTRIBUTORS.md new file mode 100644 index 000000000..2124da316 --- /dev/null +++ b/docs/CONTRIBUTORS.md @@ -0,0 +1,6 @@ +# Omnia Maintainers +- Luke Wilson and John Lockman (Dell Technologies) +Dell Technologies + +# Omnia Contributors +Dell Technologies Universita di Pisa diff --git a/docs/INSTALL.md b/docs/INSTALL.md index 98c2189b4..9c6a4f482 100644 --- a/docs/INSTALL.md +++ b/docs/INSTALL.md @@ -1,7 +1,5 @@ -# Installing Omnia - -## TL;DR - +## TL;DR Installation + ### Kubernetes Install Kubernetes and all dependencies ``` @@ -12,54 +10,96 @@ Initialize K8s cluster ``` ansible-playbook -i host_inventory_file kubernetes/kubernetes.yml --tags "init" ``` + +### Install Kubeflow +``` +ansible-playbook -i host_inventory_file kubernetes/kubeflow.yaml +``` + ### Slurm ``` ansible-playbook -i host_inventory_file slurm/slurm.yml ``` -## Build/Install +# Omnia Omnia is a collection of [Ansible](https://www.ansible.com/) playbooks which perform: * Installation of [Slurm](https://slurm.schedmd.com/) and/or [Kubernetes](https://kubernetes.io/) on servers already provisioned with a standard [CentOS](https://www.centos.org/) image. * Installation of auxiliary scripts for administrator functions such as moving nodes between Slurm and Kubernetes personalities. -### Kubernetes - -* Add additional repositories: +Omnia playbooks perform several tasks: +`common` playbook handles installation of software +* Add yum repositories: - Kubernetes (Google) - - El Repo (nvidia drivers) - - Nvidia (nvidia-docker) + - El Repo (for Nvidia drivers) - EPEL (Extra Packages for Enterprise Linux) -* Install common packages +* Install Packages from repos: + - bash-completion + - docker - gcc - python-pip - - docker - kubelet - kubeadm - kubectl + - nfs-utils - nvidia-detect + - yum-plugin-versionlock +* Restart and enable system level services + - Docker + - Kubelet + +`computeGPU` playbook installs Nvidia drivers and nvidia-container-runtime-hook +* Add yum repositories: + - Nvidia (container runtime) +* Install Packages from repos: - kmod-nvidia - - nvidia-x11-drv - - nvidia-container-runtime - - ksonnet (CLI framework for K8S configs) -* Enable GPU Device Plugins (nvidia-container-runtime-hook) -* Modify kubeadm config to allow GPUs as schedulable resource -* Start and enable services + - nvidia-container-runtime-hook +* Restart and enable system level services - Docker - Kubelet -* Initialize Cluster +* Configuration: + - Enable GPU Device Plugins (nvidia-container-runtime-hook) + - Modify kubeadm config to allow GPUs as schedulable resource +* Restart and enable system level services + - Docker + - Kubelet + +`master` playbook +* Install Helm v3 +* (optional) add firewall rules for Slurm and kubernetes + +Everything from this point on can be called by using the `init` tag +``` +ansible-playbook -i host_inventory_file kubernetes/kubernetes.yml --tags "init" +``` + +`startmaster` playbook +* turn off swap +*Initialize Kubernetes * Head/master - Start K8S pass startup token to compute/slaves - - Initialize networking (Currently using WeaveNet) - - Setup K8S Dashboard - - Create dynamic/persistent volumes - * Compute/slaves - - Join k8s cluster + - Initialize software defined networking (Calico) + +`startworkers` playbook +* turn off swap +* Join k8s cluster + +`startservices` playbook +* Setup K8S Dashboard +* Add `stable` repo to helm +* Add `jupyterhub` repo to helm +* Update helm repos +* Deploy NFS client Provisioner +* Deploy Jupyterhub +* Deploy Prometheus +* Install MPI Operator + ### Slurm -* Download and build Slurm source -* Install necessary dependencies +* Downloads and builds Slurm from source +* Install package dependencies - Python3 - munge - MariaDB - MariaDB development libraries * Build Slurm configuration files + diff --git a/docs/PREINSTALL.md b/docs/PREINSTALL.md index 0f3a74ccc..b8b609bef 100644 --- a/docs/PREINSTALL.md +++ b/docs/PREINSTALL.md @@ -5,7 +5,7 @@ Omnia assumes that prior to installation: * Systems have a base operating system (currently CentOS 7 or 8) * Network(s) has been cabled and nodes can reach the internet * SSH Keys for `root` have been installed on all nodes to allow for password-less SSH -* Ansible is installed on the master node +* Ansible is installed on either the master node or a separate deployment node ``` yum install ansible ``` diff --git a/docs/README.md b/docs/README.md index 2b28001a3..fc24b2282 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,5 +1,22 @@ -# Omnia Documentation -Omnia (Latin: all or everything) is a deployment tool to turn Dell EMC PowerEdge servers with standard RPM-based Linux OS images into a functioning Slurm/Kubernetes cluster. Omnia is a collection of [Ansible](https://ansible.org) playbooks for installing and configuring Slurm or Kubernetes on an inventory of servers, along with additional software packages and services. +**Omnia** (Latin: all or everything) is a deployment tool to configure Dell EMC PowerEdge servers running standard RPM-based Linux OS images into cluster capable of supporting HPC, AI, and data analytics workloads. Omnia installs Slurm and/or Kubernetes for managing jobs and enables installation of many other packages and services for running diverse workloads on the same converged solution. Omnia is a collection of [Ansible](https://ansible.org) playbooks, is open source, and is constantly being extended to enable comprehensive workloads. + +## What Omnia Does +Omnia can build clusters which use Slurm or Kubernetes (or both!) for workload management. Omnia will install software from a variety of sources, including: +- Standard CentOS and [ELRepo](http://elrepo.org) repositories +- Helm repositories +- Source code compilation +- [OpenHPC](https://openhpc.community) repositories (_coming soon!_) +- [OperatorHub](https://operatorhub.io) (_coming soon!_) + +Whenever possible, Omnia will opt to leverage existing projects rather than reinvent the wheel. + +![Omnia draws from existing repositories](images/omnia-overview.png) + +### Omnia Stacks +Omnia can install Kubernetes or Slurm (or both), along with additional drivers, services, libraries, and user applications. +![Omnia Kubernetes Stack](images/omnia-k8s.png) + +![Omnia Slurm Stack](images/omnia-slurm.png) ## Installing Omnia Omnia requires that servers already have an RPM-based Linux OS running on them, and are all connected to the Internet. Currently all Omnia testing is done on [CentOS](https://centos.org). Please see [PREINSTALL](PREINSTALL.md) for instructions on network setup. @@ -7,7 +24,7 @@ Omnia requires that servers already have an RPM-based Linux OS running on them, Once servers have functioning OS and networking, you can using Omnia to install and start Slurm and/or Kubernetes. Please see [INSTALL](INSTALL.md) for instructions. ## Contributing to Omnia -The Omnia project was started to give members of the [Dell Technologies HPC Community](https://dellhpc.org) a way to easily setup clusters of Dell EMC servers, but to contribute useful tools, fixes, and functionality back to the HPC Community. +The Omnia project was started to give members of the [Dell Technologies HPC Community](https://dellhpc.org) a way to easily setup clusters of Dell EMC servers, and to contribute useful tools, fixes, and functionality back to the HPC Community. ### Open to All While we started Omnia within the Dell Technologies HPC Community, that doesn't mean that it's limited to Dell EMC servers, networking, and storage. This is an open project, and we want to encourage *everyone* to use and contribute to Omnia! @@ -21,4 +38,6 @@ It's not just new features and bug fixes that can be contributed to the Omnia pr * Feedback * Validation that it works for your particular configuration -If you would like to contribute, see [CONTRIBUTING](../CONTRIBUTING.md). +If you would like to contribute, see [CONTRIBUTING](https://github.com/dellhpc/omnia/blob/master/CONTRIBUTING.md). + +### [Omnia Contributors](CONTRIBUTORS.md) diff --git a/docs/_config.yml b/docs/_config.yml index 2f7efbeab..367390b58 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -1 +1,4 @@ -theme: jekyll-theme-minimal \ No newline at end of file +theme: jekyll-theme-minimal +title: Omnia +description: Ansible playbook-based tools for deploying Slurm and Kubernetes clusters for High Performance Computing, Machine Learning, Deep Learning, and High-Performance Data Analytics +logo: images/omnia-logo.png diff --git a/docs/images/delltech.jpg b/docs/images/delltech.jpg new file mode 100644 index 000000000..d1050c330 Binary files /dev/null and b/docs/images/delltech.jpg differ diff --git a/docs/images/omnia-branch-structure.png b/docs/images/omnia-branch-structure.png new file mode 100644 index 000000000..379725a15 Binary files /dev/null and b/docs/images/omnia-branch-structure.png differ diff --git a/docs/images/omnia-k8s.png b/docs/images/omnia-k8s.png new file mode 100644 index 000000000..741114901 Binary files /dev/null and b/docs/images/omnia-k8s.png differ diff --git a/docs/images/omnia-logo.png b/docs/images/omnia-logo.png new file mode 100644 index 000000000..51d6b273e Binary files /dev/null and b/docs/images/omnia-logo.png differ diff --git a/docs/images/omnia-overview.png b/docs/images/omnia-overview.png new file mode 100644 index 000000000..6f244bbf5 Binary files /dev/null and b/docs/images/omnia-overview.png differ diff --git a/docs/images/omnia-slurm.png b/docs/images/omnia-slurm.png new file mode 100644 index 000000000..a0ce9f84f Binary files /dev/null and b/docs/images/omnia-slurm.png differ diff --git a/docs/images/pisa.png b/docs/images/pisa.png new file mode 100644 index 000000000..58615a638 Binary files /dev/null and b/docs/images/pisa.png differ diff --git a/docs/metalLB/README.md b/docs/metalLB/README.md new file mode 100644 index 000000000..f4ef3e154 --- /dev/null +++ b/docs/metalLB/README.md @@ -0,0 +1,10 @@ +# MetalLB + +MetalLB is a load-balancer implementation for bare metal Kubernetes clusters, using standard routing protocols. +https://metallb.universe.tf/ + +Omnia installs MetalLB by manifest in the playbook `startservices`. A default configuration is provdied for layer2 protocol and an example for providing an address pool. Modify metal-config.yaml to suit your network requirements and apply the changes using with: + +``` +kubectl apply -f metal-config.yaml +``` diff --git a/docs/metalLB/metal-config.yaml b/docs/metalLB/metal-config.yaml new file mode 100644 index 000000000..0a3b38368 --- /dev/null +++ b/docs/metalLB/metal-config.yaml @@ -0,0 +1,21 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + namespace: metallb-system + name: config +data: + config: | + address-pools: + - name: default + protocol: layer2 + addresses: + - 192.168.2.150/32 + - 192.168.2.151/32 + - 192.168.2.152/32 + - 192.168.2.153/32 + - 192.168.2.154/32 + - 192.168.2.155/32 + - 192.168.2.156/32 + - 192.168.2.157/32 + - 192.168.2.158/32 + - 192.168.2.159/32 diff --git a/examples/PyTorch/pytorch-deploy.yaml b/examples/PyTorch/pytorch-deploy.yaml new file mode 100644 index 000000000..a0f1d39a1 --- /dev/null +++ b/examples/PyTorch/pytorch-deploy.yaml @@ -0,0 +1,20 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: pytorch-cpu-simple + namespace: default +spec: + template: + spec: + containers: + - name: cpu-pytorch + image: docker.io/mapler/pytorch-cpu:latest + volumeMounts: + - mountPath: /pyscript + name: torch-job-volume + command: ["bash","-c","python /pyscript/pytorchcpu-example.py"] + restartPolicy: Never + volumes: + - name: torch-job-volume + hostPath: + path: /home/k8s/torch-example diff --git a/examples/PyTorch/pytorch-example.py b/examples/PyTorch/pytorch-example.py new file mode 100644 index 000000000..e3658177e --- /dev/null +++ b/examples/PyTorch/pytorch-example.py @@ -0,0 +1,54 @@ +import random +import torch + +class DynamicNet(torch.nn.Module): + def __init__(self, D_in, H, D_out): + """ + In the constructor we construct three nn.Linear instances that we will use + in the forward pass. + """ + super(DynamicNet, self).__init__() + self.input_linear = torch.nn.Linear(D_in, H) + self.middle_linear = torch.nn.Linear(H, H) + self.output_linear = torch.nn.Linear(H, D_out) + def forward(self, x): + """ + For the forward pass of the model, we randomly choose either 0, 1, 2, or 3 + and reuse the middle_linear Module that many times to compute hidden layer + representations. + Since each forward pass builds a dynamic computation graph, we can use normal + Python control-flow operators like loops or conditional statements when + defining the forward pass of the model. + Here we also see that it is perfectly safe to reuse the same Module many + times when defining a computational graph. This is a big improvement from Lua + Torch, where each Module could be used only once. + """ + h_relu = self.input_linear(x).clamp(min=0) + for _ in range(random.randint(0, 3)): + h_relu = self.middle_linear(h_relu).clamp(min=0) + y_pred = self.output_linear(h_relu) + return y_pred + +# N is batch size; D_in is input dimension; +# H is hidden dimension; D_out is output dimension. +N, D_in, H, D_out = 64, 1000, 100, 10 +# Create random Tensors to hold inputs and outputs +x = torch.randn(N, D_in) +y = torch.randn(N, D_out) +# Construct our model by instantiating the class defined above +model = DynamicNet(D_in, H, D_out) +# Construct our loss function and an Optimizer. Training this strange model with +# vanilla stochastic gradient descent is tough, so we use momentum +criterion = torch.nn.MSELoss(reduction='sum') +optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9) +for t in range(500): + # Forward pass: Compute predicted y by passing x to the model + y_pred = model(x) + # Compute and print loss + loss = criterion(y_pred, y) + if t % 100 == 99: + print(t, loss.item()) + # Zero gradients, perform a backward pass, and update the weights. + optimizer.zero_grad() + loss.backward() + optimizer.step() diff --git a/examples/k8s-tensorflow-nvidia-ngc-resnet50-multinode-mpioperator.yaml b/examples/k8s-tensorflow-nvidia-ngc-resnet50-multinode-mpioperator.yaml new file mode 100644 index 000000000..8edd702b6 --- /dev/null +++ b/examples/k8s-tensorflow-nvidia-ngc-resnet50-multinode-mpioperator.yaml @@ -0,0 +1,69 @@ +apiVersion: kubeflow.org/v1alpha2 +kind: MPIJob +metadata: + name: tensorflow-benchmarks +spec: + slotsPerWorker: 4 + cleanPodPolicy: Running + mpiReplicaSpecs: + Launcher: + replicas: 1 + template: + spec: + containers: + - image: nvcr.io/nvidia/tensorflow:19.06-py3 + imagePullPolicy: IfNotPresent + name: tensorflow-benchmarks + volumeMounts: + - mountPath: /local_mount + name: work-volume + command: + - mpirun + - --allow-run-as-root + - -np + - "4" + - -bind-to + - none + - -map-by + #- slot + - numa + - -x + - NCCL_DEBUG=INFO + - -x + - LD_LIBRARY_PATH + - python + - /local_mount/tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py + - --batch_size=512 + - --model=resnet50 + - --variable_update=horovod + - --optimizer=momentum + - --nodistortions + - --gradient_repacking=8 + - --weight_decay=1e-4 + - --use_fp16=true + volumes: + - name: work-volume + hostPath: + # directory locally mounted on host + path: /work + type: Directory + Worker: + replicas: 1 + template: + spec: + containers: + - image: nvcr.io/nvidia/tensorflow:19.06-py3 + imagePullPolicy: IfNotPresent + name: tensorflow-benchmarks + resources: + limits: + nvidia.com/gpu: 4 + volumeMounts: + - mountPath: /local_mount + name: work-volume + volumes: + - name: work-volume + hostPath: + # directory locally mounted on host + path: /work + type: Directory diff --git a/kubernetes/host_inventory_file b/kubernetes/host_inventory_file index 569ac7c1b..607fe3e2f 100644 --- a/kubernetes/host_inventory_file +++ b/kubernetes/host_inventory_file @@ -1,20 +1,20 @@ -[master] -friday - -[compute] -compute000 -compute[002:005] - -[gpus] -#compute001 -compute002 -compute004 -compute005 - -[workers:children] -compute -gpus - -[cluster:children] -master -workers +all: + children: + cluster: + children: + master: + hosts: + compute000: + workers: + children: + compute: + hosts: + compute003: + gpus: + hosts: + compute002: + compute004: + compute005: + vars: + single_node: false + master_ip: 10.0.0.100 diff --git a/kubernetes/jupyterhub.yaml b/kubernetes/jupyterhub.yaml new file mode 100644 index 000000000..161bf20cc --- /dev/null +++ b/kubernetes/jupyterhub.yaml @@ -0,0 +1,22 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- +#Playbook for installing JupyterHub v1.1.0 in Omnia + +# Start K8s worker servers +- hosts: master + gather_facts: false + roles: + - jupyterhub diff --git a/kubernetes/kubeflow.yaml b/kubernetes/kubeflow.yaml new file mode 100644 index 000000000..abda4bc1d --- /dev/null +++ b/kubernetes/kubeflow.yaml @@ -0,0 +1,22 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- +#Playbook for installing Kubeflow v1.0 on Omnia + +# Start K8s worker servers +- hosts: master + gather_facts: false + roles: + - kubeflow diff --git a/kubernetes/build-kubernetes-cluster.yml b/kubernetes/kubernetes.yml similarity index 50% rename from kubernetes/build-kubernetes-cluster.yml rename to kubernetes/kubernetes.yml index b5177f12c..8814c2ff9 100644 --- a/kubernetes/build-kubernetes-cluster.yml +++ b/kubernetes/kubernetes.yml @@ -1,3 +1,17 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + --- #Playbook for kubernetes cluster diff --git a/kubernetes/roles/common/tasks/main.yml b/kubernetes/roles/common/tasks/main.yml index 89e76c676..f2df1d649 100644 --- a/kubernetes/roles/common/tasks/main.yml +++ b/kubernetes/roles/common/tasks/main.yml @@ -1,3 +1,17 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + --- - name: add kubernetes repo @@ -68,20 +82,6 @@ name: "@Infiniband Support" state: present -- name: Install KSonnet - unarchive: - src: https://github.com/ksonnet/ksonnet/releases/download/v0.13.1/ks_0.13.1_linux_amd64.tar.gz - dest: /usr/bin/ - extra_opts: [--strip-components=1] - remote_src: yes - exclude: - - ks_0.11.0_linux_amd64/CHANGELOG.md - - ks_0.11.0_linux_amd64/CODE-OF-CONDUCT.md - - ks_0.11.0_linux_amd64/CONTRIBUTING.md - - ks_0.11.0_linux_amd64/LICENSE - - ks_0.11.0_linux_amd64/README.md - tags: install - - name: upgrade pip command: /bin/pip install --upgrade pip tags: install @@ -128,7 +128,7 @@ - name: Start and nfs-lock service service: name: nfs-lock - state: restarted + #state: restarted enabled: yes tags: install diff --git a/kubernetes/roles/computeGPU/tasks/main.yml b/kubernetes/roles/computeGPU/tasks/main.yml index fce5f562d..522385d72 100644 --- a/kubernetes/roles/computeGPU/tasks/main.yml +++ b/kubernetes/roles/computeGPU/tasks/main.yml @@ -1,3 +1,17 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + --- - name: install Nvidia driver yum: diff --git a/kubernetes/roles/jupyterhub/files/jupyter_config.yaml b/kubernetes/roles/jupyterhub/files/jupyter_config.yaml new file mode 100644 index 000000000..675208581 --- /dev/null +++ b/kubernetes/roles/jupyterhub/files/jupyter_config.yaml @@ -0,0 +1,42 @@ +proxy: + secretToken: "1c8572f630701e8792bede122ec9c4179d9087f801e1a85ed32cce69887aec1b" + +hub: + cookieSecret: "1c8572f630701e8792bede122ec9c4179d9087f801e1a85ed32cce69887aec1b" + service: + type: LoadBalancer + db: + type: sqlite-pvc + extraConfig: + jupyterlab: | + c.Spawner.cmd = ['jupyter-labhub'] + +singleuser: + image: + name: dellhpc/datasciencelab-base + tag: "1.0" + profileList: + - display_name: "DellHPC Improved Environment" + description: "Dell curated Jupyter Stacks" + kubespawner_override: + image: "dellhpc/datasciencelab-cpu:1.0" + - display_name: "DellHPC GPU Environment" + description: "Dell curated Jupyter Stacks 1 GPU" + kubespawner_override: + image: "dellhpc/datasciencelab-gpu:1.0" + extra_resource_limits: + nvidia.com/gpu: "1" + storage: + dynamic: + storageClass: nfs-client + cpu: + limit: 1 + memory: + limit: 5G + guarantee: 1G + defaultUrl: "/lab" + + +prePuller: + continuous: + enabled: true diff --git a/kubernetes/roles/jupyterhub/tasks/main.yml b/kubernetes/roles/jupyterhub/tasks/main.yml new file mode 100644 index 000000000..5f6949f1b --- /dev/null +++ b/kubernetes/roles/jupyterhub/tasks/main.yml @@ -0,0 +1,26 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- +- name: Helm - Add JupyterHub Repo + shell: helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/ + +- name: Helm - Update Repo + shell: helm repo update + +- name: JupyterHub Custom Config (files) + copy: src=jupyter_config.yaml dest=/root/k8s/jupyter_config.yaml owner=root group=root mode=655 + +- name: jupyterHub deploy + shell: helm install jupyterhub/jupyterhub --namespace default --version 0.9.0 --values /root/k8s/jupyter_config.yaml --generate-name --wait --timeout 60m diff --git a/kubernetes/roles/kubeflow/tasks/main.yml b/kubernetes/roles/kubeflow/tasks/main.yml new file mode 100644 index 000000000..75020caf5 --- /dev/null +++ b/kubernetes/roles/kubeflow/tasks/main.yml @@ -0,0 +1,122 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +--- + +#Configure build and deploy kubeflow v1.0 + +- name: Download kfctl v1.0.2 release from the Kubeflow releases page. + unarchive: + src: https://github.com/kubeflow/kfctl/releases/download/v1.0.2/kfctl_v1.0.2-0-ga476281_linux.tar.gz + dest: /usr/bin/ + remote_src: yes + +- name: Delete Omnia Kubeflow Directory if exists + file: + path: /root/k8s/omnia-kubeflow + state: absent + +- name: Create Kubeflow Directory + file: + path: /root/k8s/omnia-kubeflow + state: directory + recurse: yes + +- name: Build Kubeflow Configuration + shell: + cmd: /usr/bin/kfctl build -V -f https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.2.yaml + chdir: /root/k8s/omnia-kubeflow + +- name: Modify Cpu Limit for istio-ingressgateway-service-account + replace: + path: /root/k8s/omnia-kubeflow/kustomize/istio-install/base/istio-noauth.yaml + after: 'serviceAccountName: istio-ingressgateway-service-account' + before: '---' + regexp: 'cpu: 100m' + replace: 'cpu: 2' + +- name: Modify Mem Limit for istio-ingressgateway-service-account + replace: + path: /root/k8s/omnia-kubeflow/kustomize/istio-install/base/istio-noauth.yaml + after: 'serviceAccountName: istio-ingressgateway-service-account' + before: '---' + regexp: 'memory: 128Mi' + replace: 'memory: 512Mi' + +- name: Modify Cpu Request for istio-ingressgateway-service-account + replace: + path: /root/k8s/omnia-kubeflow/kustomize/istio-install/base/istio-noauth.yaml + after: 'serviceAccountName: istio-ingressgateway-service-account' + before: '---' + regexp: 'cpu: 10m' + replace: 'cpu: 1' + +- name: Modify Mem Request for istio-ingressgateway-service-account + replace: + path: /root/k8s/omnia-kubeflow/kustomize/istio-install/base/istio-noauth.yaml + after: 'serviceAccountName: istio-ingressgateway-service-account' + before: '---' + regexp: 'memory: 40Mi' + replace: 'memory: 256Mi' + + +- name: Modify Cpu Limit for kfserving-gateway + replace: + path: /root/k8s/omnia-kubeflow/kustomize/kfserving-gateway/base/deployment.yaml + after: 'serviceAccountName: istio-ingressgateway-service-account' + before: 'env:' + regexp: 'cpu: 100m' + replace: 'cpu: 2' + +- name: Modify Mem Limit for kfserving-gateway + replace: + path: /root/k8s/omnia-kubeflow/kustomize/kfserving-gateway/base/deployment.yaml + after: 'serviceAccountName: istio-ingressgateway-service-account' + before: 'env:' + regexp: 'memory: 128Mi' + replace: 'memory: 512Mi' + +- name: Modify Cpu Request for kfserving-gateway + replace: + path: /root/k8s/omnia-kubeflow/kustomize/kfserving-gateway/base/deployment.yaml + after: 'serviceAccountName: istio-ingressgateway-service-account' + before: 'env:' + regexp: 'cpu: 10m' + replace: 'cpu: 1' + +- name: Modify Mem Request for kfserving-gateway + replace: + path: /root/k8s/omnia-kubeflow/kustomize/kfserving-gateway/base/deployment.yaml + after: 'serviceAccountName: istio-ingressgateway-service-account' + before: 'env:' + regexp: 'memory: 40Mi' + replace: 'memory: 256Mi' + + +- name: Change Argo base service from NodePort to LoadBalancer + replace: + path: /root/k8s/omnia-kubeflow/kustomize/argo/base/service.yaml + regexp: 'NodePort' + replace: 'LoadBalancer' + +- name: Change istio-install base istio-noauth service from NodePort to LoadBalancer + replace: + path: /root/k8s/omnia-kubeflow/kustomize/istio-install/base/istio-noauth.yaml + regexp: 'NodePort' + replace: 'LoadBalancer' + +- name: Apply Kubeflow Configuration + shell: + cmd: /usr/bin/kfctl apply -V -f /root/k8s/omnia-kubeflow/kfctl_k8s_istio.v1.0.2.yaml + chdir: /root/k8s/omnia-kubeflow diff --git a/kubernetes/roles/master/tasks/main.yml b/kubernetes/roles/master/tasks/main.yml index 5f098bb4b..9d461c0f8 100644 --- a/kubernetes/roles/master/tasks/main.yml +++ b/kubernetes/roles/master/tasks/main.yml @@ -1,16 +1,30 @@ ---- -- name: Firewall Rule K8s:6443/tcp - command: firewall-cmd --zone=internal --add-port=6443/tcp --permanent - tags: master - -- name: Firewall Rule K8s:10250/tcp - command: firewall-cmd --zone=internal --add-port=10250/tcp --permanent - tags: master - -- name: Firewall Reload - command: firewall-cmd --reload - tags: master +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +--- +#- name: Firewall Rule K8s:6443/tcp + #command: firewall-cmd --zone=internal --add-port=6443/tcp --permanent + #tags: master +# +#- name: Firewall Rule K8s:10250/tcp + #command: firewall-cmd --zone=internal --add-port=10250/tcp --permanent + #tags: master +## +#- name: Firewall Reload + #command: firewall-cmd --reload + #tags: master +# - name: Create /root/bin (if it doesn't exist) file: path: /root/bin diff --git a/kubernetes/roles/startmaster/tasks/main.yml b/kubernetes/roles/startmaster/tasks/main.yml index ca7d65b74..202108888 100644 --- a/kubernetes/roles/startmaster/tasks/main.yml +++ b/kubernetes/roles/startmaster/tasks/main.yml @@ -1,10 +1,24 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + --- - name: Turn Swap OFF (if not already disabled) command: /usr/sbin/swapoff -a tags: init - name: Initialize kubeadm - command: /bin/kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=10.0.0.1 + command: /bin/kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address={{ master_ip }} #command: /bin/kubeadm init register: init_output tags: init @@ -14,7 +28,13 @@ tags: init - name: Copy Kubernetes Config for root #do this for other users too? - copy: src=/etc/kubernetes/admin.conf dest=/root/.kube/config owner=root group=root mode=644 + copy: + src: /etc/kubernetes/admin.conf + dest: /root/.kube/config + owner: root + group: root + mode: 644 + remote_src: yes tags: init - name: Cluster token @@ -32,8 +52,7 @@ name: "K8S_TOKEN_HOLDER" token: "{{ K8S_TOKEN.stdout }}" hash: "{{ K8S_MASTER_CA_HASH.stdout }}" - #ip: "{{ ansible_ib0.ipv4.address }}" - ip: "{{ ansible_p3p1.ipv4.address }}" + ip: "{{ master_ip }}" tags: init - name: @@ -48,7 +67,7 @@ - name: debug: - msg: "[Master] K8S_MASTER_IP is {{ hostvars['K8S_TOKEN_HOLDER']['ip'] }}" + msg: "[Master] K8S_MASTER_IP is {{ master_ip }}" tags: init - name: Setup Calico SDN network @@ -65,6 +84,11 @@ register: gpu_enable tags: init +- name: Deploy Xilinx Device Plugin + shell: kubectl create -f https://raw.githubusercontent.com/Xilinx/FPGA_as_a_Service/master/k8s-fpga-device-plugin/fpga-device-plugin.yml + register: fpga_enable + tags: init + - name: Create yaml repo for setup file: path: /root/k8s @@ -91,6 +115,12 @@ shell: kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | grep admin-user | awk '{print $1}') > /root/k8s/token tags: init +- name: Edge / Workstation Install allows pods to scheudle on master + shell: kubectl taint nodes --all node-role.kubernetes.io/master- + when: single_node + tags: init + + # If more debug information is needed during init uncomment the following 2 lines #- debug: var=init_output.stdout_lines #tags: init diff --git a/kubernetes/roles/startservices/files/jhub-db-pv.yaml b/kubernetes/roles/startservices/files/jhub-db-pv.yaml deleted file mode 100755 index e910f2bbb..000000000 --- a/kubernetes/roles/startservices/files/jhub-db-pv.yaml +++ /dev/null @@ -1,16 +0,0 @@ -apiVersion: v1 -kind: PersistentVolume -metadata: - name: jupyterhub-db-pv -spec: - capacity: - storage: 1Gi - accessModes: - - ReadWriteOnce - - ReadOnlyMany - - ReadWriteMany - nfs: - server: 10.0.0.1 - path: /work/k8s/jhub-db - persistentVolumeReclaimPolicy: Recycle - diff --git a/kubernetes/roles/startservices/files/jupyter-pvc.yaml b/kubernetes/roles/startservices/files/jupyter-pvc.yaml deleted file mode 100644 index f950425ce..000000000 --- a/kubernetes/roles/startservices/files/jupyter-pvc.yaml +++ /dev/null @@ -1,50 +0,0 @@ -apiVersion: v1 -kind: PersistentVolume -metadata: - name: jupyter-nfs -spec: - capacity: - storage: 1Gi - accessModes: - - ReadWriteMany - nfs: - server: 10.0.0.1 - path: "/work/jupyter1" - ---- -apiVersion: v1 -kind: PersistentVolume -metadata: - name: jupyter-hub-nfs -spec: - capacity: - storage: 1Gi - accessModes: - - ReadWriteMany - nfs: - server: 10.0.0.1 - path: "/work/jupyter2" - ---- -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: jupyter-nfs-pvc -spec: - accessModes: - - ReadWriteMany - storageClassName: "nfs" - resources: - requests: - ---- -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: jupyter-hub-nfs-pvc -spec: - accessModes: - - ReadWriteMany - storageClassName: "nfs" - resources: - requests: diff --git a/kubernetes/roles/startservices/files/jupyter_config.yaml b/kubernetes/roles/startservices/files/jupyter_config.yaml deleted file mode 100644 index de3cf2090..000000000 --- a/kubernetes/roles/startservices/files/jupyter_config.yaml +++ /dev/null @@ -1,62 +0,0 @@ -proxy: - secretToken: "1c8572f630701e8792bede122ec9c4179d9087f801e1a85ed32cce69887aec1b" - -hub: - cookieSecret: "1c8572f630701e8792bede122ec9c4179d9087f801e1a85ed32cce69887aec1b" - service: - type: LoadBalancer - db: - type: sqlite-pvc - extraConfig: - jupyterlab: | - c.Spawner.cmd = ['jupyter-labhub'] - -singleuser: - image: - name: jupyter/minimal-notebook - tag: 2343e33dec46 - profileList: - - display_name: "Minimal environment" - description: "Short and sweet, no bells or whistles, vanilla: Python." - default: true - - display_name: "Datascience environment" - description: "Some additional bells and whistles: Python, R, and Julia." - kubespawner_override: - image: jupyter/datascience-notebook:2343e33dec46 - - display_name: "Spark environment" - description: "The Jupyter Stacks with Spark" - kubespawner_override: - image: jupyter/all-spark-notebook:2343e33dec46 - - display_name: "Learning Data Science" - description: "Datascience Environment with Sample Notebooks" - kubespawner_override: - image: jupyter/datascience-notebook:2343e33dec46 - lifecycle_hooks: - postStart: - exec: - command: - - "sh" - - "-c" - - > - gitpuller https://github.com/data-8/materials-fa17 master materials-fa; - - display_name: "GPU Environment" - description: "1 GPU for intro folks" - kubespawner_override: - image: jupyter/datascience-notebook:2343e33dec46 - extra_resource_limits: - nvidia.com/gpu: "1" - storage: - dynamic: - storageClass: nfs-client - cpu: - limit: 1 - memory: - limit: 100G - guarantee: 1G - defaultUrl: "/lab" - - -prePuller: - continuous: - enabled: true - diff --git a/kubernetes/roles/startservices/files/metal-config.yaml b/kubernetes/roles/startservices/files/metal-config.yaml index a1350933e..0a3b38368 100644 --- a/kubernetes/roles/startservices/files/metal-config.yaml +++ b/kubernetes/roles/startservices/files/metal-config.yaml @@ -9,13 +9,13 @@ data: - name: default protocol: layer2 addresses: - - 10.0.0.150/32 - - 10.0.0.151/32 - - 10.0.0.152/32 - - 10.0.0.153/32 - - 10.0.0.154/32 - - 10.0.0.155/32 - - 10.0.0.156/32 - - 10.0.0.157/32 - - 10.0.0.158/32 - - 10.0.0.159/32 + - 192.168.2.150/32 + - 192.168.2.151/32 + - 192.168.2.152/32 + - 192.168.2.153/32 + - 192.168.2.154/32 + - 192.168.2.155/32 + - 192.168.2.156/32 + - 192.168.2.157/32 + - 192.168.2.158/32 + - 192.168.2.159/32 diff --git a/kubernetes/roles/startservices/tasks/main.yml b/kubernetes/roles/startservices/tasks/main.yml index b9969e328..c605f2cd8 100644 --- a/kubernetes/roles/startservices/tasks/main.yml +++ b/kubernetes/roles/startservices/tasks/main.yml @@ -1,3 +1,17 @@ +# Copyright 2020 Dell Technologies +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + --- #- name: Kick CoreDNS (this is a hack that needs to be fixed) #shell: kubectl get pods -n kube-system --no-headers=true | awk '/coredns/{print $1}'|xargs kubectl delete -n kube-system pod @@ -27,58 +41,30 @@ shell: kubectl apply -f /root/k8s/metal-config.yaml tags: init -#- name: Helm - create service account - #shell: kubectl create serviceaccount --namespace kube-system tiller - #tags: init - -#- name: Helm - create clusterRole Binding for tiller-cluster-rule - #shell: kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller - #tags: init - -#- name: Helm - create clusterRoleBinding for admin - #shell: kubectl create clusterrolebinding tiller-cluster-admin --clusterrole=cluster-admin --serviceaccount=kube-system:tiller - #tags: init - -#- name: Helm - init - #shell: helm init --upgrade - #tags: init - -#- name: Wait for tiller to start - #shell: kubectl rollout status deployment/tiller-deploy -n kube-system - #tags: init - -#- name: Helm - patch cluster Role Binding for tiller - #shell: kubectl --namespace kube-system patch deploy tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}' - #tags: init - -#- name: Wait for tiller to start - #shell: kubectl rollout status deployment/tiller-deploy -n kube-system - #tags: init - - name: Start K8S Dashboard shell: kubectl create -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-beta6/aio/deploy/recommended.yaml tags: init -- name: Start NFS Client Provisioner - shell: helm install stable/nfs-client-provisioner --set nfs.server=10.0.0.1 --set nfs.path=/work --generate-name +- name: Helm - Add Stable Repo + shell: helm repo add stable https://kubernetes-charts.storage.googleapis.com/ tags: init -- name: JupyterHub Persistent Volume Creation (files) - copy: src=jhub-db-pv.yaml dest=/root/k8s/jhub-db-pv.yaml owner=root group=root mode=655 +- name: Helm - Update Repo + shell: helm repo update tags: init -- name: jupyterHub Persistent Volume creation - shell: kubectl create -f /root/k8s/jhub-db-pv.yaml +- name: Start NFS Client Provisioner + shell: helm install stable/nfs-client-provisioner --set nfs.server=10.0.0.1 --set nfs.path=/work --generate-name tags: init -- name: JupyterHub Custom Config (files) - copy: src=jupyter_config.yaml dest=/root/k8s/jupyter_config.yaml owner=root group=root mode=655 - tags: init - -- name: jupyterHub deploy - shell: helm install jupyterhub/jupyterhub --namespace default --version 0.8.2 --values /root/k8s/jupyter_config.yaml --generate-name +- name: Set NFS-Client Provisioner as DEFAULT StorageClass + shell: "kubectl patch storageclasses.storage.k8s.io nfs-client -p '{\"metadata\": {\"annotations\":{\"storageclass.kubernetes.io/is-default-class\":\"true\"}}}'" tags: init - name: Prometheus deployment shell: helm install stable/prometheus --set alertmanager.persistentVolume.storageClass=nfs-client,server.persistentVolume.storageClass=nfs-client,server.service.type=LoadBalancer --generate-name tags: init + +- name: Install MPI Operator + shell: kubectl create -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v1alpha2/mpi-operator.yaml + tags: init diff --git a/kubernetes/roles/startworkers/tasks/main.yml b/kubernetes/roles/startworkers/tasks/main.yml index b74f0167b..41d15626c 100644 --- a/kubernetes/roles/startworkers/tasks/main.yml +++ b/kubernetes/roles/startworkers/tasks/main.yml @@ -1,3 +1,17 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + --- - name: Turn Swap OFF (if not already disabled) @@ -24,6 +38,7 @@ kubeadm join --token={{ hostvars['K8S_TOKEN_HOLDER']['token'] }} --discovery-token-ca-cert-hash sha256:{{ hostvars['K8S_TOKEN_HOLDER']['hash'] }} {{ hostvars['K8S_TOKEN_HOLDER']['ip'] }}:6443 + when: not single_node tags: init diff --git a/kubernetes/scuttle b/kubernetes/scuttle index 6ce220fff..8731efe85 100755 --- a/kubernetes/scuttle +++ b/kubernetes/scuttle @@ -1,3 +1,17 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + #!/bin/bash kubeadm reset -f diff --git a/slurm/roles/slurm-common/tasks/main.yaml b/slurm/roles/slurm-common/tasks/main.yaml index 3d86a1c2d..82d1726f2 100644 --- a/slurm/roles/slurm-common/tasks/main.yaml +++ b/slurm/roles/slurm-common/tasks/main.yaml @@ -1,3 +1,16 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. --- - name: install packages for slurm diff --git a/slurm/roles/slurm-master/tasks/main.yaml b/slurm/roles/slurm-master/tasks/main.yaml index 86ed144d0..2f4af3d71 100644 --- a/slurm/roles/slurm-master/tasks/main.yaml +++ b/slurm/roles/slurm-master/tasks/main.yaml @@ -1,3 +1,17 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + --- - name: Download Slurm source diff --git a/slurm/roles/start-slurm-workers/tasks/main.yml b/slurm/roles/start-slurm-workers/tasks/main.yml index da5719575..0e929178c 100644 --- a/slurm/roles/start-slurm-workers/tasks/main.yml +++ b/slurm/roles/start-slurm-workers/tasks/main.yml @@ -1,3 +1,17 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + --- - name: Install SLURM RPMs on compute yum: diff --git a/slurm/slurm-cluster.yaml b/slurm/slurm-cluster.yaml deleted file mode 100644 index 75a206abc..000000000 --- a/slurm/slurm-cluster.yaml +++ /dev/null @@ -1,23 +0,0 @@ ---- -#Playbook for installing Slurm on a cluster - -#collect info from everything -- hosts: all - -# Apply Common Installation and Config -- hosts: cluster - gather_facts: false - roles: - - slurm-common - -# Apply Master Config, start services -- hosts: master - gather_facts: false - roles: - - slurm-master - -# Start SLURM workers -- hosts: compute - gather_facts: false - roles: - - start-slurm-workers diff --git a/slurm/slurm.yml b/slurm/slurm.yml new file mode 100644 index 000000000..a0ad9456f --- /dev/null +++ b/slurm/slurm.yml @@ -0,0 +1,36 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +--- +#Playbook for installing Slurm on a cluster + +#collect info from everything +- hosts: all + +# Apply Common Installation and Config +- hosts: cluster + gather_facts: false + roles: + - slurm-common + +# Apply Master Config, start services +- hosts: master + gather_facts: false + roles: + - slurm-master + +# Start SLURM workers +- hosts: compute + gather_facts: false + roles: + - start-slurm-workers diff --git a/tools/change_personality b/tools/change_personality index fa5a21e3c..997f8a45e 100755 --- a/tools/change_personality +++ b/tools/change_personality @@ -1,3 +1,17 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + #!/bin/bash #Usage: change_personality diff --git a/tools/install_tools.yml b/tools/install_tools.yml index 125370ed3..b8b81a7e7 100644 --- a/tools/install_tools.yml +++ b/tools/install_tools.yml @@ -1,3 +1,16 @@ +# Copyright 2020 Dell Inc. or its subsidiaries. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. --- - hosts: master