# Access to Duke OIT Cluster Energy Group

You can find more information on [OIT's DCC workshop slides](https://rc.duke.edu/wp-content/uploads/DCC_Workshop_04-13-2017.pdf)

### Create New Account

Go to [OIT cluster's webpage](https://rcforms.oit.duke.edu/dscr-new-user-account/), fill out the form. Select `collinslab` in `Duke Compute Cluster group`. You will receive a ticket number then. After comfirming by POC (Jordan or Leslie) you can have access to DCC.

### Access DCC

The host name are:
- `dcc-slogin-01.oit.duke.edu`
- `dcc-slogin-02.oit.duke.edu`  

There's no difference of these two names, you will login to the same directory (/dsrchome/[your netid])

If you are off campus, you can either:
- `ssh NetID@login.oit.duke.edu` then ssh either of the two hosts names above
- use [Duke VPN](hips://oit.duke.edu/net-security/network/remote/vpn/) then ssh

### Data Storage

1. /dscrhome
    * Your login directory
    * 250 GB group quota
2. /work
    * 200 TB unpartitioned 
    * Designed for temporary large data files storage (No backup)
    * 27 TB available now (09/15/2017)
3. /datacommons
    * $82/TB/year

**Usage of data storage space**:
1. You should always put your codes/scripts in `/dscrhome`. But never put data or temporary large files in this directory.
2. For data/models/sigularity images, please put them in `/work`.
3. I have a copy of UAB datasets in `/work/bh163/uab_datasets`. You can point to this directory in your `uabRepoPaths.py` to avoid redundant data copies

### Copy files

#### push

- file
```Shell
scp path/to/file netid@dcc-slogin-01.oit.duke.edu:path/to/destination
```

- directory
```Shell
scp -r path/to/dir netid@dcc-slogin-01.oit.duke.edu:path/to/destination
```

#### pull

- file
```Shell
scp netid@dcc-slogin-01.oit.duke.edu:path/to/file
```

- directory
```Shell
scp -r netid@dcc-slogin-01.oit.duke.edu:path/to/dir path/to/dir
```

#### Large file/dir

```Shell
rsync -av path/to/file netid@scc-slogin-01.oit.duke.edu:path/to/file
rsync -av netid@scc-slogin-01.oit.edu:path/to/file
```

However, I recommend using GUI for transferring files:

#### Windows Users

You can use [WinSCP](hips://winscp.net) for copying files (actually more convenient)

#### Linux Users

You can use `Nautilus`, select `Connect to Server` at the left panel. Then type in the server address:
- To connect to home directory, use: sftp://[your netid]@dcc-slogin-01.oit.duke.edu/hpchome/collinslab/[your netid]
- To connect to work directory, use: sftp://[your netid]@dcc-slogin-01.oit.duke.edu/work/[your netid]

### Installed Tools

- Matlab: `/opt/apps/matlabR2016a/bin/matlab` (latest version on DCC, also has 08b, 14a, 15a)
- Python: `/opt/apps/Python-2.7.10` (also has 2.7.3, 2.7.5, 3.3.2, 3.3.3)
- R `/opt/apps/R-3.0.3`
- Other tools in `/opt/apps/rhel7`:
    * anaconda 2
    * anaconda 3
    * Python
    * R
- Other tools in `/opt/apps/slurm`:
    * miniconda 3
    * Torch 7

You'll need to know the location of those tools when deploying job

# Deploying Jobs

## Matlab

To deploy a job in Matlab, you can use a shell script like this (also avilable at [here](./scripts/matlab_sbatch_example.sh)):
```Shell
#!/bin/bash
#SBATCH -e slurm.err                         # error message will be stored in slurm.err
#SBATCH --mem=20G                            # request 20G for ram
#SBATCH -c 6                                 # 6 cpu cores
#SBATCH -p gpu-common --gres=gpu:1           # request for 1 gpu
/opt/apps/matlabR2016a/bin/matlab -nojvm -nodisplay -singleCompThread -r mycode.m > file.out
```

The last line is the same as how you run a matlab file in linux system. 
- `/opt/apps/matlabR2016a/bin/matlab`: just specifies which matlab version you are going to use, you can use other ones as I listed in _Installed Tools_. 
- `-nojvm`: run without java virtual machine
- `-nodisplay` disables the display (since you are using a command line) 
- `-singleCompThread`: is said to be 'required to prevent uncontrolled multi-threading'.

Then in the command line, do:
```Shell
sbatch shellscript.sh
```

## Python

The default Python version on DCC is 2.7. If you want to use Python 3, Run command:
```Shell
export PATH=/opt/apps/rhel7/anaconda3/bin:$PATH
```

Or add
```Shell
export PATH=/opt/apps/rhel7/anaconda3/bin:$PATH
```
to the end of the `.bash_profile`.

The GPU supported version of Tensorflow has been installed and tested for Python 3.

To deploy a job in Python, you can use a shell script like this (also avilable at [here](./scripts/python_sbatch_example.sh)):
```Shell
#!/bin/bash
#SBATCH -e slurm.err                                                   # error log file
#SBATCH --mem=20G                                                      # request 20G memory
#SBATCH -c 6                                                           # request 6 gpu cores
#SBATCH --exclude=dcc-gpu-[31-32]                                      # exclude GPUs (those two have less than 4G RAM)
#SBATCH -p gpu-common --gres=gpu:1                                     # request 1 gpu for this job
module load Anaconda3/3.5.2                                            # load conda to make sure use GPU version of tf
# add cuda and cudnn path
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/opt/apps/rhel7/cudnn/lib64:$LD_LIBRARY_PATH
# add my library path
export PYTHONPATH=$PYTHONPATH:/dscrhome/bh163/code/uab
# execute my file
python train_spca_unet_psbs.py
```

For more information about slurm commands, you can refer to [dcc workshop slides](https://rc.duke.edu/wp-content/uploads/DCC_Workshop_04-13-2017.pdf) and [slurm cheat sheet](https://slurm.schedmd.com/pdfs/summary.pdf)

Note, the second last line `export LD_LIBRARY_...` is necessary when using Tensorflow otherwise the interpreter can't find tensorflow when you run your code.

Then in the command line, do:
```Shell
sbatch shellscript.sh
```

## View Results

After submitting your job, you shall see a message like 'Submitted batch job [job number]'. To view the status of your job, use the comman 'squeue -u [your netid]'. And you can get either of the three following status:

![Pending](./pics/PD_job.PNG)

![Running](./pics/R_job.PNG)

![None](./pics/ST_job.PNG)

They indicates `Pending`, `Running` and the last one is what you'll see when your job is completed.

The output result will be stored in file 'slurm-[job number].out'. If you feel the job number is too long to remember or meaningless, you can redirect the output to a file using '> [filename]' as I provided in the above two shell script examples.

## Cancel Job

To kill a job, type
```Shell
scancel [job number]
```
or use
```Shell
scancel -u [netid]
```
to cancel all your jobs

## View GPUs on Cluster

To see the running and pending jobs on cluster, you can use
```Shell
squeue | grep dcc-gpu
squeue | grep collinslab
```

To see the status of public nodes or our lab's nodes respectively. And you'll get something like this:

![job summary](./pics/run_summary.PNG)

To see the information of all gpus in the cluster, you can use
```Shell
sinfo | grep gpu-common
sinfo | grep collinslab
```

And you'll get:

![job summary](./pics/gpu_summary.PNG)

This can help you figure out how many nodes are available or how long you might need to wait.

## Verify Your Code is Using GPU

It has been a problem to me that sometimes I submitted my jobs but they did not use the GPU allocated. To make sure your jobs are using GPUs, use the following command:
```Shell
for i in `squeue -u [your netid] | grep gpu-commo | awk '{print $8}' | grep -v NODE`;do echo $i;ssh $i nvidia-smi 2> /dev/null | grep processes;done
```

Or you can use [this script](./scripts/gpu_usage.sh). This script will check the gpu usage of given user id in both `gpu-common` and `collinslab` allocations. The easiest way to use this script is to [add this to your bash path](https://stackoverflow.com/a/20054809). Then you can use this command:
```
bash gpu_usage.sh [your netid]
```

If you see the output like this:
```
dcc-gpu-14
|  No running processes found                                                 |
```

Then you should terminate your job, because it is not using GPU but still occupying one.

## Access Batcave Repos from DCC

You can access repos in `batcave` by using the following instruction:  
`git clone ssh://[your netid]@login01.egr.duke.edu/autofs/nfs4/pse-fs-00.egr.duke.edu/lcollins-00/data/data/sourceControl/[repo name].git`

## Use Filezilla

[Filezilla](https://filezilla-project.org/) is a client that supports ftp, sftp, etc communication protocols. To use Filezilla to connect DCC, use the following setups:
1. Host: _sftp://dcc-slogin-01.oit.duke.edu_
2. Username: _your username_
3. Password: _your password_
4. Port: **22** (This is very important)