# What is TSCC?
- The [Triton Shared Computing Cluster (TSCC)](https://www.sdsc.edu/support/user_guides/tscc-quick-start.html) is the cluster that we (Yeo lab, others at UCSD) use for processing and analysis of single-cell RNASeq data. It is a part of the [San Diego Supercomputer Center](https://www.sdsc.edu/) which has an administrative team that helps us create/restore backups, install top-level updates, and manage the job submission software for scripts that can be distributed across multiple machines (nodes). 
- Having an external administrative team is useful for reducing the engineering overhead, however this also means we lack administrative (superuser, or "[sudo](https://en.wikipedia.org/wiki/Sudo)") privileges. <b>This is common among labs within large organizations and Universities</b>. Software that requires these privileges needs either a workaround or administrator help. 

# Terminology
- **Unix**: Unix is a family of operating systems (think Windows or MacOS). Linux is a free* version of Unix, which has become the core for many distributions such as Centos, Ubuntu, and Fedora, each which specializes in different things. Since we're mostly going to be using Python and R (*scripting or interpreted language* as opposed to *compiled languages* like C or Java), it doesn't really matter which Unix distribution we're using so as long as the Python or R interpreter is installed. MacOSX was designed with Unix principles in mind, making it partially compatible with Unix and a popular choice among bioinformaticians. 
- **HPC**: High-Performance Computing (HPC) simply describes a computing environment specialized for high throughput or computationally demanding analysis. HPC clusters usually require additional software to manage job (script) submission and resources within and across its network of nodes (computers). TSCC uses the [TORQUE resource manager](https://en.wikipedia.org/wiki/TORQUE), so we will be running jobs by wrapping them inside TORQUE scripts (qscripts) and submitting them to the TORQUE resource manager. **Note**: Other HPC clusters may use other resource managers (SLURM is another popular one) and therefore will have different syntax, but the idea is the same.
- **Anaconda**: Anaconda is an open-source distribution (collection of softwares) for scientific computing. It contains the Python interpreter as well as a bunch of other useful software packages (such as Scipy, Jupyter, Numpy, etc.), and conda. 
- **Conda**: conda is a package manager, or software that helps organize these packages, plus any dependencies, versions, metadata that comes with it.
- **Containers**: a container is a unit of software that essentially wraps a package or distribution along with all dependencies in a portable/reproducible way. For our purposes, we can think of containers as easy ways to share a software installation.
- **Singularity**: One popular containerization technology (another being Docker) that we will be using to share the software that you will be using throughout this course. Singularity containers (images) can be shared via two ways: the image itself (a single, larger file containing an operating system and any of the packages installed), or via instructions to create the image. We have several pre-made Singularity images on TSCC which can be called/accessed/run with another package management system called "modules". 
- **Local/remote**: "Local" usually refers to your own laptop/desktop, while "remote" will typically refer to the cluster or cloud instance that you are connecting to. You will almost always have a separate set of credentials (username/password) for each (for example, my local username is "brianyee" while my username on TSCC may be something like "ucsd-train03").
- **Node (Cluster)**: A single node is one server/machine/computer. A cluster is comprised of a group of nodes. Each node that we have access to provides up to 16 cores (processors) and up to 126Gb of memory, managed by the job scheduler (Maui) and the resource manager (TORQUE). 

# Generate a public/private access key
You should have generated a public and private key, which will link your local machine to the remote account we have for you on TSCC. Private keys can be thought of as a password that is stored on your computer and used to check whether or not it pairs properly with the public key, which I should have stored on your TSCC account. If you've already done this, great! Skip this section. If you haven't done so, follow these instructions:

### Generate a public/private key pair (Mac/Ubuntu)

- Open a terminal (Mac): Applications folder - Utilities - Launch Terminal
- Type in the following command: 
```bash
ssh-keygen -t rsa
``` 

then press ```Enter```

Enter a passphrase (the cursor won't move), which is kind of like a password for your private key (password protecting a password). It will ask you to do this twice to make sure there aren't any typos. 
#### REMEMBER THIS PASSPHRASE!

This is what you will use to log on to TSCC. Do not include the * character in the passphrase.

You will be prompted with a location to save the key. Press Enter to accept the default setting (it should look something like ```/Users/emily/.ssh/id_rsa``` except the ```/Users/emily/``` part will be your default home directory.)

- Type in the following commands to copy your public key (assuming you have saved your keys to the default location):
```bash
cat ~/.ssh/id_rsa.pub
```

You should see something like this pop up on your screen:
```
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC76Ew4UJTblrl2c8RkCtVOKS2FDVzMm51fueE+SrIQxiPbeHy/64tK/TfUYGUie6Eq4sr5w0i6
LoMXGNKlfwX8O3nym/coitxTRXcOalGb/CbPIWDN9nvlCa+JB/oYRMTmff/Is6VJur+Ir2eJZnHiB9BlmwjOoKusaqzXesjx0pym7QEEBTpCVF07
lj5rkvAF1K/ia4tnkHRlnc7JDcAvOwQvkvL2swOrYYJhTGqThCwkEebEqEuDhC7UcX1itkaOd5KmCasGVHfgLkqp+QT5pEdaSzSm/iho0VvkNpPd
+B0H5179/ZkUyN7jfSda0C54MAG+Hek8CNEkxyjpvS/R emily@Emilys-MacBook-Air.local
```

- Highlight **everything** in your terminal window and copy it with command c.
- Paste the contents into an email and send it to me (brian.alan.yee@gmail.com). I will confirm with your username (typically **ucsd-trainXY**\@tscc-login1.sdsc.edu)

### Generate a public/private key pair (Windows)

- Download [git bash](https://gitforwindows.org/)
- Open the included terminal application (**Note**: This is distinct from the Windows "powershell" terminal application that comes natively installed on Windows!)
- Follow the Mac/Ubuntu instructions

# Log onto TSCC

Using the username (provided by Brian) and passphrase (see: **REMEMBER THIS PASSPHRASE!**), login!
```bash
ssh ucsd-trainXY@tscc-login1.sdsc.edu # XY must be replaced by the ID assigned to you.
```
**Note**: Don't worry about the security warning at first (just type "```yes```"), it will always show up using a local machine that has never logged in before.

# Scratch

Every user is allocated 100GB inside their home directory, which is typically used for permanent storage of processing scripts, notebooks, manuscript figures, etc. However this is not nearly enough to process large datasets, where intermediate files alone may take up several TB of space. Fortunately, TSCC provides a separate storage allocation specifically for temporary storage called *scratch*. For our purposes, consider this unlimited storage where we will do most of our processing. Let's make things easier by softlinking the path to scratch onto our home:

```ln -s /oasis/tscc/scratch/ucsd-trainXY ~/scratch```

**Note**: The ```~``` is another way of specifying your home directory, therefore typing ```~``` is identical to typing ```/home/ucsd-trainXY```.

# Screens

Each time you open a terminal and login to TSCC, you're starting a new session. However, since these sessions are not persistent, closing your terminal window or logging out will also kill/stop any currently running job or command. Obviously this doesn't work for long-running jobs, so we will be using screens to keep your sessions running even after you've closed your terminal. By default, screens are kind of ugly, so we will be downloading and using a custom config file (.screenrc) that Olga (previous instructor) maintains. After you've logged into TSCC, download the config file:

```bash
cd ~ # change directory to your HOME
wget https://raw.githubusercontent.com/olgabot/rcfiles/master/.screenrc
```

Then type ```screen```

This .screenrc adds a status bar at the bottom of your screen, like this:

![screen example](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png "Picture of TSCC login node highlighting the tscc-login")


### Common screen commands:

**Note**: By default, screen commands use a "control character" ```a``` to prefix a "screen" command, however this configuration modifies the control character to be ```j``` instead. We do this because ```Ctrl-a``` is itself another command that you can use to move your cursor all the way to the beginning of the commandline. Just a personal preference.

- **```screen```** : starts a new screen session
- **```Ctrl-j, c```** : creates a new tab within a screen session
- **```Ctrl-j, k```** : kills a screen tab 
- **```Ctrl-j, n```** and **```Ctrl-j, p```** : activates/toggles the "next" and "previous" tabs, respectively.
- **```Ctrl-j, #```** : activates the numbered tab within a screen session
- **```Ctrl-j, Shift-a```** : allows you to re-name a screen tab.

### Attaching/detaching screen sessions:

**Note**: Simply closing the terminal window (or shutting off your computer) with an active screen session will simply "detach" it from your terminal, meaning the screen (and all the programs within the session) will continue to run.
- **```screen -r```** : Re-attach a screen (when you want to login and go back to an active screen)
- **```screen -d -r #####```** : De-tach and re-attach a screen (should you have multiple screen sessions, you can decide which one to attach). The ```#####``` represents the screen ID (by default, screens are assigned a numeric ID which become apparent when you use multiple screens).

**Note**: Last year, several students ran into issues because they accidentally started several screen sessions (sometimes screen sessions within a screen session). This is okay! Simply re-attach a defunct screen ```screen -d -r #####``` and ```Ctrl-j, k``` to kill tabs until the screen session ends. To avoid this, I recommend always typing ```screen -d -r``` which will either: 

1. re-attach an existing screen session, or 
2. complain that you don't have any active screen sessions. (Then you're free to ```screen```)

# Working with TSCC

As said before, TSCC is a cluster of computers managed by the TORQUE resource manager, which we will need to run requests or reserve resources through. 


#### When you login, you will be placed inside one of two login nodes (also called head nodes), which you can usually tell through the command prompt: 

![login node](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon48.png "Picture of TSCC login node highlighting the tscc-login")

Head nodes serve as a default, but as they are unreserved resources, we recommend **not using these nodes to run jobs!** The world won't end, but you'll probably run out of memory quickly and your processes will likely die. Instead, we'll request resources from compute nodes.

### There are two ways to request the resources:

# Submitting a job: 

Use a submission script to run specific command-line tools (ie. STAR, bowtie). These jobs will use the requested resources to simply run the commands inside the submission script. Walltime (requested time to allocate to your command) will end as soon as the command finishes, so feel free to be liberal (I usually request double the amount of time I think I need per job to ensure that I won't run out of walltime). **This is the preferred method for running software**

[Example submission script template can be found here]()

# Submitting an interactive job: 

Instead of supplying a submission script, you may include the ```-I``` parameter, which means that when the request is granted, you will be redirected to a bash shell within the compute node, leaving you free to run any commands inside the shell until walltime is exhausted. Since your interactive node will last the entire walltime, please be considerate and not reserve resources for longer than you need (below is a typical request for an appropriate amoung of resources for the work we will be doing) **This is preferred for piloting commands or running software that requires you to interact with it**.

Here is an example interactive command (requesting a 1-node, 1 processor-per-node ```nodes=1:ppn=1``` interactive session ```-I``` lasting 6 hours ```walltime=6:00:00``` using the Yeo lab queue ```-q home-yeo```. We are leaving the memory requirements up to Maui/Torque:
```bash
qsub -I -l walltime=6:00:00 -q home-yeo -l nodes=1:ppn=1
```

# Conda/Modules

Our lab primarily uses conda to manage user-level packages (software installed for an individual user), and modules to manage system and lab-wide installations. For this course, we'll be using software that has already been made available lab-wide, so there shouldn't be a need to install anything individually. 

### Common module commands:

- **```module avail```** : list all available modules
- **```module load samtools/1.5```** : loads samtools 1.5 (samtools/1.5 can be replaced with any module/version available). Not specifying a version will cause the default module to load.
- **```module list```** : lists all active modules, in the order they were loaded.
- **```module unload samtools/1.5```** : unloads the samtools module. Generally it is good practice to unload modules after use, since they modify environment variables such as your ```$PATH```

## Try loading and running the samtools module:

```samtools``` (you should get: command not found)

```module load samtools/1.5```

```samtools --version``` (you should get: 1.5)

## We will be using the following modules for this course:

Try getting an interactive node and loading these modules:

- seurat/3.0
- scanpy
- cellranger/3.0.2
- dropseqtools/1.13

**Note**: Many of our lab-wide modules simply modify your path to point to Singularity containers, which themselves contain the Seurat/Scanpy/Cellranger software. These images are portable and can theoretically be run on any HPC with Singularity (2.4+) installed. This means that these images are theoretically capable of running on your own cluster or local machine, provided that Singularity is installed.

# **Exercise: Use the cellranger module to get reads from a published dataset on GEO for re-processing**
- In several cases, BAM files were stored on GEO in lieu of fastq files, however the Cellranger pipeline will only accept sequencing data in the form of fastqs. 
- Load the cellranger module and see if you can use their software to convert a [BAM file from GEO](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR6254382).
- You may find their online documentation useful. Google is your friend!

# TL;DR

```bash
ssh ucsd-trainXY@tscc-login1.sdsc.edu
cd ~
ln -s /oasis/tscc/scratch/ucsd-trainXY ./scratch
screen -d -r # OR screen if you don't already have an active session yet
qsub -I -l walltime=6:00:00 -q home-yeo -l nodes=1:ppn=1
module load cellranger
cellranger
```