<h1 style = "font-size: 35px">Terminal Commands and Bash Scripting</h1>

This notebook summarizes the Unix module lecture and provides some additional information.

# Anatomy of a Command

<img src="img/command_anatomy.jpg" width='750px'>

# Understanding help and manual pages

Most commands have usage documentation available via `--help` (long format) **and/or** `-h` (short format)
```bash
cmd --help
```
```bash
cmd -h
```

Alternatively, some commands will also register their help via the built-in manual pages in the `man` command
```bash
man cmd
```
(press **q** to quit or exit-out after opening the manual)

# File System Navigation

## `pwd`: where am I?

What is the path to my current location (aka working directory) in the file system?

In [None]:
pwd

The first slash in the path refers to the root directory, the folder that contains all of the other folders on your computer.

## `whoami`: who am I logged in as?
To continue the existential questions...


Who am I logged in as? What is my username?

In [None]:
whoami

## `tree`: show the file system structure in a tree format
```bash
tree ~/module-1-programming
```
```
~/module-1-programming
├── Day1_1_Intro_to_Programming.ipynb
├── Day1_2_Python_Packages.ipynb
├── Day1_3_Intro_to_Pandas.ipynb
├── Day2_1_R_Basics.ipynb
├── Day2_2_Terminal_Commands_and_Bash_Scripting.ipynb
├── README.md
├── bash_playground
│   ├── scripts
│   │   ├── hello.sh
│   │   ├── hello_to_all_of_you.sh
│   │   └── hello_you.sh
│   └── sequences.txt
├── data
│   ├── chrom_lengths.tsv
│   ├── gene_chrom.tsv
│   └── hg19.chrom.sizes.txt
└── img
```

## `cd`: **c**hange my working **d**irectory

We can use the `cd` command to navigate to a directory, given the path to that directory:
```bash
cd [path_to_dir]
```
Without any arguments, `cd` will go to your home directory.

In [None]:
cd

Now where are we?

In [None]:
pwd

Usually, you'll want to give `cd` the path to a directory, though. For example, let's go to the `bash_playground/` folder.

In [None]:
cd module-1-programming/bash_playground

You can go back *up* a level with the ".." notation

In [None]:
cd ..
pwd

And you can go back to your last directory by using `-` as the directory path. This will also print the new working directory.

In [None]:
cd -

## `ls`: **l**i**s**t contents of a directory

We can use the `ls` command to print the contents of a directory, given the path to that directory:
```bash
ls [path_to_dir_or_file]...
```
Without any arguments, `ls` will show you the contents of your current working directory.

In [None]:
ls

But we can also give it the path to a file or directory (or multiple files or directories).

In [None]:
ls scripts/hello.sh scripts/hello_you.sh

Adding flags get us more information. Let's use the short-format.

In [None]:
ls -lh scripts/hello.sh scripts/hello_you.sh

Use the `-a` flag to show _all_ files, including hidden ones, prefixed with a single dot `.`.

In [None]:
ls -a

Just like with `cd`, we can also specify the path to the directory above our current one using the ".." notation.

In [None]:
ls ..

In any path, the tilde `~` represents our home directory, so we can view the files in that directory like this:

In [None]:
ls ~

The tilde actually gets replaced with the path to our home directory prior to us executing the command. You can see this by putting an `echo` in front of the command.

In [None]:
echo ls ~

# Paths

## Relative vs absolute paths

There are two types of paths:
1. relative - interpreted relative to your current working directory
2. absolute - independent of your current working directory (ie relative to the root directory)

So these paths refer to the same file:
- `scripts/hello.sh` (relative)
- `~/module-1-programming/bash_playground/scripts/hello.sh` (absolute)

### When should I use a relative or absolute path?

Whether it's best to use a relative or absolute path will depend on the context.

If we changed our working directory from `bash_playground/`, then the first path would break but the second one wouldn't.

But if we changed the location of the `module-1-programming/` directory, then the second path would be the one to break, instead. The first path wouldn't break, assuming our working directory is still `bash_playground/`!

## Different ways of writing a path

These are all equivalent paths!
- `scripts/hello.sh`
- `../bash_playground/scripts/hello.sh`
- `scripts/../scripts/hello.sh`
- `scripts/./hello.sh`

The double-dot `..` refers to the folder above and the single-dot `.` refers to the current folder.

# Wildcards and globbing

Question marks `?` and asterisks `*` are considered _wildcards_.

You can use a question-mark `?` in place of any character. Bash will find all of the files that match. This is called _globbing_.

In [None]:
ls ?equences.txt

Use an asterisk `*` when you want to match zero or more characters.

In [None]:
ls *es.txt

This is useful when you want to refer to a bunch of similarly named files simultaneously.

In [None]:
ls scripts/*.sh

Wildcards get expanded to a space-separated list prior to the command executing. A great way to see this is to prepend the command with an `echo`.

We'll discuss the `echo` command in more depth later.

In [None]:
echo ls scripts/*.sh

# Tab completion

While typing a command, you can press `<tab>` to use the terminal's auto-completions. This helps save you time and reduces the chance you'll make typos while you type. So you should make an effort to use tab completion as often as possible.

For example, let's tab-complete the `scripts/` directory.
```bash
ls sc<tab>
```
## double tap when there are multiple options
If there are multiple possible ways to complete your command, nothing will appear at first.
```bash
ls scripts/hello<tab>
```
Press `<tab>` twice in quick succession to display the possible options.
```bash
ls scripts/hello<tab><tab>
```

# Making use of your bash history

Use the up and down arrow keys to cycle through your past commands.

To view all of your past commands, use the `history` command.

**Caution**: this might spit out a lot of text!

In [None]:
history

To search for a past command within the history:
1. press _ctrl+r_
2. start typing part of the command
    The top ?hit will appear as you type
3. To move onto the next best hit, press _ctrl+r_ again
4. Once you've found the command, use _esc_ or the arrow keys (or press _enter_ if you want to execute the command immediately)

**You try**: Search for the last `echo` command we executed.

# Altering the file system

## `mkdir`: **m**a**k**e a new **dir**ectory
```bash
mkdir path_to_dir...
```

In [None]:
mkdir shiny_new

Look, it's there!

In [None]:
ls

## `cp`: **c**o**p**y a file or directory
```bash
cp path_to_source... path_to_dest
```

In [None]:
cp sequences.txt shiny_new/sequences_copy.txt
ls shiny_new

### `cp -r` to copy a directory and all of its contents

In [None]:
cp -r shiny_new same_old
ls

We should now have two new directories in our bash playground: `same_old/` and `shiny_new/`.

In [None]:
ls same_old

And `shiny_new/` should contain the same contents as `same_old/`.

In [None]:
ls shiny_new

## `touch`: create a new, empty file
```bash
touch path_to_file...
```

In [None]:
touch empty_inside.txt

Don't read into that name too much. Just trust me, it's empty. Look at the size. Or go check it yourself later!

In [None]:
ls -lh

## `mv`: **m**o**v**e a file or directory
```bash
mv path_to_source... path_to_dest
```

Let's move `empty_inside.txt` to the `shiny_new/` directory.

In [None]:
mv empty_inside.txt shiny_new/
ls

Check: is it there?

In [None]:
ls shiny_new

### renaming a file
Just use `mv` where the destination is in the same folder!

Let's convert `empty_inside.txt` to `fulfilled.txt`. Hopefully, you'll be the same way after this lesson ;)

In [None]:
mv shiny_new/empty_inside.txt shiny_new/fulfilled.txt
ls

## `rm`: **r**e**m**ove a file or directory
```bash
rm path_to_file_or_dir...
```
With great power comes great responsibility... use this carefully! There is no "undo" button or trash can in the terminal!

Let's `echo` the command first to make sure that the wildcards are expanding correctly.

In [None]:
echo rm shiny_new/*.txt

In [None]:
rm shiny_new/*.txt
ls shiny_new

### `rm -r`: delete a directory and all of its contents, **r**ecursively

In [None]:
rm -r same_old
ls

## `rmdir`: **r**e**m**ove an empty **dir**ectory
```bash
rmdir path_to_empty_dir...
```

In [None]:
rmdir shiny_new
ls

## `ln`: create a symlink to a file or directory
```bash
ln -s path_to_target... path_to_link
```
A symlink is a shortcut to a file or directory. This can be helpful when you want to have a file in multiple locations but don't want to copy it.

Recall the `data/` directory in the folder above `bash_playground/`.

In [None]:
ls ../data

Let's create a symlink to the `data/` directory from within the `bash_playground/` directory.

In [None]:
ln -s ../data data-sym
ls

A symlink is a special file that just references another file (or directory). You can view the path of a symlink using `ls -l`. 

In [None]:
ls -l data-sym

Symlinks will act exactly the same as their original file or directory.

In [None]:
ls data-sym
cd data-sym

In [None]:
ls
cd ..

In [None]:
rm data-sym
ls

Deleting a symlink doesn't delete the target it points to.

In [None]:
ls ..

# Exercise 1

Try this on your own:
1. Create an empty directory called `grad_school`
2. Create an empty file in that directory and name it with your name
3. Create a symlink to that file in the current directory called `linked_me`
4. Delete the directory you created
5. Delete the symlink

In [None]:
# TYPE YOUR CODE HERE

# Printing things

## `cat`: print the contents of files
```bash
cat path_to_file...
```

In [None]:
cat sequences.txt

If you provide more than one file, `cat` will _concatenate_ them (aka append them to each other). This is where it gets its name.

Remember the hidden secret file? Let's take a look at it now.

In [None]:
cat .you_found_me.txt

And now, let's concatenate the two files together.

In [None]:
cat .you_found_me.txt sequences.txt

## `wc`: count the number of lines or words in a file

```bash
wc [path_to_file]...
```
You can use the `wc` command to obtain counts of the lines, words, or characters (respectively) in a file.

In [None]:
wc sequences.txt

We can use flags to get just the line count (`-l`), just the word count (`-w`), or just the character count (`-c`).

In [None]:
wc -l sequences.txt

In [None]:
wc -w sequences.txt

In [None]:
wc -c sequences.txt

## `head`: print the top of a file
```bash
head [path_to_file]...
```
By default, `head` will print the first 10 lines of a file.

In [None]:
head sequences.txt

But you can also specify the number of lines to the `-n` flag.

In [None]:
head -n 5 sequences.txt

## `tail`: print the end of a file
```bash
tail [path_to_file]...
```
You can use `tail` in the exact same way as `head`.

In [None]:
tail sequences.txt

In [None]:
tail -n 5 sequences.txt

## `echo`: print a string
```bash
echo [string]...
```
You can use the `echo` command to print a string.

In [None]:
echo hey

The `echo` command accepts multiple string arguments, so we can also do this.

In [None]:
echo is anybody out there?

# Under the hood: standard input, output, and error

Every command has three standard files: standard input (stdin), standard output (stdout), and standard error (stderr).

When you run a command, its output will be written to stdout and any error messages will be written to stderr.

<img src="img/stdin_out_err.png" width='450px'>

By default, stdout and stderr are connected to your terminal, which is why you can see them on your screen.

Most commands will not read from stdin unless they detect that they should, or you explicitly request them to. More on that later.

# Redirecting stdout and stderr

You can redirect the stdout and stderr from the terminal to custom files. Standard output is represented by the number 1 and standard error by the number 2.

In [None]:
ls scripts/ nonexistent_file

In [None]:
ls scripts/ nonexistent_file 1>file_out 2>file_err

In [None]:
cat file_out

In [None]:
cat file_err

## Appending instead of overwriting

Using a single redirection operator `>` will overwrite the contents of the target file. Let's overwrite the contents of `file_out`.

In [None]:
history >file_out
tail file_out

You can append to the file (ie add to the end of it) instead of overwriting it by using two redirection operators in sequence `>>`.

In [None]:
ls >>file_out
tail file_out

Before we move onto the next section, let's just clean up after ourselves.

In [None]:
echo rm file_*

In [None]:
rm file_*

# Piping from stdout to stdin

<img src="img/mariopipe.png">

You can use the pipe operator **|** to connect the stdout of one command to the stdin of another.

<img src="img/piping_stdout_stdin.png" width='675px'>

Without any arguments, most commands (including `tail`) will assume that the file should be read from stdin, instead.

In [None]:
history | tail

You can chain a series of pipes to create a _pipeline_!

Let's try to get the 10 most recent commands from our history after excluding the 5 most recent ones.

In [None]:
history | tail | head -n5

## Piped commands run in parallel

A common misconception is that commands in a pipeline run sequentially. In fact, they actually run in parallel (at the same time)!

Each command waits until it receives a line of text from its stdin. It processes this line and then writes to its stdout as soon as it can. In this way, the original text is said to be _streamed_ through the pipeline. Instead of waiting for each step to finish before moving onto the next step, we are running the commands at the same time!

You can think of a chain of piped commands as a bunch of trains, each with their own engine, traversing on the same railroad track. Each train can go as fast or as slow as it wants, but none can overtake the other.

The ability to _stream_ text makes command pipelines one of the fastest methods to process big data, and thus, a favorite among bioinformaticians.

# Variables

In addition to storing text within files, you can also store text within variables.

Let's try storing the string "hello" inside a variable named `my_variable`.

In [None]:
my_variable=hello

To use the variable within another command, we must prefix the variable name with a dollar-sign `$`.

In [None]:
echo $my_variable

## Storing the stdout of a command in a variable

You can store the stdout of a command in a variable by wrapping the command in parantheses.
```bash
my_variable=$(cmd)
```

In [None]:
my_variable=$(head -n1 sequences.txt)
echo $my_variable

# Working with files

Let's get fancy and actually try to manipulate some of these files.

Note: all of these commands will also read from stdin if they aren't given a file argument.

## `cut`: extract specific columns from a file

```bash
cut [path_to_file]...
```
To get a column of the file, provide the column number (ex: 2) to the `-f` flag.

In [None]:
cat sequences.txt

In [None]:
cut -f 2 sequences.txt

## `tr`: substitution of characters in a file

```bash
tr character_set1 character_set2
```
You can use the `tr` command to replace all instances of a T with a U. Let's turn our DNA sequences into RNA!

Note that the `tr` command doesn't have a file argument. It only accepts input from stdin. We can attach a file to stdin using the `<` operator (rather than `>` for stdout).

In [None]:
tr T U <sequences.txt

The `tr` command can also replace more than one character at a time. Let's replace each base with its complement.

In [None]:
tr ACGT TGCA <sequences.txt

## `rev`: reverse text in a file

```bash
rev [path_to_file]...
```
The `rev` command reverses the characters on each line of a file. Let's use it to obtain the reverse complement of our sequences.

In [None]:
cut -f 2 sequences.txt | tr ACGT TGCA | rev

## `paste`: merge files horizontally

```bash
paste [path_to_file]...
```
The `paste` command is useful when you want to merge multiple files, each containing a table with **n** lines (rows).

In [None]:
paste .you_found_me.txt sequences.txt

## `sort`: sort the lines in a file

```bash
sort [path_to_file]...
```
The `sort` command will sort the lines of your file. By default, this is done alphabetically.

In [None]:
head -n 5 sequences.txt

In [None]:
sort sequences.txt | head -n 5

We can also sort the file by specific columns using the `-k` flag.

Let's sort by the second column.

In [None]:
sort -k 2 sequences.txt

## `uniq`: filter out duplicate lines in a file

```bash
uniq [path_to_file_input [path_to_file_output]]
```
You can use the `uniq` command to filter out duplicated lines or count the number of unique lines.

Note that the text provided to `uniq` must be sorted first! Otherwise, it won't work properly.

Let's find unique sequences in the `sequences.txt` file.

In [None]:
head -n 5 sequences.txt | cut -f 2 | sort

In [None]:
head -n 5 sequences.txt | cut -f 2 | sort | uniq

I bet you didn't even notice the last two sequences were duplicates until now ;P

## `grep`: search for lines containing a keyword

```bash
grep pattern [path_to_file]...
```
To search for lines in the file containing a specific pattern, you can use `grep`.

Let's look for all sequences with "TAAC" in them.

In [None]:
grep "TAAC" sequences.txt

All of these commands read from stdin and `grep` is no exception. For example, we can search through our bash history for the `echo` command.

In [None]:
history | grep echo

When you use the `-E` flag, `grep` will interpret the pattern `grep` as a _regex (aka regular expression)_. Regexes can get quite complicated, so we don't have the time to dive into them completely.

But to give you an idea of how powerful this can be, let's try to grab sequences C through F or just sequence B and D.

In [None]:
grep -E 'seq[C-F]' sequences.txt

In [None]:
grep -E 'seq(B|D)' sequences.txt

## `sed`: scan through a file and make edits

```bash
sed script_pattern [path_to_file]...
```
The `tr` command made single-character substitutions, but what if you want to replace entire words or sentences? That's one of the most common uses for the `sed` command.

Let's replace "seqC" with the label "sample1". The first argument to `sed` is a special pattern composed of the letter **s**, the search keyword, then the replacement keyword, where each is separated by some character -- in this case, we use a slash `/`.

In [None]:
head -n 5 sequences.txt | sed 's/seqC/sample1/'

## `awk`: scan through a file and extract text
```bash
awk program-text [path_to_file]...
```
`awk` is a much more powerful command than many of the others. It implements its own programming language which can be used to do any number of tasks, including most of those of the other commands we just learned.

We don't have time to discuss `awk`, but there are tons of great tutorials online you can peruse.

## `vim`: terminal-based text editor
```bash
vim [path_to_file]...
```
`vim` is a fully-featured [text editor](https://learntocodewith.me/resources/text-editors/) that runs from within the terminal. You can learn how to use it at [openvim.com](https://openvim.com).

## `less`: view large files and text one page at a time
```bash
less [path_to_file]...
```
You can use `less` to view large files one page at a time. This proves to be incredibly useful in bioinformatics, where most files are very, very large.

Let's take a look at the `sequences.txt` file with `less`.
```bash
less sequences.txt
```

This will open a separate viewing window. You can quit out of it and return to the terminal prompt by pressing **q**.

Now, let's take a look at our bash history. We can pipe into `less` for this.
```bash
history | less
```

You can use the arrow keys to scroll in `less`.

# Exercise 2

**Try this:**
Retrieve just the second column of the third line (seq C) from the sequences.txt file using a combination of tail, head, and cut. Use pipes!

In [None]:
# TYPE YOUR CODE HERE

# Got long-running commands? Use job control.

**Scenario:** You're about to execute a commmand that will take a long time.

**Solution:** Send the command to the background by appending an ampersand `&`. Keep working while it runs.

We'll use the `sleep` command to illustrate how this works, but you could use any command in theory.

## `sleep`: do nothing for n seconds
```bash
sleep num_seconds
```
Running `sleep 3` will cause `sleep` to execute for 3 seconds, doing absolutely nothing in the meantime.

In [None]:
sleep 3

Let's try sending `sleep` to the background while it runs.
```bash
sleep 7 &
```
Now, we can continue working while it executes.

# Working with larger files
Bioinformaticians frequently keep commands compressed so they don't take up too much space.

## `gzip`: compressing a file

```bash
gzip [path_to_file]...
```
You can compress a file using the `gzip` command.

In [None]:
gzip sequences.txt

Notice how the file has a `.gz` extension now?

In [None]:
ls

And if we try to take a look inside, it'll just look like gibberish.

In [None]:
head sequences.txt.gz

### decompress a file
Let's revert the file back to the way it was. We can use the `-d` switch to activate decompression.

In [None]:
gzip -d sequences.txt.gz
ls

In [None]:
head sequences.txt

Ok, now that we know how to deal with small files, let's level-up to "BIG DATA"!

## `wget`: downloading a file from a URL

```bash
wget url...
```
For example, let's download chr21 from the hg38 reference genome:

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz (also at https://bit.ly/3zjmX7L)

Recall that the URL for chr21 is stored at the bottom of the hidden file.

In [None]:
echo $(tail -n1 .you_found_me.txt)

In [None]:
wget $(tail -n1 .you_found_me.txt)

Check: do we have it?

In [None]:
ls 

## `du`: print a file's **d**isk **u**sage
```bash
du path_to_file...
```
In addition to `ls`, we can use the `du` to check how big a file is.

How big is the file we downloaded compared to the sequences file? We can use the `-h` flag to print the size in human-readable format.

In [None]:
du -h sequences.txt chr21.fa.gz

That's several 1000 times larger!

## `zcat`: output decompressed text from a gzipped file

```bash
zcat [path_to_file]...
```
We want to take a look at the first ten lines of this file, but it's compressed. Rather than uncompress it all and waste storage space, let's see if we can obtain what we want using the `zcat` command.

In [None]:
zcat chr21.fa.gz | head

The first set of lines of chr21 represent unsequenced heterochromatin. Let's see if we can find real sequences by opening the file inside the `zless` command....

## `zless`: `less` but for gzipped files

```bash
zcat [path_to_file]...
```
Let's open this file in `less` and then look around for a nucleotide like a **g**.

```bash
zless chr21.fa.gz
```

Before we finish this section, let's just clean up after ourselves.

In [None]:
rm chr21.fa.gz
ls

# Logic control
Bash has for loops and if statements, just like python and R!

## for loops

A for loop is written like this:
```bash
for my_var in <space_separated_list>; do
    cmd "$my_var"
done
```
A variable, in this case called `my_var`, is set to each item in the loop. The items are specified as a space separated list. Inside the body of the loop, we can run a `cmd` using the value of each item.

In [None]:
for name in Dolly Kitty Poppet; do
    echo Hello "$name!";
done;

As an example, let's iterate over all of the scripts in the `scripts/` directory and print out their paths. Recall that expressions with wildcards like the asterisk `*` get expanded to a space-separated list prior to command execution.

In [None]:
for file in scripts/*.sh; do
    echo "check out this file called $file"
done

## if statements
An if-statement is written like this:
```bash
if [ <logical_statement> ]; then
    cmd
fi
```
The logical statement is something that we can test for.

Here's an example. First, we make a variable called `$you` and then we test whether `$you` equals "bored".

In [None]:
you=excited
if [ $you != bored ]; then
    echo "okay onto the next adventure!"
fi

# We just scratched the surface! How can I learn more?

1. [The missing semester of your CS education: an open MIT course](https://missing.csail.mit.edu/)
2. [A curated list of resources for terminal-using bioinformaticians](https://docs.google.com/document/d/1cOjKC4OI4rsQUl_68PPHCn8vO11lwhpSWyJrGDZT80Q)
3. [Software carpentry](https://swcarpentry.github.io/shell-novice/)
4. [explainshell.com](http://explainshell.com) and help/manual pages
5. It’s dry reading, but [the GNU bash manual](https://www.gnu.org/software/bash/manual/bash.html) is really quite good
6. Other useful topics for bioinformaticians that we weren't able to cover today
	- git
	- ssh
	- sshfs
	- conda and conda environments
	- Snakemake
	- the screen command
	- process substitution
	- exit codes
	- functions
	- aliases
	- bash startup
	- using a computing cluster
	- regexes
	- manipulating strings
	- shell expansions
	- the diff and comm commands
	- file and directory permissions
	- the find and xargs commands
	- arrays
	- shell arithmetic
	- while loops
	- signals
	- the top command
	- the datamash tool
	- the /dev special files