<h1 style = "font-size: 35px">Terminal Commands and Bash Scripting v2</h1>

This notebook is an extended version of the Unix module lecture with extra content that we didn't have time to teach. You will **not** be evaluated on your knowledge of this material, but it contains lots of tidbits that we guarantee you will find useful in your research or day-to-day.

# Anatomy of a Command

<img src="img/command_anatomy.jpg" width='750px'>

# File System Navigation

## `pwd`: where am I?

What is the path to my current location (aka working directory) in the file system?

In [2]:
pwd

/home/amassara/module-1-programming
[?2004h

: 1

The first slash in the path refers to the root directory, the folder that contains all of the other folders on your computer.

## `whoami`: who am I logged in as?
To continue the existential questions...

<img src="../module-1-programming/img/whoami.jpg">

Who am I logged in as? What is my username?

In [3]:
whoami

amassara
[?2004h

: 1

## `tree`: show the file system structure in a tree format
```bash
tree ~/module-1-programming
```
```
~/module-1-programming
├── Day1_1_Intro_to_Programming.ipynb
├── Day1_2_Python_Packages.ipynb
├── Day1_3_Intro_to_Pandas.ipynb
├── Day2_1_R_Basics.ipynb
├── Day2_2_Terminal_Commands_and_Bash_Scripting.ipynb
├── README.md
├── bash_playground
│   ├── scripts
│   │   ├── hello.sh
│   │   ├── hello_to_all_of_you.sh
│   │   └── hello_you.sh
│   └── sequences.txt
├── data
│   ├── chrom_lengths.tsv
│   ├── gene_chrom.tsv
│   └── hg19.chrom.sizes.txt
└── img
```

## `cd`: **c**hange my working **d**irectory

We can use the `cd` command to navigate to a directory, given the path to that directory:
```bash
cd [path_to_dir]
```
Without any arguments, `cd` will go to your home directory.

In [4]:
cd

[?2004h

: 1

Now where are we?

In [5]:
pwd

/home/amassara
[?2004h

: 1

Usually, you'll want to give `cd` the path to a directory, though. For example, let's go to the `bash_playground/` folder.

In [6]:
cd module-1-programming/bash_playground

[?2004h

: 1

You can go back *up* a level with the ".." notation

In [7]:
cd ..
pwd

/home/amassara/module-1-programming
[?2004h

: 1

And you can go back to your last directory by using `-` as the directory path. This will also print the new working directory.

In [8]:
cd -

/home/amassara/module-1-programming/bash_playground
[?2004h

: 1

## `ls`: **l**i**s**t contents of a directory

We can use the `ls` command to print the contents of a directory, given the path to that directory:
```bash
ls [path_to_dir_or_file]...
```
Without any arguments, `ls` will show you the contents of your current working directory.

In [9]:
ls

scripts  sequences.txt
[?2004h

: 1

But we can also give it the path to a file or directory (or multiple files or directories).

In [10]:
ls scripts/hello.sh scripts/hello_you.sh

scripts/hello.sh  scripts/hello_you.sh
[?2004h

: 1

Adding flags get us more information. Let's use the short-format.

In [11]:
ls -lh scripts/hello.sh scripts/hello_you.sh

-rw-r----- 1 amassara root 82 Jan  5 05:23 scripts/hello.sh
-rw-r----- 1 amassara root 38 Jan  5 05:23 scripts/hello_you.sh
[?2004h

: 1

Use the `-a` flag to show _all_ files, including hidden ones, prefixed with a single dot `.`.

In [12]:
ls -a

.  ..  scripts	sequences.txt  .you_found_me.txt
[?2004h

: 1

Just like with `cd`, we can also specify the path to the directory above our current one using the ".." notation.

In [13]:
ls ..

 bash_playground   Day1_4_Terminal_Commands_and_Bash_Scripting.ipynb    img
 data		  'FULL - Terminal_Commands_and_Bash_Scripting.ipynb'
[?2004h

: 1

In any path, the tilde `~` represents our home directory, so we can view the files in that directory like this:

In [14]:
ls ~

cmm262-2023  module-1-programming  private  public
[?2004h

: 1

The tilde actually gets replaced with the path to our home directory prior to us executing the command. You can see this by putting an `echo` in front of the command.

In [15]:
echo ls ~

ls /home/amassara
[?2004h

: 1

# Paths

## Relative vs absolute paths

There are two types of paths:
1. relative - interpreted relative to your current working directory
2. absolute - independent of your current working directory (ie relative to the root directory)

So these paths refer to the same file:
- `scripts/hello.sh` (relative)
- `~/module-1-programming/bash_playground/scripts/hello.sh` (absolute)

### When should I use a relative or absolute path?

Whether it's best to use a relative or absolute path will depend on the context.

If we changed our working directory from `bash_playground/`, then the first path would break but the second one wouldn't.

But if we changed the location of the `module-1-programming/` directory, then the second path would be the one to break, instead. The first path wouldn't break, assuming our working directory is still `bash_playground/`!

## Different ways of writing a path

These are all equivalent paths!
- `scripts/hello.sh`
- `../bash_playground/scripts/hello.sh`
- `scripts/../scripts/hello.sh`
- `scripts/./hello.sh`

The double-dot `..` refers to the folder above and the single-dot `.` refers to the current folder.

# Understanding help and manual pages

Most commands have usage documentation available via `--help` (long format) **and/or** `-h` (short format)
```bash
cmd --help
```
```bash
cmd -h
```

Alternatively, some commands will also register their help via the built-in manual pages in the `man` command
```bash
man cmd
```
(press **q** to quit or exit-out after opening the manual)

## An example manual page: `man ls`
<img src="img/man_ls.png" width='800px'>

# Wildcards and globbing

Question marks `?` and asterisks `*` are considered _wildcards_.

You can use a question-mark `?` in place of any character. Bash will find all of the files that match. This is called _globbing_.

In [16]:
ls ?equences.txt

sequences.txt
[?2004h

: 1

Use an asterisk `*` when you want to match zero or more characters.

In [17]:
ls *es.txt

sequences.txt
[?2004h

: 1

This is useful when you want to refer to a bunch of similarly named files simultaneously.

In [18]:
ls scripts/*.sh

scripts/hello.sh  scripts/hello_to_all_of_you.sh  scripts/hello_you.sh
[?2004h

: 1

Wildcards get expanded to a space-separated list prior to the command executing. A great way to see this is to prepend the command with an `echo`.

We'll discuss the `echo` command in more depth later.

In [19]:
echo ls scripts/*.sh

ls scripts/hello.sh scripts/hello_to_all_of_you.sh scripts/hello_you.sh
[?2004h

: 1

# Tab completion

While typing a command, you can press `<tab>` to use the terminal's auto-completions. This helps save you time and reduces the chance you'll make typos while you type. So you should make an effort to use tab completion as often as possible.

For example, let's tab-complete the `scripts/` directory.
```bash
ls sc<tab>
```
## double tap when there are multiple options
If there are multiple possible ways to complete your command, nothing will appear at first.
```bash
ls scripts/hello<tab>
```
Press `<tab>` twice in quick succession to display the possible options.
```bash
ls scripts/hello<tab><tab>
```

# Making use of your bash history

Use the up and down arrow keys to cycle through your past commands.

To view all of your past commands, use the `history` command.

**Caution**: this might spit out a lot of text!

In [20]:
history

    1  exit
    2  PS1='[PEXP\[\]ECT_PROMPT>' PS2='[PEXP\[\]ECT_PROMPT+' PROMPT_COMMAND=''
    3  export PAGER=cat
    4  display () {     TMPFILE=$(mktemp ${TMPDIR-/tmp}/bash_kernel.XXXXXXXXXX);     cat > $TMPFILE;     echo "bash_kernel: saved image data to: $TMPFILE" >&2; }
    5  PS1='[PEXP\[\]ECT_PROMPT>' PS2='[PEXP\[\]ECT_PROMPT+' PROMPT_COMMAND=''
    6  export PAGER=cat
    7  display () {     TMPFILE=$(mktemp ${TMPDIR-/tmp}/bash_kernel.XXXXXXXXXX);     cat > $TMPFILE;     echo "bash_kernel: saved image data to: $TMPFILE" >&2; }
    8  pwd
    9  echo $?
   10  pwd
   11  echo $?
   12  whoami
   13  echo $?
   14  cd
   15  echo $?
   16  pwd
   17  echo $?
   18  cd module-1-programming/bash_playground
   19  echo $?
   20  cd ..
   21  pwd
   22  echo $?
   23  cd -
   24  echo $?
   25  ls
   26  echo $?
   27  ls scripts/hello.sh scripts/hello_you.sh
   28  echo $?
   29  ls -lh scripts/hello.sh scripts/hello_you.sh
   30  echo $?
   31  ls -a
   32  echo $?
   33  ls ..
  

: 1

To search for a past command within the history:
1. press _ctrl+r_
2. start typing part of the command
    The top ?hit will appear as you type
3. To move onto the next best hit, press _ctrl+r_ again
4. Once you've found the command, use _esc_ or the arrow keys (or press _enter_ if you want to execute the command immediately)

**You try**: Search for the last `echo` command we executed.

# Altering the file system

## `mkdir`: **m**a**k**e a new **dir**ectory
```bash
mkdir path_to_dir...
```

In [21]:
mkdir shiny_new

[?2004h

: 1

Look, it's there!

In [22]:
ls

scripts  sequences.txt	shiny_new
[?2004h

: 1

## `cp`: **c**o**p**y a file or directory
```bash
cp path_to_source... path_to_dest
```

In [23]:
cp sequences.txt shiny_new/sequences_copy.txt
ls shiny_new

sequences_copy.txt
[?2004h

: 1

### `cp -r` to copy a directory and all of its contents

In [24]:
cp -r shiny_new same_old
ls

same_old  scripts  sequences.txt  shiny_new
[?2004h

: 1

We should now have two new directories in our bash playground: `same_old/` and `shiny_new/`.

In [25]:
ls same_old

sequences_copy.txt
[?2004h

: 1

And `shiny_new/` should contain the same contents as `same_old/`.

In [26]:
ls shiny_new

sequences_copy.txt
[?2004h

: 1

## `touch`: create a new, empty file
```bash
touch path_to_file...
```

In [27]:
touch empty_inside.txt

[?2004h

: 1

Don't read into that name too much. Just trust me, it's empty. Look at the size. Or go check it yourself later!

In [28]:
ls -lh

total 13K
-rw-rw---- 1 amassara root   0 Jan 10 05:44 empty_inside.txt
drwxrwx--- 2 amassara root   3 Jan 10 05:44 same_old
drwxr-x--- 2 amassara root   5 Jan  5 05:23 scripts
-rw-r----- 1 amassara root 840 Jan  5 05:23 sequences.txt
drwxrwx--- 2 amassara root   3 Jan 10 05:44 shiny_new
[?2004h

: 1

## `mv`: **m**o**v**e a file or directory
```bash
mv path_to_source... path_to_dest
```

Let's move `empty_inside.txt` to the `shiny_new/` directory.

In [29]:
mv empty_inside.txt shiny_new/
ls

same_old  scripts  sequences.txt  shiny_new
[?2004h

: 1

Check: is it there?

In [30]:
ls shiny_new

empty_inside.txt  sequences_copy.txt
[?2004h

: 1

### renaming a file
Just use `mv` where the destination is in the same folder!

Let's convert `empty_inside.txt` to `fulfilled.txt`. Hopefully, you'll be the same way after this lesson ;)

In [31]:
mv shiny_new/empty_inside.txt shiny_new/fulfilled.txt
ls

same_old  scripts  sequences.txt  shiny_new
[?2004h

: 1

## `rm`: **r**e**m**ove a file or directory
```bash
rm path_to_file_or_dir...
```
With great power comes great responsibility... use this carefully! There is no "undo" button or trash can in the terminal!

Let's `echo` the command first to make sure that the wildcards are expanding correctly.

In [32]:
echo rm shiny_new/*.txt

rm shiny_new/fulfilled.txt shiny_new/sequences_copy.txt
[?2004h

: 1

In [33]:
rm shiny_new/*.txt
ls shiny_new

[?2004h[?2004l

: 1

### `rm -r`: delete a directory and all of its contents, **r**ecursively

In [34]:
rm -r same_old
ls

scripts  sequences.txt	shiny_new
[?2004h

: 1

## `rmdir`: **r**e**m**ove an empty **dir**ectory
```bash
rmdir path_to_empty_dir...
```

In [35]:
rmdir shiny_new
ls

scripts  sequences.txt
[?2004h

: 1

## `ln`: create a symlink to a file or directory
```bash
ln -s path_to_target... path_to_link
```
A symlink is a shortcut to a file or directory. This can be helpful when you want to have a file in multiple locations but don't want to copy it.

Recall the `data/` directory in the folder above `bash_playground/`.

In [36]:
ls ../data

chrom_lengths.tsv  gene_chrom.tsv  hg19.chrom.sizes.txt
[?2004h

: 1

Let's create a symlink to the `data/` directory from within the `bash_playground/` directory.

In [37]:
ln -s ../data data-sym
ls

data-sym  scripts  sequences.txt
[?2004h

: 1

A symlink is a special file that just references another file (or directory). You can view the path of a symlink using `ls -l`. 

In [38]:
ls -l data-sym

lrwxrwxrwx 1 amassara root 7 Jan 10 05:44 data-sym -> ../data
[?2004h

: 1

Symlinks will act exactly the same as their original file or directory.

In [39]:
ls data-sym
cd data-sym

chrom_lengths.tsv  gene_chrom.tsv  hg19.chrom.sizes.txt
[?2004h[?2004l

: 1

In [40]:
ls
cd ..

chrom_lengths.tsv  gene_chrom.tsv  hg19.chrom.sizes.txt
[?2004h[?2004l

: 1

In [41]:
rm data-sym
ls

scripts  sequences.txt
[?2004h

: 1

Deleting a symlink doesn't delete the target it points to.

In [42]:
ls ..

 bash_playground   Day1_4_Terminal_Commands_and_Bash_Scripting.ipynb    img
 data		  'FULL - Terminal_Commands_and_Bash_Scripting.ipynb'
[?2004h

: 1

# Exercise 1

Try this on your own:
1. Create an empty directory called `grad_school`
2. Create an empty file in that directory and name it with your name
3. Create a symlink to that file in the current directory called `linked_me`
4. Delete the directory you created
5. Delete the symlink

In [43]:
mkdir grad_school
touch grad_school/myname
ln -s grad_school/myname linked_me
rm -r grad_school
rm linked_me

[?2004h[?2004l[?2004l[?2004l[?2004l

: 1

# Printing things

## `cat`: print the contents of files
```bash
cat path_to_file...
```

In [44]:
cat sequences.txt

seqA	CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
seqB	GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
seqC	GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
seqE	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqF	TTAGTTGCAAATAGGCTTACCTAGTAGAGGTCCGACAACCCCTACACGTC
seqG	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqH	CAGAACGCACTTTCCGGCAGTGAATGTCGATGCCAACATTGCCTAAAACA
seqI	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqJ	TGCGGGGCGGCGATATGCGGCAACGATGGCCGCGAGTTAAATGAGCCATA
seqK	CTTTGTGGGGTAATCTAGAAAGCAGAGTACGTATTTGGCGACCCGACATT
seqL	CTTATTTGCTCTCCCGGCTGTAGTGTGCACAGACTTCCAGATGTAATAAG
seqM	CTTTCTCTATTTACCGAGCAATGAAAACCACGGGGAACAATCCACCCTTT
seqN	GTCAAATGGCCGGCCCACTTCTCTCTTCGACACGGCGTCGGGCAGTCTGG
seqO	ATGGTAAGACTAGCCAGCAGATCCTTAATCGGCTTTCCCGAAGGACAGTA
[?2004h

: 1

If you provide more than one file, `cat` will _concatenate_ them (aka append them to each other). This is where it gets its name.

Remember the hidden secret file? Let's take a look at it now.

In [45]:
cat .you_found_me.txt

1[?2004l
2
3
4
5
6
7
8
9
10
11
12
13
14
15
I am secret text from a hidden file
Here's chr21:
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz
[?2004h

: 1

And now, let's concatenate the two files together.

In [46]:
cat .you_found_me.txt sequences.txt

1[?2004l
2
3
4
5
6
7
8
9
10
11
12
13
14
15
I am secret text from a hidden file
Here's chr21:
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz
seqA	CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
seqB	GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
seqC	GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
seqE	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqF	TTAGTTGCAAATAGGCTTACCTAGTAGAGGTCCGACAACCCCTACACGTC
seqG	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqH	CAGAACGCACTTTCCGGCAGTGAATGTCGATGCCAACATTGCCTAAAACA
seqI	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqJ	TGCGGGGCGGCGATATGCGGCAACGATGGCCGCGAGTTAAATGAGCCATA
seqK	CTTTGTGGGGTAATCTAGAAAGCAGAGTACGTATTTGGCGACCCGACATT
seqL	CTTATTTGCTCTCCCGGCTGTAGTGTGCACAGACTTCCAGATGTAATAAG
seqM	CTTTCTCTATTTACCGAGCAATGAAAACCACGGGGAACAATCCACCCTTT
seqN	GTCAAATGGCCGGCCCACTTCTCTCTTCGACACGGCGTCGGGCAGTCTGG
seqO	ATGGTAAGACTAGCCAGCAGATCCTTAATCGGCTTTCCCGAAGGAC

: 1

## `wc`: count the number of lines or words in a file

```bash
wc [path_to_file]...
```
You can use the `wc` command to obtain counts of the lines, words, or characters (respectively) in a file.

In [47]:
wc sequences.txt

 15  30 840 sequences.txt
[?2004h

: 1

We can use flags to get just the line count (`-l`), just the word count (`-w`), or just the character count (`-c`).

In [48]:
wc -l sequences.txt

15 sequences.txt
[?2004h

: 1

In [49]:
wc -w sequences.txt

30 sequences.txt
[?2004h

: 1

In [50]:
wc -c sequences.txt

840 sequences.txt
[?2004h

: 1

## `head`: print the top of a file
```bash
head [path_to_file]...
```
By default, `head` will print the first 10 lines of a file.

In [51]:
head sequences.txt

seqA	CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
seqB	GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
seqC	GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
seqE	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqF	TTAGTTGCAAATAGGCTTACCTAGTAGAGGTCCGACAACCCCTACACGTC
seqG	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqH	CAGAACGCACTTTCCGGCAGTGAATGTCGATGCCAACATTGCCTAAAACA
seqI	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqJ	TGCGGGGCGGCGATATGCGGCAACGATGGCCGCGAGTTAAATGAGCCATA
[?2004h

: 1

But you can also specify the number of lines to the `-n` flag.

In [52]:
head -n 5 sequences.txt

seqA	CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
seqB	GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
seqC	GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
seqE	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
[?2004h

: 1

## `tail`: print the end of a file
```bash
tail [path_to_file]...
```
You can use `tail` in the exact same way as `head`.

In [53]:
tail sequences.txt

seqF	TTAGTTGCAAATAGGCTTACCTAGTAGAGGTCCGACAACCCCTACACGTC
seqG	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqH	CAGAACGCACTTTCCGGCAGTGAATGTCGATGCCAACATTGCCTAAAACA
seqI	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqJ	TGCGGGGCGGCGATATGCGGCAACGATGGCCGCGAGTTAAATGAGCCATA
seqK	CTTTGTGGGGTAATCTAGAAAGCAGAGTACGTATTTGGCGACCCGACATT
seqL	CTTATTTGCTCTCCCGGCTGTAGTGTGCACAGACTTCCAGATGTAATAAG
seqM	CTTTCTCTATTTACCGAGCAATGAAAACCACGGGGAACAATCCACCCTTT
seqN	GTCAAATGGCCGGCCCACTTCTCTCTTCGACACGGCGTCGGGCAGTCTGG
seqO	ATGGTAAGACTAGCCAGCAGATCCTTAATCGGCTTTCCCGAAGGACAGTA
[?2004h

: 1

In [54]:
tail -n 5 sequences.txt

seqK	CTTTGTGGGGTAATCTAGAAAGCAGAGTACGTATTTGGCGACCCGACATT
seqL	CTTATTTGCTCTCCCGGCTGTAGTGTGCACAGACTTCCAGATGTAATAAG
seqM	CTTTCTCTATTTACCGAGCAATGAAAACCACGGGGAACAATCCACCCTTT
seqN	GTCAAATGGCCGGCCCACTTCTCTCTTCGACACGGCGTCGGGCAGTCTGG
seqO	ATGGTAAGACTAGCCAGCAGATCCTTAATCGGCTTTCCCGAAGGACAGTA
[?2004h

: 1

## `echo`: print a string
```bash
echo [string]...
```
You can use the `echo` command to print a string.

In [55]:
echo hey

hey2004l
[?2004h

: 1

The `echo` command accepts multiple string arguments, so we can also do this.

In [56]:
echo is anybody out there?

is anybody out there?
[?2004h

: 1

# Under the hood: standard input, output, and error

Every command has three standard files: standard input (stdin), standard output (stdout), and standard error (stderr).

When you run a command, its output will be written to stdout and any error messages will be written to stderr.

<img src="img/stdin_out_err.png" width='450px'>

By default, stdout and stderr are connected to your terminal, which is why you can see them on your screen.

Most commands will not read from stdin unless they detect that they should, or you explicitly request them to. More on that later.

# Redirecting stdout and stderr

You can redirect the stdout and stderr from the terminal to custom files. Standard output is represented by the number 1 and standard error by the number 2.

In [57]:
ls scripts/ nonexistent_file

ls: cannot access 'nonexistent_file': No such file or directory
scripts/:
hello.sh  hello_to_all_of_you.sh  hello_you.sh
[?2004h

: 1

In [58]:
ls scripts/ nonexistent_file 1>file_out 2>file_err

[?2004h

: 1

In [59]:
cat file_out

scripts/:
hello.sh
hello_to_all_of_you.sh
hello_you.sh
[?2004h

: 1

In [60]:
cat file_err

ls: cannot access 'nonexistent_file': No such file or directory
[?2004h

: 1

If you omit the number prior to the redirection operator `>`, it will default to stdout (the number 1).

In [61]:
ls scripts/ nonexistent_file >file_out_new 2>file_err_new

[?2004h

: 1

In [62]:
cat file_out_new

scripts/:
hello.sh
hello_to_all_of_you.sh
hello_you.sh
[?2004h

: 1

In [63]:
cat file_err_new

ls: cannot access 'nonexistent_file': No such file or directory
[?2004h

: 1

You can redirect both stdout and stderr to the same file using the ampersand `&`.

In [64]:
ls scripts/ nonexistent_file &>file_both
cat file_both

ls: cannot access 'nonexistent_file': No such file or directory
scripts/:
hello.sh
hello_to_all_of_you.sh
hello_you.sh
[?2004h

: 1

Whichever redirection you leave off will default to the terminal, again.

In [65]:
ls scripts/ nonexistent_file >file_out
cat file_out

ls: cannot access 'nonexistent_file': No such file or directory
scripts/:[?2004l
hello.sh
hello_to_all_of_you.sh
hello_you.sh
[?2004h

: 1

## Appending instead of overwriting

Using a single redirection operator `>` will overwrite the contents of the target file. Let's overwrite the contents of `file_out`.

In [66]:
history >file_out
tail file_out

  147  echo $?4l
  148  cat file_err_new
  149  echo $?
  150  ls scripts/ nonexistent_file &>file_both
  151  cat file_both
  152  echo $?
  153  ls scripts/ nonexistent_file >file_out
  154  cat file_out
  155  echo $?
  156  history >file_out
[?2004h

: 1

You can append to the file (ie add to the end of it) instead of overwriting it by using two redirection operators in sequence `>>`.

In [67]:
ls >>file_out
tail file_out

  154  cat file_out
  155  echo $?
  156  history >file_out
file_both
file_err
file_err_new
file_out
file_out_new
scripts
sequences.txt
[?2004h

: 1

Before we move onto the next section, let's just clean up after ourselves.

In [68]:
echo rm file_*

rm file_both file_err file_err_new file_out file_out_new
[?2004h

: 1

In [69]:
rm file_*

[?2004h

: 1

# Piping from stdout to stdin

<img src="img/mariopipe.png">

You can use the pipe operator **|** to connect the stdout of one command to the stdin of another.

<img src="img/piping_stdout_stdin.png" width='675px'>

Without any arguments, most commands (including `tail`) will assume that the file should be read from stdin, instead.

In [70]:
history | tail

  157  tail file_out
  158  echo $?
  159  ls >>file_out
  160  tail file_out
  161  echo $?
  162  echo rm file_*
  163  echo $?
  164  rm file_*
  165  echo $?
  166  history | tail
[?2004h

: 1

You can chain a series of pipes to create a _pipeline_!

Let's try to get the 10 most recent commands from our history after excluding the 5 most recent ones.

In [71]:
history | tail | head -n5

  159  ls >>file_out
  160  tail file_out
  161  echo $?
  162  echo rm file_*
  163  echo $?
[?2004h

: 1

## Piped commands run in parallel

A common misconception is that commands in a pipeline run sequentially. In fact, they actually run in parallel (at the same time)!

Each command waits until it receives a line of text from its stdin. It processes this line and then writes to its stdout as soon as it can. In this way, the original text is said to be _streamed_ through the pipeline. Instead of waiting for each step to finish before moving onto the next step, we are running the commands at the same time!

You can think of a chain of piped commands as a bunch of trains, each with their own engine, traversing on the same railroad track. Each train can go as fast or as slow as it wants, but none can overtake the other.

The ability to _stream_ text makes command pipelines one of the fastest methods to process big data, and thus, a favorite among bioinformaticians.

# Variables

In addition to storing text within files, you can also store text within variables.

Let's try storing the string "hello" inside a variable named `my_variable`.

In [72]:
my_variable=hello

[?2004h

: 1

To use the variable within another command, we must prefix the variable name with a dollar-sign `$`.

In [73]:
echo $my_variable

hello04l
[?2004h

: 1

## Concatenating (joining) variables

You can join variables together by just putting them next to each other. Let's create two new variables and join them for the third.

In [74]:
hello_var=hello
world_var=world
my_variable=$hello_var$world_var
echo $my_variable

helloworld?2004l[?2004l[?2004l
[?2004h

: 1

What would it take for us to get a space between "hello" and "world"? First, let's try this and see what happens:

In [75]:
my_variable=hello world

bash: world: command not found
[?2004h

: 1

It doesn't work! But why not?

Spaces are important in bash. They're used to separate arguments from each other and delineate strings.

## Use quotes when there are spaces involved

So we just need a way to denote that both words are part of a single string. Quotes to the rescue!

In [76]:
my_variable="hello world"

[?2004h

: 1

Returning to our original example, it should be clear why we can just do something like this, now:

In [77]:
my_variable="$hello_var $world_var"
echo $my_variable

hello world2004l
[?2004h

: 1

What about if we want to put an underscore between the two variables, instead?

In [78]:
my_variable=$hello_var_$world_var
echo $my_variable

world04h[?2004l
[?2004h

: 1

It doesn't work! But why not?

We're telling bash to look for a variable called `hello_var_`. But since this variable doesn't exist, bash will replace it with an empty string `""`.

## Use braces to avoid ambiguity

So we just need a way to denote where the variable is. The solution is to surround the variable name in braces.

In [79]:
my_variable=${hello_var}_$world_var
echo $my_variable

hello_world2004l
[?2004h

: 1

## Use single quotes when you want to be interpreted literally

In [80]:
my_variable='$hello_var $world_var'
echo $my_variable

[?2004h[?2004l$hello_var $world_var
[?2004h

: 1

Within double quotes, expressions with $ are evaluated prior to the command executing. But within single quotes, everything is kept as it is.

Once again, we can prefix the command with an `echo` to see the resolved command before it gets executed.

In [81]:
echo my_variable="$hello_var $world_var"
echo my_variable='$hello_var $world_var'

my_variable=hello world
my_variable=$hello_var $world_var
[?2004h

: 1

## Storing the stdout of a command in a variable

You can store the stdout of a command in a variable by wrapping the command in parantheses.
```bash
my_variable=$(cmd)
```

In [82]:
my_variable=$(head -n1 sequences.txt)
echo $my_variable

seqA CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
[?2004h

: 1

# Working with files

Let's get fancy and actually try to manipulate some of these files.

Note: all of these commands will also read from stdin if they aren't given a file argument.

## `cut`: extract specific columns from a file

```bash
cut [path_to_file]...
```
To get a column of the file, provide the column number (ex: 2) to the `-f` flag.

In [83]:
cat sequences.txt

seqA	CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
seqB	GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
seqC	GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
seqE	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqF	TTAGTTGCAAATAGGCTTACCTAGTAGAGGTCCGACAACCCCTACACGTC
seqG	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqH	CAGAACGCACTTTCCGGCAGTGAATGTCGATGCCAACATTGCCTAAAACA
seqI	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqJ	TGCGGGGCGGCGATATGCGGCAACGATGGCCGCGAGTTAAATGAGCCATA
seqK	CTTTGTGGGGTAATCTAGAAAGCAGAGTACGTATTTGGCGACCCGACATT
seqL	CTTATTTGCTCTCCCGGCTGTAGTGTGCACAGACTTCCAGATGTAATAAG
seqM	CTTTCTCTATTTACCGAGCAATGAAAACCACGGGGAACAATCCACCCTTT
seqN	GTCAAATGGCCGGCCCACTTCTCTCTTCGACACGGCGTCGGGCAGTCTGG
seqO	ATGGTAAGACTAGCCAGCAGATCCTTAATCGGCTTTCCCGAAGGACAGTA
[?2004h

: 1

In [84]:
cut -f 2 sequences.txt

CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
TTAGTTGCAAATAGGCTTACCTAGTAGAGGTCCGACAACCCCTACACGTC
CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
CAGAACGCACTTTCCGGCAGTGAATGTCGATGCCAACATTGCCTAAAACA
CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
TGCGGGGCGGCGATATGCGGCAACGATGGCCGCGAGTTAAATGAGCCATA
CTTTGTGGGGTAATCTAGAAAGCAGAGTACGTATTTGGCGACCCGACATT
CTTATTTGCTCTCCCGGCTGTAGTGTGCACAGACTTCCAGATGTAATAAG
CTTTCTCTATTTACCGAGCAATGAAAACCACGGGGAACAATCCACCCTTT
GTCAAATGGCCGGCCCACTTCTCTCTTCGACACGGCGTCGGGCAGTCTGG
ATGGTAAGACTAGCCAGCAGATCCTTAATCGGCTTTCCCGAAGGACAGTA
[?2004h

: 1

## `tr`: substitution of characters in a file

```bash
tr character_set1 character_set2
```
You can use the `tr` command to replace all instances of a T with a U. Let's turn our DNA sequences into RNA!

Note that the `tr` command doesn't have a file argument. It only accepts input from stdin. We can attach a file to stdin using the `<` operator (rather than `>` for stdout).

In [85]:
tr T U <sequences.txt

seqA	CCCUAACCCCCUAACCCCCUAACCCUCAGUCGGGGAGGCGACAAUAGCUG
seqB	GUCAUAUGUUCUGUACGUUAUUGGCCAACUGAUCAUACCUGAAUCGAGCC
seqC	GAACCGGGAUUAUCAAAGACGAACAUGGUCGGGUCCUUGAACCAAACGAA
seqE	UCUCCGUCCGCUGGCGUGUUUUUCUUUUCUCAAGUGGGCAAGUUACCCGG
seqD	UCUCCGUCCGCUGGCGUGUUUUUCUUUUCUCAAGUGGGCAAGUUACCCGG
seqF	UUAGUUGCAAAUAGGCUUACCUAGUAGAGGUCCGACAACCCCUACACGUC
seqG	CACACGCGGACGAGCAUGAAGCACUCGACUCACCUCGAAUAGUAGGGGGA
seqH	CAGAACGCACUUUCCGGCAGUGAAUGUCGAUGCCAACAUUGCCUAAAACA
seqI	CACACGCGGACGAGCAUGAAGCACUCGACUCACCUCGAAUAGUAGGGGGA
seqJ	UGCGGGGCGGCGAUAUGCGGCAACGAUGGCCGCGAGUUAAAUGAGCCAUA
seqK	CUUUGUGGGGUAAUCUAGAAAGCAGAGUACGUAUUUGGCGACCCGACAUU
seqL	CUUAUUUGCUCUCCCGGCUGUAGUGUGCACAGACUUCCAGAUGUAAUAAG
seqM	CUUUCUCUAUUUACCGAGCAAUGAAAACCACGGGGAACAAUCCACCCUUU
seqN	GUCAAAUGGCCGGCCCACUUCUCUCUUCGACACGGCGUCGGGCAGUCUGG
seqO	AUGGUAAGACUAGCCAGCAGAUCCUUAAUCGGCUUUCCCGAAGGACAGUA
[?2004h

: 1

The `tr` command can also replace more than one character at a time. Let's replace each base with its complement.

In [86]:
tr ACGT TGCA <sequences.txt

seqT	GGGATTGGGGGATTGGGGGATTGGGAGTCAGCCCCTCCGCTGTTATCGAC
seqB	CAGTATACAAGACATGCAATAACCGGTTGACTAGTATGGACTTAGCTCGG
seqG	CTTGGCCCTAATAGTTTCTGCTTGTACCAGCCCAGGAACTTGGTTTGCTT
seqE	AGAGGCAGGCGACCGCACAAAAAGAAAAGAGTTCACCCGTTCAATGGGCC
seqD	AGAGGCAGGCGACCGCACAAAAAGAAAAGAGTTCACCCGTTCAATGGGCC
seqF	AATCAACGTTTATCCGAATGGATCATCTCCAGGCTGTTGGGGATGTGCAG
seqC	GTGTGCGCCTGCTCGTACTTCGTGAGCTGAGTGGAGCTTATCATCCCCCT
seqH	GTCTTGCGTGAAAGGCCGTCACTTACAGCTACGGTTGTAACGGATTTTGT
seqI	GTGTGCGCCTGCTCGTACTTCGTGAGCTGAGTGGAGCTTATCATCCCCCT
seqJ	ACGCCCCGCCGCTATACGCCGTTGCTACCGGCGCTCAATTTACTCGGTAT
seqK	GAAACACCCCATTAGATCTTTCGTCTCATGCATAAACCGCTGGGCTGTAA
seqL	GAATAAACGAGAGGGCCGACATCACACGTGTCTGAAGGTCTACATTATTC
seqM	GAAAGAGATAAATGGCTCGTTACTTTTGGTGCCCCTTGTTAGGTGGGAAA
seqN	CAGTTTACCGGCCGGGTGAAGAGAGAAGCTGTGCCGCAGCCCGTCAGACC
seqO	TACCATTCTGATCGGTCGTCTAGGAATTAGCCGAAAGGGCTTCCTGTCAT
[?2004h

: 1

## `rev`: reverse text in a file

```bash
rev [path_to_file]...
```
The `rev` command reverses the characters on each line of a file. Let's use it to obtain the reverse complement of our sequences.

In [87]:
cut -f2 sequences.txt | tr ACGT TGCA | rev

CAGCTATTGTCGCCTCCCCGACTGAGGGTTAGGGGGTTAGGGGGTTAGGG
GGCTCGATTCAGGTATGATCAGTTGGCCAATAACGTACAGAACATATGAC
TTCGTTTGGTTCAAGGACCCGACCATGTTCGTCTTTGATAATCCCGGTTC
CCGGGTAACTTGCCCACTTGAGAAAAGAAAAACACGCCAGCGGACGGAGA
CCGGGTAACTTGCCCACTTGAGAAAAGAAAAACACGCCAGCGGACGGAGA
GACGTGTAGGGGTTGTCGGACCTCTACTAGGTAAGCCTATTTGCAACTAA
TCCCCCTACTATTCGAGGTGAGTCGAGTGCTTCATGCTCGTCCGCGTGTG
TGTTTTAGGCAATGTTGGCATCGACATTCACTGCCGGAAAGTGCGTTCTG
TCCCCCTACTATTCGAGGTGAGTCGAGTGCTTCATGCTCGTCCGCGTGTG
TATGGCTCATTTAACTCGCGGCCATCGTTGCCGCATATCGCCGCCCCGCA
AATGTCGGGTCGCCAAATACGTACTCTGCTTTCTAGATTACCCCACAAAG
CTTATTACATCTGGAAGTCTGTGCACACTACAGCCGGGAGAGCAAATAAG
AAAGGGTGGATTGTTCCCCGTGGTTTTCATTGCTCGGTAAATAGAGAAAG
CCAGACTGCCCGACGCCGTGTCGAAGAGAGAAGTGGGCCGGCCATTTGAC
TACTGTCCTTCGGGAAAGCCGATTAAGGATCTGCTGGCTAGTCTTACCAT
[?2004h

: 1

## `paste`: merge files horizontally

```bash
paste [path_to_file]...
```
The `paste` command is useful when you want to merge multiple files, each containing a table with **n** lines (rows).

In [88]:
paste .you_found_me.txt sequences.txt

1	seqA	CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
2	seqB	GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3	seqC	GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4	seqE	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
5	seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
6	seqF	TTAGTTGCAAATAGGCTTACCTAGTAGAGGTCCGACAACCCCTACACGTC
7	seqG	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
8	seqH	CAGAACGCACTTTCCGGCAGTGAATGTCGATGCCAACATTGCCTAAAACA
9	seqI	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
10	seqJ	TGCGGGGCGGCGATATGCGGCAACGATGGCCGCGAGTTAAATGAGCCATA
11	seqK	CTTTGTGGGGTAATCTAGAAAGCAGAGTACGTATTTGGCGACCCGACATT
12	seqL	CTTATTTGCTCTCCCGGCTGTAGTGTGCACAGACTTCCAGATGTAATAAG
13	seqM	CTTTCTCTATTTACCGAGCAATGAAAACCACGGGGAACAATCCACCCTTT
14	seqN	GTCAAATGGCCGGCCCACTTCTCTCTTCGACACGGCGTCGGGCAGTCTGG
15	seqO	ATGGTAAGACTAGCCAGCAGATCCTTAATCGGCTTTCCCGAAGGACAGTA
I am secret text from a hidden file	
Here's chr21:	
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz	

: 1

## `sort`: sort the lines in a file

```bash
sort [path_to_file]...
```
The `sort` command will sort the lines of your file. By default, this is done alphabetically.

In [89]:
head -n 5 sequences.txt

seqA	CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
seqB	GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
seqC	GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
seqE	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
[?2004h

: 1

In [90]:
sort sequences.txt | head -n 5

seqA	CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
seqB	GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
seqC	GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqE	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
[?2004h

: 1

We can also sort the file by specific columns using the `-k` flag.

Let's sort by the second column.

In [91]:
sort -k 2 sequences.txt

seqO	ATGGTAAGACTAGCCAGCAGATCCTTAATCGGCTTTCCCGAAGGACAGTA
seqG	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqI	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqH	CAGAACGCACTTTCCGGCAGTGAATGTCGATGCCAACATTGCCTAAAACA
seqA	CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
seqL	CTTATTTGCTCTCCCGGCTGTAGTGTGCACAGACTTCCAGATGTAATAAG
seqM	CTTTCTCTATTTACCGAGCAATGAAAACCACGGGGAACAATCCACCCTTT
seqK	CTTTGTGGGGTAATCTAGAAAGCAGAGTACGTATTTGGCGACCCGACATT
seqC	GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
seqN	GTCAAATGGCCGGCCCACTTCTCTCTTCGACACGGCGTCGGGCAGTCTGG
seqB	GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqE	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqJ	TGCGGGGCGGCGATATGCGGCAACGATGGCCGCGAGTTAAATGAGCCATA
seqF	TTAGTTGCAAATAGGCTTACCTAGTAGAGGTCCGACAACCCCTACACGTC
[?2004h

: 1

## `uniq`: filter out duplicate lines in a file

```bash
uniq [path_to_file_input [path_to_file_output]]
```
You can use the `uniq` command to filter out duplicated lines or count the number of unique lines.

Note that the text provided to `uniq` must be sorted first! Otherwise, it won't work properly.

Let's find unique sequences in the `sequences.txt` file.

In [92]:
head -n 5 sequences.txt | cut -f2 | sort

CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
[?2004h

: 1

In [93]:
head -n 5 sequences.txt | cut -f2 | sort | uniq

CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
[?2004h

: 1

I bet you didn't even notice the last two sequences were duplicates until now ;P

## `grep`: search for lines containing a keyword

```bash
grep pattern [path_to_file]...
```
To search for lines in the file containing a specific pattern, you can use `grep`.

Let's look for all sequences with "TAAC" in them.

In [94]:
grep "TAAC" sequences.txt

seqA	CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
[?2004h

: 1

All of these commands read from stdin and `grep` is no exception. For example, we can search through our bash history for the `echo` command.

In [95]:
history | grep echo

    4  display () {     TMPFILE=$(mktemp ${TMPDIR-/tmp}/bash_kernel.XXXXXXXXXX);     cat > $TMPFILE;     echo "bash_kernel: saved image data to: $TMPFILE" >&2; }
    7  display () {     TMPFILE=$(mktemp ${TMPDIR-/tmp}/bash_kernel.XXXXXXXXXX);     cat > $TMPFILE;     echo "bash_kernel: saved image data to: $TMPFILE" >&2; }
    9  echo $?
   11  echo $?
   13  echo $?
   15  echo $?
   17  echo $?
   19  echo $?
   22  echo $?
   24  echo $?
   26  echo $?
   28  echo $?
   30  echo $?
   32  echo $?
   34  echo $?
   36  echo $?
   37  echo ls ~
   38  echo $?
   40  echo $?
   42  echo $?
   44  echo $?
   45  echo ls scripts/*.sh
   46  echo $?
   48  echo $?
   50  echo $?
   52  echo $?
   55  echo $?
   58  echo $?
   60  echo $?
   62  echo $?
   64  echo $?
   66  echo $?
   69  echo $?
   71  echo $?
   74  echo $?
   75  echo rm shiny_new/*.txt
   76  echo $?
   79  echo $?
   82  echo $?
   85  echo $?
   87  echo $?
   90  echo $?
   92  echo $?
   95  echo $?
   98  echo $?


: 1

When you use the `-E` flag, `grep` will interpret the pattern `grep` as a _regex (aka regular expression)_. Regexes can get quite complicated, so we don't have the time to dive into them completely.

But to give you an idea of how powerful this can be, let's try to grab sequences C through F or just sequence B and D.

In [96]:
grep -E 'seq[C-F]' sequences.txt

seqC	GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
seqE	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqF	TTAGTTGCAAATAGGCTTACCTAGTAGAGGTCCGACAACCCCTACACGTC
[?2004h

: 1

In [97]:
grep -E 'seq(B|D)' sequences.txt

seqB	GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
[?2004h

: 1

## `sed`: scan through a file and make edits

```bash
sed script_pattern [path_to_file]...
```
The `tr` command made single-character substitutions, but what if you want to replace entire words or sentences? That's one of the most common uses for the `sed` command.

Let's replace "seqC" with the label "sample1". The first argument to `sed` is a special pattern composed of the letter **s**, the search keyword, then the replacement keyword, where each is separated by some character -- in this case, we use a slash `/`.

In [98]:
head -n 5 sequences.txt | sed 's/seqC/sample1/'

seqA	CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
seqB	GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
sample1	GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
seqE	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
[?2004h

: 1

## `awk`: scan through a file and extract text
```bash
awk program-text [path_to_file]...
```
`awk` is a much more powerful command than many of the others. It implements its own programming language which can be used to do any number of tasks, including most of those of the other commands we just learned.

We don't have time to discuss `awk`, but there are tons of great tutorials online you can peruse.

## `vim`: terminal-based text editor
```bash
vim [path_to_file]...
```
`vim` is a fully-featured [text editor](https://learntocodewith.me/resources/text-editors/) that runs from within the terminal. You can learn how to use it at [openvim.com](https://openvim.com).

## `less`: view large files and text one page at a time
```bash
less [path_to_file]...
```
You can use `less` to view large files one page at a time. This proves to be incredibly useful in bioinformatics, where most files are very, very large.

Let's take a look at the `sequences.txt` file with `less`.
```bash
less sequences.txt
```

This will open a separate viewing window. You can quit out of it and return to the terminal prompt by pressing **q**.

Now, let's take a look at our bash history. We can pipe into `less` for this.
```bash
history | less
```

You can use the arrow keys to scroll in `less`.

### searching through the text
The `less` command has a vast array of incredible helpful keyboard shortcuts. We don't have time to go into all of them, but one of the most useful is the ability to search within the text.

To search down through the text...
1. Press the forward-slash `/`
2. Type your search keyword
3. Press _enter_

To search up from your current position, use a question-mark `?` instead of a forward-slash.

**You try**: Search for `ls` commands within `history`.

# Exercise 2

**Try this:**
Retrieve just the second column of the third line (seq C) from the sequences.txt file using a combination of tail, head, and cut. Use pipes!

In [99]:
head -n 3 sequences.txt | tail -n 1 | cut -f 2

GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
[?2004h

: 1

# Got long-running commands? Use job control.

**Scenario:** You're about to execute a commmand that will take a long time.

**Solution:** Send the command to the background by appending an ampersand `&`. Keep working while it runs.

We'll use the `sleep` command to illustrate how this works, but you could use any command in theory.

## `sleep`: do nothing for n seconds
```bash
sleep num_seconds
```
Running `sleep 3` will cause `sleep` to execute for 3 seconds, doing absolutely nothing in the meantime.

In [100]:
sleep 3

[?2004h

: 1

Let's try sending `sleep` to the background while it runs.
```bash
sleep 7 &
```
Now, we can continue working while it executes.

## backgrounding a job while it is running

I'm generally not put-together enough to anticipate when I need to background a job. So is there a way to background a job after you've already started executing it?

Yup!
1. Press ctrl+z to temporarily pause the command
2. Send it to the background using the `bg` command

For example,
1. `sleep 120`
2. _ctrl+z_
3. `bg`

To bring it back to the foreground, use the `fg` command after backgrounding it.

## ending a command prematurely

You can also permanently stop a command, if you need to. Usually, if your command isn't backgrounded, you can do this by pressing _ctrl+c_.

For example,
1. `sleep 120`
2. _ctrl+c_

Otherwise, if your command is backgrounded or `ctrl+c` isn't working for whatever reason, you can use the `kill` command.

There are two ways of doing this.

### `kill <PID>`
Every running command has a process identifier (PID). To list all processes and their PIDs, use the `ps` command. We use the `-u` option to retrieve only those processes that our **u**ser owns.
```bash
ps -u $(whoami)
```

### `kill <job ID>`
Every backgrounded command has a job identifier. To list all jobs and their IDs, use the `jobs` command.
```bash
jobs
```

## how to step away from your terminal for a while

Backgrounded jobs will be killed when you try to close your terminal window or shut down your computer. You can prepend `nohup` to your command to ensure that it continues running.
```bash
nohup sleep 7 &
```

### run your commands within `screen` sessions!

Unfortunately, `nohup` can have some unintended side effects (like redirecting stdout/stderr) and must be prepended before every command. A popular alternative is to use `screen` sessions.

1. First, create and enter a `screen` session. Let's call it `test`
```bash
screen -S test
```
2. Done! Now, you can just run your commands normally!
3. If you'd like to step away from your computer, you can just detach from the `screen` session while commands within it are running.

    Just press **ctrl + a** and then the letter **d**.
    
    **Note:** You will be detached automatically if your computer turns off or your `ssh` session dies because of a choppy internet connection.
4. When you return to your computer or restart your `ssh` session, just reattach to the `screen` session and pick up where you left off.
```bash
screen -r test
```

At any point, you can view your existing `screen` sessions.
```bash
screen -ls
```
To terminate a `screen` session, attach to it and then just exit normally.
```bash
exit
```

# Working with larger files
Bioinformaticians frequently keep commands compressed so they don't take up too much space.

## `gzip`: compressing a file

```bash
gzip [path_to_file]...
```
You can compress a file using the `gzip` command.

In [101]:
gzip sequences.txt

[?2004h

: 1

Notice how the file has a `.gz` extension now?

In [102]:
ls

scripts  sequences.txt.gz
[?2004h

: 1

And if we try to take a look inside, it'll just look like gibberish.

In [103]:
head sequences.txt.gz

�^�c sequences.txt ��ArB1C��V�ۺPJ7�`��z
�@��sq�?����M����h��t�Ѓz(��*�O�*;�Q���W� _e�])��1���9To�'U���(���w|�ﴦ3��ld(�F��"\��9���z���<@!�Ƞ�:Z�����rTd؆qp[\@ۂ�®	�/|` ���d��o��:�ğz��7��{���x�Ϟ�����I�i-���{�I���|61NH  [?2004h

: 1

### decompress a file
Let's revert the file back to the way it was. We can use the `-d` switch to activate decompression.

In [104]:
gzip -d sequences.txt.gz
ls

scripts  sequences.txt
[?2004h

: 1

In [105]:
head sequences.txt

seqA	CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
seqB	GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
seqC	GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
seqE	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqD	TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
seqF	TTAGTTGCAAATAGGCTTACCTAGTAGAGGTCCGACAACCCCTACACGTC
seqG	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqH	CAGAACGCACTTTCCGGCAGTGAATGTCGATGCCAACATTGCCTAAAACA
seqI	CACACGCGGACGAGCATGAAGCACTCGACTCACCTCGAATAGTAGGGGGA
seqJ	TGCGGGGCGGCGATATGCGGCAACGATGGCCGCGAGTTAAATGAGCCATA
[?2004h

: 1

Ok, now that we know how to deal with small files, let's level-up to "BIG DATA"!

## `wget`: downloading a file from a URL

```bash
wget url...
```
For example, let's download chr21 from the hg38 reference genome:

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz (also at https://bit.ly/3zjmX7L)

Recall that the URL for chr21 is stored at the bottom of the hidden file.

In [106]:
echo $(tail -n1 .you_found_me.txt)

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz
[?2004h

: 1

In [107]:
wget $(tail -n1 .you_found_me.txt)

--2023-01-10 05:46:22--  https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12709705 (12M) [application/x-gzip]
Saving to: ‘chr21.fa.gz’


2023-01-10 05:46:23 (24.2 MB/s) - ‘chr21.fa.gz’ saved [12709705/12709705]

[?2004h

: 1

Check: do we have it?

In [108]:
ls 

chr21.fa.gz  scripts  sequences.txt
[?2004h

: 1

## `du`: print a file's **d**isk **u**sage
```bash
du path_to_file...
```
In addition to `ls`, we can use the `du` to check how big a file is.

How big is the file we downloaded compared to the sequences file? We can use the `-h` flag to print the size in human-readable format.

In [109]:
du -h sequences.txt chr21.fa.gz

9.0K	sequences.txt
1.0K	chr21.fa.gz
[?2004h

: 1

That's several 1000 times larger!

## `zcat`: output decompressed text from a gzipped file

```bash
zcat [path_to_file]...
```
We want to take a look at the first ten lines of this file, but it's compressed. Rather than uncompress it all and waste storage space, let's see if we can obtain what we want using the `zcat` command.

In [110]:
zcat chr21.fa.gz | head

>chr214l
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

gzip: stdout: Broken pipe
[?2004h

: 1

The first set of lines of chr21 represent unsequenced heterochromatin. Let's see if we can find real sequences by opening the file inside the `zless` command....

## `zless`: `less` but for gzipped files

```bash
zcat [path_to_file]...
```
Let's open this file in `less` and then look around for a nucleotide like a **g**.

```bash
zless chr21.fa.gz
```

Before we finish this section, let's just clean up after ourselves.

In [111]:
rm chr21.fa.gz
ls

scripts  sequences.txt
[?2004h

: 1

# Logic control
Bash has for loops and if statements, just like python and R!

## for loops

A for loop is written like this:
```bash
for my_var in <space_separated_list>; do
    cmd "$my_var"
done
```
A variable, in this case called `my_var`, is set to each item in the loop. The items are specified as a space separated list. Inside the body of the loop, we can run a `cmd` using the value of each item.

In [112]:
for name in Dolly Kitty Poppet; do
    echo Hello "$name!";
done;

Hello Dolly!004l[?2004l
Hello Kitty!
Hello Poppet!
[?2004h

: 1

As an example, let's iterate over all of the scripts in the `scripts/` directory and print out their paths. Recall that expressions with wildcards like the asterisk `*` get expanded to a space-separated list prior to command execution.

In [113]:
for file in scripts/*.sh; do
    echo "check out this file called $file"
done

check out this file called scripts/hello.sh
check out this file called scripts/hello_to_all_of_you.sh
check out this file called scripts/hello_you.sh
[?2004h

: 1

## if statements
An if-statement is written like this:
```bash
if [ <logical_statement> ]; then
    cmd
fi
```
The logical statement is something that we can test for.

Here's an example. First, we make a variable called `$you` and then we test whether `$you` equals "bored".

In [114]:
you=excited
if [ $you != bored ]; then
    echo "okay onto the next adventure!"
fi

okay onto the next adventure!04l
[?2004h

: 1

# Bash: also a programming language

What is _bash_ anyways?
1. A shell programming language used within terminals
2. An application that reads and interpets all of the commands (ie code) that you type into the terminal

**Takeaway**: You can write scripts composed from all of the commands we learned!

We've written a couple of scripts for you, already. Why don't you take a look?

In [115]:
cat scripts/hello.sh

#!/usr/bin/env bash

# this is a comment; it will be ignored

echo "Hello world!"
[?2004h

: 1

## What is a shebang?

The first line of `hello.sh` is called a _shebang_. It denotes the command that should be used to execute your script.

In this case, we're using the `bash` command, since we're writing our script in the `bash` programming language. But we could have used any command -- like `python`, for example.

## Executing a script

To execute the script, just type the path to it.

In [117]:
scripts/hello.sh

Hello world!
[?2004h

: 1

## Passing arguments to a script

Scripts are, in fact, commands themselves! Within a script, `$1` refers to the first argument to the script, `$2` refers to the second argument, etc...

We demonstrate this in `hello_you.sh`.

In [118]:
cat scripts/hello_you.sh

#!/usr/bin/env bash

echo "Hello $1"!
[?2004h

: 1

In [119]:
scripts/hello_you.sh me

Hello me!
[?2004h

: 1

## Calling scripts from within other scripts 

In [120]:
cat scripts/hello_to_all_of_you.sh

#!/usr/bin/env bash

for name in "Dolly" "Kitty" "Poppet"; do
    scripts/hello_you.sh $name
done;
[?2004h

: 1

In [121]:
scripts/hello_to_all_of_you.sh

Hello Dolly!
Hello Kitty!
Hello Poppet!
[?2004h

: 1

# Create your own script!

Try to create your own script.
1. Navigate to the `module-1-programming/bash_playground/scripts/` directory in DataHub, and create a new file called `my_hello.sh`.
2. Copy the shebang from our other scripts and add your own command.
3. Every time you want to convert a regular file into a script, you'll first need to designate the file as _executable_. You should do this now using the `chmod` command:

    ```bash
    chmod u+x scripts/my_hello.sh
    ```
4. Execute your script.

    ```bash
    scripts/my_hello.sh
    ```

# Making a script globally availale using the special `$PATH` environment variable

I don’t want to have to type out the full path to `hello_you.sh` every time.

How can I make it so that I can execute `hello_you.sh` from anywhere by just typing `hello_you.sh`?

`$PATH` is a special _environment_ variable that is created by _bash_ on startup. It contains a colon-separated list of directories.

Whenever you type a command, _bash_ will look through these directories to find an executable file that matches it.

What does `$PATH` look like right now?

In [122]:
echo $PATH

/opt/conda/condabin:/opt/k8s-support/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/rstudio-server/bin
[?2004h

: 1

Notice how `/usr/bin` is listed in `$PATH`? Try this:
```bash
ls /usr/bin | less
```
Do you recognize any of those?! They're the commands we've been learning about today!

_All of the commands we've been using are just executable files in a folder on the filesystem somewhere!_

Can we make our own scripts into commands?

Let's add the `scripts/` directory to the end of `$PATH`. Note that we can obtain the absolute path to the `scripts/` directory using `pwd`:
```
echo "$(pwd)/scripts"
```

In [123]:
PATH="$PATH:$(pwd)/scripts"
echo "$PATH"

/opt/conda/condabin:/opt/k8s-support/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/rstudio-server/bin:/home/amassara/module-1-programming/bash_playground/scripts
[?2004h

: 1

And now, we should be able to execute our scripts as commands from anywhere :D

In [124]:
scripts/hello.sh

Hello world!
[?2004h

: 1

# We just scratched the surface! How can I learn more?

1. [The missing semester of your CS education: an open MIT course](https://missing.csail.mit.edu/)
2. [A curated list of resources for terminal-using bioinformaticians](https://docs.google.com/document/d/1cOjKC4OI4rsQUl_68PPHCn8vO11lwhpSWyJrGDZT80Q)
3. [Software carpentry](https://swcarpentry.github.io/shell-novice/)
4. [explainshell.com](http://explainshell.com) and help/manual pages
5. It’s dry reading, but [the GNU bash manual](https://www.gnu.org/software/bash/manual/bash.html) is really quite good
6. Other useful topics for bioinformaticians that we weren't able to cover today
	- git
	- ssh
	- sshfs
	- conda and conda environments
	- Snakemake
	- the screen command
	- process substitution
	- exit codes
	- functions
	- aliases
	- bash startup
	- using a computing cluster
	- regexes
	- manipulating strings
	- shell expansions
	- the diff and comm commands
	- file and directory permissions
	- the find and xargs commands
	- arrays
	- shell arithmetic
	- while loops
	- signals
	- the top command
	- the datamash tool
	- the /dev special files