Introduction to Command line Unix

Introduction to Command-line Unix

The goal of this lab is to become familiar with operating in the Unix command-line interface.

Most bioinformatic tools are written for use in the Unix environment so becoming comfortable with navigating the command-line is a useful skill to have and will be necessary to complete the majority of upcoming labs.

1. Navigating the Directory Structure

The Unix directory structure can be visualized as a tree. Each level of the tree indicates a series of folders or files which can have zero or many nested child folders/files. Here is a simplified example from our cloud VMs:

Let's launch the VMs: Link to VMs
Username: student
Password: biobakery

1.1 Getting the Working Directory

It is important to know where we are when navigating the Unix directory structure. The current directory you operate out of is called the Working Directory and we can print this out by using the pwd command:

~/ $ pwd 
/home/hutlab_public

Note: Using the '~' is shorthand for your home directory. In this case, the home directory is 'home/hutlab_public'

1.2 Moving Between Directories

Navigation is primarily handled by the Change Directory command cd:

~/ $ cd Tutorials/
Tutorials/$ pwd 
/home/hutlab_public/Tutorials

It is also possible to traverse multiple directories in one cd command:

Tutorials/ $ cd home/hutlab_public/Tutorials/kneadadata/input/
input/ $ pwd
/home/hutlab_public/Tutorials/kneadadata/input

Q: How could we have written the first command in short-hand?
Q: In this environment, how can we tell what our current location is?

We can navigate backwards up the Unix directory tree by using using the special characters .. in conjunction with cd:

input/ $ pwd
/home/hutlab_public/Tutorials/kneadadata/input
input/ $ cd ..
kneaddata/ $ pwd 
/home/hutlab_public/Tutorials/kneadadata

Note that this can be chained multiple times to navigate backwards several levels in the directory tree:

kneaddata/ $ cd input/
input/ $ cd ../..
Tutorials/ $ pwd 
/home/hutlab_public/Tutorials

From any location we can return to our home directory by executing the cd command alone:

Tutorials/ $ cd
~/ $ pwd
/home/hutlab_public

or by changing directory to the ~ character which is short-hand for our home directory:

~/ $ cd Tutorials/
Tutorials/ $ cd ~
~/ $ pwd
/home/hutlab_public

Exercise #1: Find the location of the input files from yesterday's MetaPhlAn tutorial using the 'cd' and 'pwd' commands - we will learn about listing the contents of the directory in the next section.

1.3 Listing Files

Now that we know how to navigate around the Unix file structure we will want to see what files/directories are present in our working directory. This is accomplished using the ls command:

~/ $ cd
~/ $ pwd
/home/hutlab_public
~/ $ ls
Desktop  Documents  Downloads  Music  Pictures  Public  Templates  Tutorials  Videos  biobakery_workflow_databases

We can see that there are multiple directories under the /home/hutlab/ location but not much more information is returned. This can be fixed by passing options to the ls command that transform what output is produced

We can use the -l option to request a long list of the contents of a directory:

~/ $ pwd
/home/hutlab_public
~/ $ ls -l 
total 40
drwxr-xr-x  3 hutlab_public hutlab_public 4096 May 28  2020 Desktop
drwxr-xr-x  2 hutlab_public hutlab_public 4096 May 14  2020 Documents
drwxr-xr-x  2 hutlab_public hutlab_public 4096 May 17 13:25 Downloads
drwxr-xr-x  2 hutlab_public hutlab_public 4096 May 14  2020 Music
drwxrwxr-x  2 hutlab_public hutlab_public 4096 May 14  2020 Pictures
drwxr-xr-x  2 hutlab_public hutlab_public 4096 May 14  2020 Public
drwxr-xr-x  2 hutlab_public hutlab_public 4096 May 14  2020 Templates
drwxr-xr-x 24 hutlab_public hutlab_public 4096 May  7 21:26 Tutorials
drwxr-xr-x  2 hutlab_public hutlab_public 4096 May 14  2020 Videos
drwxr-xr-x  3 hutlab_public hutlab_public 4096 May 18  2017 biobakery_workflow_databases

Several new pieces of information are now provided including file/directory permissions, owner, size and date last updated.

The ls command does not restrict us to listing files only in our working directory; the location to any file or directory can be provided with the same results:

~/ $ pwd
/home/hutlab_public
~/ $ ls -l ~/Tutorials/humman3/
total 12
drwxr-xr-x 3 hutlab_public hutlab_public  4096 May 19  2020 input
drwxr-xr-x 2 hutlab_public hutlab_public  4096 May 19  2020 output

Adding other arguments like -t (time of modification ordering), -r (reverse order - newest on the bottom) or -h (human-readable file sizes) will provide further information. For full details type in man ls. If you do this type q to get out of the manual page.

Tip 1: Unix Manual Pages

The majority of Unix commands will come with manuals built-in that can be accessed using the man command:

example_dirB/ $ cd
~/ $ man ls

LS(1)                     BSD General Commands Manual                    LS(1)

NAME
     ls -- list directory contents

SYNOPSIS
     ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1] [file ...]

DESCRIPTION
     For each operand that names a file of a type other than directory, ls displays its name as well as any requested, associated informa-
     tion.  For each operand that names a file of type directory, ls displays the names of files contained within that directory, as well
     as any requested, associated information.

     If no operands are given, the contents of the current directory are displayed.  If more than one operand is given, non-directory op-
     erands are displayed first; directory and non-directory operands are sorted separately and in lexicographical order.

     The following options are available:

     -@      Display extended attribute keys and sizes in long (-l) output.

     -1      (The numeric digit ``one''.)  Force output to be one entry per line.  This is the default when output is not to a terminal.

     -A      List all entries except for . and ...  Always set for the super-user.

     -a      Include directory entries whose names begin with a dot (.).
...

Manuals for commands are useful to provide the myriad number of options that are available. These man pages can be navigated using the same keys used with less: Up, Down, Page Up, Page Down, and Spacebar to navigate and q to exit.

1.4 Creating Directories

Directory creation can be done with the mkdir command:

~/ $ cd
~/ $ cd Tutorials/unix_intro
unix_intro/ $ mkdir labs
unix_intro/ $ ls -l
total 4
drwxrwxr-x 2 hutlab_public hutlab_public 4096 May 19 16:06 labs

It is also possible to create several levels of directories by passing the -p option to mkdir:

unix_intro/ $ mkdir -p labs/lab_2/data
unix_intro/ $ ls labs
lab2
unix_intro/ $ ls labs/lab_2
data

2. Manipulating Files

In this section, we will cover how to move, download, rename, and delete files.

2.1 Downloading and Extracting Compressed Files

Before moving on we'll want to download some example files that we will play with to our cloud machine. The command-line has no "Save As" prompt as you may be used to when working with a graphical interface but we can use the wget command to grab files from a remote location.

unix_intro/ $ cd labs/lab_2/data
wget https://github.com/biobakery/biobakery/releases/download/1.8/lab_2_examples.tgz

This file type is a compressed tarball file that we can extract using the tar command:

data/ $ tar zxvf lab_2_examples.tgz
example_dirA/
example_dirA/input/
example_dirA/output/
example_dirB/
example_dirB/story.text
sequences_A.fasta
sequences_B.fasta

2.2 Viewing Files

We can examine the contents of a specific text file using two methods. The cat command will print all the contents of a file to our screen in one stroke:

data/ $ cat sequences_A.fasta
> sequence 1
CTCGGAAATCGATTTAAATCCGCCTTATATAGGGGAAAACGGGGTGTCGCCTGCTGGTTA
ACATGACGTTGGTTACAAAGCGTGTCATGTACGACATGCCAGCATACCAGCGGATGTCGA
CGTCTCAGAGCGCCCCTTCGGTATGAACCAGGAATCTCGGTGTGAGATACGATTTGCCCT
GTCAGGGTAACGATCTCGTCCACCGCCTCAACCTGAGCGGTATCGGGTAAAAAGAGCGGA
GTGTTAGGACCGCTAGCTTATCGCGAGATAGGTCCACCTAAATCCGCCCACGGACCAAAG
TTTGTCACAAGTCCACGACCCTTCCTCCAGAATTATCCCTAATTCCTCAGTGGCCTATGT
GTCTCCGCCACCGGATCTTTTCTAGTTGATTTTATCTACGATGGCGAGCGGCAAGAAGGT
ATATAGATGGCGCACTCAATTCCAGACTCCGTCTTCCTCGAGGAGACCAGGCCATTCGTT
GAGGTTTGCATGTCAAGTCGACTGCTGCGTACGATCCCCCATTCACTGGGGCAACTCGAG
CAGCTTCGTCTGCGTGGCTTCAAATGCTGTGTCCGGTGGTCAGTTATTTTATCTCTAGTG
GACTGATGTCGCCTAACGACGTAACCCGTGCCTTATGTGTGAGTAGTTGAGCTACAGCGT
TAAGCTCAGGACTTACTGGTTAATTTAACGAATTCGATTAAAAGGCGGTTGTGTCTTGTT
GGAAAGAAAAAAGTCAGACGGGGATGGCAACCGTCATACGTATCACTAGATCACTACACG
CCAATGCGTTGGGCCGCCATATTTTACACTGAGGTCGGGCGGATTGAGCGAACCGACTCT
CCGATGAAAACACGGATTGTTCCAAATGGACATACAACCTAAACCAGCGGCATATAATCA
GGGAGTAGACGTGAGGTGGTCTTCCTCAGGAGTAAATCTTCATATGAGTGGTTCAGCCAT
CAAACGTCGGGGCATAATAGAGGGAGCTGAGTCTGCGGTT
> sequence 2
GTTTACAGTCTATACTGTCCCCCCCGAAGCCACAGCTATCACAAGATCAGTCGGGCTGAG
GGGGTATGGTTGGATTACACTTCGAAACTTCGATTCAGTTGTACCGGGTGTTATCTGTGA
GTGTTTCGAACATCCAGCCTGTACTACTCCGACTATGGGTCGGCGGGCGACGAGAGACCC
GTAGGGCTTCCGTCTCCCATTTCGGTGTTCGAATGGTAAGTGGGTGTGGGGTATGTAAGA
CATTTACGCCTCCCTAAAGATGGCTACGAAAGCAAATTACGGAGAAAATGGCACTCGTCA
ACACCAATGCGCCTTCCGGCTTTCAGGAAAGAACCCGTGAGGTAACACCGAGTTCTAGGG
GGGCGTTAGTTTTACTCGTATACAGAAAACTCCCCCAGTCCACCTTGGGCCCTACCTCGT
TCTCGTGCGAATTAGCCTTATAGCGAAGTTGTGCGTTGCAAGGTATTCGTAAACGCGCGT
GGTCTATCACAAATGATGGCCAGTGAGGTTCATAGGAACTGCTCCAACCCGAGAGACACA
TTCTCATAGATGTTAACCAGAGCTCGTGGATCAGCAACATTAAAGGGAACATTGTGCAAC
ATGCAACTAGAAGAGCTCGATCGCCGTCTCTGACTCTAAACCGAGGGGCAGGACCCCGGA
TCAATGAACAGAGTCAAGCCTATGATCACCGTCTTATCAAACATACATCCACATGTAACG
TCTAATCATCAGGTGCGCCGATATTTAAAGTGGCGCTGGTGTAAACCGTGGAACGTACGT
AATGCACTCGCTATAGAGAGCTTAGACGTGAGCATCGCCGTGCCCTTGTGATCCGGTATG
CCCAGCCTGTAGCACTCCGCTCGGGTATCACATGAATTTAGGTCAAGTCTCTCCCTCGTT
AAGATCGAAGTGCCATCCGGCAGAGCTGCTTCAAGACTCGAAGAATCAGTCTGTGTTGTA
CGAAGTCCACTAATACCACCCTGCTTCATCTGGAATTTCG

Excercise #2: Use the cat command to view the contents of the sequences_B.fasta file.

We can see cat is not so useful when dealing with larger text files as the contents of the file will print rapidly down the screen. In these cases the less command allows us to move through multiple "pages" of output in a more controlled fashion:

data/ $ less sequences_B.fasta

Once invoked the less command takes over the entirety of our terminal window and allows us to scroll down the file using the Up and Down arrow keys, the Page Up or Page Down keys, or the Spacebar (moves us forward a page at a time). less can be exited by pressing the q key.

Tip 2: The Wildcard Character

Before we proceed it is useful to note that the Unix command-line has robust support for pattern matching in almost all of the commands using the wildcard character *.

An example would be using the wildcard character to list all fasta files in a directory:

data/ $ pwd
/home/hutlab_public/Tutorials/unix_intro/labs/lab_2/data
data/ $ ls -l *.fasta
-rw-r--r--  1 hutlab_public hutlab_public  2062 May 28 2018 sequences_A.fasta
-rw-r--r--  1 hutlab_public hutlab_public  8249 Apr  8 2018 sequences_B.fasta
-rw-r--r--  1 hutlab_public hutlab_public  3701 Apr  8 2018 sequences_C.fasta
-rw-r--r--  1 hutlab_public hutlab_public   835 Apr  8 2018 sequences_D.fasta
-rw-r--r--  1 hutlab_public hutlab_public  4968 Apr  8 2018 sequences_E.fasta

We can insert the wildcard character into to make many combinations of partial matches to pass along to Unix commands. Let's list all files that begin with the word example:

data/ $ ls -l example*
example_dirA:
total 8
drwxr-xr-x  2 hutlab_public hutlab_public  4096 May  8 2018 input
drwxr-xr-x  2 hutlab_public hutlab_public  4096 May  8 2018 output

example_dirB:
total 20
-rw-r--r--  1 hutlab_public hutlab_public   701 Apr  8 2018 sequences_F.fasta
-rw-r--r--  1 hutlab_public hutlab_public   218 Apr  8 2018 sequences_G.fasta
-rw-r--r--  1 hutlab_public hutlab_public  4311 Apr  8 2018 sequences_H.fasta
-rw-r--r--  1 hutlab_public hutlab_public  1742 Mar 28 2018 story.txt

Excercise #3: Try listing all files that begin with seqs and using the cat command to print out all FASTA files to the screen.

2.3 Moving and Deleting Files/Directories

Moving files is a common operation on the command-line and can be achieved by using the mv command.

Let's move the seqsA.fasta file from the sequences/ folder to the example_1/ folder:

data/ $ mv sequences_B.fasta example_dirB/
data/ $ ls -l example_dirB/
total 32
-rw-r--r--  1 hutlab_public hutlab_public  8249 Apr  8 2018 sequences_B.fasta
-rw-r--r--  1 hutlab_public hutlab_public   701 Apr  8 2018 sequences_F.fasta
-rw-r--r--  1 hutlab_public hutlab_public   218 Apr  8 2018 sequences_G.fasta
-rw-r--r--  1 hutlab_public hutlab_public  4311 Apr  8 2018 sequences_H.fasta
-rw-r--r--  1 hutlab_public hutlab_public  1742 Mar 28 2018 story.txt

The mv command can also be used to rename files/directories by providing a new file/directory name during execution:

data/ $ mv sequences_A.fasta new_sequencesA.fasta
data/ $ ls -l 
total 0
drwxr-xr-x  2 hutlab_public hutlab_public  4096 Mar 28  2018 example_dirB
drwxr-xr-x  2 hutlab_public hutlab_public  4096 May 19 17:22 example_dirA
-rw-r--r--  1 hutlab_public hutlab_public 10122 May 19 17:06 lab_2_examples.tgz
-rw-r--r--  1 hutlab_public hutlab_public  2062 May 28  2018 new_sequencesA.fasta
-rw-r--r--  1 hutlab_public hutlab_public  3701 Apr  8  2018 sequences_C.fasta
-rw-r--r--  1 hutlab_public hutlab_public   835 Apr  8  2018 sequences_D.fasta
-rw-r--r--  1 hutlab_public hutlab_public  4968 Apr  8  2018 sequences_E.fasta

Deleting files is done using the rm command.

Caution should be exercised when deleting files on the command-line as no prompts or warnings will be given to confirm that files are going to be deleted.

Let's try deleting the new_sequencesA.fasta file we just renamed:

data/ $ rm new_sequencesA.fasta

When deleting directories we must supply rm with the additional -rf arguments to ensure that any files found under the specified directory are also deleted. Failure to provide the -rf argument will result in rm returning an error:

data/ $ pwd
/home/hutlab_public/Tutorials/labs/lab_2/data
data/ $ rm example_dirA/
rm: example_dirA: is a directory
data/ $ rm -rf example_dirA/

Excercise #4: Create a new directory under the data folder called to_delete and move all files that end in .fasta to the new directory. Delete this folder using rm.

2.4 Full-Text Search

Searching the contents of a text file is a useful operation made very easy through the use of the grep command:

data/ $ cd example_dirB/
example_dirB/ $ grep ">" sequences_B.fasta
> sequence 3
> sequence 4
> sequence 5
> sequence 6
> sequence 7
> sequence 8
> sequence 9
> sequence 10

grep will output the lines in the file that match our search term (> in the example above).

An option can be passed to grep to print out the line number of a match in the specified file:

example_dirB/ $ grep -n ">" sequences_B.fasta
1:> sequence 3
19:> sequence 4
38:> sequence 5
56:> sequence 6
74:> sequence 7
92:> sequence 8
110:> sequence 9
128:> sequence 10

Or just the name of the file a match was found in:

example_dirB/ $ grep -l ">" sequences_B.fasta
sequences_B.fasta

Excercise #5: Search all FASTA files for the nucleotide sequence TACTACTCCGACT in the examples_dirB directory.
Q: Which files does that sequence occur in?

Tip 3: Running Commands in the Background

When we run a command on the command-line it will normally take control of our terminal window. We can observe this behavior by executing the following command:

example_dirB/ $ xclock

You should now see that you can't type anything else into the terminal. The xclock command has taken over our terminal window and will not relinquish control until we close the clock window. Once you do this you should notice you can now type new commands into the terminal.

We can get around this by adding the & character to any of our commands. Doing this tells Unix that we would like to run this command in the background and still retain control over our terminal and execute more commands.

Excercise #6: Run the xclock command in the background.

Programs/commands like tmux can be used for interactive background sessions as well. For more information, see this in-depth tutorial.

Misc. Tips and Tricks

Below are a collection of useful tips and tricks to have handy when working in the command-line environment.

Tip 4: Accessing Command History

We can take a look at all the commands that we have executed in the current terminal session by using the history command.

~/ $ history
10823  cd ..
10824  ls
10825  cd ..
10826  ls
10827  ls -l
10828  cd ..
10829  ls
10830  ls -l
10833  ls
10835  grep ACGT seqsA.fasta
10836  grep -n ">" seqsA.fasta
10837  grep -l ">" seqsA.fasta

Note: Closing the terminal window or shutting down/restarting your computer will wipe the command history.

Also worth noting is that the Up and Down arrow keys can be used to cycle through the history and bring up any commands previously ran:

Tip 5: Auto-completing Commands

The Tab key can be used to bring up a list of commands or files/directories in the working directory that match whatever we are typing into the terminal.

An example can be seen below when typing mk into the terminal and hitting the Tab key:

~/ $ mk
mkdir             mkfontdir         mkfs.btrfs        mkfs.ext4         mkfs.msdos        mkhomedir_helper  mkmanifest        mksquashfs
mkdosfs           mkfontscale       mkfs.cramfs       mkfs.ext4dev      mkfs.ntfs         mkinitramfs       mk_modmap         mkswap
mke2fs            mkfs              mkfs.ext2         mkfs.fat          mkfs.vfat         mkisofs           mknod             mktemp
mkfifo            mkfs.bfs          mkfs.ext3         mkfs.minix        mkfs.xfs          mklost+found      mkntfs            mkzftree

A list of all commands that start with the characters mk are returned.

Similarly, we can use auto-complete/tab-complete to bring up a list of files in the working directory. Here we are using the ls command and the Tab key to bring up all files that begin with the characters sequences_ under the /home/hutlab_public/Tutorials/labs/lab_2/data/example_dirB/sequences/ directory.

sequences/ $ ls sequences_
sequences_X.fasta  sequences_Y.fasta  sequences_Z.fasta

3. Advanced Unix

Time-permitting we can begin to cover some more advanced Unix topics and commands that can be very useful.

3.1 Pipelining Commands

Unix provides us with a very clever way to send the output of one command to another for further processing. This action is called "pipelining" or "pipe'ing" and is used via the | character. Let's see a quick example using the ls command and the grep command.

We learned how to list just the FASTA files in a directory using the wildcard character but we can also achieve this using the ls command, pipe'ing, and the `grep command:

sequences/ $ ls | grep "sequences*"
sequences_X.fasta  sequences_Y.fasta  sequences_Z.fasta

This seems not so useful seeing as we can do this with less commands with the wildcard character but pipe'ing allows you to construct complex combinations of commands in an easy manner.

Excercise #7: Using Unix pipelining list all sequences under the sequences directory and use the grep command to exact all sequences that begin with ACT and then grep these lines again for the GGC nucleotide sequence.

3.2 Bash scripting and For loops

Pipe'ing is useful to chain commands together that operate on a "stream" of output but sometimes we want to execute commands that may work in conjunction but not necessarily on the direct output of an adjacent command.

The Unix for loop is a construct we can use to execute one or more actions on a collection that we can define. Below is the basic format of a for loop:

for <ITEM> in <COLLECTION>
do
  <ACTIONS>
done

There is a lot to digest here but we can break it down line by line. Our first line:

for <ITEM> in <COLLECTION>

allows us to iterate over each of the elements in <COLLECTION> and stores the current element in a variable that we define as <ITEM>. A more relevant example here would be if we wanted to iterate over each of the sequence files ending in .fasta in the current directory:

for file in *.fasta

We store each of the FASTA files while we are iterating over all FASTA files in a variable named file and we can access each element by calling this variable as we will see below.

Our next line is telling Unix that we are about to pass one or more actions to be executed on each item in our collection:

do

Next up we supply any actions we'd like to execute. Following our sequence example, let's create a new directory for each file we find and store the sequence ID's in a text file in this new folder:

for file in *.fasta
do
  mkdir ${file%%.*}
  grep ">" $file > sequence_headers.txt

The important thing to note here is that when we want to make use of the item we are iterating over we need to prepend the $ character to whatever name we provided to our variable. In our example above our variable is named file so when we intend to use it in any Unix command we can call up the file by supplying $file to the command. The key here being the $ attached to our filename to indicate to Unix that we are calling up the value stored in a variable.

The final line simply tells Unix that we are ending our for loop and to execute the actions on all of the items on our collection:

done

Put it all together it looks something like this:

Note: Here the {file%%.*} is telling the command to ignore the .fasta when creating or navigating into the directory.

for file in *.fasta
do
  mkdir ${file%%.*}
  grep ">" $file > ${file%%.*}/fasta_headers.txt
  cp $file ${file%%.*}/sequences.txt
done

Excercise #8: Run the supplied for loop. Try adding a step to make a copy of the sequence file in the newly created folder as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly