-
Notifications
You must be signed in to change notification settings - Fork 72
Introduction to Command line Unix
The goal of this lab is to become familiar with operating in the Unix command-line interface.
Most bioinformatic tools are written for use in the Unix environment so becoming comfortable with navigating the command-line is a useful skill to have and will be necessary to complete the majority of upcoming labs.
- 1. Navigating the Directory Structure
- 2. Working with the files
- 3. Advanced Unix
- Misc. Tips and Tricks
The Unix directory structure can be visualized as a tree. Each level of the tree indicates a series of folders or files which can have zero or many nested child folders/files. Here is a simplified example from our cloud VMs:
Let's launch the VMs:
Link to VMs
Username: student
Password: biobakery
It is important to know where we are when navigating the Unix directory structure. The current directory
you operate out of is called the Working Directory and we can print this out by using the pwd
command:
~/ $ pwd
/home/hutlab_public
Note: Using the '~' is shorthand for your home directory. In this case, the home directory is 'home/hutlab_public'
Navigation is primarily handled by the Change Directory command cd
:
~/ $ cd Tutorials/
Tutorials/$ pwd
/home/hutlab_public/Tutorials
It is also possible to traverse multiple directories in one cd
command:
Tutorials/ $ cd home/hutlab_public/Tutorials/kneadadata/input/
input/ $ pwd
/home/hutlab_public/Tutorials/kneadadata/input
Q: How could we have written the first command in short-hand?
Q: In this environment, how can we tell what our current location is?
We can navigate backwards up the Unix directory tree by using using the special characters ..
in
conjunction with cd
:
input/ $ pwd
/home/hutlab_public/Tutorials/kneadadata/input
input/ $ cd ..
kneaddata/ $ pwd
/home/hutlab_public/Tutorials/kneadadata
Note that this can be chained multiple times to navigate backwards several levels in the directory tree:
kneaddata/ $ cd input/
input/ $ cd ../..
Tutorials/ $ pwd
/home/hutlab_public/Tutorials
From any location we can return to our home directory by executing the cd
command alone:
Tutorials/ $ cd
~/ $ pwd
/home/hutlab_public
or by changing directory to the ~
character which is short-hand for our home directory:
~/ $ cd Tutorials/
Tutorials/ $ cd ~
~/ $ pwd
/home/hutlab_public
Exercise #1: Find the location of the input files from yesterday's MetaPhlAn tutorial using the 'cd' and 'pwd' commands - we will learn about listing the contents of the directory in the next section.
Now that we know how to navigate around the Unix file structure we will want to see what files/directories are present in our working directory. This is accomplished using the ls
command:
~/ $ cd
~/ $ pwd
/home/hutlab_public
~/ $ ls
Desktop Documents Downloads Music Pictures Public Templates Tutorials Videos biobakery_workflow_databases
We can see that there are multiple directories under the /home/hutlab/
location but not much
more information is returned. This can be fixed by passing options to the ls
command that transform
what output is produced
We can use the -l
option to request a long list of the contents of a directory:
~/ $ pwd
/home/hutlab_public
~/ $ ls -l
total 40
drwxr-xr-x 3 hutlab_public hutlab_public 4096 May 28 2020 Desktop
drwxr-xr-x 2 hutlab_public hutlab_public 4096 May 14 2020 Documents
drwxr-xr-x 2 hutlab_public hutlab_public 4096 May 17 13:25 Downloads
drwxr-xr-x 2 hutlab_public hutlab_public 4096 May 14 2020 Music
drwxrwxr-x 2 hutlab_public hutlab_public 4096 May 14 2020 Pictures
drwxr-xr-x 2 hutlab_public hutlab_public 4096 May 14 2020 Public
drwxr-xr-x 2 hutlab_public hutlab_public 4096 May 14 2020 Templates
drwxr-xr-x 24 hutlab_public hutlab_public 4096 May 7 21:26 Tutorials
drwxr-xr-x 2 hutlab_public hutlab_public 4096 May 14 2020 Videos
drwxr-xr-x 3 hutlab_public hutlab_public 4096 May 18 2017 biobakery_workflow_databases
Several new pieces of information are now provided including file/directory permissions, owner, size and date last updated.
The ls
command does not restrict us to listing files only in our working directory; the location to any file or directory can
be provided with the same results:
~/ $ pwd
/home/hutlab_public
~/ $ ls -l ~/Tutorials/humman3/
total 12
drwxr-xr-x 3 hutlab_public hutlab_public 4096 May 19 2020 input
drwxr-xr-x 2 hutlab_public hutlab_public 4096 May 19 2020 output
Adding other arguments like -t
(time of modification ordering), -r
(reverse order - newest on the bottom) or -h
(human-readable file sizes) will provide further information. For full details type in man ls
. If you do this type q
to get out of the manual page.
The majority of Unix commands will come with manuals built-in that can be accessed using the man
command:
example_dirB/ $ cd
~/ $ man ls
LS(1) BSD General Commands Manual LS(1)
NAME
ls -- list directory contents
SYNOPSIS
ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1] [file ...]
DESCRIPTION
For each operand that names a file of a type other than directory, ls displays its name as well as any requested, associated informa-
tion. For each operand that names a file of type directory, ls displays the names of files contained within that directory, as well
as any requested, associated information.
If no operands are given, the contents of the current directory are displayed. If more than one operand is given, non-directory op-
erands are displayed first; directory and non-directory operands are sorted separately and in lexicographical order.
The following options are available:
-@ Display extended attribute keys and sizes in long (-l) output.
-1 (The numeric digit ``one''.) Force output to be one entry per line. This is the default when output is not to a terminal.
-A List all entries except for . and ... Always set for the super-user.
-a Include directory entries whose names begin with a dot (.).
...
Manuals for commands are useful to provide the myriad number of options that are available. These man pages can be navigated using the
same keys used with less
: Up, Down, Page Up, Page Down, and Spacebar to navigate and q to exit.
Directory creation can be done with the mkdir
command:
~/ $ cd
~/ $ cd Tutorials/unix_intro
unix_intro/ $ mkdir labs
unix_intro/ $ ls -l
total 4
drwxrwxr-x 2 hutlab_public hutlab_public 4096 May 19 16:06 labs
It is also possible to create several levels of directories by passing the -p
option to mkdir
:
unix_intro/ $ mkdir -p labs/lab_2/data
unix_intro/ $ ls labs
lab2
unix_intro/ $ ls labs/lab_2
data
In this section, we will cover how to move, download, rename, and delete files.
Before moving on we'll want to download some example files that we will play with to our cloud machine. The command-line has no
"Save As" prompt as you may be used to when working with a graphical interface but we can use the wget
command to grab files from a remote location.
unix_intro/ $ cd labs/lab_2/data
wget https://github.com/biobakery/biobakery/releases/download/1.8/lab_2_examples.tgz
This file type is a compressed tarball file that we can extract using the tar
command:
data/ $ tar zxvf lab_2_examples.tgz
example_dirA/
example_dirA/input/
example_dirA/output/
example_dirB/
example_dirB/story.text
sequences_A.fasta
sequences_B.fasta
We can examine the contents of a specific text file using two methods. The cat
command will print all the contents of a file
to our screen in one stroke:
data/ $ cat sequences_A.fasta
> sequence 1
CTCGGAAATCGATTTAAATCCGCCTTATATAGGGGAAAACGGGGTGTCGCCTGCTGGTTA
ACATGACGTTGGTTACAAAGCGTGTCATGTACGACATGCCAGCATACCAGCGGATGTCGA
CGTCTCAGAGCGCCCCTTCGGTATGAACCAGGAATCTCGGTGTGAGATACGATTTGCCCT
GTCAGGGTAACGATCTCGTCCACCGCCTCAACCTGAGCGGTATCGGGTAAAAAGAGCGGA
GTGTTAGGACCGCTAGCTTATCGCGAGATAGGTCCACCTAAATCCGCCCACGGACCAAAG
TTTGTCACAAGTCCACGACCCTTCCTCCAGAATTATCCCTAATTCCTCAGTGGCCTATGT
GTCTCCGCCACCGGATCTTTTCTAGTTGATTTTATCTACGATGGCGAGCGGCAAGAAGGT
ATATAGATGGCGCACTCAATTCCAGACTCCGTCTTCCTCGAGGAGACCAGGCCATTCGTT
GAGGTTTGCATGTCAAGTCGACTGCTGCGTACGATCCCCCATTCACTGGGGCAACTCGAG
CAGCTTCGTCTGCGTGGCTTCAAATGCTGTGTCCGGTGGTCAGTTATTTTATCTCTAGTG
GACTGATGTCGCCTAACGACGTAACCCGTGCCTTATGTGTGAGTAGTTGAGCTACAGCGT
TAAGCTCAGGACTTACTGGTTAATTTAACGAATTCGATTAAAAGGCGGTTGTGTCTTGTT
GGAAAGAAAAAAGTCAGACGGGGATGGCAACCGTCATACGTATCACTAGATCACTACACG
CCAATGCGTTGGGCCGCCATATTTTACACTGAGGTCGGGCGGATTGAGCGAACCGACTCT
CCGATGAAAACACGGATTGTTCCAAATGGACATACAACCTAAACCAGCGGCATATAATCA
GGGAGTAGACGTGAGGTGGTCTTCCTCAGGAGTAAATCTTCATATGAGTGGTTCAGCCAT
CAAACGTCGGGGCATAATAGAGGGAGCTGAGTCTGCGGTT
> sequence 2
GTTTACAGTCTATACTGTCCCCCCCGAAGCCACAGCTATCACAAGATCAGTCGGGCTGAG
GGGGTATGGTTGGATTACACTTCGAAACTTCGATTCAGTTGTACCGGGTGTTATCTGTGA
GTGTTTCGAACATCCAGCCTGTACTACTCCGACTATGGGTCGGCGGGCGACGAGAGACCC
GTAGGGCTTCCGTCTCCCATTTCGGTGTTCGAATGGTAAGTGGGTGTGGGGTATGTAAGA
CATTTACGCCTCCCTAAAGATGGCTACGAAAGCAAATTACGGAGAAAATGGCACTCGTCA
ACACCAATGCGCCTTCCGGCTTTCAGGAAAGAACCCGTGAGGTAACACCGAGTTCTAGGG
GGGCGTTAGTTTTACTCGTATACAGAAAACTCCCCCAGTCCACCTTGGGCCCTACCTCGT
TCTCGTGCGAATTAGCCTTATAGCGAAGTTGTGCGTTGCAAGGTATTCGTAAACGCGCGT
GGTCTATCACAAATGATGGCCAGTGAGGTTCATAGGAACTGCTCCAACCCGAGAGACACA
TTCTCATAGATGTTAACCAGAGCTCGTGGATCAGCAACATTAAAGGGAACATTGTGCAAC
ATGCAACTAGAAGAGCTCGATCGCCGTCTCTGACTCTAAACCGAGGGGCAGGACCCCGGA
TCAATGAACAGAGTCAAGCCTATGATCACCGTCTTATCAAACATACATCCACATGTAACG
TCTAATCATCAGGTGCGCCGATATTTAAAGTGGCGCTGGTGTAAACCGTGGAACGTACGT
AATGCACTCGCTATAGAGAGCTTAGACGTGAGCATCGCCGTGCCCTTGTGATCCGGTATG
CCCAGCCTGTAGCACTCCGCTCGGGTATCACATGAATTTAGGTCAAGTCTCTCCCTCGTT
AAGATCGAAGTGCCATCCGGCAGAGCTGCTTCAAGACTCGAAGAATCAGTCTGTGTTGTA
CGAAGTCCACTAATACCACCCTGCTTCATCTGGAATTTCG
Excercise #2: Use the cat
command to view the contents of the sequences_B.fasta
file.
We can see cat
is not so useful when dealing with larger text files as the contents of the file will print rapidly down the screen. In these cases the
less
command allows us to move through
multiple "pages" of output in a more controlled fashion:
data/ $ less sequences_B.fasta
Once invoked the less
command takes over the entirety of our terminal window and allows us to scroll down the file using the Up and Down arrow keys, the Page Up or Page Down keys, or the Spacebar (moves us forward a page at a time). less
can be exited by pressing the q key.
Before we proceed it is useful to note that the Unix command-line has robust support for pattern matching in almost all of the commands using the
wildcard character *
.
An example would be using the wildcard character to list all fasta files in a directory:
data/ $ pwd
/home/hutlab_public/Tutorials/unix_intro/labs/lab_2/data
data/ $ ls -l *.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 2062 May 28 2018 sequences_A.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 8249 Apr 8 2018 sequences_B.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 3701 Apr 8 2018 sequences_C.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 835 Apr 8 2018 sequences_D.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 4968 Apr 8 2018 sequences_E.fasta
We can insert the wildcard character into to make many combinations of partial matches to pass along to Unix commands. Let's list all
files that begin with the word example
:
data/ $ ls -l example*
example_dirA:
total 8
drwxr-xr-x 2 hutlab_public hutlab_public 4096 May 8 2018 input
drwxr-xr-x 2 hutlab_public hutlab_public 4096 May 8 2018 output
example_dirB:
total 20
-rw-r--r-- 1 hutlab_public hutlab_public 701 Apr 8 2018 sequences_F.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 218 Apr 8 2018 sequences_G.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 4311 Apr 8 2018 sequences_H.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 1742 Mar 28 2018 story.txt
Excercise #3: Try listing all files that begin with seqs
and using the cat
command to print out all FASTA files to the screen.
Moving files is a common operation on the command-line and can be achieved by using the mv
command.
Let's move the seqsA.fasta
file from the sequences/
folder to the example_1/
folder:
data/ $ mv sequences_B.fasta example_dirB/
data/ $ ls -l example_dirB/
total 32
-rw-r--r-- 1 hutlab_public hutlab_public 8249 Apr 8 2018 sequences_B.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 701 Apr 8 2018 sequences_F.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 218 Apr 8 2018 sequences_G.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 4311 Apr 8 2018 sequences_H.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 1742 Mar 28 2018 story.txt
The mv
command can also be used to rename files/directories by providing a new file/directory name during execution:
data/ $ mv sequences_A.fasta new_sequencesA.fasta
data/ $ ls -l
total 0
drwxr-xr-x 2 hutlab_public hutlab_public 4096 Mar 28 2018 example_dirB
drwxr-xr-x 2 hutlab_public hutlab_public 4096 May 19 17:22 example_dirA
-rw-r--r-- 1 hutlab_public hutlab_public 10122 May 19 17:06 lab_2_examples.tgz
-rw-r--r-- 1 hutlab_public hutlab_public 2062 May 28 2018 new_sequencesA.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 3701 Apr 8 2018 sequences_C.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 835 Apr 8 2018 sequences_D.fasta
-rw-r--r-- 1 hutlab_public hutlab_public 4968 Apr 8 2018 sequences_E.fasta
Deleting files is done using the rm
command.
Caution should be exercised when deleting files on the command-line as no prompts or warnings will be given to confirm that files are going to be deleted.
Let's try deleting the new_sequencesA.fasta
file we just renamed:
data/ $ rm new_sequencesA.fasta
When deleting directories we must supply rm
with the additional -rf
arguments to ensure that any files found under the specified directory are also deleted. Failure to provide the -rf
argument will result in rm
returning an error:
data/ $ pwd
/home/hutlab_public/Tutorials/labs/lab_2/data
data/ $ rm example_dirA/
rm: example_dirA: is a directory
data/ $ rm -rf example_dirA/
Excercise #4: Create a new directory under the data
folder called to_delete
and move all files that end in .fasta
to the new directory. Delete this folder using rm
.
Searching the contents of a text file is a useful operation made very easy through the use of the
grep
command:
data/ $ cd example_dirB/
example_dirB/ $ grep ">" sequences_B.fasta
> sequence 3
> sequence 4
> sequence 5
> sequence 6
> sequence 7
> sequence 8
> sequence 9
> sequence 10
grep
will output the lines in the file that match our search term (>
in the example above).
An option can be passed to grep
to print out the line number of a match in the specified file:
example_dirB/ $ grep -n ">" sequences_B.fasta
1:> sequence 3
19:> sequence 4
38:> sequence 5
56:> sequence 6
74:> sequence 7
92:> sequence 8
110:> sequence 9
128:> sequence 10
Or just the name of the file a match was found in:
example_dirB/ $ grep -l ">" sequences_B.fasta
sequences_B.fasta
Excercise #5: Search all FASTA files for the nucleotide sequence TACTACTCCGACT
in the examples_dirB
directory.
Q: Which files does that sequence occur in?
When we run a command on the command-line it will normally take control of our terminal window. We can observe this behavior by executing the following command:
example_dirB/ $ xclock
You should now see that you can't type anything else into the terminal. The xclock
command has taken over our terminal window and will not relinquish control until we close the clock window. Once you do this you should notice you can now type new commands into the terminal.
We can get around this by adding the &
character to any of our commands. Doing this tells Unix that we would like to run this command in the background and still retain control over our terminal and execute more commands.
Excercise #6: Run the xclock
command in the background.
Programs/commands like tmux
can be used for interactive background sessions as well. For more information, see this in-depth tutorial.
Below are a collection of useful tips and tricks to have handy when working in the command-line environment.
We can take a look at all the commands that we have executed in the current terminal session
by using the history
command.
~/ $ history
10823 cd ..
10824 ls
10825 cd ..
10826 ls
10827 ls -l
10828 cd ..
10829 ls
10830 ls -l
10833 ls
10835 grep ACGT seqsA.fasta
10836 grep -n ">" seqsA.fasta
10837 grep -l ">" seqsA.fasta
Note: Closing the terminal window or shutting down/restarting your computer will wipe the command history.
Also worth noting is that the Up and Down arrow keys can be used to cycle through the history and bring up any commands previously ran:
The Tab key can be used to bring up a list of commands or files/directories in the working directory that match whatever we are typing into the terminal.
An example can be seen below when typing mk
into the terminal and hitting the Tab key:
~/ $ mk
mkdir mkfontdir mkfs.btrfs mkfs.ext4 mkfs.msdos mkhomedir_helper mkmanifest mksquashfs
mkdosfs mkfontscale mkfs.cramfs mkfs.ext4dev mkfs.ntfs mkinitramfs mk_modmap mkswap
mke2fs mkfs mkfs.ext2 mkfs.fat mkfs.vfat mkisofs mknod mktemp
mkfifo mkfs.bfs mkfs.ext3 mkfs.minix mkfs.xfs mklost+found mkntfs mkzftree
A list of all commands that start with the characters mk
are returned.
Similarly, we can use auto-complete/tab-complete to bring up a list of files in the working directory.
Here we are using the ls
command and the Tab key to bring up all files that begin with the characters sequences_
under the /home/hutlab_public/Tutorials/labs/lab_2/data/example_dirB/sequences/
directory.
sequences/ $ ls sequences_
sequences_X.fasta sequences_Y.fasta sequences_Z.fasta
Time-permitting we can begin to cover some more advanced Unix topics and commands that can be very useful.
Unix provides us with a very clever way to send the output of one command to another for further processing. This action is called "pipelining" or "pipe'ing" and is used via the |
character. Let's see a quick example using the ls
command and the grep
command.
We learned how to list just the FASTA files in a directory using the wildcard character but we can also achieve this using the ls
command, pipe'ing, and the `grep command:
sequences/ $ ls | grep "sequences*"
sequences_X.fasta sequences_Y.fasta sequences_Z.fasta
This seems not so useful seeing as we can do this with less
commands with the wildcard character but pipe'ing allows you to construct complex combinations of commands in an easy manner.
Excercise #7: Using Unix pipelining list all sequences under the sequences
directory and use the grep
command to exact all sequences that begin with ACT
and then grep these lines again for the GGC
nucleotide sequence.
Pipe'ing is useful to chain commands together that operate on a "stream" of output but sometimes we want to execute commands that may work in conjunction but not necessarily on the direct output of an adjacent command.
The Unix for
loop is a construct we can use to execute one or more actions on a collection that we can define. Below is the basic format of a for
loop:
for <ITEM> in <COLLECTION>
do
<ACTIONS>
done
There is a lot to digest here but we can break it down line by line. Our first line:
for <ITEM> in <COLLECTION>
allows us to iterate over each of the elements in <COLLECTION>
and stores the current element in a variable that we define as <ITEM>
. A more relevant example here
would be if we wanted to iterate over each of the sequence files ending in .fasta in the current directory:
for file in *.fasta
We store each of the FASTA files while we are iterating over all FASTA files in a variable named file
and we can access each element by calling this variable as we will see below.
Our next line is telling Unix that we are about to pass one or more actions to be executed on each item in our collection:
do
Next up we supply any actions we'd like to execute. Following our sequence example, let's create a new directory for each file we find and store the sequence ID's in a text file in this new folder:
for file in *.fasta
do
mkdir ${file%%.*}
grep ">" $file > sequence_headers.txt
The important thing to note here is that when we want to make use of the item we are iterating over we need to prepend the $
character to whatever name we provided to our variable. In our example above our variable is named file
so when we intend to use it in any Unix command we can call up the file by supplying $file
to the command. The key here being the $
attached to our filename to indicate to Unix that we are calling up the value stored in a variable.
The final line simply tells Unix that we are ending our for
loop and to execute the actions on all of the items on our collection:
done
Put it all together it looks something like this:
Note: Here the {file%%.*}
is telling the command to ignore the .fasta
when creating or navigating into the directory.
for file in *.fasta
do
mkdir ${file%%.*}
grep ">" $file > ${file%%.*}/fasta_headers.txt
cp $file ${file%%.*}/sequences.txt
done
Excercise #8: Run the supplied for
loop. Try adding a step to make a copy of the sequence file in the newly created folder as well.
- HUMAnN 2.0
- HUMAnN 3.0
- MetaPhlAn 2.0
- MetaPhlAn 3.0
- MetaPhlAn 4.0
- MetaPhlAn 4.1
- PhyloPhlAn 3
- PICRUSt 2.0
- ShortBRED
- PPANINI
- StrainPhlAn 3.0
- StrainPhlAn 4.0
- MelonnPan
- WAAFLE
- MetaWIBELE
- MACARRoN
- FUGAsseM
- HAllA
- HAllA Legacy
- ARepA
- CCREPE
- LEfSe
- MaAsLin 2.0
- MMUPHin
- microPITA
- SparseDOSSA
- SparseDOSSA2
- BAnOCC
- anpan
- MTXmodel
- PARATHAA