-
Notifications
You must be signed in to change notification settings - Fork 15
/
1.01-unix.Rmd
1246 lines (1026 loc) · 59.2 KB
/
1.01-unix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
pdf_document: default
html_document: default
---
# Essential Unix/Linux Terminal Knowledge
Unix was developed at AT&T Bell Labs in the 1960s. Formally "UNIX" is a
trademarked operating system,
but when most people talk about "Unix"
they are talking about the _shell_, which is the text-command-driven interface
by which Unix users interact with the computer.
The Unix shell has been around, largely unchanged, for many decades because it is _awesome_.
When you learn it, you aren't learning a fad, but, rather, a mode of interacting
with your computer that has been time tested and will likely
continue to be the lingua franca of large computer systems
for many decades to come.
For bioinformatics, Unix is the tool of choice for a number of reasons: 1) complex analyses
of data can be undertaken with a minimum of words; 2) Unix allows automation of tasks,
especially ones that are repeated many times; 3) the standard set of Unix commands includes
a number of tools for managing large files and for inspecting and manipulating text files; 4) multiple,
successive analyses upon a single stream of data can be expressed and executed efficiently,
typically without the need to write intermediate results to the disk;
5) Unix was developed when computers were extremely limited in terms of memory and speed. Accordingly,
many Unix tools have been well optimized and are appropriate to the massive genomic data sets
that can be taxing even for today's large, high performance computing systems; 6) virtually all state-of-the-art
bioinformatic tools are tailored to run in a Unix environment; and finally, 7) essentially
every high-performance computer cluster runs some variant of Unix, so if you are going to be
using a cluster for your analyses (which is highly likely), then you have gotta know Unix!
## Getting a bash shell on your system
A special part of the Unix operating system is the "shell." This is the system that
interprets commands from the user. At times it behaves like an interpreted programming
language, and it also has a number of features that help to minimize the amount of
typing the user must do to complete any particular tasks. There are a number of
different "shells" that people use. We will focus on one called "bash," which stands
for the "Bourne again shell." Many of the shells share a number of features.
Many common operating systems are built upon Unix or upon Linux---an open-source flavor of
Unix that is, in many scenarios, indistinguishable. Hereafter we will refer
to both Unix and Linux as "Unix" systems). For example all Apple Macintosh computers are
built on top of the Berkeley Standard Distribution of Unix and bash is the default
shell. Many people these days
use laptops that run a flavor of Linux like Ubuntu, Debian, or RedHat. Linux
users should ensure that they are running the bash shell. This can be done
by typing "bash" at the command line, or inserting that into their profile.
To know what shell is currently running you can type:
```sh
echo $0
```
at the Unix command line. If you are running `bash` the result should
be
```sh
-bash
```
PCs running Microsoft Windows are something of the exception in the computer world, in that
they are not running an operating system built on Unix. However, Windows 10 now allows for
a Linux Subsystem to be run. For Windows, it is also possible to install a lightweight
implementation of bash (like Git Bash). This is helpful for learning how to use Unix,
but it should be noted that most bioinformatic tools are still difficult to install
on Windows.
Mac computers used to come from the store configured to use bash as the default
shell; however, since the Catalina OS, Apple now sets a different shell---the Z-shell,
or zsh---as the default. This is apparently because of some changes to bash's
license status. However, an old version of bash is still on the Apple system, and it
can be set as the default (which is what I do, because bash is what is used in
most bioinformatics). Doing so involves changing the default shell with the `chsh` command (`chsh -s /bin/bash`) and then adding a line that looks like:
```sh
export BASH_SILENCE_DEPRECATION_WARNING=1
```
to your `~/.bashrc` file. (This will all make more sense after you have digested
the contents of this chapter.)
## Navigating the Unix filesystem
Most computer users will be familiar with the idea of saving documents into "folders."
These folders are typically navigated using a "point-and-click" interface
like that of the Finder in Mac OS X or the File Explorer in a Windows system.
When working in a Unix shell, such a point-and-click interface is typically not available,
and the first hurdle that new Unix users must surmount is learning to quickly navigate in
the Unix filesystem from a terminal prompt. So,
we begin our foray into Unix and its command prompt with this essential skill.
When you start a Unix shell in a terminal window you get a _command prompt_
that might look something like this:
```sh
my-laptop:~ me$
```
or, perhaps something as simple as:
```sh
$
```
or maybe something like:
```sh
/~/--%
```
We will adopt the convention in this book that, unless we are intentionally doing something
fancier, the Unix command prompt is given
by a percent sign, and this will be used when displaying text typed at a command
prompt, followed by output from the command. For example
```sh
% pwd
/Users/eriq
```
shows that I issued the Unix command `pwd`, which instructs the computer to
**p**rint **w**orking **d**irectory, and the computer responded by printing
`/Users/eriq`, which, on my Mac OS X system is my _home directory_.
In Unix parlance, rather than speaking of "folders," we call
them "directories;" however, the two are essentially the same thing.
Every user
on a Unix system has a home directory. It is the domain on a shared computer
in which the user has privileges to create and delete files and do work.
It is where most of your work will happen. When you are working in the Unix
shell there is a notion of a
_current working directory_---that is to say, a place within the hierarchy of
directories where you are "currently working." This will become more concrete
after we have encountered a few more concepts.
The specification `/Users/eriq` is what is known as an _absolute path_, as it provides the
"address" of my home directory, `eriq`, on my laptop, starting from the _root_
of the filesystem. Every Unix computer system has a root directory (you can
think of it as the "top-most" directory in a hierarchy), and on every Unix system
this root directory always has the special name, `/`. The address
of a directory relative to the root is specified by starting with the root (`/`)
and then naming each subsequent directory that you must go inside of in order to
get to the destination, each separated by a `/`. For example, `/Users/eriq`
tells us that we start at the root (`/`) and then we go into the `Users` directory
(`Users`) and then, from there, into the `eriq` directory. Note that `/` is used to mean the root
directory when at the beginning of an absolute path, but in the remainder of the path
its meaning is different: it is used merely as a separator
between directories nested within one another. Figure \@ref(fig:file-hierarchy) shows
an example hierarchy of some of the directories that are found on the author's
laptop.
```{r file-hierarchy, echo=FALSE, fig.align='center', dpi=50, fig.cap="A partial view of the directories on the author's laptop."}
knitr::include_graphics("figs/file-hierarchy.png")
```
From this perspective it is obvious that the directory `eriq` lives inside `Users`, and also that, for example,
the absolute path of the directory `git-repos` would be `/Users/eriq/Documents/git-repos`.
Absolute paths give the precise location of a directory relative to the root of the filesystem,
but it is not always convenient, nor appropriate, to work entirely with absolute paths.
For one thing, directories that are deeply nested within many others can have long and unwieldy
absolute path names that are hard to type and can be difficult to remember. Furthermore, as we will
see later in this book, absolute paths are typically not _reproducible_ from one computer's
filesystem to another. Accordingly, it is more common to give the address of directories using
_relative paths_. Relative paths work much like absolute paths; however, they do not start with
a leading `/`, and hence they do not take as their
starting point the root directory. Rather, their starting point is implicitly taken
to be the current working directory. Thus, if the current working directory is
`/Users/eriq`, then the path `Documents/pers` is a relative path to the
`pers` directory, as can again be seen in Figure \@ref(fig:file-hierarchy).
The special relative path symbol `..` means "the directory that is one level higher up
in the hierarchy." So, if the current working directory were `/Users/eriq/Documents/git-repos`,
then the path `..` would mean `/Users/eriq/Documents`, the path
`../work` gives the directory `/Users/eriq/Documents/work`, and, by using two or more `..` symbols
separated by forward slashes, we
can even go up multiple levels in the hierarchy: `../../../zoe` is a relative path for
`/Users/zoe`, when the current working directory is `/Users/eriq/Documents/git-repos`.
When naming paths, another
useful Unix shorthand is `~` (a tilde) which denotes the user's home directory. This is particularly
useful since most of your time in a Unix filesystem will be spent in a directory within your
home directory. Accordingly, `~/Documents/work` is a quick shorthand for
`/Users/eriq/Documents/work`. This is essential practice if you are working on a large shared computing
resource in which the absolute path to your home directory might be changed by the
system administrator when restructuring the filesystem.
```{block2, note-text, type='rmdtip'}
**A useful piece of terminology:** in any path, the "final" directory
name is called the _basename_ of the path. Hence the basename of `/Users/eriq/Documents/git-repos`
is `git-repos`. And the basename of `../../Users` is `Users`.
```
### Changing the working directory with `cd`
When you begin a Unix terminal session, the current working
directory is set, by default, to your home directory. However, when you are doing
bioinformatics or otherwise hacking on the command line, you will typically
want to be "in another directory" (meaning you will want the current working
directory set to some other directory). For this, Unix provides the `cd` command, which
stands for **c**hange **d**irectory. The syntax is simple:
`cd` _path_
where _path_ is an absolute or a relative path. For example, to
get to the `git-repos` directory from my home directory would require
a simple command: `cd Documents/git-repos`. Once there, I could change to
my `Desktop` directory with `cd ../../Desktop`. Witness:
```sh
% pwd
/Users/eriq
% cd Documents/git-repos/
% pwd
/Users/eriq/Documents/git-repos
% cd ../../Desktop
% pwd
/Users/eriq/Desktop
```
Once you have used `cd`, the working directory of your current shell will
remain the same no matter how many other commands you issue,
until you invoke the `cd` command another time and change
to a different directory.
If you give the `cd` command with no path specified, your working directory
will be set to your home directory. This is super-handy if you have been
exploring the levels of a Unix filesystem above your home directory and cannot
remember how to get back to your home directory. Just remember that
```sh
% cd
```
will get you back home.
Another useful shortcut is to supply `-` (a hyphen) as the path to `cd`. This will
change the working directory back to where you were before your last invocation
of `cd`, and it will tell you which directory you have returned to. For example, if you start in `/Users/eriq/Documents/git-repos` and then
`cd` to `/bin`, you can get back to `git-repos` with `cd -` like so:
```sh
% pwd
/Users/eriq/Documents/git-repos
% cd /bin/
% pwd
/bin
% cd -
/Users/eriq/Documents/git-repos
% pwd
/Users/eriq/Documents/git-repos
```
Note that the output of `cd -` is the newly-returned-to current working directory.
### Updating your command prompt {#comm-prompt}
When you are buzzing around in your filesystem, it is often difficult to remember
which directory you are in. You can always type `pwd` to figure that out,
but the bash shell also provides a way to print the current working directory
_within your command prompt_.
For example, the command:
```sh
PS1='[\W]--% '
```
redefines the command prompt to be the basename of the current directory surrounded
by brackets and followed by `--%`:
```sh
% pwd
/Users/eriq/Documents/git-repos
% PS1='[\W]--% '
[git-repos]--% cd ../
[Documents]--% cd ../
[~]--% cd ../
[Users]--%
```
This can make it considerably easier to keep track of where you are in your file system.
We will discuss later how to invoke this change automatically in every terminal session
when we talk about customizing environments in Section \@ref(unix-env).
### TAB-completion for paths
Let's be frank...typing path names in order to change from one directory to another can feel
awfully tedious, especially when your every neuron is screaming, "Why can't I just have a friggin' Finder
window to navigate in!" Do not despair. This is a normal reaction when you first start using Unix.
Fortunately, Unix file-system navigation can be made much less painful (or even enjoyable)
for you by becoming a master of _TAB-completion_. Imagine the Unix shell is watching
your every keystroke and trying to guess what you are about to type. If you type the first part
of a directory name after a command, like `cd` and then hit the TAB key, the shell will respond
with its best guess of how you want to complete what you are typing.
Take the file hierarchy of Figure \@ref(fig:file-hierarchy), and imagine that we are in the root
directory. At that point, if we type `cd A`, the shell will think "Ooh! I'll bet they want to
change into the directory `Applications` because that is the only directory that starts with `A`. Sure enough,
if you hit TAB, the shell adds to the command line so that `cd A` becomes `cd Applications/`
and the cursor is still waiting for further input at the end of the command.
Boom! That was way easier (and more accurate) than typing all those letters after `A`.
Developing a lightning-fast TAB-completion trigger finger is, quite seriously, essential to surviving and
thriving in Unix. Use your left pinky to hit TAB. Hone your skills. Make sure you can hit TAB with your eyes
closed. TAB early and TAB often!
Once you can hit TAB instantly from within the middle of any phrase, you
will also want to understand a few simple rules of TAB completion:
1. If you try TAB-completing a word on the command line that is not at the beginning
of the command line (i.e., you are typing a word after a command like `cd`), then the shell
tries to complete the word with a _directory name_ or a _file name_.
1. The shell will only complete an _entire_ directory or file name if the name _uniquely_ matches the first part of the
path that has been entered. In our example, there were no other directories than `Applications` in `/` that start
with `A`, so the shell was certain that we must have been going for `Applications`.
1. If there is more than one directory or file name that matches what you have already typed, then, the first
time you hit TAB, nothing happens, but the _second_ time you hit TAB, the shell will print a list of
names that match what you have written so far. For example, in our Figure \@ref(fig:file-hierarchy) example,
hitting TAB after typing `cd ~/D` does nothing. But the second time we hit TAB we get a list of
matching names:
```{sh, eval=FALSE}
% cd ~/D
Desktop/ Documents/ Downloads/
```
So, if we are heading to `Documents` we can see that adding `oc` to our command line, to create `cd Doc` would be sufficient to allow the shell to
uniquely and correctly guess where we are heading. `cd Doc` will TAB-complete into `cd Documents/`
1. If there are multiple directory or file names that match the current command line, and they share
more letters than those currently on the command line, TAB-completion will complete
the name to the end of the shared portion of the name. An example helps: let's say
I have the following two directories with hideously long names in my `Downloads` folder:
```{sh, eval=FALSE}
WIFL.rep_indiv_est.mixture_collection.count.gr8-results
WIFL.rep_indiv_est.mixture_collection.count-results
```
Then, TAB completing on `~/Downloads/WIFL.rep` will partially complete so that the prompt and command look like:
```{sh, eval=FALSE}
% cd ~/Downloads/WIFL.rep_indiv_est.mixture_collection.count
```
and hitting TAB twice gives:
```{sh, eval=FALSE}
% cd ~/Downloads/WIFL.rep_indiv_est.mixture_collection.count
WIFL.rep_indiv_est.mixture_collection.count-results
WIFL.rep_indiv_est.mixture_collection.count.gr8-results
```
At this point, adding `-` and TAB completing will give the first of those directories.
The last example shows just how much typing TAB completion can save you. So, don't be
shy about hitting that TAB key. When navigating your filesystem (or writing longer command
lines that require paths of files) you should consider hitting TAB after every 1 or 2 letters.
In routine work on the command line, probably somewhere around 25% or more of my keystrokes
are TABs. Furthermore, a TAB is never going to execute a command, and it typically won't
complete to a path that you don't want (unless you got the first part of its name wrong), so there
isn't any risk to hitting TAB all the time.
### Listing the contents of a directory with `ls`
So far we have been focusing mostly on directories. However, directories themselves
are not particularly interesting---they are merely containers. It is the _files_ inside of directories
that we typically work on. The command `ls` lists the contents---typically files or
other directories---within a directory.
Invoking the `ls` command without any other arguments (without anything after it)
returns the contents of the current working directory. In our example,
if we are in `/Users` then we get:
```sh
% ls
eriq zoe
```
By default, `ls` gives output in several columns of text, with the directory contents
sorted lexicographically. For example, the following is output from the `ls` command
in a directory on a remote Unix machine:
```sh
% ls
bam map-sliced-fastqs-etc.sh
bam-slices play
bwa-run-list.txt REDOS-map-sliced-fastqs-etc.sh
fastq-file-prefixes.txt sliced
fqslice-22.error slice-fastqs.sh
fqslice-22.log slicer-lines.txt
map-etc.sh Slicer-Logs-summary.txt
```
The first line shows the command prompt and the command: `% ls`, and the remainder is
the output of the command.
Invoked without any further arguments, the `ls`
command simply lists the contents of the current working directory. However,
you can also direct `ls` to list the contents of another directory by simply
adding the path (absolute or relative) of that directory on the command line. For example, continuing with
the example in Figure \@ref(fig:file-hierarchy), when we are in the home directory (`eriq`)
we can see the directories/files
contained within `Documents` like so:
```sh
[~]--% ls Documents
git-repos/ pers/ work/
```
If you give paths to more than one directory as arguments to `ls`, then
the contents of each directory are listed after a heading line that gives
the directory's path (as given as an argument to `ls`), followed by a colon. For example:
```sh
[~]--% ls Documents/git-repos Documents/work
Documents/git-repos:
ARCHIVED_mega-bioinf-pop-gen.zip lowergranite_0.0.1.tar.gz
AssignmentAdustment/ mega-bioinf-pop-gen-examples/
CKMRsim/ microhaps_np/
Documents/work:
assist/ maps/ oxford/ uw_days/
courses_audited/ misc/ personnel/
```
You might also note in the above example, that some of the paths listed within
each of the two directories are followed by a slash, `/`. This `ls` customization denotes that
they are directories themselves. Much like your command prompt, `ls` can be customized in ways
that make its output more informative. We will return to that in Section \@ref(unix-env).
If you pass the path of a file to `ls`, and that file exists in your filesystem,
then `ls` will respond by printing the file's path:
```sh
% ls Documents/git-repos/lowergranite_0.0.1.tar.gz
Documents/git-repos/lowergranite_0.0.1.tar.gz
```
If the file does not exist you get an error message to that effect:
```sh
% ls Documents/try-this-name
ls: Documents/try-this-name: No such file or directory
```
The multi-column, default output of `ls` is useful when you want
to scan the contents of a directory, and quickly see as many files
as possible in the fewest lines of output.
However, this output format is not well
structured. For example, you don't know how many columns are going to be used in
the default output of `ls` (that
depends on the length of the filenames and the width of your terminal), and it
offers little information beyond the names of the files.
You can direct the `ls` command to provide more information, by using it with the `-l`
option (that is a lower case "L", for "long"). Appropriately, with the `-l` option, the `ls` command will return
output in _long_ format:
```sh
2019-02-08 21:09 /osu-chinook/--% ls -l
total 108
drwxr-xr-x 2 eriq kruegg 4096 Feb 7 08:26 bam
drwxr-xr-x 14 eriq kruegg 4096 Feb 8 15:56 bam-slices
-rw-r--r-- 1 eriq kruegg 17114 Feb 7 20:16 bwa-run-list.txt
-rw-r--r-- 1 eriq kruegg 824 Feb 6 14:14 fastq-file-prefixes.txt
-rw-r--r-- 1 eriq kruegg 0 Feb 7 20:14 fqslice-22.error
-rw-r--r-- 1 eriq kruegg 0 Feb 7 20:14 fqslice-22.log
-rwxr--r-- 1 eriq kruegg 1012 Feb 7 07:59 map-etc.sh
-rwxr--r-- 1 eriq kruegg 1138 Feb 7 20:56 map-sliced-fastqs-etc.sh
drwxr-xr-x 3 eriq kruegg 4096 Feb 7 13:01 play
-rwxr--r-- 1 eriq kruegg 1157 Feb 8 15:08 REDOS-map-sliced-fastqs-etc.sh
drwxr-xr-x 14 eriq kruegg 4096 Feb 8 15:49 sliced
-rwxr--r-- 1 eriq kruegg 826 Feb 7 20:09 slice-fastqs.sh
-rw-r--r-- 1 eriq kruegg 1729 Feb 7 16:11 slicer-lines.txt
```
Each row contains information about only a single file.
The first column indicates what kind of file
each entry is, and also tells us which users have permission to
do certain things with the file (more on this in Section \@ref(file-perm)).
The third and fourth columns show that the owner of
each file is `eriq`, who is a user in the group called `kruegg`. After that
we see the size of the file (in bytes) and the date and time it was last modified.
There are a few options to `ls` that are particularly useful. One is `-a`, which causes
`ls` to include in its listing all files, even _hidden_ ones. In a Unix file system,
any file whose name starts with a `.` is considered a _hidden_ file. Commonly, such
files are configuration files or other files used by programs that you typically
don't interact with directly. (We will see an example of this when we start working with `git` for version control, Section \@ref(git-workings).) The `-d` option for `ls` is also
quite handy. Recall that when you provide the name of a directory as an argument to `ls`,
the default behavior is to list the contents of the directory. This can be troublesome
when you are listing the contents of a subdirectory: `ls ~/Documents/git-repos/*` lists the
contents (which can be substantial) of each of the directories in my `git-repos` directory, but
I might only want to know the name of each of those directories, rather than the full contents
of each diretory.
`ls -d ~/Documents/git-repos` will do that for you. Finally, the `-R` option to `ls` will cause the operating system to drill down, _recursively_ into all the subdirectories of the one you
supplied to the command, and list their contents, as well.
### Globbing
If you have ever had to move a large number of files of a certain type from
one folder to another in a Finder window, you know that clicking and
selecting each file and then individually dragging it could be a tedious task.
Of course, in a graphical file browser your can select multiple files to move at
once. Unix provides a similar system for operating upon multiple files at once; however
it works a little differently, and is based on defining groups of files according
to matching their names to certain patterns. This system, called _filename expansion_ or
"globbing," quickly provides the names of a large number of files and paths, which let's
you operate on multiple files quickly and efficiently. In short, globbing allows for
_wildcard matching_ in path names. This means that you can
specify multiple files that have names that share a common part, but differ in other parts.
The most widely used (and the most permissive) wildcard is the asterisk, `*`. It matches
anything in a file name. So, for example:
- `*.vcf` will expand to any files in the current directory with the suffix `.vcf`.
- `D*s` will expand to any files that start with an uppercase `D` and end with an `s`.
- `*output-*.txt` will expand to any files that include the phrase `output-` somewhere
in their name and also end with `.txt`.
- `*` will expand to all files in the current working directory.
- `/usr/local/*/*.sh` will expand to any files ending in `.sh` that reside within any directory that
is within the `/usr/local` directory.
```{block2, note-dot-files, type='rmdnote'}
**Actually, there is some arcana here:** Names of files or directories that start with a
dot (a period) will not expand unless the
dot is included explicitly. Files with names starting with a dot are
"hidden" files in Unix. You also will not see them in the results of `ls`, unless you
use the `-a` option: `ls -a`.
```
After the asterisk, the next most commonly-used wildcard is the question mark, `?`. The question mark
denotes any single character in a file name. For example. If you had a series of files that looked
like `AA-file.txt`, `AB-file.txt`, ..., `AZ-file.txt`. You could get get all those by
using `A?-file.txt`. This would not expand to, for example, `AAZ-file.txt`, if that were in the directory.
You can be more specific in globbing by putting things within `[` and `]`. For example:
`A[A-D]*` would pick out any files starting with, `AA`, `AB`, `AC`, or `AD`. Or you could
have said `A[a-d]*` which would get any files starting with `Aa`, `Ab`, `Ac`, or `Ad`. And you
can also do it with numbers: `[0-9]`. You can also negate the contents of the `[]`, with `^`. Thus,
`100_[^ABC]*` picks out all files that start with `100_` followed by anything that is _not_ an `A`, `B`, or a `C`.
Finally, you can be even more specific about replacements in file names by iterating over
different possibilities with a comma-separated list within curly braces. For example, `img.{png,jpg,svg}`
will iterate over the values in curly braces and expand to `img.png img.jpg img.svg`. Interestingly,
with curly braces, this forms all those file names whether they exist or not. So, unlike `*` it isn't
really matching available file names.
The last thing to note about all of these globbing constructs is that they are not intimately
associated with the `ls` command. Rather, they simply provide expansions on the command
line, and the `ls` command is listing all those files. For example, try `echo *.txt`.
### What makes a good file-name?
If the foregoing discussion suggests to you that it might not be good to use an
actual `*`, `?`, `[`, or `{` in names that you give to files and directories
on your Unix system, then congratulations on your intuition! Although you can use
such characters in your filenames, they have to be preceded by a backslash, and it
gets to be a huge hassle. So don't use them in your file names. Additionally,
characters such as `#`, `|`, and `:` do not play well for file names. Don't use them!
Another pet peeve of mine (and anyone who uses Unix) are file names that have spaces in them.
In Windows and on a Mac it is easy to create file names that have spaces in them. In fact, the
standard Windows system comes with such space-containing directory names as `My Documents` or `My
Pictures`. Yikes! Please _don't ever do that in your Unix life!_ One can deal with spaces in file
names, but there is really no reason to include spaces in your file names, and having spaces in file
names will typically break a good many scripts. Rather than a space, use an underscore, `_`, or a
dash, `-`. You must admit that, not only does `My-Documents` work better, but it actually
_looks_ better too!
However, should you have to deal with files having spaces in their name, you can
address them by either backslash-escaping the spaces, or putting the whole
file name in quotation marks (single or double quotation marks will work).
If you have a file called `dumb file name.jpg`, you can address it on the
command line as either of the following three:
```sh
dumb\ file\ name.jpg
"dumb file name.jpg"
'dumb file name.jpg'
```
To make your life easier, however, the bottom line is that you should name your files
on a Unix system using only upper- and lowercase letters (Unix file systems are
case-sensitive), numerals, and the following three punctuation characters: `.`, `-`, and `_`.
Though you can use other punctuation characters, they often require special treatment, and it
is better to avoid them altogether.
## The anatomy of a Unix command
Nearly every Unix command that you might invoke follows a certain pattern. First comes
the `command` itself. This is the word that tells the system the name of the command
that you are actually trying to do. After that, often, you will provide a series
of _options_ that will modify the behavior of the command (for example, as we have seen, `-l`
is an option to the `ls` command). Finally, you might then provide some _arguments_ to the
functions. These are typically paths to files or directories that you would like the
command to operate on. So, in short, a typical Unix command invocation will look
like this:
`command` _options_ _arguments_
Of course, there are exceptions. For example, when invoking Java-based programs from your
shell, arguments might be supplied in ways that make them look like options, etc. But, for
the most part, the above is a useful way of thinking about Unix commands.
Sometimes, especially when using `samtools` or `bcftools`, the `command` part of the
command line might including a command and a subcommand, like `samtools view` or
`bcftools query`. This means that the operating system is calling the program
`samtools` (for example), and then samtools interprets the next token (`view`) to
know that it needs to run the `view` routine, and interpret all following
options in that context.
We will now break down each element in
`command` _options_ _arguments_.
### The `command` {#anatomy-command}
When you type a command at the Unix prompt, whether it is a command like `ls` or
one like `samtools` (Section \@ref(samtools)), the Unix system has to search around
the filesystem for a file that matches the command name and which provides the actual
instructions (the computer code, if you will) for what the command will actually do.
It cannot be stressed enough how important it is to
understand where and how the bash shell searches for these command files. Understanding this
well, and knowing how to add directories that the shell searches for executable
commands will alleviate a lot of frustration that often arises with Unix.
In brief, all Unix shells (and the bash shell specifically) maintain
an _environment variable_ called `PATH` that is a colon-separated list of the pathnames
where the shell searches for commands. You can print the `PATH` variable using
the `echo` command:
```sh
echo $PATH
```
On a freshly installed system without many customizations the `PATH` might look like:
```sh
/usr/bin:/bin:/usr/sbin:/sbin
```
which is telling us that, when bash is searching for a command, it searches for a file
of the same name as the command first in the directory `/usr/bin`. If it finds it there, then
it uses the contents of that file to invoke the command. If it doesn't find it there,
then it next searches for the file in directory `/bin`. If it's not there, it searches
in `/usr/sbin`, and finally in `/sbin`. If it does not find the command in any of those directories
then it returns the error `command not found`.
When you install programs on your own computer system, quite often the installer will modify
a system file that specifies the `PATH` variable upon startup. Thus after installing some
programs that use the command line on a Mac system, the "default" `PATH` might look like:
```sh
/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Library/TeX/texbin
```
### The _options_
Sometimes these are called flags, and they provide a convenient way of
telling a Unix command how to operate. We have already seen a few of them,
like the `-a`, `-l` or `-d` options to `ls`.
Most, but not all, Unix tools follow the convention that options specified by a single
letter follow a single dash, while those specified by multiple letters follow two dashes.
Thus, the `tar` command takes the single character options `-x`, `-v`, and `-f`, but also takes
an option named like `--check-links`. Some utilities also have two different names---a single-letter
name and a long name---for many options.
For example, the `bcftools view` program uses either `-a` or `--trim-alt-alleles` to invoke the option
that trims alternate alleles not seen in a given subset of individuals. Other tools, like BEAGLE, are perfectly content to
have options that are named with multiple letters following just a single dash.
Sometimes options take parameter values, like `bcftools view -g het`. In that case, `het` is
a parameter value. Sometimes the parameter values are added to the option with an equals-sign.
With some unix utilities' single-letter options can be bunged together
following a single dash, like, `tar -xvf` being synonymous with `tar -x -v -f`. This
is not universal, and it is not recommended to expect it.
Holy Cow! This is not terribly standardized, and probably won't make sense until
you really get in there and starting playing around in Unix...
### Arguments
These are often file names, or other things that are not preceded by an option flag.
For example, in the `ls` command:
```sh
ls -lrt dir3
```
`-lrt` is giving `ls` the options `-l`, `-r`, and `-t` and `dir3` is the _argument_---the
name of the directory whose contents you should list.
### Getting information about Unix commands
Zheesh! The above looks like a horrible mish-mash. How do we find out
how to use/invoke different commands and programs in Unix? Well, most
programs are documented, and you have to learn how to read the documentation.
If a Unix utility is properly installed, you should be able to find its manual page with
the `man` command. For example, `man ls` or `man tar`. These "man-pages", as the
results are called, have a fairly uniform format. They start with a summary of what the
utility does, then then show how it is invoked and what the possible options are by
showing a skeleton in the form:
"`command` _options_ _arguments_"
and usually square brackets are put around things that are not required. This format
can get quite ugly and hard to parse for an old human brain, like mine, but stick with it.
If you don't have a man-page for a program, you might try invoking the program with
the `--help` option, or maybe with no option at all. Sometimes, that returns a textual
explanation of how the program should be invoked, and the options that are avaiable.
## Handling, Manipulating, and Viewing files and streams
In Unix, there are two main types of files: _regular files_ which are things like text files,
figures, etc.---Anything that holds data of some sort. And then there are "special" files, which
include _directories_ which you've already seen, and _symbolic links_ which we will talk about later.
### Creating new directories
You can make a new directory with:
```sh
mkdir path
```
where `path` is a path specification (either absolute or relative). Note that if you
want to make a directory within a subdirectory that does currently not exist, for example:
```sh
mkdir new-dir/under-new-dir
```
when `new-dir` does not already exist, then you have to either create `new-dir` first, like:
```sh
mkdir new-dir
mdkir new-dir/under-new-dir
```
or you have to use the `-p` option of `mkdir`, which creates all necessary parent directories
as well, like:
```sh
mkdir -p new-dir/under-new-dir
```
If there is already a file (regular or directory) with the same path specifiation as a directory you are
trying to create, you will get an error from `mkdir` (unless you are using the `-p` option,
in which case `mkdir` doesn't do anything, but neither does it complain about the fact that a
directory already exists where you wanted to make one).
### Fundamental file-handling commands
For the day-to-day business of moving, copying, or removing files in the file system,
the three main Unix commands are:
* `mv` for moving files and directories
* `cp` for copying files and directories
* `rm` for removing files and directories
These obviously do different things, but their syntax is somewhat similar.
#### `mv`
`mv` can be invoked with just two arguments like:
```
mv this there
```
which moves the file (or directory) from the path `this` to the path `there`.
* If `this` is a regular file (i.e. not a directory), and:
- `there` is a directory,`this` gets moved inside of `there`.
- `there` is a regular file that exists, then `there` will get overwritten, becoming a regular
file that holds the contents of `this`.
- `there` does not exist, it will be created as regular file whose contents are identical
to those of `this`.
* If `this` is a
directory and:
- `there` does not exist in the filesystem, the directory `there` will be made
and its contents will be the (former) contents of `this`
- if `there` already exists, and is a directory, then the directory `this` will
be moved inside of the directory `there` (i.e. it will become `there/this`).
- if `there` already exists, but is not a directory, then nothing will change
in the filesystem, but an an error will
be reported.
In all cases, whatever used to exist at path `this` will no longer be found there.
And `mv` can be invoked with multiple arguments, in which case the last one must be a directory
_that already exists_ that receives all the earlier arguments inside it. So, if you already have
a directory named `dest_dir` then you can move a lot of things into it like:
```sh
mv file1 file2 dir1 dir2 dest_dir
```
You can also write that as as
```sh
mv file1 file2 dir1 dir2 dest_dir/
```
which makes its meaning a little more clear, but there is no requirement that the
final argument have a trailing `/`.
Note, if any files in `dest_dir` have the same name as the files you are moving into
`dest_dir` they _will_ get overwritten.
So, you have must be careful not to overwrite files that you don't want to overwrite.
Using `mv` can be dangerous in that way.
#### `cp`
This works much the same way as `mv` with two different flavors:
```sh
cp this there
```
and
```sh
cp file1 file2 dest_dir
# or
cp file1 file2 dest_dir/
```
The result is very much like that of `mv`, but instead of moving the file
from one place to another (an operation that can actually be done without moving the
data within the file to a different place on the hard drive), the `cp` command actually
makes a full copy of files. Note that, if the files are large, this can take a long time.
#### `rm`
Finally we get to the very spooky `rm` command, which is short for "remove." If you
say "rm myfile.txt" the OS will remove that file from your hard drive's directory. The data
that were in the file might live on for some time on your hard drive---in other words, by default,
`rm` does not wipe the file off your hard drive, but simply "forgets" where to look for that file. And
the space that file took up on your hard drive is no longer reserved, and could easily be
overwritten the next time you write something to disk. (Nonetheless, if you do `rm` a file, you should never expect to be able to get it back). So, be very careful about using `rm`. It takes an `-r` option for recursively removing directories _and_ all of
their contents.
When used in conjunction with globbing, `rm` can be very useful. For example, if you wanted
to remove all the files in a directory with a `.jpg` extension, you would do `rm *.jpg` from
within that directory. However, it's a disaster to accidentally remove a number of files you
might not have wanted to. So, especially as you are getting familiar with Unix, it is
worth it to experiment with your globbing using `ls` first, to see what the results are,
and, only when you are convinced that you won't remove any files that you do not
want to trash, should you
then use `rm` to remove those files.
### "Viewing" Files
When using a Graphical User Interface, or GUI,
when you interact with files on your computer,
you typically open the files with some application. For example, you open Word files
with Microsoft Word. When working on the Unix shell, that same paradigm does not
really exist. Rather, (apart from a few cases like the text editors, `nano`, `vim` and
`emacs`) instead of opening a file and letting the user interact with it, the shell is
much happier just streaming the contents of the file to the terminal.
The most basic of such commands is the `cat` command, which _catenates_ the contents
of a file into a very special _data stream_ called _stdout_, which is short
for "standard output." If you don't provide any other instruction, data that gets
streamed to _stdout_ just shoots by on your terminal screen. If the file is very large, it might
do this for a long time. If the file is a _text file_ then the data in it can be
written out in letters that are recognizable. If it is a _binary file_ then there is
no good way to represent the contents as text letters, and your screen will be filled with
all sorts of crazy looking characters.
It is generally best not to `cat` very large files, especially binary ones. If you do and
you need to stop the command from continuing to spew stuff across your screen, you can type
`cntrl-c` which is the universal Unix command for "kill the current process happening on the
shell." Usually that will stop it.
```{block2, note-text-terminal, type='rmdtip'}
**A note regarding terminals:** On a Mac, both the Terminal app and the application
iTerm2 are quite fast at spewing text
across the screen. Megabytes of text or binary gibberish can flash by in seconds flat. This
is not the case with the terminal window within RStudio, which can by abysmally slow, and
usually doesn't store many lines of output.
```
Sometimes you want to just look at the top of a file. The `head` command
shows you the first 10 lines of a file. That is valuable. The `less` command
shows a file one screenful at a time. You can hit the space bar to see the next screenful,
and you can hit `q` to quit viewing the file. If the file has very long lines
(as might be the case with a VCF file) then you can give `less` the `-S` option
to not wrap lines. In that case, the left and right arrow keys can be used to
scroll through the long lines.
Try navigating to a file and using `cat`, `head`, and `less` on it.
One particularly cool thing about `cat` is that if you say
```sh
cat file1 file2
```
it will catenate the contents of both files, in the order they
are listed on the command line, to _stdout_.
Now, one **Big Important Unix Fact** is that many programs written to run in the
Unix shell behave in the same way regarding their output: they write their
output to _stdout_. We have already seen this with `ls`: its output just
gets written to the screen, which is where _stdout_ goes by default.
### Redirecting standard output: `>` and `>>`
Unix starts to get really fun when you realize that you can "redirect" the
contents of _stdout_ from any command (or group of commands...see the next chapter!)
to a file. To do that, you merely follow the command (and all its options and arguments)
with `> path` where `path` is the path specifying the file into which you
wish to redirect _stdout_.
Witness, try this:
```{sh, eval=FALSE}
# echo three lines of text to a file in the /tmp directory
echo "bing
bong
boing" > /tmp/file1
# echo three more lines of text to another file
echo "foo
bar
baz" > /tmp/file2
# now view the contents of the first file
cat /tmp/file1
# and the second file:
cat /tmp/file2
```
It is important to realize that when you redirect output into a file
with `>`, any contents that previously existed in that file will
be deleted (wiped out!). So be careful about redirecting. Don't
accidentally redirect output into a file that has valuable data in it.
The `>>` redirection operator does not delete the destination file before
it redirects output into it. Rather, `>> file` means "append _stdout_ to the contents that already exist in `file`." This can be very useful
sometimes.
### stdin, `<` and `|`
Not only do most Unix-based programs deliver output to standard output, but
most utilities can also receive input from a file stream called _stdin_ which
is short for "standard input."
If you have data in a file that you want to send into standard input
for a utility, you can use the `<` like this:
```sh
command < file
```
But, since most Unix utilities also let you specify the file as an argument,
this is not used very much.
However, what is used all the time in Unix, and it is one of the things
that makes it super fun, is the pipe, `|`, which says, "take _stdout_ coming
out of the command on the left and redirect it into _stdin_ going into
the command on the right of the pipe.
For example, if I wanted to count the number of files and directories stored in my `git-repos`
directory, I could do
```sh
% ls -dl Documents/git-repos/* | wc
174 1566 14657
```
which pipes the output of `ls -dl` (one line per file) into the _stdin_ for the `wc` command, which
counts the number of lines, words, and letters sent to its standard input. So, the output tells
me that there are 174 files and directories in my directory `Documents/git-repos`.
Note that pipes and redirects can be combined in sequence over multiple
operations or commands. This is what gives rise to the terminology of
making "Unix pipelines:" the data are like streams of water coming into
or out of different commands, and the pipes hook up all those streams into
a pipeline.
### stderr
While output from Unix commands is often written to _stdout_, if anything goes wrong with
a program, then messages about that get written to a different stream called _stderr_, which, you
guessed it! is short for "standard error". By default, both _stdout_ and _stderr_ get written
to the terminal, which is why it can be hard for beginners to think of them as separate streams.
But, indeed, they are. Redirecting _stdout_ with `>`, that does **not** redirect _stderr_.
For example. See what happens when we ask `ls` to list a file that does not exist:
```sh
[~]--% ls file-not-here.txt
ls: file-not-here.txt: No such file or directory
```
The error message comes back to the screen. If you redirect the output
it still comes back to the screen!
```sh
[~]--% ls file-not-here.txt > out.txt
ls: file-not-here.txt: No such file or directory
```
If you want to redirect _stderr_, then you need to specify which stream
it is. On all Unix systems, _stderr_ is stream #2, so the `2>` syntax can be
used:
```sh
[~]--% ls file-not-here.txt 2> out.txt
```
Then there is no output of _stderr_ to the terminal, and when you `cat` the output
file, you see that it went there!
```sh
[~]--% cat out.txt
ls: file-not-here.txt: No such file or directory
```
Doing bioinformatics, you will find that there will be failures of various programs.
It is essential when you write bioinformatic pipelines to redirect _stderr_ to a
file so that you can go back, after the fact, to sleuth out why the failure occurred.
Additionally, some bioinformatic programs write things like progress messages to
_stderr_ so it is important to know how to redirect those as well.
### Symbolic links
Besides regular files and directories, a third type of file in Unix is called a
_symbolic link_. It is a special type of file whose contents are just an
absolute or a relative path to another file. You can think of symbolic links
as "shortcuts" to different locations in your file system. There are many
useful applications of symbolic links.
Symbolic links are made using the `ln` command with the `-s` option. For example,
if I did this in my home directory:
```sh
[~]--% ln -s /Users/eriq/Documents/git-repos/srsStuff srs