# Lecture 2: Command Line Basics
CBIO (CSCI) 4835/6835: Introduction to Computational Biology

## Overview and Objectives




## Part 1: BASH Basics

If you've never used a command-line before... **Don't be intimidated!**

<img src="https://www.bleepstatic.com/tutorials/cmdprompt/cmdprompt.gif" />

### Bash is to command prompts as Windows is to operating systems

Other command prompts include
 - `csh` (some would say the original: the "C-shell"
 - `bash` ("bourne-again" shell; tends to be default on most Linux and macOS systems)
 - `ksh` (Korn shell)
 - `zsh` (Z shell)

### Think of the fancy point-and-click user-interfaces as running commands on a prompt behind-the-scenes whenever you click something

### I highly recommend either Linux (Ubuntu, Mint, RedHat) or macOS. The Windows MS-DOS prompt is something else entirely.

If you're on a Windows machine, you can either:
 - Activate the Ubuntu shell (Windows 10 only) https://msdn.microsoft.com/en-us/commandline/wsl/install_guide
 - Install Cygwin https://www.cygwin.com/
 - Install VirtualBox ( https://www.virtualbox.org/wiki/VirtualBox ) and run an Ubuntu virtual machine inside
 - Go to the computer labs (RedHat or macOS will work)

I have a macOS laptop, an Ubuntu workstation, a bunch of RedHat servers, and a Windows 10 home desktop.

I'm most at home with either macOS or Ubuntu.

**It's like learning another language**: you'll only get better at it if you **immerse yourself in it**, even when you don't want to.

### Diving in!

You've fired up the command prompt (or `Terminal` in macOS). How do you see what's in the current folder?

<pre>
Last login: Mon Jan  9 18:36:07 on ttys006
example1:~ squinn$ ls
Applications   Dropbox        Music          SpiderOak Hive
Desktop        Google Drive   Pictures       metastore_db
Documents      Library        Programming    nltk_data
Downloads      Movies         Public         rodeo.log
example1:~ squinn$ 
</pre>

### `ls`

Allows you to view the contents of the current directory--folders and files.

But how do we tell the difference between the two? **Use an optional `-l` flag.**

### (aside: "flags" are options to commands that slightly tweak their behavior to account for different user intentions--like "quit" versus "force quit")

<pre>
example1:~ squinn$ ls -l
total 264
drwx------   7 squinn  staff     238 Oct 23  2015 Applications
drwx------+ 59 squinn  staff    2006 Jan  9 17:49 Desktop
drwx------+ 20 squinn  staff     680 Dec 23 09:35 Documents
drwx------+  5 squinn  staff     170 Jan  9 18:27 Downloads
drwx------@ 17 squinn  staff     578 Jan  8 18:03 Dropbox
drwx------@ 49 squinn  staff    1666 Jan  4 15:47 Google Drive
drwx------+ 74 squinn  staff    2516 Nov 17 15:06 Library
drwx------+  6 squinn  staff     204 May 20  2015 Movies
drwx------+  5 squinn  staff     170 Oct 22  2014 Music
drwx------+ 18 squinn  staff     612 Jul 29 11:31 Pictures
drwxr-xr-x  37 squinn  staff    1258 Jan  4 15:57 Programming
drwxr-xr-x+  5 squinn  staff     170 Oct 21  2014 Public
drwx------@  8 squinn  staff     272 Jun 30  2015 SpiderOak Hive
drwxr-xr-x   9 squinn  staff     306 Sep 17  2015 metastore_db
drwxr-xr-x   4 squinn  staff     136 Apr 27  2016 nltk_data
-rw-r--r--   1 squinn  staff  131269 Jan  9 18:32 rodeo.log
example1:~ squinn$
</pre>

Anything that starts with a `d` on the left is a folder (or **directory**), otherwise it's a file.

Ok, that's cool. I can tell what is what where I currently am. ...but wait, how do I even know where I am?

<pre>
example1:~ squinn$ pwd
/home/squinn
example1:~ squinn$
</pre>

### `pwd`

Pretty straightforward--stands for **P**rint **W**working **D**irectory. Gives you the full path to where you are currently working. Not really any other needed optional flags.

Great! Now I know where I am, and what is what where I am. How do I move somewhere else?

<pre>
example1:~ squinn$ cd Music/
example1:Music squinn$ ls
iTunes
example1:Music squinn$ 
</pre>

You'll notice the output of the `ls` command has now changed, which hopefully isn't surprising.

Since we've **C**hanged **D**irectories with the `cd` command--you essentially double-clicked the "Music" folder--now we're in a different folder with different contents; in this case, a lone "iTunes" folder.

Folders within folders represent a recursive hierarchy. We won't delve too much into this concept, except to say that, unless you're in the **root directory** (`/` on Linux, `C:\` on Windows), there is always a **parent directory**--the enclosing folder around the folder you are currently in.

Therefore, while you can always change to a very specific directory by supplying the full path--

<pre>
example1:~ squinn$ cd /home/squinn/Dropbox
example1:Dropbox squinn$ ls
Cilia_Papers     Imaging_Papers   OdorAnalysis     Public
Computer Case    LandUseChange    OrNet            cilia movies
Icon?            NSF_BigData_2015 OrNet Videos
example1:Dropbox squinn$
</pre>

--I can also navigate to the parent folder of my current location, irrespective of my *specific* location, using the special `..` notation.

### `cd ..`

Takes you up one level to the parent directory of where you currently are.

<pre>
example1:Dropbox squinn$ pwd
/home/squinn/Dropbox
example1:Dropbox squinn$ cd ..
example1:~ squinn$ pwd
/home/squinn
example1:~ squinn$
</pre>

Let's see some other examples!

<pre>
example1: squinn$ ls
Lecture1.ipynb
example1: squinn$ ls -l
total 40
-rw-r--r--  1 squinn  staff  18620 Jan  5 19:54 Lecture1.ipynb
example1: squinn$ pwd
/home/squinn/teaching/4835/lectures
example1: squinn$ cd ..
example1: squinn$ pwd
</pre>

What prints out?
 - `~/`
 - `/home/squinn`
 - `/home/squinn/teaching`
 - `/home/squinn/teaching/4835`
 - An Error

<pre>
$ ls -l
total 8
-rw-rw-r-- 1 squinn staff   19 Sep  3 09:08 hello.txt
drwxrwxr-x 2 squinn staff 4096 Sep  3 09:08 lecture
$ ls *.txt
</pre>

What prints out?
 - `hello.txt`
 - `*.txt`
 - `hello.txt lecture`
 - An Error

### Spacing Out

<tt>du</tt> - disk usage of files/directores
<pre>
[squinn tmp]$ du -s
146564	.
[squinn tmp]$ du -sh
144M	.
[squinn tmp]$ du -sh intro
4.0K	intro
</pre>

<tt>df</tt> - usage of full disk

<pre>
[squinn tmp]$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
pulsar:/home     37T   28T  9.3T  75% /net/pulsar/home
</pre>

### Dude, where's my stuff?

<tt>locate</tt> find a file system wide
<tt>find</tt> search directory tree
<tt>which</tt> print location of a command
<tt>man</tt> print manual page of a command


### Save the Environment

<tt>NAME=value</tt>  set NAME equal to value **No spaces around equals**
<tt>export NAME=value</tt> set NAME equal to value and make it stick
<tt>\$</tt> *dereference* variable
<pre>
$ X=3
$ echo $X
3
$ X=hello
$ echo $X
hello
$ echo X
X
</pre>

### Getting at your variables

Which does **not** print the value of X?
 - `echo $X`
 - `echo ${X}`
 - `echo '$X'`
 - `echo "$X"`

### Capturing Output

<tt>`cmd`</tt> evaluates to output of cmd
<pre>
$ FILES=`ls`
$ echo $FILES 
hello.txt lecture
</pre>

### Your Environment

<tt>env</tt> list all set environment variables
<tt>PATH</tt> where shell searches for commands
<tt>LD_LIBRARY_PATH</tt> library search path
<tt>PYTHONPATH</tt> where python searches for modules

<tt>.bashrc</tt> initialization file for bash - set PATH etc here


### History

<tt>history</tt> show commands previously issued
<tt>up arrow</tt> cycle through previous commands
<tt>Ctrl-R</tt> search through history for command **AWESOME**
<tt>.bash_history</tt> file that stores the history
<tt>HISTCONTROL</tt> environment variable that sets history options: ignoredups
<tt>HISTSIZE</tt> size of history buffer

### Shortcuts

<tt>Tab</tt> autocomplete
<tt>Ctrl-D</tt>  EOF/logout/exit
<tt>Ctrl-A</tt>  go to beginning of line
<tt>Ctrl-E</tt>  go to end of line
<tt>alias new=cmd</tt>  

<pre>
make a nickname for a command
$ alias l='ls -l'
$ alias
$ l
</pre>

### Commands

The first word you type is the program you want to run.  <tt>bash</tt> will search <tt>PATH</tt> for an appropriately named executable and run it with the specified arguments.

* <tt>ipython<tt> - start interactive python shell (more later)
* <tt>ssh</tt> *hostname*  - connect to *hostname*
* <tt>passwd</tt> - change your password
* <tt>nano</tt> - a user-friendly text editor

## <tt>ssh</tt> into <tt>jupyterhub.cs.uga.edu</tt> and change your password

## Part 2: Text Manipulation

### Review

<tt>ls</tt> - list files
<tt>cd</tt> - change directory
<tt>pwd</tt> - print working (current) directory
<tt>..</tt> - special file that refers to parent directory
<tt>.</tt> - the current directory
<tt>cat <em>file</em></tt> - print out contents of file
<tt>more <em>file</em></tt> - print contents of file with pagination

### I/O Redirection

<tt>&gt;</tt> send *standard output* to file

<pre>
$ echo Hello > h.txt
</pre>

<tt>&gt;&gt;</tt> append to file

<pre>
$ echo World >> h.txt
</pre>

<tt>&lt;</tt>  send file to *standard input* of command

<tt>2&gt;</tt>  send *standard error* to file

<tt>&gt;&</tt>  send output and error to file

<pre>
$ echo Hello > h.txt
$ echo World >> h.txt
$ cat h.txt
</pre>

What prints out?
 - Hello
 - World
 - HelloWorld
 - <br />Hello<br />World
 - An Error

<pre>
$ echo Hello > h.txt
$ echo World > h.txt
$ cat h.txt
</pre>

What prints out?
 - Hello
 - World
 - HelloWorld
 - <br />Hello<br />World
 - An Error

### Pipes

A pipe (<tt>|</tt>) redirects the *standard output* of one program to the *standard input* of another.  It's like you typed the output of the first program into the second.  This allows us to chain several simple programs together to do something more complicated.
<pre>
$ echo Hello World | wc
</pre>

### Simple Text Manipulation

<tt>cat</tt> dump file to stdout
<tt>more</tt> paginated output
<tt>head</tt> show first 10 lines
<tt>tail</tt> show last 10 lines
<tt>wc</tt> count lines/words/characters
<tt>sort</tt> sort file by line and print out (<tt>-n</tt> for numerical sort)
<tt>uniq</tt> remove **adjacent** duplicates (<tt>-c</tt> to count occurances)
<tt>cut</tt> extract fixed width columns from file

<pre>
$ cat text
a
b
a
b
b
$ cat text | uniq | wc
</pre>

What is the first number to print out?
 - 1
 - 2
 - 3
 - 4
 - 5
 - None of the above

<pre>
$ cat text
a
b
a
b
b
$ cat text | sort | uniq | wc
</pre>

What is the first number to print out?
 - 1
 - 2
 - 3
 - 4
 - 5
 - None of the above

### Advanced Text Manipulation

<tt>grep</tt> search contents of file for expression
<tt>sed</tt> stream editor - perform substitutions
<tt>awk</tt> pattern scanning and processing, great for dealing with data in columns

### grep

Search file contents for a pattern.
<tt>grep <em>pattern</em> <em>file(s)</em></tt>
 * <tt>‐r</tt> recursive search
 * <tt>‐I</tt> skip over binary files
 * <tt>‐s</tt> suppress error messages
 * <tt>‐n</tt> show line numbers
 * <tt>‐A</tt> *N* show *N* lines after match
 * <tt>‐B</tt> *N* show *N* lines before match

<pre>
$ grep a text | wc
</pre>

What is the first number to print out?
 - 1
 - 2
 - 3
 - 4
 - 5
 - None of the above

### sed
Search and replace

<pre>
sed 's/<em>pattern</em>/<em>replacement</em>/' <em>file</em>
</pre>

 * <tt>‐i</tt> replace in-place (overwrites input file)

<pre>
$ sed 's/a/b/' text | uniq | wc
</pre>

What is the first number to print out?
 - 1
 - 2
 - 3
 - 4
 - 5
 - None of the above

### awk
Pattern scanning in processing language. We'll mostly use it to extract columns/fields. It processes a file line-by-line and if a condition holds runs a simple program on the line.

<tt> awk '<em>optional condition</em> {<em>awk program</em>}' <em>file</em></tt>
* <tt>-F<em>x</em></tt> make *x* the field deliminator (default whitespace)
* <tt>NF</tt> number of fields on current line
* <tt>NR</tt> current record number
* <tt>\$0</tt> full line
* <tt>\$<em>N</em></tt> Nth field

### awk

<pre>
$ cat names
id last,first 
1 Smith,Alice
2 Jones,Bob
3 Smith,Charlie
</pre>
Try these:

<pre>
$ awk '{print $1}' names
$ awk -F, '{print $2}' names
$ awk 'NR > 1 {print $2}' names 
$ awk '$1 > 1 {print $0}' names
$ awk 'NR > 1 {print $2}' names | awk -F, '{print $1}' | sort | uniq -c

</pre>

## Exercises

<pre>
mkdir intro
cd intro
wget https://eds-uga.github.io/cbio4835-sp17/files/Spellman.csv
wget https://eds-uga.github.io/cbio4835-sp17/files/1shs.pdb
</pre>

- How many data points are in Spellman.csv?
-  The first three letters of the systematic open reading frames are: 'Y' for yeast, the chromosome number, then the chromosome arm. In the dataset, how many ORFs from chromosome A are there?
- How many are there from each chromosome? 
  - each chromosome arm?
- How many data points start with a positive expression value?
- What are the 10 data points with the highest initial expression values?
  - Lowest?
- How many lines are there where expression values are continuously increasing for the first 3 time steps?
- Sorted by biggest increase?

<pre>
wc Spellman.csv   (gives number of lines, because of header this is off by one)
grep YA Spellman.csv |wc
grep ^YA Spellman.csv |wc  (this is a bit better, ^ matches begining of line)
grep ^YA -c Spellman.csv  (grep can provide the count itself)
awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-2 | sort | uniq -c
awk -F, 'NR > 1 {print $1}' Spellman.csv | cut -b 1-3 | sort | uniq -c
awk -F, 'NR > 1 && $2 > 0 {print $0}' Spellman.csv | wc
awk -F, 'NR > 1  {print $1,$2}' Spellman.csv  | sort -k2,2 -n | tail
awk -F, 'NR > 1  {print $1,$2}' Spellman.csv  | sort -k2,2 -n -r | tail
awk -F, 'NR > 1 && $3 > $2 && $4 > $3 {print $0}' Spellman.csv  |wc
awk -F, 'NR > 1 && $3 > $2 && $4 > $3  {print $4-$2,$0}' Spellman.csv   | sort -n -k1,1
</pre>

## More Exercises

- Create a pdb file from 1shs that consists of only ATOM records. 
- Create a pdb with only ATOM records from chain A.
- How many carbon atoms are in this file?

<pre>
grep ^ATOM 1shs.pdb > newpdb.pdb (^matches beginning of line)
grep ^ATOM 1shs.pdb | awk '$5 == "A" {print $0}'
#this is UNSAFE with pdb files since there is no guarantee that fields
#will be whitespace seperated, safer is:
grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == "A" {print $0}' > newpdb.pdb
 
grep ^ATOM 1shs.pdb | awk ' substr($0,22,1) == "A" {print $0}' | cut -b 78- | sort | uniq -c

</pre>

## Administrivia

 - Did everyone finish the pre-test? It was due today before lecture. https://docs.google.com/forms/d/1ka9yH5G3bOCfdJUTaeZXV2BdtvqqsiPaxnvKI2f4YK4/
 
 - Office hours: **Tuesdays (today!) at 11:00 - 12:30**. Boyd GSRC 638A.

## Resources

 - A BASH cheatsheet: http://eds-uga.github.io/cbio4835-sp17notes/bash_cheatsheet.pdf