# Introduction to Jupyter & Unix Commands!

Welcome! Is this your first time navigating through a Jupyter notebook? <br> 
Don't worry! We'll be taking a look at a few important unix commands in order to work through this notebook.

Here's an overview:
- Jupyter Notebook Overview
- Unix Commands Overview
- Working with Sequence Files Example

-----


# Jupyter Notebook Overview

So... [what is a notebook?](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html) Simply, it's a document that allows you to use both computational code snippets as well as regular text for explanation/analysis.
We create cells that contain either code (usually Python or R, you can run unix commands as well), or Markdown language for general writing.
After you write your cells, you can select it and click run (or press Ctrl-Enter) and it will run the selected cell.

A key note is that your code will run in the directory where the notebook is saved. So if you have code producing output (exporting a textfile or making a directory with a unix command), it will execute where the notebook file is located -- unless you include unix commands to change the working directory.

The notebook executes in the order you run the cells. This means you can execute the first line of code and then the last line of code if you wanted. Often this breaks your work, so be mindful of the order you're executing code cells. The order is important to the output. 

Finally, as I mentioned you can use unix commands. Here we're working with a python 3 kernel (the computational engine to execute code in our notebook), so in order to run those commands we need to enter a "!" in front of the command, and for the change directory (cd) command it requires a "%". This tells Python that we are running a unix command. 

---


# Unix Commands Overview

Let's go over the basic Unix commands that are included in this notebook. If you need a more thorough explanation, you can follow this introduction from [software carpentry](https://swcarpentry.github.io/shell-novice/01-intro.html)

### Getting Unstuck
Commands have a helpful flag (additional arguments to do extra actions) called "--help" that will give you information on what the command does and how to use it, and any other flags that you can attach to the command:

    ls --help

There's also a more in-depth manual accessed by typing the command "man" followed by the command:

    man ls

### Navigation

__pwd__: Print Working Directory. It tells you which directory you're currently working in. \
__ls__: List. List out all directories/files in the current directory. \
__cd__: Change Directory. Moves you to the specified directory. 

### Working with Files & Directories
__mkdir__: Make Directory. This creates a new directory. \
__wget__: A network download tool. This command supports downloads files from a server. \
__unzip__: Unzip. Unzips a compressed file. \
__rm__: Remove. This deletes a file, if it's a directory you'll need the recursive -r flag, but be mindful this is permanent. \
__mv__: Move. Moves the specified file. \
__cat__: Concatenate. This displays the contents of a file. \
__cp__: Copy. Copies the contents of a file/directory. \
__grep__: Search Command. This filters through a file to search for a pattern of specified characters. \
__wc__: Word Count. Calculates a file's word, line, character or byte count. \
__diff__: Difference. Displays the differences in files. \
__cksum__: Checksum. Generates output values of a file (CRC, Byte Size, and Name). \
__sed__: Stream Editor. Can insert, delete, search and replace (substitute) text.

### Piping & Filtering
Unix commands can chain content from one command into another. For example, we can use the concatenate "cat" command, "pipe" to to a line count using the word count command (wc -l) and save the into a new file:

    cat fileName.txt | wc -l > file_line_count.txt

This will be important in the upcoming example on working with sequence files.


---


# Working with Sequences Files

In this exercise, we will use Unix to work with a sequence file with contigs in it. Contigs are parts of a genomic sequence that are from the same organisms and have been "assembled together" from shorter sequence reads that come directly off a sequencer.

## Step 1: Go to the assignment directory and get the contig file

Remember, our notebooks work in the current working directory -- and when you login to the HPC this automatically is your home directory. You will need to move to the project directory `/xdisk/bhurwitz/your_netid`. The next two cells define what that project directory is (be sure to replace your_netid with your actual netid) and then move into that directory for our exercise.

In [3]:
# Ctrl + Enter will run the selected cell, after running a cell you'll see the terminal's output.
# In this case, we're printing the work directory for the assignments and changing into that directory.
# note the ipython uses a % in front of the change directory command
netid = "your_netid"
workdir = "/xdisk/bhurwitz/" + netid + '/assignments'
print(workdir)
%cd workdir

/xdisk/bhurwitz/bhurwitz


In [None]:
# now we are going to `ls` to list all the assignment directories, and then change into 01_intro_unix directory for this assignment
!ls
%cd 01_intro_unix

Now you are in the 01_intro_unix directory (and nothing is currently in this directory). We can test that with the `ls -l` command

In [None]:
!ls -l

Now we are going to make a directory to store our contigs, and go into that directory

In [None]:
!mkdir 01_contigs
%cd 01_contigs

## Step 2: Use wget to download the contigs files and unzip the folder

Next, we will go and get the contigs files from the iMicrobe FTP site. We can use the `wget` command to pull down the data from the FTP site to our current directory.

In [None]:
!wget ftp://ftp.imicrobe.us/biosys-analytics/contigs/contigs.zip

#random tip: Output from a command can be cleared by clicking Cell -> Current (or All Outputs) -> Clear

Check to see that you have the contigs.zip file

In [None]:
!ls -l

## Unpack the zipped file

Great news! You downloaded the file. It should have a file size of "1979343", Now back to the exercise. Let's unpack the contigs.zip file.

In [None]:
#Unzip the Fasta Files
!unzip contigs.zip
#Delete the Zip download
!rm contigs.zip
# check out the files you just downloaded and unpacked
!ls -l

### Fasta formmated files

These files are in [FASTA format](https://en.wikipedia.org/wiki/FASTA_format), which basically looks like this:

```
>Contig_4027
AACCGGGCCAATCACCACGCGATGGACGGTACGCTCGATTTCAATGGCAACCTGTATTTCTCGGACGATCTGAACACCAACCCCTATCGGAGCATCGGGAAGATCGATGGACGGACCGGGGAGATCACCAACGTCCAGGTCGTTGATTTCTCCGAAGACAACATCGAATCCACCGTCGATGTAATGGGATTGGGTTGGATGGAAGTGGGAGTGTCTCTTTCCACTCACCTGGGGGATTTTGCTTGCGGTGTGAC
>Contig_33139
TGTGACGGACCGTGATCGTTCCCTGATCCAGGTCGACGTCACTCCACTGGAGAGCCAGCAGCTCGCCCAGGCGTAGTCCGCAGAAGATGGCAGTGAAGAATAGAGCCGCTTGTGGGTGCGAGTTGCCGGAGCTGTTCCAGTCCCTGAGACCATCGACCAACGTCCGCGCCTCGGTGGAGCTATAGGGATTGATTTGTTTTTTCTGAGACGACTGCGGTCCGAGGATCTTGCGAAGATCGATCCCGATCGCTGGG
```

Header lines start with ">", then the sequence follows. Sequences may be broken up over several lines of 50 or 80 characters, but it's just as common to see the sequences take only one (sometimes very long) line. Sequences may be nucleotides, proteins, very short DNA/RNA, longer contigs (shorter strands assembled into contiguous regions), or entire chromosomes or even genomes.


## Step 3: Grep files

So, how many sequences are in "group12_contigs.fasta" file? To answer, we just need to count how many times we see ">". We can do that with the `grep` command

In [None]:
!grep > group12_contigs.fasta

**What just happened??** You got a usage statement for grep, and it didn't execute the command. What happened to the file?

In [None]:
!ls -l
# Notice in the output for the first row, 4th column, the file size is now "0" 

### Oh no! I overwrote my file

You should actually see nothing (a zero length file) because something quite insidious happened with that first "grep" statement -- it overwrote our original "group12_contigs.fasta" with the result of "grep"ing for nothing, which is nothing. Let's check that, you should see the file has a "0" length, or nothing in it.

### Gotcha!

What is going on? Remember that the ">" symbol tell Unixs to redirect the output of grep into a file. But, we need to tell Unix that we mean a literal greater-than sign by placing it in single or double quotes or putting a backslash in front of it:

In [None]:
# This is known as escaping characters -- a common occurance in programming.
!grep '>' group12_contigs.fasta
!grep \> group12_contigs.fasta

So, we ran those commands correctly, but nothing was output. This is because the file doesn't have anything in it! (given that we erased it in the last step) Let's try those commands on one of the other contigs files.

In [None]:
!grep '>' group24_contigs.fasta
#This prints out a long list of every occurance of ">" contained in the file.
#There's another method to produce just a number of the occurance that we'll use shortly.

### Let's get the file back, and try again.

Ugh, OK, I have to go back and wget the "contigs.zip" file to restore it. That's OK. Things like this happen all the time. First we will delete the old files.

In [None]:
#Deletes all files in the current directory with the .fasta file type
!rm *.fasta
#Downloads, unzips and deletes the zip file again
!wget ftp://ftp.imicrobe.us/biosys-analytics/contigs/contigs.zip
!unzip contigs.zip
!rm contigs.zip
!ls -l

### The file is back!

You should see something like this from the last command

```
-rw-rw----  1 bhurwitz  staff  3034371 Aug 10  2016 group12_contigs.fasta
-rw-rw----  1 bhurwitz  staff  1550608 Aug 10  2016 group20_contigs.fasta
-rw-rw----  1 bhurwitz  staff  1686023 Aug 10  2016 group24_contigs.fasta
```

### Count the sequences in the contigs file

Now that I have restored my data, I want to count how many greater-than signs (or fasta headers) are in the file. These are the names of the sequences in the contigs file. You should get 132.


In [None]:
!grep '>' group12_contigs.fasta | wc -l
#Notice the pipe symbol "|", instead of printing out all occurances of ">" we're just using the wc (word count) command with
#the -l flag to produce a single number of instances.

### Setting aliases for something you do often

JUST FYI.

I could see doing that often. Maybe we should make this into an "alias". The problem is that the "argument" to the function (the filename) is stuck in the middle of the chain of commands, so it would make it tricky to use an alias for this. We can create a bash function that we add to our $HOME/.bashrc (or $HOME/.zshrc if you are in that shell -- on Mac).

You can add this function using nano (a text editor):

   #### Step 1: Open nano with the file name using the command:
        nano countseqs.sh
   #### Step 2: Copy the function into the nano text editor and save it:
        function countseqs() {
          grep '>' $1 | wc -l
        }
       
-----
        
        Ctrl + X to Exit, Y to Save, Enter to Confirm file name
        
-----

#### Step 3: Add the function to the end of your .bashrc file with the command:
        cat countseqs.sh >> ~/.bashrc




 
#### Step 4: Next you would source the file to make the changes live in the current unix window.


        source ~/.bashrc

#### Step 5: You can now run it from the command line:

        countseqs group12_contigs.fasta

-----

 There is a powerful tool called ["seqmagick"](https://github.com/fhcrc/seqmagick) that will do this (and much, much more). We will try that program out later...



## Step 4: Searching for something...

Moving on, let's find how many contig IDs in "group12_contigs.fasta" contain the number "47":

In [None]:
!grep 47 group12_contigs.fasta > group12_ids_with_47
#Grepping all ID's with number 47 and saving it into a new file
!cat group12_ids_with_47
#Outputting all the information in the new file

You should see something like this:

```
cat group12_ids_with_47
>Contig_247
>Contig_447
>Contig_476
>Contig_1947
>Contig_4764
>Contig_4767
>Contig_13471
```

Let's play around with the file, by putting it in some temp files. Here are two ways to make a copy of the file contents.

In [None]:
!cat group12_ids_with_47 > temp1_ids

Here we make a copy the file again to make duplicate files.

In [None]:
!cp group12_ids_with_47 temp2_ids

## Step 5: Checking if files are the same

How can we be sure these files are the same? Let's use "diff":


In [None]:
!diff temp1_ids temp2_ids

You should see nothing, which is a case of "no news is good news." They don't differ in any way. We can verify this with "cksum" below (see below). You should see this:

```
2188208005 89 temp1_ids
2188208005 89 temp2_ids
```

They are the same file size. If there were even one character difference, they would generate different hashes.

In [None]:
!cksum temp*


## Step 6: Checking for duplicates

First, we will create a file with duplicate IDs:

In [None]:
!cat temp1_ids temp2_ids > duplicate_ids
#This concatenates both temp files content into a new file

Next, we will check contents of "duplicate_ids" using "less" or "cat." Now grab all of the contigs IDs from "group20_contigs.fasta" that contain the number "51." Concatenate the new IDs to the duplicate_ids file in a file called "multiple_ids".

In [None]:
!cp duplicate_ids multiple_ids
!grep 51 group20_contigs.fasta >> !$

Notice that in the command above

```
grep 51 group20_contigs.fasta >> !$
```

is the same as

```
grep 51 group20_contigs.fasta >> multiple_ids
```

Cool shortcut huh?

Also notice the ">>" arrows to indicate that we are appending to the existing "multiple_ids" file.

Now we will remove the existing "temp" files using a "*" wildcard:

In [None]:
!rm temp*

### Using sort and uniq

Now let's explore more of what "sort" and "uniq" can do for us. We want to find which IDs are unique and which are duplicated. If we read the manpage ("man uniq"), we see that there are "-d" and "-u" flags for doing just that. The "-d" flag will only print duplicate lines, one for each group. And the "-u" will only print unique lines. Don't forget that input to "uniq" needs to be sorted for this all to work because the duplicates need to be next to each other in the list.

In [None]:
!sort multiple_ids | uniq -d > temp1_ids #sort multiple_ids, pipe into a uniq -d (repeated flag), place into temp1 file
!sort multiple_ids | uniq -u > temp2_ids #sort multiple_ids, pipe into a uniq -u (unique flag), place into temp2 file
!diff temp* #check the differences between the two files

You should see something like this:

```
1,7c1,11
< >Contig_13471
< >Contig_1947
< >Contig_247
< >Contig_447
< >Contig_476
< >Contig_4764
< >Contig_4767
---
> >Contig_10051
> >Contig_1651
> >Contig_4851
> >Contig_5141
> >Contig_5143
> >Contig_5164
> >Contig_5170
> >Contig_5188
> >Contig_6351
> >Contig_9651
> >Contig_9851
```

Let's remove our temp files again and make a "clean_ids" file:

In [None]:
!rm temp*
!sort multiple_ids | uniq > clean_ids
!wc -l multiple_ids clean_ids

You should see something like this:

 14 multiple_ids
 7 clean_ids
 21 total

## Step 7: Using the sed command to alter the ids

We can use "sed" to alter the IDs. The "s//" command says to "substitute" the first thing with the second thing, e.g., to replace the first occurence of "foo" with "bar", use ["s/foo/bar/"](http://stackoverflow.com/questions/4868904/what-is-the-origin-of-foo-and-bar). If you want to replace all instances, of "foo" with "bar", use ["s/foo/bar/g"] to say you want to run the command "globally.

In [None]:
!sed 's/C/c/' clean_ids
!sed 's/_/./' clean_ids
!sed 's/>//' clean_ids > newclean_ids

After we run these sed commands, what do our ids look like? Can you write a few Unix commands below to see what is in the newclean_ids file?

What did you change with the first two commands? Did it "stick", aka was saved in the clean_ids file?

In [None]:
!cat clean_ids 
!cat newclean_ids

As we see with the cat command above, only the last sed command stuck and saved into the newclean_ids file. We have a few options to get all of them to work. We could pipe it all together, or we can use this handy -e flag and just space our substitutions with "/" in between the quotes. Usually, a single command is specified as the first argument to sed.  BUt, you can add multiple commands by using the -e (piping from cat) or -f (in a file) options.  All commands are applied to the input in the order they are specified regardless of their origin.

In [None]:
!sed -e 's/C/c/;s/_/./;s/>//' clean_ids > final_clean_ids

In [None]:
!cat final_clean_ids

Now we see all of our wanted substitutions have been placed into this final_clean_ids file for futher use.