# Outline: Accessing NCBI Resources on the Command Line for Biologists 

This Jupyter Notebook contains the background and instructions for the hands-on exercises of this workshop:

* [Introduction](#Introduction)
* [Objective 1 - Getting Around on the Command Line](#Objective-1)

# Introduction

## About Jupyter Notebooks and this workshop  
This workshop uses a Jupyter Notebook, a platform that allows you to run code from individual cells in a web page and display the results of the command.  

In this notebook the language of the code cells is simply the Unix bash shell that you can use to run command-line tools (MagicBLAST, blastn, efetch, etc.) or invoke any standard Unix programs or utilities (ls, grep, head, tail, sort, cut etc.)

In some cases the commands will create files on your working directory on the server. When that happens a new file will appear in the list on the left-hand side of your notebook.  

To run code in a cell you can select the cell and  use the "Run" button at the top of the notebook, or you can use hold the shift key down and press the enter key (shift+enter) to run the cell.

**Example:** Run the following cell. This will create file on your working directory, list the contents of your directory, and list the contents of your file.

In [None]:
# Creating a file with Unix
echo "This is my file"

The output appears immediately below the cell. Notice that the new file appears in the list on the left-hand side of your browser. Also, the bracketed space to the left of the cell now contains the number 1 \[1]. The number is the number of times you have run cells in the notebook. When the cell is running the brackets will show an asterisk \[*].

**Important tip:** As you go through the notebook, in order to run, some cells require that you have run the cells before them. If you missed a previous cell, you can use the "Run" menu to "Run All Above the Selected Cell"

**If you get lost**: Click the outline icon on the left sidebar (looks like three bullet points) to get an entire interactive outline of the course. 

## Sofware installed on this notebook
The following bioinformatics tools are installed on this server for use during the workshop, and will need to be installed locally if you want to re-create these analyses on your own computer. 

- **EDirect**: a suite of scripts for accessing NCBI sequence and literature data through the E-Utilities API. [More info](https://www.ncbi.nlm.nih.gov/books/NBK179288/)


## Our Case Studies

# Objective 1 - Set up a workspace using the command line <a class="anchor" id="Objective-1"></a>

## So, what IS the command line? And why do we use it? 

**End Goal:** A main project directory with appropriately named sub-directories for different types of data (citations/publications, nucleotide, protein, genomes)

## Running Commands: Program Name and Parameters

**Commands I will demo during this section:**
* `pwd`
*  `ls` vs. `ls -a` - commands that can be run with or without arguments 
*  `echo` (uh oh, this is missing an argument! ) 
* `echo "this is a message"` this is better! 

## Interacting with Unix directory structure

**Need to create custom directory structure diagram labeled with our JupyterHub directory names**

**Commands I will demo:**
* You are Here:  `pwd`
* What's in a (directory) name/path?
* Checking out our surroundings: `ls` for within a directory, compared with `ls .` and `ls ..` 
* Moving around in our surroundings: `cd` command and its arguments, practicing/predicting what you get if you run `ls` after moving to a different directory.
* Creating your own directories to store data: `mkdir`. 

## Interacting with files

**Commands I will demo during this section:**
* `mv` can be used to literally move a file OR rename a file 
* `cp`
* `less`
* `head`

# Objective 2: Use EDirect commands to identify genes associated with AIP and identify suitable animal models for the disease

## Introduction to the EDirect suite

## Using `einfo` to find out about databases

The `einfo` command is very useful for learning about available databases, the searchable fields in each database, and links between the various NCBI databases. When run with the `dbs` (databases) flag, `einfo` produces a list of available databases.  

In [None]:
einfo -dbs 

Okay, that is a long list of databases! Sometimes, we want to store a list like this in a text file so that we can examine it again without running the command again. We can use the `>` symbol to direct the output of the `einfo` command into a text file called `ncbi_dbs.txt` that lives in our current directory. 

In [None]:
einfo -dbs > ncbi_dbs.txt

We can go to our files pane (click on folder icon), and double click `ncbi_dbs.txt` to open that list in another tab for future reference! We can look at the list and guess that `pubmed` might be a good option for looking at publications. To check, we can run `einfo` again, this time specifying `pubmed`. 

In [None]:
einfo -db pubmed

This output (unless you are already fluent in XML) is a little bit hard to read. Lets add the `-fields` flag to get a human-readable list of the searchable fields in `pubmed`, and also direct this list to a file. This will come in handy soon as we refine our searches. Take a look! 

In [None]:
einfo -db pubmed -fields > pubmed_fields.txt

In [None]:
# Briefly compare with the fields in a sequence database, then clear the output
einfo -db nuccore -fields

## Using `esearch` and `efetch` to find relevant results in databases. 

As the name might suggest, we have `esearch` allows us to perform an Entrez search of a database using a query. For our purposes, a basic `esearch` command will have a database specified using `-db` (remember to look at our list!) and something that we are searching for inside of that database, specified using `-query`. 

In [None]:
# Demonstrate that running `esearch` without specifying a value for one or both of these will return errors
esearch -db pubmed
esearch -query "aip" 

First, lets try the full name of the disease as a query. 

In [None]:
# Looking for a disease of interest to us...  
esearch -db pubmed -query "acute intermittent porphyria"

This results structure tells us a number of useful things, and also stores this information in way that that can feed into downstream programs. First, it reminds us what database we just searched in, and stores this info for use in other programs. We also may want to look at the `Count` value, which counts the number of records returned by the search. 

As March 2023, there are **2353 results**, too many papers for us to read, and probably too many for us to even download. Note also that this the `esearch` step in and of itself does NOT give us any info about the contents of the results. 

### Refining our search using database fields

If we want to get a more manageable number of results, one way to do that is to modify the query with an applicable **field** from the PubMed database. Take a look back at our `pubmed_fields.txt` file to get an idea of their names and abbreviations. **Let's modify our query to look only at papers published in 2020:** 

In [None]:
esearch -db pubmed -query "acute intermittent porphyria AND 2020 [PDAT]"

This time we see many fewer results! What did we modify in our query? 
* `AND`: This tells **esearch** we are qualifying the first half of the query, which is our disease name, with another condition. In this case, a publication year. 
* `2020`: Our chosen value for specifying publication date 
* `[PDAT]`: This tells **esearch** which PubMed database field `2020` should be searched in for. Note that this term matches exactly what we got from `einfo`, and is placed inside of brackets. 

You can apply this same logic to choosing a specific journal. Let's assume that since we are interested in the genetic basis of AIP, the journal **Genes** might be a good option: 

In [None]:
esearch -db pubmed -query "acute intermittent porphyria AND Genes [JOUR]"

As of March 2023, this returned just a single result, much more manageable. We can actually get the same result if we specify BOTH a journal and year: 

In [None]:
esearch -db pubmed -query "acute intermittent porphyria AND Genes [JOUR] AND 2020 [PDAT]"

### Sending `esearch` results to `efetch`

Now that we have narrowed our search down to what looks like just one publication, we can retrieve more information about it. We do so by adding in a second command, `efetch` that can receive that result structure directly from `esearch` and return information about that reuslt in a format of our choice. 

In [None]:
esearch -db pubmed -query "acute intermittent porphyria AND Genes [JOUR]" | efetch -format abstract

In [None]:
# You can also direct this to a file to save it for later 
esearch -db pubmed -query "acute intermittent porphyria AND Genes [JOUR]" | efetch -format abstract > AIP_genes.txt

In [None]:
head AIP_genes.txt
# mv AIP_genes.txt PubMed/

In [None]:
#Medline format, ready for your citation manager
esearch -db pubmed -query "acute intermittent porphyria AND Genes [JOUR]" | efetch -format medline

In [None]:
# And finally, if you just want PubMed ID to pipe into something else
esearch -db pubmed -query "acute intermittent porphyria AND Genes [JOUR]" | efetch -format uilist

In [None]:
esearch -db pubmed -query "acute intermittent porphyria AND Genes [JOUR]" | elink -target gene

If you want to learn more about the possible formats associated with each NCBI database, here is a link: https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/

### Using the NCBI `Gene` Database

By reading over just this one abstract, we learn that this form of porphyria is related to decreased activity of hepatic hydroxymethylbilane synthase (HMBS), the third enzyme in the heme biosynthetic pathway. Now we know what our target gene is, and can search for it in any number of NCBI databases. 

In [None]:
# esearch -db pubmed -query "acute intermittent porphyria" | elink -target gene 

In [None]:
# search -db pubmed -query "acute intermittent porphyria AND Genes [JOUR]" | elink -target gene | efetch

In [None]:
# Seems obvious to see what "Gene" has to say about a gene...
einfo -db gene -fields

In [None]:
esearch -db gene -query "HMBS" 

In [None]:
#Specify that HMBS is actually the official gene symbol, not just a set of characters to search for 
esearch -db gene -query "HMBS[sym]"

In [None]:
esearch -db gene -query "HMBS[sym]" | efetch -format gene_table > hmbs_gene_table_sym.txt

In [None]:
# Still 408 results, but this hopefully filters out more irrelevant entries. 
head hmbs_gene_table_sym.txt

In [None]:
# Finding the right genes by starting with the disease

esearch -db gene -query "acute intermittent porphyria"

In [None]:
esearch -db gene -query "acute intermittent porphyria" | efetch -format gene_table | head -30

In [None]:
esearch -db gene -query "3145[UID]" | efetch -format gene_table | head -10

## Searching a sequence database

We can take back a look at 

# Objective 3: Use `elink` to more exhaustive search for genes and proteins 

In [None]:
esearch -db gene -query "3145[UID]"

In [None]:
esearch -db gene -query "3145[UID]" | efetch -format gene_table

**We can learn a lot about HMBS from this output:**  

HMBS hydroxymethylbilane synthase`[Homo sapiens]`

Gene ID: 3145, updated on 5-Mar-2023

Reference GRCh38.p14 Primary Assembly NC_000011.10  from: `119084881 to: 119093549`

mRNA transcript variant X1 XM_005271531.2, 14 exons,  total annotated spliced exon length: 1596 

protein isoform X1 XP_005271588.1, 13 coding  exons,  annotated AA length: 344

In [None]:
einfo -db gene -links

In [None]:
esearch -db gene -query "HMBS[sym]" | elink -target taxonomy