================================================= So you want to start using that big data in NCBI?
Authored by: Fan Yang & Adina Howe; MODIFIED from a previous tutorial
EDAMAME tutorials have a CC-BY license. Share, adapt, and attribute please!
=================== Learning objectives
This is a tutorials for working with the data that is available in NCBI. The learning objectives for this tutorial are as follows:
- To be able to download specific gene sequences or genomes from NCBI (even with a big list of gene sequences).
- To be able to create use these genes as a database to annotate a sequencing dataset.
- To estimate the number of genes and their corresponding annotations in multiple sequencing datasets.
You will need to know some things prior to this tutorial:
- Familiarity with the structure of NCBI website and their nucleotide and genome databases.
- Ability to navigate in the unix shell.
- Ability to execute programs in the shell.
- Access and login to an Amazon EC2 instance or similar ubuntu-based server
The key challenge that we will work through...or your mission, if you choose to accept it, is to identify nitrogen fixation genes found in sequencing DNA from soils.
You have been delivered three dogma-changing metagenomes (sequencing datasets) originating from three different Iowa crop soils (corn, soybean, and prairie). You are interesting in identifying nitrogen fixation genes that are associated with native bacteria in these soils. Nitrogen fixation is a natural process performed by bacteria that converts nitrogen in the atmosphere into a form that is usable for plants. If we can optimize natural nitrogen fixation, our hope is to reduce nitrogen fertilizer inputs that may contribute to the eutrophication of downstream waters (e.g., dead zones in the Gulf of Mexico).
========== Your tools
Get your EC2 instance going - you need an Ubuntu 14.0 based instance (64-bit) with m3.large.
================ Getting the Data
Get the metagenome datasets and scripts related to this tutorial. Don't forget, you will need to install "git".
sudo apt-get install git
All the tutorial materials are contained on a Github repository. The reason for using Github is that this material can be updated by me and grabbed by you lucky folk seamlessly with just a couple commands. To refresh your memmory, here is Jackson's github tutorial. If you are interested in learning more about Git, see this tutorial also. Now, let's get some play data:
git clone https://github.com/germs-lab/tutorial-blast-annotation
This command will make a directory (or folder for those more Finder/Explorer inclined) named "tutorial-blast-annotation" in the location where it was run. Within that directory, there will be two directories named "data" and "ncbi". You can see this by navigating (hint: cd) to the "tutorial-blast-annotation" directory and typing:
This tutorial is located in directory "ncbi". There are many more for you to explore on your own.
Navigate to the data directory and identify the number of sequences in each file. Hint: To find specific characters in a file, you can use grep. For example, to find all instances of AGTC in the corn.fa file, we could:
grep AGTC corn.fa
To find sequences, we know that each sequence will start with a special character, ">". This character in the shell, remember, is a bit special. So to find it as a symbol in the text, we're going to put a '^' right before it in quotes:
grep ^">" corn.fa
Now, to count, you'll remember we can use the command "wc", with a pipe...So you're command will look something like this:
grep ^">" corn.fa | wc
Or...if you want to do this quicky::
for x in *fa; do echo $x; grep ^">" $x | wc; done
To identify nitrogen fixation genes, you've been tasked to build a database of all previously observed known nitrogen fixation genes (nifH). To build this database, you have been reading literature for about two weeks and come up with a list of about 30 genes:
You'll also see this list in a file in the data directory (hint: use cat).
Check out the file containing these gene IDs.
You have a sinking feeling like this isn't really leveraging the big data biology that everyone says sequencing technologies have provided. You've decided to check out NCBI for its contents.
Go to the NCBI webpage and identify an estimate of total nifH genes and download a list of their accession numbers.
You'll want to navigate in a web-browser to the
http://www.ncbi.nlm.nih.gov/. You'll see in the search query box that you can search a number of databases. Here, we want to look at the nucleotide database and query something along the lines of nifH or nitrogen fixation.
When I did this, there were nearly 270,000 genes that were hit by this query. You will want to look for the "Send To" link at the upper right of the page (put on a magnifying glass!), and download the GI list for this query.
Find that file on your computer and give it a peek. If you're feeling up for it, transfer it to your EC2 instance (hint: scp).
To make this tutorial not-as-painful to complete in a reasonable amount of time, I've also made a list of 300 nifH genes from NCBI and put them in a file 300-nifh-genes.txt in the data directory. I would highly suggest you use this gene to build your database going forward in this tutorial.
Take a look at this file. Prove to yourself that it contains 300 genes (Hint: wc)
Now, we are going to learn how to download these genes (by learning about the NCBI API below)
Think about how you would download this data if you didn't have this tutorial.
You may have thought about some of the following:
- Go to the web portal and look up each FASTA
- Go to the FTP site, find each genome, and download manually
- Use the NCBI Web Services API to download the data
Among these, I'm going to assume many of you are familiar with the first two. This tutorial then is going to focus on using APIs.
================================ Scaling "Getting the Data" On Up
Here's some answers on what exactly is API, among which my favorite is "an interface through which you access someone else's code or through which someone else's code accesses yours -- in effect the public methods and properties."
The NCBI has a whole toolkit which they call Entrez Programming Utilities or eutils for short. You can read all about it in the documentation. There are a lot of things you can do to interface with all things NCBI, including publications, etc., but I am going to focus today on downloading sequencing data. NCBI has also developed Entrez Direct to directly access, parse, and dowload from NCBI databases from command line.
Today we are going to focus on how to obtain sequence files using
wget and eutil web links from command line. To do this, you're going to be using one tool in eutils, called efetch. There is a whole chapter devoted to efetch -- when I first started doing this kind of work, this documentation always broke my heart. Its easier for me to just show you how to use it.
Open a web browser, and check out what NCBI knows about this gene (nifH, in this case). Check it out here.
Download the gene with eutils commands in your web-browser and take a look at the file.
On your web-browser, paste the following URL to download the nucleotide genome for gene X51500.1:
Try downloading the GenBank file instead by pasting this onto your web-browser:
Do you notice the difference in these two commands? Let's breakdown the command here:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?This is command telling your computer program (or your browser) to talk to the NCBI API tool efetch.
db=nuccoreThis command tells the NCBI API that you'd like it to look in this particular database for some data. Other databases that the NCBI has available can be found here.
id=X51500.1This command tells the NCBI API efetch the ID of the gene/genome you want to find.
rettype=gb&retmode=textThese two commands tells the NCBI how the data is returned. You'll note that in the two examples above this command varied slightly. In the first, we asked for only the FASTA sequence, while in the second, we asked for the Genbank file. Here's some elusive documentation on where to find these "return" objects.
Also, a useful command is also
version=1. There are different versions of sequences and some times that is useful. For reproducibility, I try to specify versions in my queries, see these comments.
Notice the "&" that comes between each of these little commands, it is necessary and important.
Ok, let's think of automating this sort of query. So...we're moving from your lil laptop to your jumbo EC2 instance now.
Download a gene sequence on the command line.
Going back onto your instance, in the shell, you could run the same commands above with the addition of curl on your EC2 instance:
You'll see it fly on to your screen. Don't panic - you can save it to a file and make it more useful BUT note the path you are in and where you will save this file (as long as you know...that's fine):
curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=X51500.1&rettype=fasta&retmode=text" > X51500.1.fa
You could now imagine writing a program where you made a list of IDs you want to download and put it in a for loop, curling each genome and saving it to a file. The following is a shell script, "fetch_genome.sh". It should be located in your current directory. Let's take a quick look at it (Hint: less).
You'll see that the id here is a string character which is obtained from list of IDs contained in a separate file. The rest of the script manages where the files are being placed and what they are named. It also prints some output to the screen so you know its running.
You'll see that you need to provide a list of IDs (the first argument) and a directory where you want to save the downloaded files (the second argument).
Run this script (note that your paths for the script or data may need to be specified) -- also see note below:
bash fetch_genome.sh ../data/300-nifh-genes.txt nifh_genes_fas
Sit back and think of the glory that is happening on your screen right now...
If you are nervous....you may want to run this on just a few of these IDs to begin with. You can create a smaller list using the head command with the -n parameter in the shell. For example,
head -n 3 300-nifh-genes.txt > 3genes.txt.
After all the 300 genes are downloaded, you will want to concatenate them into one file (Hint:
>>), named "all-nifH.fa".
Take a break. Put up your pink stickie if you need help with this.
==================================== Moving forward on your own: comparing your data to the databases
Frequently, we ask the question of whether a gene is present or not in our metagenome (and their potential abundance). There are lots of ways to do this and arguably "blasting" is one of the most common. What we have done so far is collecting a specific set (i.e., nifH) of genes from a public database. We can use this collection as our new blast database.
To blast, you will need to:
- Format your downloaded nifH gene fasta file ("all-nifH.fa") for blast
- Perform blast
Also please see Tracy Teal's tutorial here for reference.
You now have the foundation for having some sequencing data that you need to compare to any database. You should be able to generate the information needed to perform statistical analyses. Note, that you can do this for specific genes and also genomes...! Now, go forth and conquer!
==================================================== Bonus Material on Genbank Files and Genome downloads
We also use a lot of python to work with eutil as well as blast results. The alternative version of this tutorial contains lots of python code. If you are interested, they can be found here.