# Notebook 1.2: Genome Databases

In this notebook we will introduce more details about the NCBI and Ensembl genome databases, and how assembled genome data generally looks in published form. 

### Learning objectives: 

1. Understand the hierarchical structure of FTP sites. 
2. Learn the format of published reference genomes. 
3. Learn to keep a look out for README files. 


### FTP sites

When using Codio for this course your unit will typically have a README file in it which contains instructions for how to complete your assignments each week. The use of a file called README to provide instructions is not specific to this course, but is actually a general practice used widely in computer programming and bioinformatics. 

For example, let's look at the NCBI RefSeq Genome Index: http://ftp.ncbi.nlm.nih.gov/genomes/  (You'll need to copy and paste the url into your browser, unfortunately markdown does not allow making FTP links clickable). 

### The README file

FTP is a protocol for sharing large files over the internet by allowing users to browse a file system just like it was a series of folders on their computer. Open a browser tab to the FTP link above. You should see a series of links to other files and folders. Click on a folder and you will  see a new page with links to the files and folders nested within that one. Go back to the original link location. Here there is a file called README.txt. **Click on it to read it.**

<div class="alert alert-success">
    <b>Question:</b> 
    From the README file, what is the difference between the refseq/ and genomes/ directories? Answer in markdown below. 
</div>

<div class="alert alert-warning">
    <h3>Response:</h3>
    
The GenBank directory 
area includes genome sequence data for a larger number of organisms 
than the RefSeq directory area; however, some assemblies are 
unannotated. The RefSeq genomes are all annotated.
</div>


------------------------------

### Genome files
We will focus on the refseq assemblies for now. Go down to the section in the README file labeled "Data provided per assembly:" and read the contents of each of the files that can be included in the genome assembly directory. Pretty cool that all of this data is publicy available and so easy to access, huh? 

### Investigate a genome assembly 

From the original link go to the refseq/ directory and browse through the folders of genomes for different organismal groups. Choose an organism by selecting a folder labeled with a latin binomial (e.g., *Homo sapiens*). If there are multiple genomes assemblies available for this organism select the one labeled "representative" or "reference". In the genome folder that you selected, find the file with the ending `_assembly_stats.txt` and open it. Answer the question below with the information contained in the assembly stats. 

<div class="alert alert-success">
    <b>Action:</b> Fill in the markdown cell below with the following statistics from the assembly stats of your selected organism. Make sure the markdown text is nicely formatted and easily readable when you are finished. (Tip: leave two spaces at the end of a line to create a line-break. 
</div>

<div class="alert alert-warning">
    <h3>Response:</h3>
    

- Organism name:  ...
- Date (publish date): ...  
- Sequencing technology:  ...
- Assembly_method:  ...
- Assembly_level:  ...
    
</div>


### Comparing genome assembly stats
Take a look at two representative genomes considered very good and not so great, respectively: the [Baker's Yeast](https://en.wikipedia.org/wiki/Saccharomyces_cerevisiae) genome (Fungi/Saccharomyces_cereviseae), and the [Walnut](https://en.wikipedia.org/wiki/Juglans_regia) genome (plant/Juglans_regia), by finding their assembly stats in the FTP site. 

<div class="alert alert-success">
    <b>Question:</b> 
    After looking at the other assembly stats files, do you think that the organism you selected above is a very good genome assembly, or a not-so-good genome assembly? What statistics of the assembly report did you use to make this conclusion, and what do those statistics mean? Why do you think the Walnut genome includes some stats that the Yeast genome doesn't? Answer using markdown below. 
</div>


<div class="alert alert-warning">
    <h3>Response:</h3>

Based on N50 statistics and the fact that the Yeast genome if chromosome-scale whereas the Walnut genome is only scaffold level, it is clear that the yeast genome is better. 
</div>


----------------------

## NCBI and Ensembl

NCBI is maintained by the US NIH, but it is not the only place where genome data is made publicly available online. There are several governmental agencies setup around the world for archiving and hosting genomic data. For example, in Europe EMBL hosts genomic resources, and the [Ensembl](https://useast.ensembl.org/info/about/index.html) database provides annotation resources similar to NCBI RefSeq. For example, you can browse the Ensembl FTP site here (http://ftp.ensembl.org/pub/release-94/fasta/) where you will find similar files, though sometimes organized and formatted a bit differently. 

<div class="alert alert-success">
    <b>Action:</b> 
    When completed save and download this notebook in HTML format and submit it along with the other notebooks for this unit to courseworks.  
</div>
