# **Bioinformatics with Jupyter Notebooks for WormBase:**
## **Data Retrieval 1 - Exploring WormBase FTP**
Welcome to the first jupyter notebook in the WormBase tutorial series. Over this series of tutorials, we will write code in Python that allows us to retrieve and perform simple analyses with data available on the WormBase sites.

This tutorial will deal with the data on the FTP site. (ftp://ftp.wormbase.org/pub/wormbase/)
We will both explore the site and extract data of interest. Let's get started!

We start by installing and loading the libraries that are required for this tutorial.

In [1]:
#Run this cell if you do not have the wget package installed already.
import sys
!{sys.executable} -m pip install wget



In [2]:
import wget
import gzip
import shutil
import ftplib

### Connecting to the WormBase FTP site

We need to connect to the wormbase ftp site. Here, we navigate to the `pub/wormbase/` directory which is where the data is situated.
The cell below opens a ftp connection with the site and displays the files and sub-directories inside the `pub/wormbase/` directory.

In [17]:
ftp = ftplib.FTP('ftp.wormbase.org')
ftp.login()
ftp.cwd('pub/wormbase')
files = []
ftp.dir(files.append)
print(*files, sep = "\n")

-rwxr-xr-x    1 1001     1001         2930 Aug 31  2020 README
-rwxr-xr-x    1 1001     1001         2622 Aug 14  2014 README~
drwxr-xr-x    3 1001     1001           20 Aug 24  2013 archive
drwxr-xr-x   18 1001     1001         4096 Apr 15 15:53 datasets-published
drwxr-xr-x   11 1001     1001         4096 Mar 31  2016 datasets-wormbase
lrwxrwxrwx    1 1001     1001           29 Aug 29  2013 nGASP -> datasets-published/nGASP_2005
drwxr-xr-x    3 1001     1001           22 May 24  2011 outgoing
drwxr-xr-x    7 1001     1001          103 Apr 22 10:44 parasite
drwxr-xr-x    4 1001     1001           34 May 17  2013 people
drwxr-xr-x  100 1001     1001        16384 Jun 11 19:15 releases
drwxr-xr-x    4 1001     1001         8192 Mar 02  2013 software
drwxr-xr-x   36 1001     1001         4096 Aug 08  2019 species


The README file contains information regarding the data contained in all the directories and subdirectories. Downloading it can lead you to required data easily.

In [4]:
with open('README', 'wb') as downloaded_file:
  ftp.retrbinary('RETR README', downloaded_file.write)

Display the contents of the README file for easy understanding of the organisation of the data on the FTP site.

In [5]:
with open('README') as f:
    lines = f.read()
    print(lines)

Site Contents
--------------------------------------

species/
   Core files and annotations for all species available
   at WormBase (or of possible interest to WormBase users)
   organized by species. Files previously available in genomes/
   can be found here.  File names, paths, and contents are 
   standardized and computable. Please see species/README for
   details.

      Look here for the most current and archival versions of:
        - genomic fasta sequence
        - genomic annotations in GFF2 or GFF3
        - assembly versions
        - commonly requests data sets by species
     
releases/
   Core files for each WormBase release organized by WS release ID.

      Check here if you are interested in downloading all the files
      that comprise the current WormBase release, or any other
      older releases.

datasets-published/
   Published datasets submitted to WormBase for distribution.

datasets-wormbase/
   WormBase-generated datasets and data dumps. Includes non-spe

### Downloading the entire current release

The releases folder contains the core files for the various releases of WormBase and all subsidiary files. We can easily access the current version and download the required data. 

In [6]:
# This is a huge download (~40 GB)!! Uncomment this code and run this cell only if you are need the entire current 
#release and have enough disk space!!

#!wget --cut-dirs 4 -r --no-parent ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/
#!mv ftp.wormbase.org current_release

The cell above downloads all the files associated with the current development release (which is around 40 GB in size). A lot of the data might not be relevant to the users requirement and that is why we can download just the required data from the release using the code in the following cells.

### Downloading sequence, annotation, gff and assembly data files belonging to current development release

The following cells deal with accessing and downloading the required sequence, annotations, gff, and assembly data from the current development release.

To download assembly files, annotation files, etc., assign the organism name, the bioproject id, the wormbase version (WS280 for the latest version), the file type.
You can also navigate into the different directories and check the available files to download.

Change the variables based on what you need. We can also list the files in a directory in case you need to see the available files in a directory before assigning your variables.

In [7]:
#List the different species for which we have available data:
print(*ftp.nlst('releases/current-development-release/species'), sep = "\n")
#Assign species based on your requirement
species = 'c_elegans'

releases/current-development-release/species/ASSEMBLIES.WS280.json
releases/current-development-release/species/b_malayi
releases/current-development-release/species/c_angaria
releases/current-development-release/species/c_brenneri
releases/current-development-release/species/c_briggsae
releases/current-development-release/species/c_elegans
releases/current-development-release/species/c_inopinata
releases/current-development-release/species/c_japonica
releases/current-development-release/species/c_latens
releases/current-development-release/species/c_nigoni
releases/current-development-release/species/c_remanei
releases/current-development-release/species/c_sinica
releases/current-development-release/species/c_tropicalis
releases/current-development-release/species/o_tipulae
releases/current-development-release/species/o_volvulus
releases/current-development-release/species/p_pacificus
releases/current-development-release/species/p_redivivus
releases/current-development-release/species

In [8]:
#List the different bioproject IDs for which we have available data:
print(*ftp.nlst('releases/current-development-release/species/'+species), sep = "\n")
#Assign bioproject ID based on your requirement
bioproject = 'PRJNA275000'

releases/current-development-release/species/c_elegans/PRJEB28388
releases/current-development-release/species/c_elegans/PRJNA13758
releases/current-development-release/species/c_elegans/PRJNA275000


In [9]:
#Now list all available fles for the species and bioproject ID chosen:
print(*ftp.nlst('releases/current-development-release/species/'+species+'/'+bioproject), sep = "\n")

releases/current-development-release/species/c_elegans/PRJNA275000/c_elegans.PRJNA275000.WS280.annotations.gff3.gz
releases/current-development-release/species/c_elegans/PRJNA275000/c_elegans.PRJNA275000.WS280.canonical_geneset.gtf.gz
releases/current-development-release/species/c_elegans/PRJNA275000/c_elegans.PRJNA275000.WS280.genomic.fa.gz
releases/current-development-release/species/c_elegans/PRJNA275000/c_elegans.PRJNA275000.WS280.genomic_masked.fa.gz
releases/current-development-release/species/c_elegans/PRJNA275000/c_elegans.PRJNA275000.WS280.genomic_softmasked.fa.gz
releases/current-development-release/species/c_elegans/PRJNA275000/c_elegans.PRJNA275000.WS280.mRNA_transcripts.fa.gz
releases/current-development-release/species/c_elegans/PRJNA275000/c_elegans.PRJNA275000.WS280.protein.fa.gz


In [10]:
#The WormBase ID for the current developement release (as on June 2021) is WS280.
wormbase_id = 'WS280'
#Change the descriptor and extension based on your needs and keep in mind the format. You should not include .gz !!
descriptor = 'annotations'
extension = 'gff3'

In [11]:
#Generate the link for your download based on your requirements as assigned
link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/'+species+'/'+bioproject+'/'+species+'.'+bioproject+'.'+wormbase_id+'.'+descriptor+'.'+extension+'.gz'

Download the file from the link we generated above using the identifiers provided by you. And then you can unzip the .gz file to get the required file!

In [12]:
wget.download(link)

'c_elegans.PRJNA275000.WS280.annotations.gff3.gz'

In [13]:
downloaded_file = species+'.'+bioproject+'.'+wormbase_id+'.'+descriptor+'.'+extension
with gzip.open(downloaded_file + '.gz', 'rb') as f_in:
    with open(downloaded_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

#### Downloading the reference genome for Caenorhabditis elegans

In [14]:
#Generate the link for the reference genome
species = 'c_elegans'
bioproject = 'PRJNA13758'
wormbase_id = 'WS280'
descriptor = 'genomic'
extension = 'fa'
link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/'+species+'/'+bioproject+'/'+species+'.'+bioproject+'.'+wormbase_id+'.'+descriptor+'.'+extension+'.gz'

In [15]:
#Download the reference genome
wget.download(link)
downloaded_file = species+'.'+bioproject+'.'+wormbase_id+'.'+descriptor+'.'+extension
with gzip.open(downloaded_file + '.gz', 'rb') as f_in:
    with open(downloaded_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

### Downloading ontology data files belonging to current development release

The following cells deal with accessing and downloading the required ontology data from the current development release.

Change the variables based on what you need. We can also list the files in a directory in case you need to see the available files in a directory before assigning your variables.

In [18]:
#List the available ontology data in the current development release:
print(*ftp.nlst('releases/current-development-release/ONTOLOGY'), sep = "\n")

releases/current-development-release/ONTOLOGY/anatomy_association.WS280.wb
releases/current-development-release/ONTOLOGY/anatomy_association.WS280.wb.b_malayi
releases/current-development-release/ONTOLOGY/anatomy_association.WS280.wb.c_brenneri
releases/current-development-release/ONTOLOGY/anatomy_association.WS280.wb.c_briggsae
releases/current-development-release/ONTOLOGY/anatomy_association.WS280.wb.c_elegans
releases/current-development-release/ONTOLOGY/anatomy_association.WS280.wb.c_japonica
releases/current-development-release/ONTOLOGY/anatomy_association.WS280.wb.c_remanei
releases/current-development-release/ONTOLOGY/anatomy_association.WS280.wb.o_volvulus
releases/current-development-release/ONTOLOGY/anatomy_association.WS280.wb.p_pacificus
releases/current-development-release/ONTOLOGY/anatomy_association.WS280.wb.s_ratti
releases/current-development-release/ONTOLOGY/anatomy_association.WS280.wb.t_muris
releases/current-development-release/ONTOLOGY/anatomy_ontology.WS280.obo
r

In [19]:
#Assign the data_type and species based on your requirement and also availabity (from output of previous cell)

# Remember - the data_type variable can have any one of these values: 
#anatomy_association, anatomy_ontology, development_association, development_ontology,disease_association, 
#disease_ontology, gene_association, gene_ontology, phenotype_association, phenotype_ontology, rnai_phenotypes, 
#rnai_phenotypes_quick.
data_type = 'gene_association'

#Remember - the species variable can have any one of these values:
#b_malayi, c_brenneri, c_briggsae, c_elegans, c_japonica, c_remanei, o_volvulus, p_pacificus, s_ratti, t_muris
species = 'c_elegans'

#The WormBase ID for the current developement release (as on June 2021) is WS280.
wormbase_id = 'WS280'

In [20]:
#Generate the link for extracting the data
if (species != ''):
    species = '.' + species

if (data_type[-8:]=='ontology'):
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/'+data_type+'.'+wormbase_id+'.obo'
elif (data_type[4:]=='rnai'):
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/'+data_type+'.'+wormbase_id+'.wb.c_elegans'
elif (data_type == 'disease_association'):
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/'+data_type+'.'+wormbase_id+'.daf.txt'+species
else :
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/'+data_type+'.'+wormbase_id+'.wb'+species

In [21]:
wget.download(link)

'gene_association.WS280.wb.c_elegans'

This is the end of the first tutorial for WormBase data! This tutorial dealt with extracting WormBase data from the FTP site easily and programatically. 

In the next tutorial, we will use intermine to access the WormMine site and retrieve data from WormBase in another way.