# **Bioinformatics with Jupyter Notebooks for WormBase:**
## **Data Retrieval 1 - Exploring WormBase FTP**
Welcome to the first jupyter notebook in the WormBase tutorial series. Over this series of tutorials, we will write code in Python that allows us to retrieve and perform simple analyses with data available on the WormBase sites.

This tutorial will deal with the data on the FTP site. (ftp://ftp.wormbase.org/pub/wormbase/)
We will both explore the site and extract data of interest. Let's get started!

We start by installing and loading the libraries that are required for this tutorial.

In [None]:
#Run this cell if you do not have the wget package installed already.
import sys
!{sys.executable} -m pip install wget

In [None]:
import wget
import gzip
import shutil
import ftplib

### Connecting to the WormBase FTP site

We need to connect to the wormbase ftp site. Here, we navigate to the `pub/wormbase/` directory which is where the data is situated.
The cell below opens a ftp connection with the site and displays the files and sub-directories inside the `pub/wormbase/` directory.

In [None]:
ftp = ftplib.FTP('ftp.wormbase.org')
ftp.login()
ftp.cwd('pub/wormbase')
files = []
ftp.dir(files.append)
print(*files, sep = "\n")

The README file contains information regarding the data contained in all the directories and subdirectories. Downloading it can lead you to required data easily.

In [None]:
with open('README', 'wb') as downloaded_file:
  ftp.retrbinary('RETR README', downloaded_file.write)

Display the contents of the README file for easy understanding of the organisation of the data on the FTP site.

In [None]:
contents = open('README').read()
print(contents)

### Downloading the entire current release

The releases folder contains the core files for the various releases of WormBase and all subsidiary files. We can easily access the current version and download the required data. 

#### _This is a very large download (~40 GB). Uncomment this code and run this cell only if you are need the entire current release and have enough disk space_




In [None]:
#!wget --cut-dirs 4 -r --no-parent ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/
#!mv ftp.wormbase.org current_release

The cell above downloads all the files associated with the current development release (which is around 40 GB in size). Good part of the data might not be relevant, so in the following cells we will download just parts of the data as an example.

### Downloading sequence, annotation, gff and assembly data files belonging to current development release

The following cells deal with accessing and downloading the required sequence, annotations, gff, and assembly data from the current development release.

To download assembly files, annotation files, etc., assign the organism name, the bioproject id, the wormbase version (WS280 for the latest version), the file type.
You can also navigate into the different directories and check the available files to download.

Change the variables based on what you need. We can also list the files in a directory in case you need to see the available files in a directory before assigning your variables.

In [None]:
#List the different species for which we have available data:
print(*ftp.nlst('releases/current-development-release/species'), sep = "\n")


#Assign species based on your requirement
species = 'c_elegans'

In [None]:
#List the different bioproject IDs for which we have available data:
print(*ftp.nlst('releases/current-development-release/species/' + species), sep = "\n")
 

#Assign bioproject ID based on your requirement
bioproject = 'PRJNA275000'

In [None]:
#Now list all available fles for the species and bioproject ID chosen:

print(*ftp.nlst('releases/current-development-release/species/'+species+'/'+bioproject), sep = "\n")

In [None]:
#The WormBase ID for the current developement release (as on June 2021) is WS280.
wormbase_id = 'WS280'

#Change the descriptor and extension based on your needs and keep in mind the format. You should not include .gz !!
descriptor = 'annotations'
extension = 'gff3'

In [None]:
#Generate the link for your download based on your requirements as assigned
link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/'+species+'/'+bioproject+'/'+species+'.'+bioproject+'.'+wormbase_id+'.'+descriptor+'.'+extension+'.gz'

Download the file from the link we generated above using the identifiers provided by you. And then you can unzip the .gz file to get the required file!

In [None]:
wget.download(link)

In [None]:
downloaded_file = species + '.' + bioproject + '.' + wormbase_id + '.' + descriptor + '.' + extension
with gzip.open(downloaded_file + '.gz', 'rb') as f_in:
    with open(downloaded_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

#### Downloading the reference genome for Caenorhabditis elegans

In [None]:
#Generate the link for the reference genome

species = 'c_elegans'
bioproject = 'PRJNA13758'
wormbase_id = 'WS280'
descriptor = 'genomic'
extension = 'fa'
link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/' + species \
+ '/' + bioproject + '/' + species + '.' + bioproject + '.' + wormbase_id + '.' + descriptor + '.' + \
extension + '.gz'

In [None]:
#Download the reference genome
wget.download(link)
downloaded_file = species + '.' + bioproject + '.' + wormbase_id + '.' + descriptor + '.' + extension
with gzip.open(downloaded_file + '.gz', 'rb') as f_in:
    with open(downloaded_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

### Downloading ontology data files belonging to current development release

The following cells deal with accessing and downloading the required ontology data from the current development release.

Change the variables based on what you need. We can also list the files in a directory in case you need to see the available files in a directory before assigning your variables.

In [None]:
#List the available ontology data in the current development release:
print(*ftp.nlst('releases/current-development-release/ONTOLOGY'), sep = "\n")

Assign the data_type and species based on your requirement and also availabity (from output of previous cell)

Remember - the data_type variable can have any one of these values: 

_anatomy_association

_anatomy_ontology

_development_association

_development_ontology

_disease_association

_disease_ontology

_gene_association

_gene_ontology

_phenotype_association

_phenotype_ontology

_rnai_phenotypes

_rnai_phenotypes_quick

In [None]:
data_type = 'gene_association'

Remember - the species variable can have any one of these values:

b_malayi

c_brenneri

c_briggsae

c_elegans

c_japonica

c_remanei

o_volvulus

p_pacificus

s_ratti

t_muris


In [4]:
species = 'c_elegans'

The WormBase ID for the current developement release (as on June 2021) is WS280.

In [None]:
wormbase_id = 'WS280'

In [None]:
#Generate the link for extracting the data
if (species != ''):
    species = '.' + species

if (data_type[-8:]=='ontology'):
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/' + data_type \
    + '.' + wormbase_id + '.obo'
elif (data_type[4:]=='rnai'):
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/' + data_type \
    + '.' + wormbase_id + '.wb.c_elegans'
elif (data_type == 'disease_association'):
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/' + data_type \
    + '.' + wormbase_id + '.daf.txt' + species
else :
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/' + data_type \
    + '.' + wormbase_id + '.wb' + species

In [None]:
wget.download(link)

This is the end of the first tutorial for WormBase data! This tutorial dealt with extracting WormBase data from the FTP site easily and programatically. 

In the next tutorial, we will use intermine to access the WormMine site and retrieve data from WormBase in another way.