# **Bioinformatics with Jupyter Notebooks for WormBase:**
## **Data Retrieval 1 - Exploring WormBase FTP**
Welcome to the first jupyter notebook in the WormBase tutorial series. Over this series of tutorials, we will write code in Python that allows us to retrieve and perform simple analyses with data available on the WormBase sites.

This tutorial will deal with the data on the FTP site.
We will both explore the site and extract data of interest. Let's get started!

We start by installing and loading the libraries that are required for this tutorial.

In [None]:
import wget
import gzip
import shutil
import ftplib

### Connecting to the WormBase FTP site

We need to connect to the wormbase ftp site. Here, we navigate to the `pub/wormbase/` directory which is where the data is situated.
The cell below opens a ftp connection with the site and displays the files and sub-directories inside the `pub/wormbase/` directory.

In [None]:
ftp = ftplib.FTP('ftp.wormbase.org')
ftp.login()
ftp.cwd('pub/wormbase')

files = []
ftp.dir(files.append)
print(*files, sep = "\n")

The README file contains information regarding the data contained in all the directories and subdirectories. Downloading it can lead you to required data easily.

In [None]:
with open('README', 'wb') as downloaded_file:
  ftp.retrbinary('RETR README', downloaded_file.write)

Display the contents of the README file for easy understanding of the organisation of the data on the FTP site.

In [None]:
with open('README') as f:
    lines = f.read()
    print(lines)

### Downloading the entire current release

The releases folder contains the core files for the various releases of WormBase and all subsidiary files. We can easily access the current version and download the required data. 

This is a huge download (~40 GB)!! Uncomment this code and run this cell only if you are need the entire current release and have enough disk space!!

In [None]:
#!wget --cut-dirs 4 -r --no-parent ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/
#!mv ftp.wormbase.org current_release

The cell above downloads all the files associated with the current development release (which is around 40 GB in size). A lot of the data might not be relevant to the users requirement and that is why we can download just the required data from the release using the code in the following cells.

### Downloading sequence, annotation, gff and assembly data files belonging to current development release

The following cells deal with accessing and downloading the required sequence, annotations, gff, and assembly data from the current development release.

To download assembly files, annotation files, etc., assign the organism name, the bioproject id, the wormbase version (WS280 for the latest version), the file type.
You can also navigate into the different directories and check the available files to download.

Change the variables based on what you need. We can also list the files in a directory in case you need to see the available files in a directory before assigning your variables.

We first list the different species for which we have data available. Then we assign the species variable based on our requirement.

In [None]:
print(*ftp.nlst('releases/current-development-release/species'), sep = "\n")
species = 'c_elegans'

We then list the different bioproject IDs for which we have data available. Then we assign the bioproject variable based on our requirement.

In [None]:
print(*ftp.nlst('releases/current-development-release/species/' + species), sep = "\n")
bioproject = 'PRJNA275000'

We now list all the available files for the specified species and bioproject ID values.

In [None]:
print(*ftp.nlst('releases/current-development-release/species/' + species + '/' + bioproject), sep = "\n")

We can now assign values for WormBase ID, and the descriptor and extension of the file we are looking to download. Keep in mind the format of the file while assigning the extension and remember to NOT include '.gz'. Once these variables have been assigned the desired values, we can generate the link for our download!

The WormBase ID for the current developement release (as on August 2021) is WS282.

In [None]:
wormbase_id = 'WS282'
descriptor = 'annotations'
extension = 'gff3'

In [None]:
link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/' + species + \
       '/' + bioproject + '/' + species + '.' + bioproject + '.' + wormbase_id + '.'+ descriptor + '.' \
       + extension + '.gz'

In [None]:
link

Download the file from the link we generated above using the identifiers provided by you. And then you can unzip the .gz file to get the required file!

In [None]:
wget.download(link)

In [None]:
downloaded_file = species + '.' + bioproject + '.' + wormbase_id + '.' + descriptor + '.' + extension

with gzip.open(downloaded_file + '.gz', 'rb') as f_in:
    with open(downloaded_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

#### Downloading the reference genome for Caenorhabditis elegans

We first generate the link for the reference genome by assigning the required values to the variables and then we download the reference genome!

In [None]:
species = 'c_elegans'
bioproject = 'PRJNA13758'
wormbase_id = 'WS282'
descriptor = 'genomic'
extension = 'fa'

link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/' + species + \
       '/' + bioproject + '/' + species + '.' + bioproject + '.' + wormbase_id + '.' + descriptor + '.' \
       + extension + '.gz'

In [None]:
wget.download(link)

downloaded_file = species + '.' + bioproject + '.' + wormbase_id + '.' + descriptor + '.' + extension

with gzip.open(downloaded_file + '.gz', 'rb') as f_in:
    with open(downloaded_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

### Downloading ontology data files belonging to current development release

The following cells deal with accessing and downloading the required ontology data from the current development release.

Change the variables based on what you need. We can also list the files in a directory in case you need to see the available files in a directory before assigning your variables.

We first list the available ontology data in the current development release.

In [None]:
print(*ftp.nlst('releases/current-development-release/ONTOLOGY'), sep = "\n")

Based on the availability (from the previous output), assign the data type and species.

The data type variable can take these values: anatomy_association, anatomy_ontology, development_association, development_ontology, disease_association, disease_ontology, gene_association, gene_ontology, phenotype_association, phenotype_ontology, rnai_phenotypes, and rnai_phenotypes_quick.

The species variable can take these values: b_malayi, c_brenneri, c_briggsae, c_elegans, c_japonica, c_remanei, o_volvulus, p_pacificus, s_ratti, and t_muris.

The WormBase ID for the current development release (as on August 2021) is WS282.

Once these variables have been assigned the desired values, we can generate the link for our download!

In [None]:
data_type = 'gene_association'
species = 'c_elegans'
wormbase_id = 'WS282'

In [None]:
if (species != ''):
    species = '.' + species

if (data_type[-8:]=='ontology'):
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/' + data_type + \
           '.' + wormbase_id + '.obo'
    
elif (data_type[4:]=='rnai'):
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/' + data_type + \
           '.' + wormbase_id + '.wb.c_elegans'
    
elif (data_type == 'disease_association'):
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/' + data_type + \
           '.' + wormbase_id + '.daf.txt' + species
    
else :
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/' + data_type + \
           '.' + wormbase_id + '.wb' + species

In [None]:
wget.download(link)

This is the end of the first tutorial for WormBase data! This tutorial dealt with extracting WormBase data from the FTP site easily and programatically. 

In the next tutorial, we will use intermine to access the WormMine site and retrieve data from WormBase in another way.

Acknowledgements:
- ftp://ftp.wormbase.org/pub/wormbase/