# **Bioinformatics with Jupyter Notebooks for WormBase:**
## **Data Retrieval 1 - Exploring WormBase FTP**
Welcome to the first jupyter notebook in the WormBase tutorial series. Over this series of tutorials, we will write code in Python that allows us to retrieve and perform simple analyses with data available on the WormBase sites.

This tutorial will deal with the data on the FTP site. (ftp://ftp.wormbase.org/pub/wormbase/)
We will both explore the site and extract data of interest. Let's get started!

We start by installing and loading the libraries that are required for this tutorial.

In [None]:
import wget
import gzip
import shutil
import ftplib
import os

We need to connect to the wormbase ftp site. Here, we navigate to the `pub/wormbase/` directory which is where the data is situated.
The cell below opens a ftp connection with the site and displays the files and sub-directories inside the `pub/wormbase/` directory.

In [None]:
ftp = ftplib.FTP('ftp.wormbase.org')
ftp.login()
ftp.cwd('pub/wormbase')
files = ftp.retrlines('LIST')
files

The README file contains information regarding the data contained in all the directories and subdirectories. Downloading it can lead you to required data easily.

In [None]:
with open('README', 'wb') as downloaded_file:
  ftp.retrbinary('RETR README', downloaded_file.write)

In [None]:
f = open('README').read()
print(f)

The releases folder contains the core files for the various releases of WormBase and all subsidiary files. We can easily access the current version and download the required data. 

In [None]:
!wget --cut-dirs 4 -r --no-parent ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/
!mv ftp.wormbase.org current_release

The cell above downloads all the files associated with the current development release (which is around 40 GB in size). A lot of the data might not be relevant to the users requirement and that is why we can download just the required data from the release using the code in the following cells.

The following cells deal with accessing and downloading the required sequence, annotations, gff, and assembly data from the current development release.

To download assembly files, annotation files, etc., assign the organism name, the bioproject id, the wormbase version (WS280 for the latest version), the file type.
You can also naviagte into the different directories and check the available files to download.

Change the variables based on what you need. Uncomment the lines that list the files in a directory in case you need to see the available files in a directory before assigning your variables.

In [None]:
#print(*ftp.nlst('releases/current-development-release/species'), sep = "\n")
species = 'c_elegans'
#print(*ftp.nlst('releases/current-development-release/species/'+species), sep = "\n")
bioproject = 'PRJNA13758'
#print(*ftp.nlst('releases/current-development-release/species/'+species+'/'+bioproject), sep = "\n")
wormbase_id = 'WS280'
descriptor = 'annotations'
extension = 'gff3'

In [None]:
link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/species/' + species
link += '/' + bioproject + '/' + species + '.' + bioproject + '.' + wormbase_id + '.' + descriptor 
link += '.' + extension + '.gz'
link

Download the file from the link we generated above using the identifiers provided by you. And then you can unzip the .gz file to get the required file!

In [None]:
wget.download(link)

In [None]:
downloaded_file = os.path.basename(link)



with gzip.open(downloaded_file + '.gz', 'rb') as f_in:
    with open(downloaded_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
# I would simply this ^^^^

Here we can get the sequence files for a species - genomic, protein and transcript.

Again - change the variables based on what you need. Uncomment the lines that list the files in a directory in case you need to see the available files in a directory before assigning your variables.

In [None]:
#print(*ftp.nlst('species'), sep = "\n")
species = 'c_elegans'
data_type = 'sequence/protein'
#print(*ftp.nlst('species/'+species+'/'+data_type), sep = "\n")
bioproject = 'PRJNA13758'
wormbase_id = 'WS280'
# Descriptor can be - genomic.fa, genomic_masked.fa, genomic_softmasked.fa 
# OR protein.fa, wormpep_package.tar
# OR CDS_transcripts.fa, mRNA_transcripts.fa, cds_transcripts.fa, coding_transcripts.fa, ncrna_transcripts.fa, transposon_transcripts.fa, pseudogenic_transcripts.fa, ncRNA_transcripts.fa
descriptor = 'wormpep_package.tar'

In [None]:
if bioproject != '':
    bioproject = '.' + bioproject
link = 'ftp://ftp.wormbase.org/pub/wormbase/species/' + species + '/' + data_type + '/'
link += species + bioproject + '.' + wormbase_id + '.' + descriptor +'.gz'
link

In [None]:
wget.download(link)

Here we can get the gff files for a species.

Again - change the variables based on what you need. Uncomment the lines that list the files in a directory in case you need to see the available files in a directory before assigning your variables.

In [None]:
#print(*ftp.nlst('species'), sep = "\n")
species = 'c_elegans'
data_type = 'gff'
#print(*ftp.nlst('species/'+species+'/'+data_type), sep = "\n")
bioproject = ''
wormbase_id = 'WS280'
descriptor = 'annotations' #annotations or protein_annotation
file_extension='gff2' #gff2 or gff3

In [None]:
if bioproject != '':
    bioproject = '.' + bioproject
link = 'ftp://ftp.wormbase.org/pub/wormbase/species/' + species + '/' + data_type + '/'
link += species + bioproject + '.' + wormbase_id + '.' + descriptor + '.' + file_extension + '.gz'
link

In [None]:
wget.download(link)

Here we can get the assembly files for a species.

Again - change the variables based on what you need. Uncomment the lines that list the files in a directory in case you need to see the available files in a directory before assigning your variables.

In [None]:
#print(*ftp.nlst('species'), sep = "\n")
species = 'c_elegans'
data_type = 'assemblies'
#print(*ftp.nlst('species/'+species+'/'+data_type), sep = "\n")
bioproject = 'PRJNA13758'
wormbase_id = 'current_development'
descriptor = 'assembly'

In [None]:
if bioproject != '':
    bioproject = '.' + bioproject
link = 'ftp://ftp.wormbase.org/pub/wormbase/species/' + species + '/'
link += data_type+ '/' + species + bioproject + '.' + wormbase_id + '.' + descriptor + '.agp.gz'
link

In [None]:
wget.download(link)

Here we can get the annotation files for a species.

Again - change the variables based on what you need. Uncomment the lines that list the files in a directory in case you need to see the available files in a directory before assigning your variables.

In [None]:
#Getting annotation files 
#print(*ftp.nlst('species'), sep = "\n")
species = 'c_elegans'
data_type = 'annotation'
print(*ftp.nlst('species/' + species + '/' + data_type), sep = "\n")

In [None]:
type_of_annot = 'geneIDs'
print(*ftp.nlst('species/'+species+'/'+data_type+'/'+type_of_annot), sep = "\n")

In [None]:
#Copy required file from above list to the variable below
download_file = 'species/c_elegans/annotation/geneIDs/c_elegans.PRJNA13758.current.geneIDs.txt.gz'
link = 'ftp://ftp.wormbase.org/pub/wormbase/'+download_file
link

In [None]:
wget.download(link)

The following cells deal with accessing and downloading the required ontology data from the current development release.

Change the variables based on what you need. Uncomment the lines that list the files in a directory in case you need to see the available files in a directory before assigning your variables.

In [None]:
#Types of data: anatomy_association, anatomy_ontology, development_association, development_ontology,
#disease_association, disease_ontology, gene_association, gene_ontology, phenotype_association, phenotype_ontology,
#rnai_phenotypes, rnai_phenotypes_quick.

data_type = 'gene_association'
species = 'c_elegans'
wormbase_id = 'WS280'

In [None]:
if (species != ''):
    species = '.' + species

if (data_type[-8:]=='ontology'):
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/'+data_type+'.'+wormbase_id+'.obo'
elif (data_type[4:]=='rnai'):
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/'+data_type+'.'+wormbase_id+'.wb.c_elegans'
elif (data_type == 'disease_association'):
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/'+data_type+'.'+wormbase_id+'.daf.txt'+species
else :
    link = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release/ONTOLOGY/'+data_type+'.'+wormbase_id+'.wb'+species

In [None]:
wget.download(link)

This is the end of the first tutorial for WormBase data! This tutorial dealt with extracting WormBase data from the FTP site easily and programatically. 

In the next tutorial, we will use intermine to access the WormMine site and retrieve data from WormBase in another way.