<a href="https://colab.research.google.com/github/cjw85/ClaraGenomicsAnalysis/blob/master/Curating_Read_Until_input_files_for_MinKNOW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Curating Read Until input files for MinKNOW</h1>

The following short workflow will prepare and download the necessary file to perform a Read Until sequence experiment selecting for reads that span genes, transcripts, exons, etc. stored within [ensembl](https://www.ensembl.org/).

To prepare the files, simply execute teh code cells below in sequence by pressing the `Play` button to the left-hand side.

In [None]:
#@markdown *Code installation*
print(" * Installing software")
!pip install pyranges > /dev/null
from ftplib import FTP
import os
from google.colab import files
import ipywidgets as widgets
import pandas as pd
import pyranges as pr
import requests


class EnsemblRestClient(object):
    def __init__(self, server='http://rest.ensembl.org'):
        self.server = 'http://rest.ensembl.org'
        self.ftp = 'ftp.ensembl.org'
        self.ftp_dna_path = '/pub/release-100/fasta/{}/dna/'
        self.ftp_dna_suff = {
            'primary':'dna.primary_assembly.fa.gz',
            'toplevel':'dna.toplevel.fa.gz'}
        self.ftp_gtf_path = '/pub/release-100/gtf/{}/'
        self.ftp_gtf_suff = {
            'gtf':"100.gtf.gz"}

        self.dna_template = \
            "ftp://" + self.ftp + self.ftp_dna_path + \
            "/{}.{}." + self.ftp_dna_suff['toplevel']
        self.gtf_template = \
            "ftp://" + self.ftp + self.ftp_gtf_path + \
            "/{}.{}." + self.ftp_gtf_suff['gtf']

    def get(self, endpoint, params=dict(), **kwargs):
        if 'json' not in kwargs:
            kwargs['json'] = params
        data = dict()
        try:
            response = requests.get(self.server + endpoint, **kwargs)
            if response.status_code == 429:
                if 'Retry-After' in response.headers:
                    retry = e.headers['Retry-After']
                    time.sleep(float(retry))
                    response = requests.get(self.server + endpoint, **kwargs)
        except:
            print(' - Request failed for {0}'.format(endpoint))
            print(response.status_code) 
        else:
            data = response.json()
            if "error" in data:
                print(" - ERROR:\n   {}".format(data["error"]))
        return data

    def species_list(self):
        return self.get("/info/species")

    def assembly_name(self, species):
        #assembly = client.get('/info/assembly/{}'.format(species))
        #if 'assembly_name' in assembly:
        #    return assembly['assembly_name']
        paths = self._ftp_list(
            self.ftp_dna_path.format(species), self.ftp_dna_suff)
        return paths['toplevel'].split('.')[1]

    def _ftp_list(self, path, filt):
        ftpdata = dict()
        with FTP('ftp.ensembl.org') as ftp:
            ftp.login()
            def grab(x):
                fname = x.split()[-1]
                for key, value in filt.items():
                    if fname.endswith(value):
                        ftpdata[key] = fname
            ftp.dir(path, grab)
        return ftpdata

    def dna_url(self, species, toplevel=True, assembly_name=None):
        if assembly_name is None:
            assembly_name = client.assembly_name(species)
        return self.dna_template.format(
            species, species.capitalize(), assembly_name)
        
    def gtf_url(self, species, assembly_name=None):
        if assembly_name is None:
            assembly_name = client.assembly_name(species)
        return self.gtf_template.format(
            species, species.capitalize(), assembly_name)


print(" * Querying ensembl species")
client = EnsemblRestClient()
species_list = client.species_list()
species_list = sorted(s['name'] for s in species_list['species'])
print(" - Found {} species".format(len(species_list)))
species_list.insert(0, "--")
urls = (None, None)

To efficiently produce reasonable target regions please provide an average read length. This should be an arithmetic mean not an N50 length. After pressing play here you will be given the opportunity to select your genome of interest from a drop-down box.

In [None]:
read_length =  5000 #@param {type:"integer"}

def species_change(change):
    global urls
    if change['type'] == 'change' and change['name'] == 'value':
        print(" * Finding files, please wait...", end="")
        spec = change['new']
        assm = client.assembly_name(spec)
        dna_url = client.dna_url(spec, assembly_name=assm)
        gtf_url = client.gtf_url(spec, assembly_name=assm)
        urls = (dna_url, gtf_url)
        print("done")

print("Select a species:")
species_dropdown = widgets.Dropdown(
    options=species_list, value='--', description='species:')
species_dropdown.observe(species_change)
display(species_dropdown)

After the message `* Finding files, please wait...done` has been display above, press play on the next code cell to retrieve the required data and prepare the files required for MinKNOW.



In [None]:
#@markdown *Assembly and gene retrieval and processing*
dna_url, gtf_url = urls
try:
    print(" * Retrieving files...")
    print(dna_url)
    print(gtf_url)
    dna_path = os.path.basename(dna_url)
    gtf_path = os.path.basename(gtf_url)
    if not os.path.isfile(dna_path):
        !wget $dna_url || printf "\n * Failed to download assembly\n"
        if not os.path.isfile(dna_path):
            raise FileNotFoundError('Assembly could not be downloaded.')
    else:
        print(" - Skipping genome download")
    if not os.path.isfile(gtf_path):
        !wget $gtf_url || printf "\n * Failed to download gtf\n"
        if not os.path.isfile(gtf_path):
            raise FileNotFoundError('GTF could not be downloaded.')
    else:
        print(" - Skipping gtf download")
except Exception as e:
    print(" * Failed to retrieve files")
    print("{}".format(e))
else:
    print(" * Finished download")
    
    print(" * Reading gtf")
    ranges = pr.read_gtf(gtf_path)
    print(" - Merging and expanding intervals (this may take a while)...", end="")
    merged = ranges.merge(strand=False)
    sloppy = merged.slack(read_length // 2).merge(strand=False)
    print("done")
    df = pd.DataFrame({'Original':[len(ranges)], 'Merged':[len(merged)], 'Expanded':[len(sloppy)]})
    display(df)
    bed_path = "{}.read_until.bed".format(dna_path)
    sloppy.to_bed(bed_path)

    print(" * Output files:")
    print("   - Genome: {}".format(os.path.abspath(dna_path)))
    print("   - Bed   : {}".format(os.path.abspath(bed_path)))

When the above code has finished executing, pressing play on the next step will download to your computer:

1.   A reference genome (to provide to MinKNOW)
2.   The source `.gtf` file from which target regions were produced.
3.   A `.bed` file containing target regions to provide to MinKNOW.

In [None]:
#@markdown *File download*
download_genome = True #@param {type:"boolean"}
download_gtf = True #@param {type:"boolean"}
download_targets = True #@param {type:"boolean"}
print(" * Downloading files:")

if download_genome:
    files.download(dna_path)
if download_gtf:
    files.download(gtf_path)
if download_targets:
    files.download(bed_path)