# Reference genome


There are multiple source of reference genome data: [Ensembl](), [UCSC](), [NCBI](), [GENCODE](), ... 


## 1. Archived

Also, here are the collection of reference genome and annotations: 

  - [iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html) - Ready-To-Use Reference Sequences and Annotations from Illumina. The files were downloaded from [Ensembl](), [NCBI]() or [UCSC](). 
  
  - [refgenie](http://refgenomes.databio.org/) - A website collect data for common species, and provide RESTfull API to access the data. see [API doc](http://refgenomes.databio.org/docs) for the RESTful API. And also support CLI tool to download data: `refgenie pull ...`, see the [refgenie](http://refgenie.databio.org/en/latest/) Documentation.

  - [GenomeResouces](http://zhanglab.net/resources/genome/Home.html) - Collection of common Genome Annotations and Resources, by Zhang Qiangfeng lab (Tsinghua University)     

## 2. Manual Download

Download GTF annotation data from Ensembl.

Use [CrossMap](http://crossmap.sourceforge.net/) (on github: https://github.com/liguowang/CrossMap) to convert genome coordinates between different assemblies (like liftOver).

See [ChromosomeMappings](https://github.com/dpryan79/ChromosomeMappings) - by Devon Ryan, to convert chromosome names between UCSC <-> Ensembl <-> Gencode for a variety of genomes.





http://igenomes.illumina.com.s3-website-us-east-1.amazonaws.com/Drosophila_melanogaster/Ensembl/BDGP6/Drosophila_melanogaster_Ensembl_BDGP6.tar.gz




## 3. Using Refgenie from Python

### 3.1 Download pre-build assemblies   

```
$ refgenie pull -g dm6 fasta bowtie2_index ensembl_gtf   
$ refgenie pull -g hg38 fasta bowtie2_index ensembl_gtf   
```

### 3.2 Add custom assets

Prepare custom annotation data for genome, for example, add `bed` for dm6

```
# go to genome dir
```






For example, add `dm6` assembly

  - Create a empty dir within `$genome_data` directory: `dm6/fasta`   
  - Build the fasta and annotation assets, using the following commands
  
  
```
# create dm6
$ cd /data/biodata/refgenie/
$ mkdir -p dm6/fasta
# add genome
$ refgenie add dm6/fasta -p dm6/fasta
# Caution!!!, 
# Change the asset_path: [dm6/fasta] to [fasta] in genome_config.yaml file
# build
$ wget ftp://ftp.ensembl.org/pub/release-102/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz 
$ refgenie build dm6/fasata --files fasta=Drosophila_melanogaster.BDGP6.28.dna.toplevel.fa.gz 
# build more assets
$ refgenie build -g dm6 bowtie_index bowtie2_index bwa_index kallisto_index
# STAR for small genome, change genomeSAindexNbases to 6 or 5. default: 14 (10-15)
$ refgenie build -g dm6 star_index --params genomeSAindexNbases=6
```




### 3.3 Build assets

**All files must be gzipped as input** - in current version: refgenie 0.9.3


#### 3.3.1 Reference

+ fasta

```
# require: fasta, samtools
$ refgenie build dm6/fasta --files=dm6.fa.gz
```



#### Annotations


+ ensembl_gtf 

```
# swith to refgenic HOME
$ cd /data/biodata/refgenie

# create dir 
$ mkdir -p dm6/ensembl_gtf/default/

# copy/download ensembl data
$ cp Drosophila_melanogaster.BDGP6.28.102.gtf.gz dm6/ensembl_gtf/default/dm6_ensembl.gtf.gz

# build
$ refgenie build dm6/ensembl_gtf --files ensembl_gtf=dm6/ensembl_gtf/default/dm6_ensembl.gtf.gz

```

#### Indexes for aligner: `bowtie`, `bowtie2`, `hisat2`, `bwa`, `STAR`, `kallisto`, `salmon`


+ bowtie_index   

```
# require: fasta, bowtie2
$ refgenie build dm6/bowtie_index
```

+ bowtie2_index   

```
# require: fasta, bowtie2
$ refgenie build dm6/bowtie2_index
```

+ hisat2_index

```
# require: fasta, hisat2
$ refgenie build dm6/hisat2_index
```

+ bwa_index

```
# require: fasta, bwa
$ refgenie build dm6/bwa_index
```

+ kallisto_index 

```
# require: fasta, kallisto
$ refgenie build dm6/kallisto_index
```

+ salmon_index

```
# require: fasta, salmon
$ refgenie build dm6/salmon_index
```

+ star_index

```
# require: fasta, STAR
$ refgenie build dm6/star_index
```




## Extract TSS, gene_body

Ensembl annotated TSS and real TSS?    
see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531148/    
DOI: 10.1093/nar/gks1233    
title: EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era    
> substantial fraction of the gene starts in ENSEMBL are in fact located 10–20 bp downstream of the true TSS.

from `refgenie build` function, The TSS were moved 20bp downstream.(file: `refgenie/asset_build_packages.py` line-457)

Why?



In [None]:
import requests
import sys

server = "https://rest.ensembl.org"
ext = "/archive/id/ENSG00000157764?"
 
r = requests.get(server+ext, headers={ "Content-Type" : "application/json"})
 
if not r.ok:
  r.raise_for_status()
  sys.exit()
 
decoded = r.json()
print(repr(decoded))

In [None]:
import requests
import sys

server = "https://rest.ensembl.org"
ext = "/info/rest?"
 
r = requests.get(server+ext, headers={ "Content-Type" : "application/json"})
 
if not r.ok:
  r.raise_for_status()
  sys.exit()
 
decoded = r.json()
print(repr(decoded))
 

In [70]:
import os
import yaml
import toml
import pickle
import json

class Config(object):
    """Working with config, in dict/yaml/toml/pickle formats
    load/dump
    
    Example:
    1. write to file
    >>> Config(d).dump('out.json')
    >>> Config(d).dump('out.toml')
    >>> Config(d).dump('out.pickle')
    
    2. load from file
    >>> d = Config().load('in.yaml')

    read/write data
    """
    def __init__(self, x=None, **kwargs):
        self = update_obj(self, kwargs, force=True)
        self.x = x


    def load(self, x=None):
        """Read data from x, auto-recognize the file-type
        toml
        json
        pickle
        txt
        ...
        """
        if x == None:
            x = self.x # dict or str

        if x is None:
            x_dict = None # {} ?
        elif isinstance(x, dict):
            x_dict = collections.OrderedDict(sorted(x.items()))
        elif isinstance(x, str):
            reader = self.get_reader(x)
            if reader is None:
                x_dict = None
                log.error('unknown x, {}'.format(x))
            else:
                x_dict = reader(x)
        else:
            x_dict = None
            log.warning('dump(x=) dict,str expect, got {}'.format(
                type(x).__name__))

        return x_dict


    def dump(self, d=None, x=None):
        """Write data to file x, auto-recognize the file-type
        d str or dict, data
        x str file to save data(dict)

        toml
        json
        pickle
        txt
        ...
        """
        if d is None:
            d = self.load(self.x)
        # make sure: dict
        if isinstance(x, str):
            writer = self.get_writer(x)
            if writer is None:
                log.error('unknown x, {}'.format(x))
            else:
                writer(d, x)
        else:
            log.warning('dump(x=) expect str, got {}'.format(
                type(x).__name__))


    def guess_format(self, x):
        """Guess the file format, by file extension
    
        file format:
        - toml
        - yaml
        - json
        - pickle

        data format:
        - dict
        """
        formats = {
            'json': 'json',
            'yaml': 'yaml',
            'yml': "yaml",
            'toml': 'toml',
            'pickle': 'pickle',
            'txt': 'log',
        }

        if isinstance(x, str):
            x_ext = os.path.splitext(x)[1]
            x_ext = x_ext.lstrip('.').lower()
            x_format = formats.get(x_ext, None)
        elif isinstance(x, dict):
            x_format = 'dict'
        else:
            x_format = None

        return x_format


    def get_reader(self, x):
        """Get the reader for file x, based on the file extension
    
        could be: json/yaml/toml/pickle
        """
        x_format = self.guess_format(x)
        readers = {
            'json': self.from_json,
            'yaml': self.from_yaml,
            'toml': self.from_toml,
            'pickle': self.from_pickle
        }
        return readers.get(x_format, None)


    def get_writer(self, x):
        """Get the reader for file x, based on the file extension

        could be: json/yaml/toml/pickle
        """
        x_format = self.guess_format(x)
        writers = {
            'json': self.to_json,
            'yaml': self.to_yaml,
            'toml': self.to_toml,
            'pickle': self.to_pickle,
            'log': self.to_log,
        }
        return writers.get(x_format, None)


    def from_json(self, x):
        """Loding data from JSON file
        x should be file
        """
        d = None
        if file_exists(x):
            try:
                with open(x, 'r') as r:
                    if os.path.getsize(x) > 0:
                        d = json.load(r)
                        d = collections.OrderedDict(sorted(d.items()))
            except Exception as exc:
                log.error('from_json() failed, {}'.format(exc))
            finally:
                return d
        else:
            log.error('from_json() failed, file not exists: {}'.format(x))


    def from_yaml(self, x):
        """Loding data from YAML file
        x should be file
        """
        d = None
        if file_exists(x):
            try:
                with open(x, 'r') as r:
                    if os.path.getsize(x) > 0:
                        d = yaml.load(r, Loader=yaml.FullLoader)
                        d = collections.OrderedDict(sorted(d.items()))
            except Exception as exc:
                log.error('from_yaml() failed, {}'.format(exc))
            finally:
                return d
        else:
            log.error('from_yaml() failed, file not exists: {}'.format(x))
        # with open(x, 'r') as r:
        #     try:
        #         d = yaml.safe_load(r)
        #         return collections.OrderedDict(sorted(d.items()))
        #     except yaml.YAMLError as exc:
        #         log.warning(exc)


    def from_toml(self, x):
        """Loding data from TOML file
        x should be file
        """
        d = None
        if file_exists(x):
            try:
                with open(x, 'r') as r:
                    if os.path.getsize(x) > 0:
                        d = toml.load(x)
                        d = collections.OrderedDict(sorted(d.items()))
            except Exception as exc:
                log.error('from_toml() failed, {}'.format(exc))
            finally:
                return d
        else:
            log.error('from_toml() failed, file not exists: {}'.format(x))


    def from_pickle(self, x):
        """Loding data from pickle file
        x should be file
        """
        d = None
        if file_exists(x):
            try:
                with open(x, 'rb') as r:
                    if os.path.getsize(x) > 0:
                        d = pickle.load(r)
                        d = collections.OrderedDict(sorted(d.items()))
            except Exception as exc:
                log.error('from_pickle() failed, {}'.format(exc))
            finally:
                return d
        else:
            log.error('from_pickle() failed, file not exists: {}'.format(x))


    def to_json(self, d, x):
        """Writing data to JSON file
        d dict, data to file
        x None or str, path to JSON file, or return string
        """
        x = file_abspath(x)
        if not isinstance(d, dict):
            log.error('to_json(d=) failed, dict expect, got {}'.format(
                type(d).__name__))
        elif not isinstance(x, str):
            log.error('to_json(d=) failed, str expect, got {}'.format(
                type(x).__name__))
        elif not file_exists(os.path.dirname(x), isfile=False):
            log.error('to_json(x=) failed, file not exists: {}'.format(x))
        else:
            try:
                with open(x, 'wt') as w:
                    json.dump(d, w, indent=4, sort_keys=True)
                # return x
            except Exception as exc:
                log.error('to_json() failed, {}'.format(exc))


    def to_yaml(self, d, x):
        """Writing data to YAML file
        d dict, data to file
        x str, path to YAML file

        yaml.dump(), does not support OrderedDict
        Solution: OrderedDict -> json -> dict
        """
        # x_yaml = x
        # x = os.path.splitext(x_yaml)[0] + '.toml'
        # log.warning('OrderedDict is not supported in YAML, save as TOML instead: {}'.format(x))
        # check
        x = file_abspath(x)
        if not isinstance(d, dict):
            log.error('to_yaml(d=) failed, dict expect, got {}'.format(
                type(d).__name__))
        elif not isinstance(x, str):
            log.error('to_yaml(d=) failed, str expect, got {}'.format(
                type(x).__name__))
        elif not file_exists(os.path.dirname(x), isfile=False):
            log.error('to_yaml(x=) failed, file not exists: {}'.format(x))
        else:
            try:
                with open(x, 'wt') as w:
                    # toml.dump(d, w)
                    yaml.dump(dict(d), w)
                # return x
            except Exception as exc:
                log.error('to_yaml() failed, {}'.format(exc))


    def to_toml(self, d, x):
        """Writing data to TOML file
        d dict, data to file
        x str, path to TOML file
        """        
        x = file_abspath(x)
        if not isinstance(d, dict):
            log.error('to_toml(d=) failed, dict expect, got {}'.format(
                type(d).__name__))
        elif not isinstance(x, str):
            log.error('to_toml(d=) failed, str expect, got {}'.format(
                type(x).__name__))
        elif not file_exists(os.path.dirname(x), isfile=False):
            log.error('to_toml(d=) failed, file not exists: {}'.format(x))
        else:
            try:
                with open(x, 'wt') as w:
                    toml.dump(d, w)
                # return x
            except Exception as exc:
                log.error('to_toml() failed, {}'.format(exc))


    def to_pickle(self, d, x):
        """Writing data to pickle file
        d dict, data to file
        x str, path to pickle file
        """        
        x = file_abspath(x)
        if not isinstance(d, dict):
            log.error('to_pickle(d=) failed, dict expect, got {}'.format(
                type(d).__name__))
        elif not isinstance(x, str):
            log.error('to_pickle(x=) failed, str expect, got {}'.format(
                type(x).__name__))
        elif not file_exists(os.path.dirname(x), isfile=False):
            log.error('to_pickle(x=) failed, file not exists: {}'.format(x))
        else:
            try:
                with open(x, 'wb') as w:
                    pickle.dump(d, w, protocol=pickle.HIGHEST_PROTOCOL)
                # return x
            except Exception as exc:
                log.error('to_pickle() failed, {}'.format(exc))


    def to_log(self, d, x, stdout=False):
        """Writing data to log file: key: value format
        d dict, data to file
        x str, path to pickle file
        """
        x = file_abspath(x)
        if not isinstance(d, dict):
            log.error('to_log(d=) failed, dict expect, got {}'.format(
                type(d).__name__))
        elif not isinstance(x, str):
            log.error('to_log(x=) failed, str expect, got {}'.format(
                type(x).__name__))
        elif not file_exists(os.path.dirname(x), isfile=False):
            log.error('to_log(x=) failed, file not exists: {}'.format(x))
        else:
            try:
                # organize msg
                msg = []
                for k, v in d.items():
                    if isinstance(v, str) or isinstance(v, numbers.Number) or isinstance(v, bool):
                        v = str(v)
                    elif isinstance(v, list):
                        v = ', '.join(map(str, v))
                    else:
                        v = '...' # skip
                    msg.append('{:30s} | {:<40s}'.format(k, v))
                # save
                with open(x, 'wt') as w:
                    w.write('\n'.join(msg) + '\n')
                if stdout:
                    print('\n'.join(msg))
                # return x
            except Exception as exc:
                log.error('to_log() failed, {}'.format(exc))


    def _tmp(self, suffix='.txt'):
        """
        Create a tmp file to save json object
        """
        tmp = tempfile.NamedTemporaryFile(prefix='tmp', suffix=suffix,
            delete=False)
        return tmp.name
    
    
    
# df = Config().load('./aaaa.json')
# Config().dump(df, './aaaa.txt')
# Config().dump(df, './aaaa.json')
#df = Config().load('./aaaa.json')
#Config().dump(df, 'aaaa.pickle')
df = {'aaa': 1, 'BBB': 2}
Config().dump(df, 'aaaa.yaml')