In [1]:
from pathlib import Path
import pybedtools
import pandas as pd
import json

This notebook documents the process for extending Dorina regulators database.
Dorina depends on the correct naming of both the bedfile and and bed record (the rows in the bed file). Dorina filter most bed files by the name, and if the name of the file and the bed record do not match, there will be problems. Please see dorina.regulator.py line 80 for implementation details. 

Also, webdorina depends on a specific parsing to process the regulator site. This specification is:  
'{experiment}#{reg_name}_{assembly}\*{reg_name}'  
PARCLIP#TAF15_hg19*TAF15

To keep the database consistent with the old data 
- peaks in experimental replicates are merged
- the score of the resulting peak is the median peak score

This may not be optimal for every experiment, but the database contains various types of experiments.

In [2]:
def change_name(rec, new_name=""):
    rec.name = new_name
    return rec

## Add targetScan predictions on 17 April 2018
Add targetScan prediction from http://www.targetscan.org/vert_72/vert_72_data_download/Predicted_Target_Locations.default_predictions.hg19.bed.zip 

Data was lift over to hg38 with Crossmap:
`conda activate crossmap`
`python2 /home/tbrittoborges/bin/miniconda3/envs/crossmap/bin/CrossMap.py bed /prj/dorina2/crossmap/hg19ToHg38.over.chain Predicted_Target_Locations.default_predictions.hg19.bed targetScan_hg38.bed`

## Add eClip data on 16 April 2018
Add eClip data from Van Nostrand, Eric L., et al. "Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP)." Nature methods 13.6 (2016): 508.

In [5]:
for cell in ('K562', 'HepG2'):
    path = Path(f'/Volumes/biodb/encode/encode_hg38_clip_peaks/{cell}/replicates/')
    bed_files = path.glob('*.bed')
    regulators = [x.stem[: x.stem.find('_')] for x in bed_files]
    for regulator in regulators:
        bt1 = pybedtools.BedTool(str(path / f'{regulator}_{cell}_rep01.bed'))
        bt2 = pybedtools.BedTool(str(path / f'{regulator}_{cell}_rep02.bed'))
        bt_inter = bt1.intersect(bt2, s=True).sort()
        bt_merged = bt1.cat(bt2, postmerge=False).sort()
        bt_inter = bt_inter.map(b=bt_merged, c='5', s=True, o='median').bed6()
        new_name = f'{cell}-{regulator}|eClip'
        bt_inter.each(change_name, new_name=new_name).saveas().moveto(
            f'/Volumes/tbrittoborges/dorina_eclip/eClip_{cell}_{regulator}_vanNostrand2016_hg38.bed')    

In [6]:
!cat /Volumes/tbrittoborges/dorina_eclip/* > /Volumes/prj/dorina2/regulators/h_sapiens/hg38/eClip_RBP_hg38.bed

In [7]:
pd.read_json('/Volumes/prj/dorina2/regulators/h_sapiens/hg38/TargetScanCons_mirna_hg19.json').head()

Unnamed: 0,description,experiment,id,methods,references,summary
0,Liftover to hg38 assembly with crossmap.,TargetScan Cons. miRNA:targets,miR-504|4725-5p|TargetScan,,"Benjamin P, et al. Cell 2005;20:15-20.",miR-504|4725-5p|TargetScan
1,Liftover to hg38 assembly with crossmap.,TargetScan Cons. miRNA:targets,miR-124|124ab|506|TargetScan,,"Benjamin P, et al. Cell 2005;20:15-20.",miR-124|124ab|506|TargetScan
2,Liftover to hg38 assembly with crossmap.,TargetScan Cons. miRNA:targets,miR-19ab|TargetScan,,"Benjamin P, et al. Cell 2005;20:15-20.",miR-19ab|TargetScan
3,Liftover to hg38 assembly with crossmap.,TargetScan Cons. miRNA:targets,miR-137|137ab|TargetScan,,"Benjamin P, et al. Cell 2005;20:15-20.",miR-137|137ab|TargetScan
4,Liftover to hg38 assembly with crossmap.,TargetScan Cons. miRNA:targets,miR-31|TargetScan,,"Benjamin P, et al. Cell 2005;20:15-20.",miR-31|TargetScan


In [8]:
eclip_json = []
for cell in ('K562', 'HepG2'):
    path = Path(f'/Volumes/biodb/encode/encode_hg38_clip_peaks/{cell}/replicates/')
    bed_files = path.glob('*.bed')
    regulators = [x.stem[: x.stem.find('_')] for x in bed_files]
    for regulator in regulators:
        n_sites = len(pybedtools.BedTool(
            f'/Volumes/tbrittoborges/dorina_eclip/eClip_{cell}_{regulator}_vanNostrand2016_hg38.bed'))
        eclip_json.append(
            {
            "description": f'sites regulated by {n_sites} obtained from {cell} cells. The ' 
            'duplicated were merged with intersect and the median score is presented.'  
            'The original data was mapped to hg38 assembly',
            "experiment": 'eClip',
            "id": f'{cell}-{regulator}|eClip',
            "references": 'Van Nostrand, Eric L., et al. Robust transcriptome-wide discovery ' 
            'of RNA-binding protein binding sites with enhanced CLIP (eCLIP). Nature methods ' 
            '13.6 (2016): 508.',
            "summary": f'{cell}-{regulator}|eClip'                
                }
            )        

In [9]:
with open('/Volumes/prj/dorina2/regulators/h_sapiens/hg38/eClip_RBP_hg38.json', 'w') as fout:
   json.dump(eclip_json, fout, indent=True)

In [10]:
pd.read_json('/Volumes/prj/dorina2/regulators/h_sapiens/hg38/eClip_RBP_hg38.json').head()

Unnamed: 0,description,experiment,id,references,summary
0,sites regulated by 101732 obtained from K562 c...,eClip,K562-U2AF2|eClip,"Van Nostrand, Eric L., et al. Robust transcrip...",K562-U2AF2|eClip
1,sites regulated by 42087 obtained from K562 ce...,eClip,K562-UPF1|eClip,"Van Nostrand, Eric L., et al. Robust transcrip...",K562-UPF1|eClip
2,sites regulated by 31614 obtained from K562 ce...,eClip,K562-SMNDC1|eClip,"Van Nostrand, Eric L., et al. Robust transcrip...",K562-SMNDC1|eClip
3,sites regulated by 27191 obtained from K562 ce...,eClip,K562-CPSF6|eClip,"Van Nostrand, Eric L., et al. Robust transcrip...",K562-CPSF6|eClip
4,sites regulated by 81632 obtained from K562 ce...,eClip,K562-PRPF8|eClip,"Van Nostrand, Eric L., et al. Robust transcrip...",K562-PRPF8|eClip


## Fix for mm10 regulators on 13 April 2018

dorina and webdorina were not working with mm10 regulatos due to mismatch of the regulators record name and .bed file name.  
This section fixes this issue

In [3]:
bed_files = Path('/Volumes/prj/dorina2/regulators/m_musculus/mm10/').glob('*.bed')

In [5]:
for bed_file in bed_files:
    bt = pybedtools.BedTool(str(bed_file))
    new_name = bed_file.stem + '*'
    bt.each(change_name, new_name=new_name).saveas().moveto(str(bed_file))

 there are still problems with the pictar files