# Processing data with Bed2Seq.py

Below is a demo on how to generate sequence data and corresponding labels from a set of BED files with Bed2Seq.py 

1) Check the bed files

In [6]:
!ls ../data/bed_files

CN.bed	DN.bed	GA.bed	IPS.bed  NSC.bed


2) Run Bed2Seq

In [15]:
!python3 ../src/data_processing/Bed2Seq.py --BedDir ../data/bed_files/ --OutDir ../data/seq_data/ --RefGenome ../tool/genome_bins.bed --ToolDir ../tool/

------------Starting Bed2Seq------------

------------Checking input BED directory------------
Input bed directory: ../data/bed_files/

Number of Bed file Found: 5
GA CN IPS DN NSC

Finished
------------Checking Reference Genome------------
Finished
------------Checking required tools and files------------
Finished
------------Calculating overlaps------------
../tool/bedtools2/bin/intersectBed -a ../tool/genome_bins.bed -b ../data/bed_files/GA.bed ../data/bed_files/CN.bed ../data/bed_files/IPS.bed ../data/bed_files/DN.bed ../data/bed_files/NSC.bed -names GA CN IPS DN NSC -wa -wb -f 0.5 > ../data/seq_data/overlap.bed

Writing overlap.bed
Finished
------------Composing sequence and corresponding labels------------
Writing train.bed
Writing test.bed
Writing labels.pt
Finished
------------Writing Sequence file to destination------------
Writing train.fasta
Writing test.fasta
Writing train.seq
Writing test.seq
Finished
------------Writing Meta data------------
Finished
------------Clean Up-

3) Checking the output

In [27]:
!cat ../data/seq_data/MetaData.txt

Input bed directory: ../data/bed_files/
Number of BED files: 5
BED file names: GA CN IPS DN NSC
Number of train sequences: 1407923
Number of test sequences: 165626


In [21]:
import torch
import sys
sys.path.append('../src')
from model import data_loader

In [28]:
Dset = data_loader.seq_data(seq_path = '../data/seq_data/train.seq', training_mode = True, label_path = '../data/seq_data/labels.pt')
seq_id, onehot_seq, label = Dset[0]
print(seq_id)
print(onehot_seq.size())
print(label.size())
FeatMap = torch.load('../data/seq_data/FeatMap.pt')
print(FeatMap)

50
torch.Size([4, 1000])
torch.Size([5])
{'GA': 0, 'CN': 1, 'IPS': 2, 'DN': 3, 'NSC': 4}


# Processing variants with SNP2Seq.py

Below is a demo on how to generate sequence data and corresponding labels from a set of BED files with Bed2Seq.py 

1) Check the variant file

In [33]:
!cat ../data/SNP_file/rsid.txt
!cat ../data/SNP_file/test.vcf

rs328
rs12854784
chr1	109817590	[known_CEBP_binding_increase]	G	T
chr10	23508363	[known_FOXA2_binding_decrease]	A	G
chr16	52599188	[known_FOXA1_binding_increase]	C	T
chr16	209709	[known_GATA1_binding_increase]	T	C


2) Run SNP2Seq and checking output

rsid mode

In [37]:
!python3 ../src/data_processing/SNP2Seq.py --InputType rsid --InputFile ../data/SNP_file/rsid.txt --OutDir ../data/vseq_data/ --ToolDir ../tool/

------------Starting SNP2Seq------------

------------Checking required tools and files------------
Finished
------------Converting input to bed------------
Finished
------------Writing seq file to destination------------
Finished
------------Clean up------------
Finished


In [38]:
Dset = data_loader.SNP_data(seq_path = '../data/vseq_data/out.vseq')
seq_id, ref_seq, alt_seq = Dset[0]
print(seq_id)
print(ref_seq.size())
print(alt_seq.size())

rs328;C;G
torch.Size([4, 1000])
torch.Size([4, 1000])


VCF mode

In [39]:
!python3 ../src/data_processing/SNP2Seq.py --InputType VCF --InputFile ../data/SNP_file/test.vcf --OutDir ../data/vseq_data/ --ToolDir ../tool/

------------Starting SNP2Seq------------

------------Checking required tools and files------------
Finished
------------Converting input to bed------------
Finished
------------Writing seq file to destination------------
Finished
------------Clean up------------
Finished


In [40]:
Dset = data_loader.SNP_data(seq_path = '../data/vseq_data/out.vseq')
seq_id, ref_seq, alt_seq = Dset[0]
print(seq_id)
print(ref_seq.size())
print(alt_seq.size())

[known_CEBP_binding_increase];G;T
torch.Size([4, 1000])
torch.Size([4, 1000])
