Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail at 'Filter by non repetitive region' #53

Open
bioysu opened this issue Aug 6, 2018 · 3 comments
Open

Fail at 'Filter by non repetitive region' #53

bioysu opened this issue Aug 6, 2018 · 3 comments
Assignees
Labels

Comments

@bioysu
Copy link

bioysu commented Aug 6, 2018

Dir Sir, I tried to run DCC, but I failed every time at step 'Filter by non repetitive region'.
Could you help me to fix it?
Here is the command:
/data/suyao/tools/build/bin/DCC @/data/suyao/project/circRNA/output/DCC/samplesheet_test -mt1 @/data/suyao/project/circRNA/output/DCC/mate1_test -mt2 @/data/suyao/project/circRNA/output/DCC/mate2_test -D -R /data/suyao/data/UCSC/hg19/database/hg19_repeat.gtf -an /data/suyao/data/ensembl/grch37/release-92/Homo_sapiens.GRCh37.87.chr_patch_hapl_scaff_chr.gtf -k -Pi -F -M -Nr 5 3 -fg -G -A /data/suyao/data/UCSC/hg19/chromosomes/hg19.fa -O /data/suyao/project/circRNA/output/DCC/output_test

Here is the output on the screen:
Output folder /data/suyao/project/circRNA/output/DCC/output_test already exists, reusing
DCC 0.4.6 started
Input file names have duplicates, add number suffix in input order to output files for distinction
40 CPU cores available, using 2
Please make sure that the read pairs have been mapped both, combined and on a per mate basis
Collecting chimera information from mates-separate mapping
Combining individual circRNA read counts
Using files _tmp_DCC/tmp_circCount and _tmp_DCC/tmp_coordinates for filtering
Filtering by read counts
Traceback (most recent call last):
File "/data/suyao/tools/build/bin/DCC", line 11, in
load_entry_point('DCC==0.4.6', 'console_scripts', 'DCC')()
File "build/bdist.linux-x86_64/egg/DCC/main.py", line 349, in main
File "build/bdist.linux-x86_64/egg/DCC/circFilter.py", line 92, in filter_nonrep
File "build/bdist.linux-x86_64/egg/DCC/circFilter.py", line 85, in read_rep_region
File "/data/suyao/tools/build/lib/python2.7/site-packages/HTSeq-0.10.0-py2.7-linux-x86_64.egg/HTSeq/init.py", line 210, in iter
strand, frame, attributeStr) = line.split("\t", 8)
ValueError: need more than 1 value to unpack
started circRNA detection from file _tmp_DCC/Chimeric.out.junction.HF3HYK
=> separating duplicates [_tmp_DCC/Chimeric.out.junction.HF3HYK]
Read 113121040.+.113119607.ST-E00251:429:HCMF5CCXY:2:2202:28229:35801 has more than 2 count.
Read 113121040.+.113119607.ST-E00251:429:HCMF5CCXY:2:2202:28229:35801 has more than 2 count.
Read 113121040.+.113119607.ST-E00251:429:HCMF5CCXY:2:2202:28229:35801 has more than 2 count.
=> locating small circRNAs [_tmp_DCC/Chimeric.out.junction.HF3HYK]
=> locating circRNAs (stranded mode) [_tmp_DCC/Chimeric.out.junction.HF3HYK]
=> merging circRNAs [_tmp_DCC/Chimeric.out.junction.HF3HYK]
=> sorting circRNAs (stranded mode) [_tmp_DCC/Chimeric.out.junction.HF3HYK]
finished circRNA detection from file _tmp_DCC/Chimeric.out.junction.HF3HYK
started circRNA detection from file _tmp_DCC/Chimeric.out.junction.L546H9
=> separating duplicates [_tmp_DCC/Chimeric.out.junction.L546H9]
Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:23043:55614 has more than 2 count.
Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:23043:55614 has more than 2 count.
Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:22901:55860 has more than 2 count.
Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:22901:55860 has more than 2 count.
Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:23043:55614 has more than 2 count.
Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:22901:55860 has more than 2 count.
=> locating small circRNAs [_tmp_DCC/Chimeric.out.junction.L546H9]
=> locating circRNAs (stranded mode) [_tmp_DCC/Chimeric.out.junction.L546H9]
=> merging circRNAs [_tmp_DCC/Chimeric.out.junction.L546H9]
=> sorting circRNAs (stranded mode) [_tmp_DCC/Chimeric.out.junction.L546H9]
finished circRNA detection from file _tmp_DCC/Chimeric.out.junction.L546H9
started circRNA detection from file _tmp_DCC/Chimeric.out.junction.14SVFY
=> separating duplicates [_tmp_DCC/Chimeric.out.junction.14SVFY]
Read 14270933.-.14280546.ST-E00251:429:HCMF5CCXY:2:1219:13129:19346 has more than 2 count.
Read 14270933.-.14280546.ST-E00251:429:HCMF5CCXY:2:1219:13129:19346 has more than 2 count.
Read 135848567.-.135851264.ST-E00251:429:HCMF5CCXY:2:1221:20222:8675 has more than 2 count.
Read 135848567.-.135851264.ST-E00251:429:HCMF5CCXY:2:1221:20222:8675 has more than 2 count.
Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count.
Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count.
Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count.
Read 98667021.-.98667505.ST-E00251:429:HCMF5CCXY:2:1206:13484:21087 has more than 2 count.
Read 98667021.-.98667505.ST-E00251:429:HCMF5CCXY:2:1206:13484:21087 has more than 2 count.
Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count.
Read 98667021.-.98667505.ST-E00251:429:HCMF5CCXY:2:1206:13484:21087 has more than 2 count.
Read 135848567.-.135851264.ST-E00251:429:HCMF5CCXY:2:1221:20222:8675 has more than 2 count.
Read 14270933.-.14280546.ST-E00251:429:HCMF5CCXY:2:1219:13129:19346 has more than 2 count.
=> locating small circRNAs [_tmp_DCC/Chimeric.out.junction.14SVFY]
=> locating circRNAs (stranded mode) [_tmp_DCC/Chimeric.out.junction.14SVFY]
=> merging circRNAs [_tmp_DCC/Chimeric.out.junction.14SVFY]
=> sorting circRNAs (stranded mode) [_tmp_DCC/Chimeric.out.junction.14SVFY]
finished circRNA detection from file _tmp_DCC/Chimeric.out.junction.14SVFY

Here is the log file in the outputfile:

2018-08-06 19:02:47,705 DCC 0.4.6 started
2018-08-06 19:02:47,705 DCC command line: /data/suyao/tools/build/bin/DCC @/data/suyao/project/circRNA/output/DCC/samplesheet_test -mt1 @/data/suyao/project/circRNA/output/DCC/mate1_test -mt2 @/data/suyao/project/circRNA/output/DCC/mate2_test -D -R /data/suyao/data/UCSC/hg19/database/hg19_repeat.gtf -an /data/suyao/data/ensembl/grch37/release-92/Homo_sapiens.GRCh37.87.chr_patch_hapl_scaff_chr.gtf -k -Pi -F -M -Nr 5 3 -fg -G -A /data/suyao/data/UCSC/hg19/chromosomes/hg19.fa -O /data/suyao/project/circRNA/output/DCC/output_test
2018-08-06 19:02:47,705 Input file names have duplicates, add number suffix in input order to output files for distinction
2018-08-06 19:02:47,713 Starting to detect circRNAs
2018-08-06 19:02:47,713 Stranded data mode
2018-08-06 19:02:47,713 Please make sure that the read pairs have been mapped both, combined and on a per mate basis
2018-08-06 19:02:47,713 Collecting chimera information from mates-separate mapping
2018-08-06 19:03:20,618 started circRNA detection from file _tmp_DCC/Chimeric.out.junction.HF3HYK
2018-08-06 19:03:20,618 started circRNA detection from file _tmp_DCC/Chimeric.out.junction.L546H9
2018-08-06 19:04:34,465 Read 113121040.+.113119607.ST-E00251:429:HCMF5CCXY:2:2202:28229:35801 has more than 2 count.
2018-08-06 19:04:34,469 Read 113121040.+.113119607.ST-E00251:429:HCMF5CCXY:2:2202:28229:35801 has more than 2 count.
2018-08-06 19:05:24,760 Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:23043:55614 has more than 2 count.
2018-08-06 19:05:24,764 Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:23043:55614 has more than 2 count.
2018-08-06 19:05:24,777 Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:22901:55860 has more than 2 count.
2018-08-06 19:05:24,781 Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:22901:55860 has more than 2 count.
2018-08-06 19:10:16,099 Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:23043:55614 has more than 2 count.
2018-08-06 19:10:16,110 Read 56527625.-.56528213.ST-E00251:429:HCMF5CCXY:2:1213:22901:55860 has more than 2 count.
2018-08-06 19:11:09,119 Read 113121040.+.113119607.ST-E00251:429:HCMF5CCXY:2:2202:28229:35801 has more than 2 count.
2018-08-06 19:11:56,154 finished circRNA detection from file _tmp_DCC/Chimeric.out.junction.L546H9
2018-08-06 19:11:56,154 started circRNA detection from file _tmp_DCC/Chimeric.out.junction.14SVFY
2018-08-06 19:12:24,298 Read 14270933.-.14280546.ST-E00251:429:HCMF5CCXY:2:1219:13129:19346 has more than 2 count.
2018-08-06 19:12:24,303 Read 14270933.-.14280546.ST-E00251:429:HCMF5CCXY:2:1219:13129:19346 has more than 2 count.
2018-08-06 19:12:56,735 Read 135848567.-.135851264.ST-E00251:429:HCMF5CCXY:2:1221:20222:8675 has more than 2 count.
2018-08-06 19:12:56,740 Read 135848567.-.135851264.ST-E00251:429:HCMF5CCXY:2:1221:20222:8675 has more than 2 count.
2018-08-06 19:13:22,486 finished circRNA detection from file _tmp_DCC/Chimeric.out.junction.HF3HYK
2018-08-06 19:16:28,007 Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count.
2018-08-06 19:16:28,013 Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count.
2018-08-06 19:18:20,494 Read 98667021.-.98667505.ST-E00251:429:HCMF5CCXY:2:1206:13484:21087 has more than 2 count.
2018-08-06 19:18:20,499 Read 98667021.-.98667505.ST-E00251:429:HCMF5CCXY:2:1206:13484:21087 has more than 2 count.
2018-08-06 19:21:44,266 Read 30954186.-.30956927.ST-E00251:429:HCMF5CCXY:1:1221:30908:35133 has more than 2 count.
2018-08-06 19:21:57,792 Read 98667021.-.98667505.ST-E00251:429:HCMF5CCXY:2:1206:13484:21087 has more than 2 count.
2018-08-06 19:23:01,357 Read 135848567.-.135851264.ST-E00251:429:HCMF5CCXY:2:1221:20222:8675 has more than 2 count.
2018-08-06 19:24:37,163 Read 14270933.-.14280546.ST-E00251:429:HCMF5CCXY:2:1219:13129:19346 has more than 2 count.
2018-08-06 19:25:01,135 finished circRNA detection from file _tmp_DCC/Chimeric.out.junction.14SVFY
2018-08-06 19:25:01,135 Combining individual circRNA read counts
2018-08-06 19:25:16,451 Write in annotation
2018-08-06 19:25:16,451 Select gene features in Annotation file
2018-08-06 19:30:00,390 Filtering started
2018-08-06 19:30:00,390 Using files _tmp_DCC/tmp_circCount and _tmp_DCC/tmp_coordinates for filtering
2018-08-06 19:30:02,196 Filtering by read counts
2018-08-06 19:30:02,862 Filter by non repetitive region

@tjakobi tjakobi self-assigned this Aug 6, 2018
@tjakobi
Copy link
Contributor

tjakobi commented Aug 6, 2018

Dear @bioysu,

thank you for your feedback! Would it be possible for you to upload the repetitive region file you specified on the command line in some way? It seems that the HTSeq library, which parses the GTF file has trouble with its syntax. I'd like to verify that the hg19_repeat.gtf file is valid.

Cheers,
Tobias

@bioysu
Copy link
Author

bioysu commented Aug 7, 2018

hg19_repeat.gtf is download from UCSC genome browser.
Here is the head of hg19_repeat.gtf:
chr1 hg19_rmsk exon 16777161 16777470 2147.000000 + . gene_id "AluSp"; transcript_id "AluSp";
chr1 hg19_rmsk exon 25165801 25166089 2626.000000 - . gene_id "AluY"; transcript_id "AluY";
chr1 hg19_rmsk exon 33553607 33554646 626.000000 + . gene_id "L2b"; transcript_id "L2b";
chr1 hg19_rmsk exon 50330064 50332153 12545.000000 + . gene_id "L1PA10"; transcript_id "L1PA10";
chr1 hg19_rmsk exon 58720068 58720973 8050.000000 - . gene_id "L1PA2"; transcript_id "L1PA2";
chr1 hg19_rmsk exon 75496181 75498100 10586.000000 + . gene_id "L1MB7"; transcript_id "L1MB7";

@tjakobi
Copy link
Contributor

tjakobi commented Aug 10, 2018

Dear @bioysu,

could you count the line of the file and and generate an md5sum? Using the same file, DCC does not produce the error here. Maybe the download was incomplete?

Cheers,
Tobias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants