Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gff2bed doesn't place gene id values into 4th BED column #244

Closed
mmokrejs opened this issue Oct 2, 2020 · 2 comments
Closed

gff2bed doesn't place gene id values into 4th BED column #244

mmokrejs opened this issue Oct 2, 2020 · 2 comments

Comments

@mmokrejs
Copy link

mmokrejs commented Oct 2, 2020

Hi Alex,
I tried to get around the need to modify the gtf files from gencode by using their gff3 files directly which seem to be more strictly defined. But I am puzzled by differences between gtf2bed-converted gencode v21 versus gff2bed-converted gencode v35 outputs:

for n in 21; do echo $n; gzip -dc ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_"$n"/gencode.v"$n".annotation.gtf.gz | gtf2bed - |  grep exon > ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_"$n"/gencode.v"$n".annotation.exons.bed; done
21
$
$ for n in 21 28 29 35; do echo $n; gzip -dc ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_"$n"/gencode.v"$n".annotation.gtf.gz | gff2bed - |  grep exon > ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_"$n"/gencode.v"$n".annotation.exons.bed2; done
21
28
29
35
$
$ head ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_21/gencode.v21.annotation.exons.bed ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_21/gencode.v21.annotation.exons.bed2 
==> ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_21/gencode.v21.annotation.exons.bed <==
chr1	11868	12227	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	12009	12057	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	12178	12227	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	12612	12697	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 3; exon_id "ENSE00001758273.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	12612	12721	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	12974	13052	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 4; exon_id "ENSE00001799933.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	13220	13374	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 5; exon_id "ENSE00001746346.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	13220	14409	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	13452	13670	ENSG00000223972.5	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 6; exon_id "ENSE00001863096.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	14403	14501	ENSG00000227232.5	.	-	HAVANA	exon	.	gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "WASH7P-001"; exon_number 11; exon_id "ENSE00001843071.1"; level 2; ont "PGO:0000005"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";

==> ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_21/gencode.v21.annotation.exons.bed2 <==
chr1	11868	12227	.	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	12009	12057	.	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	12178	12227	.	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	12612	12697	.	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 3; exon_id "ENSE00001758273.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	12612	12721	.	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	12974	13052	.	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 4; exon_id "ENSE00001799933.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	13220	13374	.	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 5; exon_id "ENSE00001746346.2"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	13220	14409	.	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	13452	13670	.	.	+	HAVANA	exon	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-001"; exon_number 6; exon_id "ENSE00001863096.1"; level 2; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1	14403	14501	.	.	-	HAVANA	exon	.	gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "WASH7P-001"; exon_number 11; exon_id "ENSE00001843071.1"; level 2; ont "PGO:0000005"; tag "basic"; transcript_support_level "NA"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
$

Is there a way to get gene names into a dedicated BED column? Could I instruct gff2bed to extract gene_name values into a dedicated column?

The runtime help for convert2bed and actually gff2bed doesn't help me to understand what fields it acts upon and which not. Would you please clarify the runtime help text a bit more?

Thank you,

@alexpreynolds
Copy link
Collaborator

The gtf2bed script (or convert2bed --input=gtf cannot or should not be used for converting GFF3, or vice versa, as the formats are different, especially for attributes and other metadata.

In v2.4.40 (in development), the gtf2bed and gff2bed conversion scripts now support copying a subset of reserved attributes to the ID field by keyname.

By default, gtf2bed and gff2bed will parse the attributes string and copy the gene_id value to the output ID field (i.e., fourth column).

The --attribute-key option can be used to copy over gene_name, transcript_name, and several other attributes.

More information is available via convert2bed --help-gtf, convert2bed --help-gff, gtf2bed --help,gff2bed --help, or the online documentation.

Relevant commit: 090a7ae

@deliaBlue
Copy link

Hi Alex,
I tried to convert a GFF3 file into a BED one (using convert2bed and gff2bed) but the fourth column appears to only have . instead of reporting the ID. My GFF3 file looks as follows:

19	.	miRNA_primary_transcript	39	122	.	+	.	ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1
19	.	miRNA	52	74	.	+	.	ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-5p;Derives_from=MI0003140
19	.	miRNA	89	110	.	+	.	ID=MIMAT0002823;Alias=MIMAT0002823;Name=hsa-miR-512-3p;Derives_from=MI0003140

The output BED file using a previous version (2.4.35) gives contains the ID in the 4th column but the 2.4.41 does not; this is how the output looks.

19	38	122	.	.	+	.	miRNA_primary_transcript	.	ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1
19	51	74	.	.	+	.	miRNA	.	ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-5p;Derives_from=MI0003140
19	88	110	.	.	+	.	miRNA	.	ID=MIMAT0002823;Alias=MIMAT0002823;Name=hsa-miR-512-3p;Derives_from=MI0003140

The command used is:

convert2bed -i gff < mirna_annotations.gff3 > mirna_annotations.bed 

I tried to play with the option --attribute-key by giving it different values but none of them provides the ID on the fourth column.

Is there a way to actually have the ID on the fourth column?

Thank you,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants