Create exon file and g2t file from new Refseq GFF

First script, gff2tsv

Script to convert refseq GFF to TSV using CDS.

Info on the file:

coordinates get converted from 1-based to 0-based format.
exon number is infered using cds and exon coordinates (any overlap size)

CDS not gathered:

NT_ --> Genomic Contig or scaffold, clone-based or WGSa
NW_ --> Genomic Contig or scaffold, primarily WGSa
Removing duplicated exons entirely, except for exons present in X and Y chromosomes i.e. NM_000451.4 in that case keep the X copy:
- NM_001134939.1
- NM_001172437.2
- NM_001184961.1
- NM_001301020.1
- NM_001301302.1
- NM_001301371.1
- NM_002537.3
- NM_004152.3
- NM_015068.3
- NM_016178.2
Mitochondrial genes:
- NC_012920.1

How to run

Tested only with a Refseq gff file. Genome build should be input as '37' or '38'.

python gff2tsv.py ${gff_file} -f ${flank} -o ${output_name} -b ${genome_build}

Outputs

TSV file with the CDS

Format:

chrom   start   end HGNC:ID Refseq_transcript_id    exon_nb

Second script, refseq_g2t

Warning

Release 1.2.0 has updated the first script (gff2tsv) but not the second (refseq_g2t) this version of the refseq_g2t script may not work with the new version of gff2tsv

Please use with caution. Testing and validation will be needed before use with new gff2tsv script v1.2.0.

Script to check that g2t from new refseq gff is correct and generate a g2t from exon file from gff2tsv

HGMD file was generated using the following command line on a MySQL database using a HGMD dump (project-Fz4Q15Q42Z9YjYk110b3vGYQ:file-Fz4Q46842Z9z2Q6ZBjy7jVPY):

sudo mysql -e "select markname.hgncID, concat(refcore, '.', refversion) from gene2refseq join markname on markname.gene_id=gene2refseq.hgmdID" hgmd_2020_3 | grep -v "concat(" > hgmd_transcripts.txt

The Haemonc file was provided to me by Aisha Dahir and contains a single column with transcripts without versions (https://github.com/eastgenomics/Haemonc_requests/tree/main/URA-97)

How to run

python refseq_g2t.py -n ${nirvana_g2t} -r ${refseq_exon_file} [ check ${hgmd_file} ${haemonc_file} ]

Use the check flag to do checks:

nirvana
hgmd
haemonc

Outputs

G2T formatted file i.e.:

HGNC:14825  NM_001005484.2  not_clinical_transcript not_canonical
HGNC:31275  NM_001005221.2  not_clinical_transcript not_canonical
HGNC:15079  NM_001005277.1  not_clinical_transcript not_canonical
HGNC:28706  NM_001385641.1  not_clinical_transcript not_canonical
HGNC:28706  NM_001385640.1  not_clinical_transcript not_canonical
HGNC:28706  NM_152486.4 not_clinical_transcript not_canonical
HGNC:24517  NM_015658.4 not_clinical_transcript not_canonical
HGNC:24023  NM_198317.3 not_clinical_transcript not_canonical
HGNC:25284  NM_032129.3 not_clinical_transcript not_canonical
HGNC:25284  NM_001367552.1  not_clinical_transcript not_canonical
HGNC:25284  NM_001160184.2  not_clinical_transcript not_canonical
HGNC:28208  NM_001291366.2  not_clinical_transcript not_canonical

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
README.md		README.md
gff2tsv.py		gff2tsv.py
refseq_g2t.py		refseq_g2t.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

gff2tsv.py

gff2tsv.py

refseq_g2t.py

refseq_g2t.py

requirements.txt

requirements.txt

Repository files navigation

Create exon file and g2t file from new Refseq GFF

First script, gff2tsv

How to run

Outputs

Second script, refseq_g2t

How to run

Outputs

About

Releases 4

Packages

Contributors 4

Languages

eastgenomics/exon_file_and_g2t_from_new_refseq_gff

Folders and files

Latest commit

History

Repository files navigation

Create exon file and g2t file from new Refseq GFF

First script, gff2tsv

How to run

Outputs

Second script, refseq_g2t

How to run

Outputs

About

Resources

Stars

Watchers

Forks

Languages