Skip to content

Scripts to convert refseq gff to tsv and convert that tsv into a g2t file

Notifications You must be signed in to change notification settings

eastgenomics/exon_file_and_g2t_from_new_refseq_gff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Create exon file and g2t file from new Refseq GFF

First script, gff2tsv

Script to convert refseq GFF to TSV using CDS.

Info on the file:

  • coordinates get converted from 1-based to 0-based format.
  • exon number is infered using cds and exon coordinates (any overlap size)

CDS not gathered:

  • NT_ --> Genomic Contig or scaffold, clone-based or WGSa
  • NW_ --> Genomic Contig or scaffold, primarily WGSa
  • Removing duplicated exons entirely, except for exons present in X and Y chromosomes i.e. NM_000451.4 in that case keep the X copy:
    • NM_001134939.1
    • NM_001172437.2
    • NM_001184961.1
    • NM_001301020.1
    • NM_001301302.1
    • NM_001301371.1
    • NM_002537.3
    • NM_004152.3
    • NM_015068.3
    • NM_016178.2
  • Mitochondrial genes:
    • NC_012920.1

How to run

Tested only with a Refseq gff file. Genome build should be input as '37' or '38'.

python gff2tsv.py ${gff_file} -f ${flank} -o ${output_name} -b ${genome_build}

Outputs

TSV file with the CDS

Format:

chrom   start   end HGNC:ID Refseq_transcript_id    exon_nb

Second script, refseq_g2t

Warning

Release 1.2.0 has updated the first script (gff2tsv) but not the second (refseq_g2t) this version of the refseq_g2t script may not work with the new version of gff2tsv

Please use with caution. Testing and validation will be needed before use with new gff2tsv script v1.2.0.

Script to check that g2t from new refseq gff is correct and generate a g2t from exon file from gff2tsv

HGMD file was generated using the following command line on a MySQL database using a HGMD dump (project-Fz4Q15Q42Z9YjYk110b3vGYQ:file-Fz4Q46842Z9z2Q6ZBjy7jVPY):

sudo mysql -e "select markname.hgncID, concat(refcore, '.', refversion) from gene2refseq join markname on markname.gene_id=gene2refseq.hgmdID" hgmd_2020_3 | grep -v "concat(" > hgmd_transcripts.txt

The Haemonc file was provided to me by Aisha Dahir and contains a single column with transcripts without versions (https://github.com/eastgenomics/Haemonc_requests/tree/main/URA-97)

How to run

python refseq_g2t.py -n ${nirvana_g2t} -r ${refseq_exon_file} [ check ${hgmd_file} ${haemonc_file} ]

Use the check flag to do checks:

  • nirvana
  • hgmd
  • haemonc

Outputs

G2T formatted file i.e.:

HGNC:14825  NM_001005484.2  not_clinical_transcript not_canonical
HGNC:31275  NM_001005221.2  not_clinical_transcript not_canonical
HGNC:15079  NM_001005277.1  not_clinical_transcript not_canonical
HGNC:28706  NM_001385641.1  not_clinical_transcript not_canonical
HGNC:28706  NM_001385640.1  not_clinical_transcript not_canonical
HGNC:28706  NM_152486.4 not_clinical_transcript not_canonical
HGNC:24517  NM_015658.4 not_clinical_transcript not_canonical
HGNC:24023  NM_198317.3 not_clinical_transcript not_canonical
HGNC:25284  NM_032129.3 not_clinical_transcript not_canonical
HGNC:25284  NM_001367552.1  not_clinical_transcript not_canonical
HGNC:25284  NM_001160184.2  not_clinical_transcript not_canonical
HGNC:28208  NM_001291366.2  not_clinical_transcript not_canonical

About

Scripts to convert refseq gff to tsv and convert that tsv into a g2t file

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages