update of genes.refGene files #26

adelpozomt · 2016-06-28T11:48:06Z

I need to use an updated version of refseq. Is it available any script to download the current version of the file 'genes.refGene' or I should to build it by hand?. Thank you. Angela

lacek · 2017-08-29T09:33:38Z

The format of genes.refGene is almost the same as refGene table from UCSC (hg19 schema, hg19 dump). I bet it is the data source, except that there are versions in the accession name but not in UCSC's table, e.g. NR_046018.2 v.s. NR_046018.

UCSC provides another table gbCdnaInfo with accession name and version (schema, dump). You can load the two tables into a MySQL database and join the two tables to get the required data, e.g:

SELECT r.bin,
       CONCAT(r.name, '.', g.version) AS name,
       r.chrom,
       r.strand,
       r.txStart,
       r.txEnd,
       r.cdsStart,
       r.cdsEnd,
       r.exonCount,
       r.exonStarts,
       r.exonEnds,
       r.score,
       r.name2,
       r.cdsStartStat,
       r.cdsEndStat,
       r.exonFrames
INTO OUTFILE '/tmp/genes.refGene'
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
FROM refGene r, gbCdnaInfo g
WHERE r.name = g.acc;

If you have difficulty in setting up a database, you may try querying UCSC's public database instance directly as the resulting data is only around 20MB (for now):

mysql -ugenomep --password=password -hgenome-mysql.cse.ucsc.edu -ACD hg19 -BNe "SELECT r.bin,
       CONCAT(r.name, '.', g.version) AS name,
       r.chrom,
       r.strand,
       r.txStart,
       r.txEnd,
       r.cdsStart,
       r.cdsEnd,
       r.exonCount,
       r.exonStarts,
       r.exonEnds,
       r.score,
       r.name2,
       r.cdsStartStat,
       r.cdsEndStat,
       r.exonFrames
FROM hg19.refGene r, hgFixed.gbCdnaInfo g
WHERE r.name = g.acc" > genes.refGene"

And for LRG_RefSeqGene, latest file is simple available at NCBI's FTP: ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/LRG_RefSeqGene

anopperl · 2017-09-04T11:44:37Z

Is this "genes.refGene" file for hg19 or hg18 ?
I want this file for both hg19 and gh18.

lacek · 2017-09-05T02:02:26Z

@anopperl The gbCdnaInfo table is same for all assembly. But the links to refGene table posted above were for hg19 only. For hg18, you may simply change the hg19 to hg18 in the links:

Similarly for hg38:

On the other hand, if you prefer querying the public UCSC database directly, simply replace all hg19 to hg18 (or hg38), e.g.:

mysql -ugenomep --password=password -hgenome-mysql.cse.ucsc.edu -ACD hg18 -BNe "SELECT r.bin,
       CONCAT(r.name, '.', g.version) AS name,
       r.chrom,
       r.strand,
       r.txStart,
       r.txEnd,
       r.cdsStart,
       r.cdsEnd,
       r.exonCount,
       r.exonStarts,
       r.exonEnds,
       r.score,
       r.name2,
       r.cdsStartStat,
       r.cdsEndStat,
       r.exonFrames
FROM hg18.refGene r, hgFixed.gbCdnaInfo g
WHERE r.name = g.acc" > genes.refGene"

Finally, you can concatenate genes.refGene for different assemblies into a single file.

anopperl · 2017-09-06T11:21:20Z

thanks lacek
i want to get coordinate of "NM_000352.3:c.215-10A>G" so i used hgvs code(https://github.com/counsyl/hgvs)
i have gotten transcript as you said above MySQL command for hg19 and hg18.
but transcript(genes.refGene) file have NM_000352.4 but dont have NM_000352.3 ......

lacek · 2017-09-28T02:15:02Z

UCSC doesn't keep track of all versions of accessions so we're only getting latest version of transcript from their database.

For this particular transcript NM_000352.3, it is in the original genes.refGene file in this repo. You may combine it with the one you've created. There would still be versions of transcript missing in this combined file.

You will need a data source with all version transcript. One I could think of is UTA. You need to either write a query to produce data in the format of genes.refGene file, or write a python adapter function (get_transcript) that fetch required transcript data from the UTA database.

On the other hand, you may also want to take a look at hgvs, an HGVS parser that is based on UTA. It seems more robust but less easy to use.

Fix setup.py for unicode literals.

davmlaw · 2022-02-03T06:56:16Z

I've made a Python package that provides ~800k transcripts (both RefSeq and Ensembl) for PyHGVS

https://github.com/SACGF/cdot

You can either download a JSON.gz file, or use a REST service. To use it:

from cdot.pyhgvs.pyhgvs_transcript import JSONPyHGVSTranscriptFactory, RESTPyHGVSTranscriptFactory

factory = RESTPyHGVSTranscriptFactory()
# factory = JSONPyHGVSTranscriptFactory(["./cdot-0.2.1.refseq.grch38.json.gz"])  # Uses local JSON file
pyhgvs.parse_hgvs_name(hgvs_c, genome, get_transcript=factory.get_transcript_grch37)

simzep · 2023-03-30T14:47:50Z

Thank you all for the info how to generate refgene files. Is there a way to get a hs1 refGene. Using the solution from @lacek with 'hs1' does not work.

lacek · 2023-03-31T02:19:37Z

@simzep By hs1 are you referring to HCLS1?

Anyhow, this library is unmaintained, and comes with a number of problems (e.g. bugs in parsing dup and ins, lack of reference bases checking, no support on inversion and mitochondrial, etc). You're better off with other similar library (e.g. https://github.com/biocommons/hgvs) or tool (e.g. https://asia.ensembl.org/info/docs/tools/vep/recoder/index.html) for parsing HGVS.

simzep · 2023-03-31T18:18:29Z

@lacek Thanks for the reply.
I'm trying to create / find a refGene file for the T2T-CHM13 HS1 gene project found here similar to the one created with your sql command.
https://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/
I hope it makes sense this is my first project involving bioinformatics.

lacek · 2023-04-01T16:13:22Z

@simzep Didn't aware of that you're referring to a reference genome of CHM13.

For UCSC data, you can search on it's table browser page. E.g. for refGene of hg38: https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_track=refSeqComposite&hgta_table=refGene&hgta_doSchema=data%20format%20description

At the moment, I cannot find the refGene table for hs1 on UCSC. The closest one maybe https://genome.ucsc.edu/cgi-bin/hgTables?db=hs1&hgta_track=hub_3671779_refSeqComposite&hgta_table=hub_3671779_ncbiRefSeq&hgta_doSchema=data%20format%20description. It's in BigBed format though and require conversion if you find it appropriate in your use case.

davmlaw · 2023-04-03T05:00:03Z

Hi, I added T2T support to cdot, so you should be able to convert to/from HGVS (using the Biocommons HGVS) reasonably easily, see example code here:

https://github.com/SACGF/cdot/wiki/Biocommons-T2T-CHM13v2.0-example-code

lacek mentioned this issue Aug 29, 2017

how to create or find "genes.refGene" file for hg19 and hg38 #39

Open

anopperl mentioned this issue Sep 6, 2017

how to get coordinate of "AB026906.1:c.40_42del" by hgvs code #40

Open

jtratner pushed a commit that referenced this issue Nov 20, 2019

Merge pull request #26 from dev/fix-pip-unicode

dc79262

Fix setup.py for unicode literals.

davmlaw mentioned this issue Nov 5, 2021

HGVS / genome coordinate conversion does not account for cDNA alignment gaps #60

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update of genes.refGene files #26

update of genes.refGene files #26

adelpozomt commented Jun 28, 2016

lacek commented Aug 29, 2017 •

edited

anopperl commented Sep 4, 2017 •

edited

lacek commented Sep 5, 2017 •

edited

anopperl commented Sep 6, 2017

lacek commented Sep 28, 2017 •

edited

davmlaw commented Feb 3, 2022

simzep commented Mar 30, 2023

lacek commented Mar 31, 2023

simzep commented Mar 31, 2023

lacek commented Apr 1, 2023

davmlaw commented Apr 3, 2023

update of genes.refGene files #26

update of genes.refGene files #26

Comments

adelpozomt commented Jun 28, 2016

lacek commented Aug 29, 2017 • edited

anopperl commented Sep 4, 2017 • edited

lacek commented Sep 5, 2017 • edited

anopperl commented Sep 6, 2017

lacek commented Sep 28, 2017 • edited

davmlaw commented Feb 3, 2022

simzep commented Mar 30, 2023

lacek commented Mar 31, 2023

simzep commented Mar 31, 2023

lacek commented Apr 1, 2023

davmlaw commented Apr 3, 2023

lacek commented Aug 29, 2017 •

edited

anopperl commented Sep 4, 2017 •

edited

lacek commented Sep 5, 2017 •

edited

lacek commented Sep 28, 2017 •

edited