Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update of genes.refGene files #26

Open
adelpozomt opened this issue Jun 28, 2016 · 11 comments
Open

update of genes.refGene files #26

adelpozomt opened this issue Jun 28, 2016 · 11 comments

Comments

@adelpozomt
Copy link

I need to use an updated version of refseq. Is it available any script to download the current version of the file 'genes.refGene' or I should to build it by hand?. Thank you. Angela

@lacek
Copy link

lacek commented Aug 29, 2017

The format of genes.refGene is almost the same as refGene table from UCSC (hg19 schema, hg19 dump). I bet it is the data source, except that there are versions in the accession name but not in UCSC's table, e.g. NR_046018.2 v.s. NR_046018.

UCSC provides another table gbCdnaInfo with accession name and version (schema, dump). You can load the two tables into a MySQL database and join the two tables to get the required data, e.g:

SELECT r.bin,
       CONCAT(r.name, '.', g.version) AS name,
       r.chrom,
       r.strand,
       r.txStart,
       r.txEnd,
       r.cdsStart,
       r.cdsEnd,
       r.exonCount,
       r.exonStarts,
       r.exonEnds,
       r.score,
       r.name2,
       r.cdsStartStat,
       r.cdsEndStat,
       r.exonFrames
INTO OUTFILE '/tmp/genes.refGene'
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
FROM refGene r, gbCdnaInfo g
WHERE r.name = g.acc;

If you have difficulty in setting up a database, you may try querying UCSC's public database instance directly as the resulting data is only around 20MB (for now):

mysql -ugenomep --password=password -hgenome-mysql.cse.ucsc.edu -ACD hg19 -BNe "SELECT r.bin,
       CONCAT(r.name, '.', g.version) AS name,
       r.chrom,
       r.strand,
       r.txStart,
       r.txEnd,
       r.cdsStart,
       r.cdsEnd,
       r.exonCount,
       r.exonStarts,
       r.exonEnds,
       r.score,
       r.name2,
       r.cdsStartStat,
       r.cdsEndStat,
       r.exonFrames
FROM hg19.refGene r, hgFixed.gbCdnaInfo g
WHERE r.name = g.acc" > genes.refGene"

And for LRG_RefSeqGene, latest file is simple available at NCBI's FTP: ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/LRG_RefSeqGene

@anopperl
Copy link

anopperl commented Sep 4, 2017

Is this "genes.refGene" file for hg19 or hg18 ?
I want this file for both hg19 and gh18.

@lacek
Copy link

lacek commented Sep 5, 2017

@anopperl The gbCdnaInfo table is same for all assembly. But the links to refGene table posted above were for hg19 only. For hg18, you may simply change the hg19 to hg18 in the links:

Similarly for hg38:

On the other hand, if you prefer querying the public UCSC database directly, simply replace all hg19 to hg18 (or hg38), e.g.:

mysql -ugenomep --password=password -hgenome-mysql.cse.ucsc.edu -ACD hg18 -BNe "SELECT r.bin,
       CONCAT(r.name, '.', g.version) AS name,
       r.chrom,
       r.strand,
       r.txStart,
       r.txEnd,
       r.cdsStart,
       r.cdsEnd,
       r.exonCount,
       r.exonStarts,
       r.exonEnds,
       r.score,
       r.name2,
       r.cdsStartStat,
       r.cdsEndStat,
       r.exonFrames
FROM hg18.refGene r, hgFixed.gbCdnaInfo g
WHERE r.name = g.acc" > genes.refGene"

Finally, you can concatenate genes.refGene for different assemblies into a single file.

@anopperl
Copy link

anopperl commented Sep 6, 2017

thanks lacek
i want to get coordinate of "NM_000352.3:c.215-10A>G" so i used hgvs code(https://github.com/counsyl/hgvs)
i have gotten transcript as you said above MySQL command for hg19 and hg18.
but transcript(genes.refGene) file have NM_000352.4 but dont have NM_000352.3 ......

@lacek
Copy link

lacek commented Sep 28, 2017

UCSC doesn't keep track of all versions of accessions so we're only getting latest version of transcript from their database.

For this particular transcript NM_000352.3, it is in the original genes.refGene file in this repo. You may combine it with the one you've created. There would still be versions of transcript missing in this combined file.

You will need a data source with all version transcript. One I could think of is UTA. You need to either write a query to produce data in the format of genes.refGene file, or write a python adapter function (get_transcript) that fetch required transcript data from the UTA database.

On the other hand, you may also want to take a look at hgvs, an HGVS parser that is based on UTA. It seems more robust but less easy to use.

@davmlaw
Copy link

davmlaw commented Feb 3, 2022

I've made a Python package that provides ~800k transcripts (both RefSeq and Ensembl) for PyHGVS

https://github.com/SACGF/cdot

You can either download a JSON.gz file, or use a REST service. To use it:

from cdot.pyhgvs.pyhgvs_transcript import JSONPyHGVSTranscriptFactory, RESTPyHGVSTranscriptFactory

factory = RESTPyHGVSTranscriptFactory()
# factory = JSONPyHGVSTranscriptFactory(["./cdot-0.2.1.refseq.grch38.json.gz"])  # Uses local JSON file
pyhgvs.parse_hgvs_name(hgvs_c, genome, get_transcript=factory.get_transcript_grch37)

@simzep
Copy link

simzep commented Mar 30, 2023

Thank you all for the info how to generate refgene files. Is there a way to get a hs1 refGene. Using the solution from @lacek with 'hs1' does not work.

@lacek
Copy link

lacek commented Mar 31, 2023

@simzep By hs1 are you referring to HCLS1?

Anyhow, this library is unmaintained, and comes with a number of problems (e.g. bugs in parsing dup and ins, lack of reference bases checking, no support on inversion and mitochondrial, etc). You're better off with other similar library (e.g. https://github.com/biocommons/hgvs) or tool (e.g. https://asia.ensembl.org/info/docs/tools/vep/recoder/index.html) for parsing HGVS.

@simzep
Copy link

simzep commented Mar 31, 2023

@lacek Thanks for the reply.
I'm trying to create / find a refGene file for the T2T-CHM13 HS1 gene project found here similar to the one created with your sql command.
https://hgdownload.soe.ucsc.edu/goldenPath/hs1/bigZips/
I hope it makes sense this is my first project involving bioinformatics.

@lacek
Copy link

lacek commented Apr 1, 2023

@simzep Didn't aware of that you're referring to a reference genome of CHM13.

For UCSC data, you can search on it's table browser page. E.g. for refGene of hg38: https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_track=refSeqComposite&hgta_table=refGene&hgta_doSchema=data%20format%20description

At the moment, I cannot find the refGene table for hs1 on UCSC. The closest one maybe https://genome.ucsc.edu/cgi-bin/hgTables?db=hs1&hgta_track=hub_3671779_refSeqComposite&hgta_table=hub_3671779_ncbiRefSeq&hgta_doSchema=data%20format%20description. It's in BigBed format though and require conversion if you find it appropriate in your use case.

@davmlaw
Copy link

davmlaw commented Apr 3, 2023

Hi, I added T2T support to cdot, so you should be able to convert to/from HGVS (using the Biocommons HGVS) reasonably easily, see example code here:

https://github.com/SACGF/cdot/wiki/Biocommons-T2T-CHM13v2.0-example-code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants