documentation on how to load hg38 dataset to the database #6847

khzhu · 2019-11-22T02:56:25Z

Fix #6729

Describe changes proposed in this pull request:

new documentation on how to load hg38 dataset to the cBioPortal database

Checks

Runs on heroku
Has tests or has a separate issue that describes the types of test that should be created. If no test is included it should explicitly be mentioned in the PR why there is no test.
The commit log is comprehensible. It follows 7 rules of great commit messages. For most PRs a single commit should suffice, in some cases multiple topical commits can be useful. During review it is ok to see tiny commits (e.g. Fix reviewer comments), but right before the code gets merged to master or rc branch, any such commits should be squashed since they are useless to the other developers. Definitely avoid merge commits, use rebase instead.
Is this PR adding logic based on one or more clinical attributes? If yes, please make sure validation for this attribute is also present in the data validation / data loading layers (in backend repo) and documented in File-Formats Clinical data section!

Any screenshots or GIFs?

If this is a new visual feature please add a before/after screenshot or gif
here with e.g. GifGrabber.

Notify reviewers

Read our Pull request merging
policy. It can help to figure out who worked on the
file before you. Please use git blame <filename> to determine that
and notify them either through slack or by assigning them as a reviewer on the PR

jjgao

@khzhu instead of creating a new file, I am wondering if we should add the text to the existing files, eg. https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#cancer-study?

jjgao · 2019-11-22T17:20:35Z

docs/Support-multiple-reference-genomes.md

@@ -0,0 +1,35 @@
+# [Introduction](introduction)
+From release 3.1, cBioportal supports multiple reference genomes. This means the hg19 datasets can co-exist with the hg38 ones in the same portal database. We also support mouse genomes (mm10), but it is not recommended to have both human and mouse data in the same instance of the cBioPortal.


@khzhu I make some change here. I change the version from 2.2 to 3.1. Was 2.2 referred to the db version? I think the portal code version should be used.

I also change the text about mouse genome and recommend not to import human and mouse into the same instance.

thanks, @jjgao, for the review! merged with your changes.

jjgao · 2019-11-22T17:22:31Z

docs/Support-multiple-reference-genomes.md

+1. [Import Reference Genome](Import-reference-genome) into your cBioPortal database. If the version of your portal instance is lower than 3.1, you will have to [update your cBioPortal installation](Updating-your-cBioPortal-installation) 
+and migrate your database schema to the latest. The migrartion script by default will add three reference genomes (hg19, hg38, mm10) to the database.
+2. [Update the reference genome gene database table](Updating-gene-and-gene_alias-tables) to include the genes from the reference genome of interest to the database. You will also need to update both gene and gene alias tables in order to support other species such as mouse.


If I start a new instance from scatch, do I need # 1&2 or are they included in the seed db?

If I migrate my db, will the migration script take care of both 1&2?

if you are using a seed database to restore you database or using the migration script you will still need to do step 2 to add genes from your new reference genome.

jjgao · 2019-11-22T17:25:12Z

docs/Support-multiple-reference-genomes.md

+* You can also overwrite default genome values by using the following options of the importer script:
+```
+  -species SPECIES, --species SPECIES
+                        species information (default: assumed human)
+  -ucsc UCSC_BUILD_NAME, --ucsc_build_name UCSC_BUILD_NAME
+                        UCSC reference genome assembly name (default: assumed
+                        hg19)
+  -ncbi NCBI_BUILD_NUMBER, --ncbi_build_number NCBI_BUILD_NUMBER
+                        NCBI reference genome build number (default: assumed
+                        GRCh37 for UCSC reference genome build hg19)
+```
+The **species** by default is human. You do not need to supply it unless you try to add datasets from other species such as a mouse. **ucsc** and **ncbi** are both required 
+when loading hg38 datasets into the portal database. For instance, to load hg38 dataset:
+```
+core/src/main/scripts/importer/metaImport.py  -s /path/to/your/study -jar scripts.jar -ucsc hg38 -ncbi GRCh38 -n -o
+
+```


Maybe remove this for simplicity so that people alway change their meta file?

meta file is only used for populating the cancer_study table.
the validation script uses ucsc/ncbi options to validate the reference genome when importing any studies profiled with a different genome other than a default genome value listed in portal properties file.
also, those default genome values are used by the importer script (java script) if no ucsc/ncbi values supplied by the user.

…w comments from JJ

khzhu · 2019-11-22T20:52:26Z

Hi @jjgao , I've resolved all of your comments. May I please have your approval to get documentation merged or let me know if anything else need to be done. Thanks!

khzhu requested a review from jjgao November 22, 2019 02:57

adding documentation on how to load hg38 dataset to the portal

38686d6

khzhu force-pushed the doc-on-multiple-ref-genomes-6729 branch from 1e2dd2a to 38686d6 Compare November 22, 2019 08:25

jjgao requested changes Nov 22, 2019

View reviewed changes

Update Support-multiple-reference-genomes.md and resolving code revie…

931b6d2

…w comments from JJ

khzhu force-pushed the doc-on-multiple-ref-genomes-6729 branch from a4ecfd7 to 931b6d2 Compare November 22, 2019 19:47

jjgao approved these changes Nov 24, 2019

View reviewed changes

jjgao merged commit 3eae71c into cBioPortal:master Nov 24, 2019

inodb added the documentation label Nov 29, 2019

jagnathan deleted the doc-on-multiple-ref-genomes-6729 branch June 2, 2021 16:57

jagnathan restored the doc-on-multiple-ref-genomes-6729 branch June 2, 2021 16:57

jagnathan deleted the doc-on-multiple-ref-genomes-6729 branch June 2, 2021 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documentation on how to load hg38 dataset to the database #6847

documentation on how to load hg38 dataset to the database #6847

khzhu commented Nov 22, 2019

jjgao left a comment

jjgao Nov 22, 2019

khzhu Nov 22, 2019 •

edited

jjgao Nov 22, 2019

khzhu Nov 22, 2019

jjgao Nov 22, 2019

khzhu Nov 22, 2019

khzhu commented Nov 22, 2019

		@@ -0,0 +1,35 @@
		# [Introduction](introduction)
		From release 3.1, cBioportal supports multiple reference genomes. This means the hg19 datasets can co-exist with the hg38 ones in the same portal database. We also support mouse genomes (mm10), but it is not recommended to have both human and mouse data in the same instance of the cBioPortal.

documentation on how to load hg38 dataset to the database #6847

documentation on how to load hg38 dataset to the database #6847

Conversation

khzhu commented Nov 22, 2019

Checks

Any screenshots or GIFs?

Notify reviewers

jjgao left a comment

Choose a reason for hiding this comment

jjgao Nov 22, 2019

Choose a reason for hiding this comment

khzhu Nov 22, 2019 • edited

Choose a reason for hiding this comment

jjgao Nov 22, 2019

Choose a reason for hiding this comment

khzhu Nov 22, 2019

Choose a reason for hiding this comment

jjgao Nov 22, 2019

Choose a reason for hiding this comment

khzhu Nov 22, 2019

Choose a reason for hiding this comment

khzhu commented Nov 22, 2019

khzhu Nov 22, 2019 •

edited