Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RelativeCopyNumber #277

Open
ahwagner opened this issue Feb 22, 2021 · 12 comments
Open

RelativeCopyNumber #277

ahwagner opened this issue Feb 22, 2021 · 12 comments
Labels
Schema Stayin' Alive Issues to exempt from stale issue processing
Milestone

Comments

@ahwagner
Copy link
Member

Revisit Relative CopyNumber statements, as seen in cytogenetics resources and WRT X-chromosome abnormalities.

@ahwagner ahwagner added the Stayin' Alive Issues to exempt from stale issue processing label Feb 22, 2021
@ahwagner ahwagner added this to the 1.3 milestone Feb 22, 2021
@ahwagner ahwagner self-assigned this Feb 22, 2021
@ahwagner ahwagner removed their assignment Dec 10, 2021
@ahwagner
Copy link
Member Author

@mcannon068nw follow this thread.

@mbaudis
Copy link
Member

mbaudis commented Dec 10, 2021

+1

@mbaudis
Copy link
Member

mbaudis commented Dec 11, 2021

As mentioned in an exchange w/ @ahwagner the representation of relative copy number states is essential since:

  • most analyses do not deliver copy number counts - in fact the use of "copies" is extremely unusual
  • CN estimates will be influenced by clonality, impurity and ploidy
  • even correct CN counts can frequently (cancer genomes...) only be interpreted by knowing the base ploidy level

A main logical paradigm of CN analyses and representation is the relative level with respect to a baseline (i.e. ploidy level). In cancer genomics there is pretty much consistent use of a limited set of CN levels:

  • homozygous deletion (i.e. estimated 0 copies in any ploidy)
  • deletion (i.e. fewer copies than the assumed baseline; CN=1 for diploid)
  • duplication / low level gain (i.e. one or few more copies than baseline; operationally one can assume this as "not more than a duplication of the base CN count", i.e. 4 in a diploid genome)
  • amplification / high level gain : from ~5 to any number - possibly in the hundreds - of copies of a genomic region. However, there is no clear definition of an amplification threshold and some definitions may include the regionality and exclude e.g. events leading to multiple copies of a chromosome

In practice, the current lack of a relative indication of CN state prohibits the use of the schema for most real world applications representing CNV events (or require to use fake values).

Changes needed

  1. a class describing the relative copy number state (REQUIRED)
  2. a representation of the base copy number at the location in the given sample (OPTIONAL; e.g. 2 for autosomes in diploid cells, 1 for X/Y in males, 3 for a triploid cell line with e.g. 69,XXX etc.)

The best option would be to have an ontology for such classes & SO should be the place? However, CNV representation there is confusing & incomplete.

Minimal pseudo-ontology for CNVs

id: CNVO:000001
label: copy number assessment
  id: CNVO:000002
  label: base ploidy
   id: CNVO:000004
   label: copy-neutral loss of heterozygosity
  id: CNVO:000003
  label: copy number variation
    id: CNVO:000005
    label: copy number loss
      id: CNVO:000007
      label: low-level copy number loss
      id: CNVO:000008
      label: homozygous deletion
    id: CNVO:000006
    label: copy number gain
      id: CNVO:000009
      label: low-level copy number gain
      id: CNVO:000010
      label: genomic amplification

I'd be happy to help working on this & extremely flexible regarding solutions ...

@ahwagner
Copy link
Member Author

I agree completely with @mbaudis above. I think this gets around many of the challenges of representing the assay signal (e.g. log2 ratios) and moves straight to the heart of what CNV callers predict. This is very VRS-like, in my opinion (we also avoid VAF / read depth / intensity metrics elsewhere in VRS).

+1 @mbaudis

@mbaudis
Copy link
Member

mbaudis commented Feb 1, 2022

As of January 18, 2022 the copy number assessment class and its tree are represented in the Experimental Factor Ontology (EFO):

id: EFO:0030063
label: copy number assessment
  |
  |-id: EFO:0030064
  | label: regional base ploidy
  |   |
  |   |-id: EFO:0030065
  |     label: copy-neutral loss of heterozygosity
  |
  |-id: EFO:0030066
    label: relative copy number variation
      |
      |-id: EFO:0030067
      | label: copy number loss
      |   |
      |   |-id: EFO:0030068
      |   | label: low-level copy number loss
      |   |
      |   |-id: EFO:0030069
      |     label: complete genomic deletion
      |
      |-id: EFO:0030070
        label: copy number gain
          |
          |-id: EFO:0030071
          | label: low-level copy number gain
          |
          |-id: EFO:0030072
             label: high-level copy number gain
             note: commonly but not consistently used for >=5 copies on a bi-allelic genome region
              |
              |-id: EFO:0030073
                 label: focal genome amplification
                 note: >-
                   commonly used for localized multi-copy genome amplification events where the
                   region does not extend >3Mb (varying 1-5Mb) and may exist in a large number of
                   copies

@ahwagner
Copy link
Member Author

On the upcoming 2/28 call @larrybabb and I will discuss a proposal to align the above classifications of low / high level copy number gain / loss as a Relative Copy Number class.

This class will be defined by a subject (matching the same variable from [Absolute] Copy Number) and a copy number assessment described by the integer range -2 to +2:
-2: complete copy loss
-1: low-level copy loss
0: copy neutral
1: low-level copy gain
2: high-level copy gain

The cardinality inherent to integers helps with computability over a strictly term-based system.

@ahwagner
Copy link
Member Author

Loss-of-heterozygosity needs to be discussed in the context of genotypes.

@larrybabb
Copy link
Contributor

per a discussion between @ahwagner and @larrybabb
Dreaft Relative Copy Number class proposal

-- the target region/gene/feature
subject:  region/gene/feature/allele/haplotype

--5 quantifiable values that correspond to the EFO copy number assessment subterms that are stable and reliable
copy number assessment:   (http://www.ebi.ac.uk/efo/EFO_0030063)
        -2 = complete loss  (http://www.ebi.ac.uk/efo/EFO_0030069)
        -1 = partial loss   (http://www.ebi.ac.uk/efo/EFO_0030068)
         0 = copy-neutral   (http://www.ebi.ac.uk/efo/EFO_0030064)
         1 = low-level gain (http://www.ebi.ac.uk/efo/EFO_0030071)
         2 = high-level gain(http://www.ebi.ac.uk/efo/EFO_0030072)

@mbaudis
Copy link
Member

mbaudis commented Feb 25, 2022

Great! 2 questions:

  • any intention to handle "focal genome amplification", or is this too much in the "annotation realm"?
  • for CN-LOH: this would then be handled as a combination of a genotype assessment (somehow expressing allelic homozygosity) and then the relative CN at the locus? Could also cover e.g. LOH with CN gain (not sure about examples but everything happens...).

@ahwagner
Copy link
Member Author

On GA4GH call today, some concerns were stated about integer approach; confusing, and also might cause challenges when extending to other levels beyond complete / low loss / neutral / low gain / high gain

@mbaudis
Copy link
Member

mbaudis commented Mar 3, 2022

I guess the main arguments against directly using CURIEs would be that

  • VRS feels that not all of the ones from the EFO branch are suitable for this concept - at least in the way VRS is seeing it - and
  • the terms themselves may not become the defaults (e.g. waiting for SO as still somehow standard ontology in the variant space though w/ some problematic/lacking concepts ATM)

... ?

OTOH - CURIE concept/definition in VRS, recommended values basically adopt term definitions from EFO (thanks), flexibility to change recommended terms while keeping structure, hierarchical retrieval in implementations (complete loss is just a subset of loss) ...

@ahwagner
Copy link
Member Author

ahwagner commented Mar 3, 2022

Mostly this is for consistency with the spec so far, where we can link to all external concepts associated with a concept, e.g. sources for Allele. Our plan is to eventually provide structured alignment to the EFO (and eventually SO?) concepts, and (when we get to producing LD-contexts) we will have explicit concept equivalency maps to these entities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Schema Stayin' Alive Issues to exempt from stale issue processing
Projects
None yet
Development

No branches or pull requests

4 participants