# How To Represent Copy Number Variants (CNVs)

In [1]:
from ga4gh.vrs import models
from ga4gh.vrs.extras.translator import Translator
from ga4gh.vrs.dataproxy import SeqRepoRESTDataProxy

import json
from IPython.display import Image



Removing allOf attribute from CopyNumber to avoid python-jsonschema-objects error.
Removing allOf attribute from SequenceInterval to avoid python-jsonschema-objects error.
Removing allOf attribute from RepeatedSequenceExpression to avoid python-jsonschema-objects error.


A basic example. [Gene APOE - apolipoprotein E](https://www.ncbi.nlm.nih.gov/gene/348) has (at least) three copies:


In [2]:
# We use the indefinite range to express "at least three"
indefrange = models.IndefiniteRange(comparator=">=", value=3)


# The CopyNumber model allows us to represent this on the gene level. 
apoe_cn = models.CopyNumber(copies = indefrange ,                         
                            subject = models.Gene(gene_id="ncbigene:348")
                           )



VRS allows us to represent all objects in JSON.

In [3]:
print (json.dumps(apoe_cn.as_dict(), indent=1))

{
 "type": "CopyNumber",
 "subject": {
  "type": "Gene",
  "gene_id": "ncbigene:348"
 },
 "copies": {
  "type": "IndefiniteRange",
  "value": 3,
  "comparator": ">="
 }
}


## Example BRCA1 exon duplication

![BRCA1 exon duplication](images/BRCA1_exon_dup.png)


This example demonstrates a copy number event, where a specific exon in BRCA1 got duplicated. The exon has three copies.
However in many CNV cases we don't know the exact breakpoints of the event, since they did not get sequenced. For
example the breakpoints might be somewhere in the intron space between exons. A possible way to represent this in HGVS
is `NC_000017.10:g.41209048-?_41209172+?dup`

In [4]:

# first let's specify the sequence interval and chromosome that got duplicated
interval = models.SequenceInterval(start=models.Number(value=41209048), end=models.Number(value=41209172))
location = models.SequenceLocation(interval=interval,
                                  sequence_id="refseq:NC_000017.10")



For a CNV we declare this derived from this location. 
Use of [DerivedSequenceExpression](https://vrs.ga4gh.org/en/stable/terms_and_model.html?highlight=DerivedSequenceExpression#derivedsequenceexpression) indicates that the derived sequence is approximately equivalent 
to the reference indicated, and is typically used for describing large regions for variation concepts 
where the exact sequence is inconsequential.

Note, if we would KNOW the duplication is in tandem, we would use [Molecular Variation](https://vrs.ga4gh.org/en/stable/terms_and_model.html?highlight=Molecular%20Variation#molecular-variation) rather than a [Systemic Variation](https://vrs.ga4gh.org/en/stable/terms_and_model.html?highlight=Systemic%20Variation#systemic-variation) (see next example). In this case we actually don't know where in the genome the duplication was inserted, therefore we are expressing this uncertainty here.

In [5]:

derivedseq = models.DerivedSequenceExpression(location=location, reverse_complement=False)

# and finally we express how many copies of this derived sequence can be found 

# note, we know there are at least 3 copies (but not 100% sure there might not be more.)
# that means we use an IndefiniteRange and provide the comparator
copies = models.IndefiniteRange(value=3, comparator=">=")

# and finally this comes together as the CopyNumber object:
cn = models.CopyNumber(copies=copies, subject = derivedseq)

print (json.dumps(cn.as_dict(), indent=1))

{
 "type": "CopyNumber",
 "subject": {
  "type": "DerivedSequenceExpression",
  "location": {
   "type": "SequenceLocation",
   "sequence_id": "refseq:NC_000017.10",
   "interval": {
    "type": "SequenceInterval",
    "start": {
     "type": "Number",
     "value": 41209048
    },
    "end": {
     "type": "Number",
     "value": 41209172
    }
   }
  },
  "reverse_complement": false
 },
 "copies": {
  "type": "IndefiniteRange",
  "value": 3,
  "comparator": ">="
 }
}


## Example MME exon tandem duplication
![MME exon tandem duplication](images/MME_exon_tandem_dup.png)


Here a different example. This CNV event is known to be in tandem. We are representing this using
[Molecular Variation](https://vrs.ga4gh.org/en/stable/terms_and_model.html?highlight=Molecular%20Variation#molecular-variation).


In [6]:
# let's start again with expressing the location which got duplicated.
interval = models.SequenceInterval(start=models.Number(value=154886500), end=models.Number(value=41209172))
location = models.SequenceLocation(interval=interval,
                                  sequence_id="refseq:NC_000003.11")

In [7]:
# in contrast to the previous example, where we were not confident about this being a tandem duplication event, 
# here we are. As such we use RepeatedSequenceExpression
derivedseq = models.DerivedSequenceExpression(location=location, reverse_complement=False)

cnv_count = models.IndefiniteRange(comparator=">=", value=4)

# We are expressing this as a molecular variation
tandem_repeat = models.RepeatedSequenceExpression(seq_expr=derivedseq, count=cnv_count)

print (json.dumps(tandem_repeat.as_dict(), indent=1))


{
 "type": "RepeatedSequenceExpression",
 "seq_expr": {
  "type": "DerivedSequenceExpression",
  "location": {
   "type": "SequenceLocation",
   "sequence_id": "refseq:NC_000003.11",
   "interval": {
    "type": "SequenceInterval",
    "start": {
     "type": "Number",
     "value": 154886500
    },
    "end": {
     "type": "Number",
     "value": 41209172
    }
   }
  },
  "reverse_complement": false
 },
 "count": {
  "type": "IndefiniteRange",
  "value": 4,
  "comparator": ">="
 }
}


In [8]:
# A systemic variation can also be derived from a molecular variation:

systemic_variation = models.CopyNumber(subject=tandem_repeat )

print (json.dumps(systemic_variation.as_dict(), indent=1))

{
 "type": "CopyNumber",
 "subject": {
  "type": "RepeatedSequenceExpression",
  "seq_expr": {
   "type": "DerivedSequenceExpression",
   "location": {
    "type": "SequenceLocation",
    "sequence_id": "refseq:NC_000003.11",
    "interval": {
     "type": "SequenceInterval",
     "start": {
      "type": "Number",
      "value": 154886500
     },
     "end": {
      "type": "Number",
      "value": 41209172
     }
    }
   },
   "reverse_complement": false
  },
  "count": {
   "type": "IndefiniteRange",
   "value": 4,
   "comparator": ">="
  }
 }
}
