# Module 10: More on Variation and Genetics

## Structural Variations

Structural variations are polymorphic rearrangements of a genome from 50bp to hundreds of kb in size (usually larger than 1kb). About 12% of the genome is covered by common structural variations. About 0.5% of the genome in each invidual is affected by structural variations. 

Many structrual variations are INDELs, but there are also other forms like inversions.

They are a major source of phenotype variation, including health-related phenotypes. Cancers and behavioral diseases are particularly noteworthy for being related to a large amount of strcutural variation. As research continues, many rare diseases, neurological disorders, and other behavioral disorders are being found to have related structural variations.

Historical methods that rely on genotyping assays have been very low resolution, only able to detect structural variations larger than 50 kb.

### Types of Structural Variations

| Structural Variation      | "Sequence"        |
|---------------------------|-------------------|
| Reference                 | A - B - C         |
| Deletion                  | A - C             |
| Insertion                 | A - E - B - C     |
| Inversion                 | A - C - B         |
| Tandem Duplication        | A - A - B - C     |
| Dispersed Duplication     | A - B - A - C     |
| Copy-Number Variant (CNV) | A - A - A - B - C |

### Detection from Short Reads

Read Pair (RP) information: 
- We expect a certain distance between pair-reads (400-500 bp for insert size)
- A deletion may cause paired reads to be mapped on the reference genome "far apart"
- A mobile element insertion (MEI) may cause us to lose one of the read pairs
- Tandem duplication may cause read pairs to be observed in "reverse order"

Read Depth (RD) information:
- Duplication may cause increased depth (higher coverage) in an area
- Deletion may cause a loss of depth (less coverage) in an area
- If we have a heterozygous deletion, we may lose about half of our coverage
- If we have a homozygous deletion, we may lose all of our coverage

Split Reads (SR):
- We may be able to see a deletion by a gap in the split-read map
- Computationally expensive, done by analyzer machine

Assembly (AS)
- We may be able to detect large structural variations through assembly of the sequence
- Computationally expensive, done by computer from analysis data

### Complex SVs with Longer Reads

PacBio, MinION, and Oxford Nanopore sequencing can produce longer reads.

Inversions in particular are very hard to find with shorter reads, since they could just be mapped in reverse from reality. Longer-read sequencing makes this much more clear!

Short reads can also entirely miss some forms of variation, including some INDELs. So long-read sequencing is best to find these

### Variant QC

#### Variant Filtering

False discoveries can occur even if proper modeling of population-based data has been completed. These false discoveries affect the overall quality, not only for the problematic sites but many others in linkage disequilibrium (LD).

Indicators can include base read distribution, base quality, mapping quality... but multi-sample statistics are often more informative

Any time you think you've found something of interest, you need to validate it! The bigger the change (such as multiple different alterations to a single read), the less likely it is to be valid. 

If the variant is only in lower-quality reads, it should be suspected as a false discovery.

If all reads with the variant also have deletions in the same read (multiple mismatches in a single read), it should be suspected as a potentially false discovery or even contamination from an outside genome. 

#### Evaluation of SNP Callsets

Compare to chip genotypes for the same individuals
- GWAS or ExomeChip data
- This is the gold standard, if available
- May only have common variants

Sensitivity analysis on known sites
- HapMap, dbSNP, 1000 Genomes
- Have frequencies of known variants
- Has a greater volume of information than the chip genotypes

Transition to trans-version ratio
- Transition is more likely, easier to occur:
    - G <-> T
    - A <-> C
- Transversions are less likely, harder to occur:
    - A <-> G
    - C <-> T
    - A <-> T
    - C <-> G
- Typical (Ts/Tv or Ti/Tv) ratios:
    - Whole genome: 2.2-2.4
    - Whole exome: 2.7-3.1
- Biologically true SNPs typically have similar ratios
    - False discoveries typically have much *lower* ratios! 

Use family data to corroborate or look for Mendelian inconsistencies

## Alleles, Frequencies, and Linkage Disequilibrium

### Alleles:

Humans, like many other species, are diploid. As such, they (typically) possess two alleles for each gene. There may be many different possible alleles for any given gene, but a diploid organism would only have two alleles.

Major or dominant alleles are typically written in uppercase: A/B
Minor or recessive allleles are typically written in lowercase: a/b

#### Allele Frequency:

Allele Frequency for a Diploid:
$$
P(a) = probability\ of\ the\ a\ allele\ in\ the\ population
\newline
P(a) = \frac{2 \cdot N(aa) + N(Aa)}{2 \cdot N_{ind}}
$$

#### Allele Distribution:

Lets declare p to be the frequency of the dominant/major allele (A), and q to be the frequency of the recessive/minor allele (a):
$$
P(A) \stackrel{\Delta}= p, P(a) = 1-p\stackrel{\Delta}= q
$$

Then the probability of each diploid genotype can be described as:
$$
P(AA) = p^2
\newline
P(Aa\ or\ aA) = pq + qp = 2pq
\newline
P(aa) = q^2
\newline
P(AA)+P(Aa\ or\ aA) + P(aa) = q^2 + 2pq + q^2 = 1
$$

This is also known as Hardy-Weinberg equilibrium: if mating is random, and there are no outside influences or factors, these probabilities will be preserved throughout the generations.

In reality, there are usually influences and outside factors, such as those one might discuss as "natural selection."

### Linkage Disequilibrium

#### Haplotype vs Genotype

**Diploids**: Organisms that contain two homologous copies of chromosomes. For organisms that have sexual reproduction, one copy is typically inherited from each parent

**Haplotype**: Alleles on one side of a diploid chromosome. 

**Genotype**: Diploid allele at a specific marker

The genotype for a biallelic gene (a gene with only two possible alleles) could be conceptualized as follows:

| Genotype:   | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 0 | 2 |
|-------------|---|---|---|---|---|---|---|---|---|
| Haplotype 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
| Haplotype 2 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |

Two diploid organisms with the same genotype may have different haplotypes.

**Subject 1**

| Genotype:   | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 0 | 2 |
|-------------|---|---|---|---|---|---|---|---|---|
| Haplotype 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
| Haplotype 2 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |

**Subject 2**
| Genotype    | Aa | Bb |
|-------------|----|----|
| Haplotype 1 | A  | b  |
| Haplotype 2 | a  | B  |

Haplotype is often biologically important! Alleles on the same haploid can act together, especially if they include enhancers, promoters, or any other type of regulatory mechanisms. Additionally, alleles on the same haploid are correlated.

#### Linkage Disequilibrium is Due to Haplotype Linkage!

If P(A & B) is the haplotype frequency, then we can define linkage disequilibrium as:
$$
P(A \And B) \not= P(A)\cdot P(B)
$$

This occurs when the two genes are *linked* - their assortment is not fully independent! In contrast, when they *are* independently sorted, we say that the alleles are in *linkage equilibrium*.

#### Haplotype Phasing/Imputation

Due to linkage disequilibrium, we know that people tend to share "blocks" of genotypes - those highly linked genes are more likely to be inherited in a shared "block" than individually. As such, we can check for unlikely genotypes or fill-in missing genotypes.

Haplotype phasing involves statstically estimating haplotype and genotype data in order to perform haplotype and genotype imputation. If we have missing information in the middle of a haplotype, we can use the information about the known haplotypes to determine the probability that this "missing" information is consistent with the rest of the haplotype block.

Why use Imputation?

- Increased power:
    - The reference panel is far more likely to contain either the causal variant or a better tag than a GWAS array
- Fine-mapping:
    - Imputation provides a high-resolution overview of an association signal across a locus
- Meta-analysis:
    - Imputation allows GWAS typed with different arrays to be combined up to variants in the reference panel
Sometimes even allows for mega-analysis

Large-scale genotyping and re-sequencing reference panels are available: 
- HapMap Consortium
    - HapMap2: 60CEU, 60 YRI, and 90 CHB/JPT individuals typed for ~3M variants
    - HapMap3: 1011 individuals from multiple ethnic groups typed for ~1.6M variants
- 1000 Genomes Project
    - Most recent release includes 2,504 individuals from multiple ethnic groups typed for ~79M variants, including SNPs, INDELs, and SVs

There are also some recent WGS reference panels with larger and even bigger (> 10,000) sample sizes!
- NCBI has a reference panel in the works that has over 100,000 samples
- Many of them are not quite available to the public yet

## Genetic Association Studies

Genetic association studies seek to identify genetic variants associated with diseases and traits. This allows us to improve our understanding of the genetic mechanisms underlying these diseases and traits, identify potential drug targets for new therapies, or develop screening techniques for individuals at risk.

You can compare the genotypes to the distributions of the phenotypes, and perform statistical testing to determine if there is an association.

A linear regression is typically performed with the genotype category as the predictor and the continuous phenotype variable as the response variable.

- N-samples
    - Divided along the phenotypes
- M-markers
    - Divided along the genotypes
- Coefficients 
    - β<sub>1</sub> and β<sub>2</sub>, which correlate to the phenotypes y<sub>1</sub> and y<sub>2</sub> respectively
    - Coefficient β</sub>M</sub>, which correlates to the phenotype y<sub>N</sub>

Remember, correlation does not indicate causation!

There will be several genes that can appear as significant, but it's because they are in the *same haplotype* as a causal gene! 