# Module 12: Population Structure, Annotation

**Note: Genomics vs Genetics:**

*Genomics* refers to the study of genomes, while *genetics* refers to the study of heritable phenomena. In this way, *genomics* refers to studying differences between genomes, while *genetics* may refer to studying what genetic risk factors contribute to a heritable phenomena.

*Genetics* can involve multiple generations. Genomics is involved in genetics, but genetics is not always involved in genomics.

## Genetic Association Studies

**Genetic association studies** aim to link certian genetic markers to specific phenotypes.

### Phenotypes

*Phenotypes* may be binary (qualitative) or continuous (quantitative).
- Binary (qualitative) phenotypes are answered by a yes/no question, such as "Is this a case or control?"
- Continuous (quantitative) phenotypes have a numerical answer that comes from a range of possibilities, like fasting glucose level or systolic blood pressure.

### Linkage Studies

Linkage studies were the traditional approach to identifying genes for human traits and diseases since early genomic study in the 1980s. This works best for Mendelian diseases, like Huntington's, where there is a clear co-segregation of genetic markers with disease within pedigrees.

For complex traits, which are more common, linkage analysis has been less successful because the relationship between genotype and phenotype is less clear. Complex interactions between multiple genes, and/or between genes and non-genetic risk factors, influence the outcome phenotype.

### Association Studies

Association studies operate under a common disease, common variant hypothesis: complex traits will be determined by variants that occur frequently in the population, but each only have a small individual impact.

By comparing allele frequencies in cases and controls (or mean trait values between alleles in quantitative traits), sufficient sample size can yield a powerful approach to identifying loci contributing to complex traits.

This relies on the understanding of "genetic architecture": the type of contributing alleles to a phenotype. 
- Rare alleles causing Mendelian disease are more easily identified, and typically receive *Family-based sequencing*
- Common variants can be implicated in common disease through powerful sampling and *GWAS*
- High-effect common variants are rare, and may be identified through some *Array-based GWAS*
    - It's more common for relatively mild phenotypes that may be tolerated well, rather than severe disease
- Low fequency variants of intermediate effect are identified through *population sequencing, dense reference imputation GWAS, and specialized array genotyping*
- Rare variants of small effect are very hard to identify through genetic means, and may require deep genome analyses with very large samples

#### Sources of Association

Causal association is *best*:
- Genetic marker alleles influence susceptibility

Linkage disequilibrium is *useful*:
- Genetic marker alleles associated with other nearby alleles that influence susceptibility

Population structure is *miseleading*:
- Genetic marker alleles unrelated to disease alleles
- This is simply due to statistics in the underlying population
- Risk is higher with heterogenous populations in studies, as there may be cofounders!

Avoiding population structure issues:
- Avoid stratification by design
    - Collect a sample matched by ancestry
    - Use family-based controls, such as the Transmission Disequlibrium Test (TDT)
- Analyze association by population groups
    - Use self reported ethnicity, or genetic markers
    - Carry out association analysis within each group
- Account for an inflated false-positive rate
    - Apply genomic control
    - Adjust for population principal components
    - Variance comopnent model for family-based association tests

#### Genomic Inflation Factor

Genomic inflation factor is calculated with Chi-Squared.

1. Compute the Chi-Square statistic (χ<sup>2</sup>) for each marker
2. Calculate the Genomic inflation factor (λ)
$$
\lambda = \frac{\text{Median Observed } \chi^2}{\text{Median Expected } \chi^2}
$$
Where Median Expected χ<sup>2</sup> is 0.456
3. Adjust statistic at candidate markers
$$
\chi^2_{fair} = \frac{\chi^2_{biased}}{\lambda}
$$


### Principal Component Analysis (PCA)

PCA determines the "axes of genotype variation" for selected sets of genotypes.
- Many principal components mirror European geography
- By including PC's as covariates in regression models, you can adjust for stratification
- Requires linear algebra

### Variance Component Model for Family-Based Association Test

Population analysis assumes uncorrelated phenotypes betwen individuals under the null
$$
y \thicksim N(X \Beta,\sigma^2 I)
$$

Family based analysis assumes phenotypes are correlated with relatives' phenotypes through a Kinship Coefficient (K<sub>ij</sub>)
$$
y \thicksim N(X \Beta,\sigma^2_g K + \sigma^2_e I)
$$

Similar models for population-based analysis can account for the distant relationship inferred from genotypes with a Genotype_Based Kinship Coefficient: $$ \hat{K}_ij $$

$$
y \thicksim N(X \Beta,\sigma^2_g \hat{K} + \sigma^2_e I)
$$

Particularly useful when correction with principal components is not adequate, but is much more computationally demanding

### Softwares

[PLINK](https://www.cog-genomics.org/plink2) is for PCA Analysis
- Version 1.9 takes VCF as input
- Is the "swiss army knife" of genomic analysis

Mixed-Model based software:
- [GMMAT](https://rdrr.io/github/hanchenphd/GMMAT/man/GMMAT-package.html)
    - Designed by Dr. Chen at UTH!
    - R-based
- [EMMAX, included in EPACTS](https://genome.sph.umich.edu/wiki/EPACTS)

##  Functional Annotation

Annotation occurs typically in the VCF file, where the biological interpretation of each variant is added to the data.

- Coding variant
    - Protein sequence change (silent, missense, nonsense)
    - Splice site variant
    - Frameshift INDELs
- Noncoding variants
    - Somewhat based on guess work and inferences
    - Evolutionary interpretation (conserved region)
    - Regulatory elements, promoters

### Annotation Scores:

Databases for Protein Coding Variations:
- [SIFT](http://sift.bii.a-star.edu.sg/) focuses on Constraint-based predictors, providing evolutionary and biochemical (indirect) data
- [polyPhen](http://genetics.bwh.harvard.edu/pph2/) focuses on Trained classifiers, providing evolutionary, biochemical, and structural data.

Databases for Non-Coding Variations:
- [GERP](http://mendel.standford.edu/SidowLab/downloads/gerp/index.html) focusing on single-site scoring and evolutionary interpretations
- [phyloP](httpL//compgen.bscb.cornell.edu/phast) also focuses on single-site scoring and evolutionary interpretations

Combined Annotation-Dependent Depletion (CADD)
- Provides information on both Coding and Non-Coding Variations
- An algorithm that integrates multiple types of evidence into a single score
    - Conservation score
    - Epigenetic information
    - Protein function scores for coding variants
- Trains a support vector machine on simulated and observed variants
- Variants present in the simulation but not observed are likely deleterious

#### Using Catalogued Variants

Allele frequencies for SNV and INDELs
- 1000 Genomes Project Phase 3 (~25000 whole genomes)
- UK10K cohorts
- NHLBI Trans-Omics for Precision Medicine (TOPMed)
- ExAc consortium (60,706 whole exomes)
- gnomAD consortium (123,136 expomes and 15,496 whole genomes)
Disease related SNV and INDELs
- Clinvar with confidence levels (stars)
- GWAS catalog
- GRASP 2.0
Regulatory Regions (much more volatile for versioning than coding databases)
- ENCODE
    - TFBS, DNAse clusters, genome segmentations
    - 6 cell lines
- FANTOM5
    - Predicted enhancers and promoters
    - Enhancer target genes
- Ensembl Regulatory Build (ENCODE+Roadmap+Blueprint)
    - TFBS, cell type specific activity prediction
    - 68 cell types (ACTIVE, POISED, REPRESSED, INACTIVE)
- Roadmap peak calls
    - ~1000 datasets (127 epigenomes x histone modification)

### Annotation Software

Each software performs annotation slightly differently, and can annotate different sites slightly differently. There are some software pipelines which employ multiple tools at once to get the most comprehensive view of the data. Regardless, it's vital to ensure that the latest version of the software *and* databases are used.

ANNOVAR is one of the most commonly used annotation software tools, and has documentation to assist in use. It uses VCF files.

SnpEff is also an available software for download.

Variant Effect Predictor (VEP) is one of the more commonly used annotation software tools. It was originally developed in the ensemble project. It has a web-based tools for smaller files, as well as visualization.