## Introduction

### What is natural selection

Natural selection is the phenomenon that individuals in a population survive and reproduce differently because of differences in their pheotypes.

There are different types of natural selection, for example,
- Purifying selection
- Positive selection

<img src="https://d3i71xaburhd42.cloudfront.net/326581abd3371faf63fad44f6c31b833854cb37f/3-Figure1-1.png" width="800"/>

**Figure 1 Genomic signatures of positive selection.** This figure is from [Biswas and Akey (2006)](https://doi.org/10.1016/j.tig.2006.06.005).

### Extended haplotype homozygosity (EHH) and integrated haplotype score (iHS)

EHH is a statistic to measure the decrease of variation around a SNP. This is because positive selection could decrease the diversity around a variant under selection, as shown in Figure 1.

iHS is a measure of the amount of (EHH) at a given SNP along the ancestral allele relative to the derived allele.

<img src="https://journals.plos.org/plosbiology/article/figure/image?size=medium&id=10.1371/journal.pbio.0040072.g006">

**Figure 2 Signals of selection for three candidate selection regions using EHH and iHS.** This figure is from [Voight et al. (2006)](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.0040072).


## selscan pipeline

In this tutorial, we will use `selscan` to calculate the iHS scores in the *LCT* region from the CEU population.

The *LCT* gene makes an enzyme called lactase to digest lactose. It has the strongest signals of positive selection in Europeans ([Gerbault 2014](https://doi.org/10.1159/000360136)).

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Lactose_tolerance_in_the_Old_World.svg/1280px-Lactose_tolerance_in_the_Old_World.svg.png" width="500">

**Figure 3 An estimate of the percentage of adults that can digest lactose in the indigenous population of the Old World.** This figure is from https://en.wikipedia.org/wiki/Lactose_intolerance.

### Download selscan

We can download `selscan` with `git clone`.

```
git clone https://github.com/szpiech/selscan
```

### Run selscan

Here, we provide a VCF file containig the *LCT* gene with 1 Mb flank size from the CEU population and a map file containing the information of each variant in the VCF file.

Then we can calculate the iHS scores in this dataset with the following command.

```
./selscan/bin/linux/selscan --vcf ./data/chr2.CEU.LCT.flank.1mb.biallelic.snps.vcf.gz --map ./data/chr2.CEU.LCT.flank.1mb.biallelic.snps.map --ihs
```

If you are using macOS, then use

```
./selscan/bin/osx/selscan --vcf ./data/chr2.CEU.LCT.flank.1mb.biallelic.snps.vcf.gz --map ./data/chr2.CEU.LCT.flank.1mb.biallelic.snps.map --ihs
```

### Check results

There are two output files from `selscan`. One is the log file storing the summary of the analysis. We should look at the file ended with `.out`.

An example is in below.

```
rs16831345      135871914       0.0858586       206664  253975  -0.0895271
rs2034840       135876308       0.0858586       206664  254076  -0.0897007
rs55807439      135886744       0.0808081       232011  251582  -0.0351707
rs2305595       135887893       0.0858586       208931  254358  -0.0854418
```
- The first column is the name of the variant.
- The second column is the position of the variant.
- The third column is the frequency of the allele 1.
- The fourth column is the iHH1 score.
- The fifth column is the iHH0 score.
- The sixth column is the unstandarized iHS score.

More details can be found in the [manual](https://github.com/szpiech/selscan/blob/master/manual/selscan-manual.pdf).

To explore the results quickly, we can simply sort the variants by their unstandarized iHS scores and choose those variants with top iHS scores as the candidate variants under selection.

```
sort -rnk 6,6 outfile.ihs.out | head -100
sort -rnk 6,6 outfile.ihs.out | head -100 | awk '$1~/rs4988235/'
sort -rnk 6,6 outfile.ihs.out | head -100 | awk '$1~/rs182549/'
```

For example, we can find two classic SNPs [rs4988235](https://www.snpedia.com/index.php/Rs4988235) and [rs182549](https://www.snpedia.com/index.php/Rs182549) associated with lactose intolerance in Europeans are in the top candidates.