![banner](https://anopheles-genomic-surveillance.github.io/_images/banner.jpg)

***Training course in data analysis for genomic surveillance of African malaria vectors - Workshop 6***

---

# Module 3 - Detecting new forms of insecticide resistance using selection scans

**Theme: Analysis**

In this module we're going learn how to scan the genome for signals of recent selection with the H12 statistic, using haplotype variation data from Ag3.0.

We will use functions in the `malariagen_data` Python package to calibrate, run and plot H12 genome-wide selection scans. Will then learn how interpret and investigate the results to detect candidate genes potentially involved in new forms of insecticide resistance.

## Learning objectives

At the end of this module you will be able to:

* Calibrate the window size parameter for an H12 analysis.
* Run, plot and interpret a genome scan for recent selection using the H12 statistic.
* Identify candidate genes that are under selection.

## Lecture

### English

In [1]:
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/ItyVxRTorJ8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

### Français

In [2]:
%%html
@@TODO

## Setup

First, let’s begin by installing and importing some Python packages, and configuring access to _Anopheles_ genomic data from the MalariaGEN Ag3.0 data resource.

In [1]:
!pip install -q malariagen_data

In [2]:
import malariagen_data

### Saving H12 results

H12 genome wide scans for selection may take a while to complete, particularly if you’re running this code on a service with modest computational resources such as Google Colab.

To avoid having to rerun these analyses, we’ll save the results so we can come back to them later. In Google Colab, you can save results to your Google Drive, which will mean you don’t lose results even if you leave the notebook and come back several days later.

When mounting your Google Drive you will need to follow the authorization instructions.

In [None]:
from google.colab import drive
drive.mount("drive")

With our Google Drive now mounted, we can define and make a directory where we want to save our H12 results.

In [3]:
results_dir = "drive/MyDrive/malariagen_data_cache"

In Google Colab, we can actually see our mounted drive and H12 results directory by clicking on the file tab on the left hand side of the screen.

Next we should setup the malariagen_data package. As we want to save our H12 results in the Google Drive folder we just set up, we’ll use the results_cache parameter and assign our results directory to it. If we were running this notebook locally, then we could assign a local folder to this parameter and the H12 results would instead get stored on our hard drive.

In [4]:
ag3 = malariagen_data.Ag3(results_cache=results_dir)
ag3

MalariaGEN Ag3 API client,MalariaGEN Ag3 API client
"Please note that data are subject to terms of use,  for more information see the MalariaGEN website or contact data@malariagen.net.  See also the Ag3 API docs.","Please note that data are subject to terms of use,  for more information see the MalariaGEN website or contact data@malariagen.net.  See also the Ag3 API docs..1"
Storage URL,gs://vo_agam_release/
Data releases available,3.0
Results cache,/home/jovyan/github/anopheles-genomic-surveillance/anopheles-genomic-surveillance.github.io/docs/workshop-6/drive/MyDrive/malariagen_data_cache
Cohorts analysis,20220608
Species analysis,aim_20220528
Site filters analysis,dt_20200416
Software version,malariagen_data 6.1.0
Client location,"Iowa, US (Google Cloud)"


## Why do we need to scan the genome for signals of recent selection?

Mosquitoes are exposed to high levels of insecticides both directly via malaria vector control campaigns and more widely through agricultural use of these chemicals. The insecticides being used are also changing, as new compounds are developed. The high strength of selective pressure exerted by these insecticides is evidenced by the speed at which insecticide resistance evolves in mosquitoes.

Historically, molecular techniques were used to discover genes and SNPs associated with insecticide resistance, and produce the assays necessary to track them in natural populations. These processes are skill and time intensive, often taking years to discover new loci. 

Clearly these kind of time scales are not operationally feasible when it comes to public health. The purchase and use of insecticides needs to be informed by recent resistance data, to ensure that the vector control chemicals can be changed before high levels of resistance cause them to become less effective or even fail completely.

### Known unknowns

In this series of workshops, we have already seen how to use genomic data to investigate regions of the genome that we know are associated with insecticide resistance. For example, in [workshop 1](https://anopheles-genomic-surveillance.github.io/workshop-1/module-4-vgsc-snps.html) we looked at SNP variation in VGSC and in [workshop 2](https://anopheles-genomic-surveillance.github.io/workshop-2/module-4-cnv-frequencies.html) we looked at CNV variation in the Cyp6aap region. Genomic data are great for these kind of analyses because they allow large numbers of samples to be investigated quickly, easily, and in parallel, effectively running traditional molecular assays but *in silico*. 

### Unknown unknowns

The speed at which insecticide resistance evolves coupled with the use of new insecticidal compounds makes it risky to only investigate genes known to be associated with resistance, as it is likely that new forms of resistance will go undetected. Fortunately, whole genome data allows us to naively scan the genome for signals of adaptation known as **selective sweeps**. By using these **genome wide scans for selection** (**GWSS**), novel loci under selection in mosquito populations can be quickly identified for further investigation as candidate insecticide resistance loci.

## Running a GWSS using the H12 statistic

### What is H12 analysis?

H12 measures **haplotype homozygosity**, in effect the similarity of haplotypes, and will run it on the phased Ag3.0 haplotype data we learnt about in module 1. This statistic is sensitive to recent selection, making it ideal for detecting selective sweeps driven by recent insecticidal pressures and, unlike some other selection detecting statistics is excellent at detecting **soft sweeps**.

### Different kinds of selective sweeps

Selective sweeps can be divided into hard or soft sweeps. When positive selection acts on a single adaptive mutation, causing it to increase in frequency, it is referred to a **hard sweep**. Where multiple adaptive mutations occur at the same locus and increase in frequency due to positive selection, it is known as a **soft sweep**.

However, with H12, we aren't focusing on the individual mutations, *e.g.* SNPs, but rather haplotypes. We would expect the haplotype around a locus under selection to increase in frequency with it. So, in the case of hard sweep a single haplotype should be detected at high frequency and in a soft sweep, multiple haplotypes should be present at high frequencies. The figure below demonstrates this concept with mutation frequency on the left and haplotype frequency at the end time point on the right.

<img src="https://storage.googleapis.com/vo_agam_release/reference/training-images/workshop-6/figure1_resized-768x489.png" alt="hard and soft sweeps"/>

### How H12 works

H12 is based on measure of population diversity known as haplotype homozygosity or **H1**. In their [2015 paper](https://https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005004), Garud and colleagues demonstrated that although H1 was a good statistic to detect hard sweeps, with a single haplotype at high frequency, it was less successful at finding soft sweeps. 

In order to detect both hard and soft sweeps, Garud *et al.* modified the H1 statistic so that the first and second most common haplotypes frequencies are combined (hence **H12**).

For a deeper dive in the statistic and it's underlying mathematics, I would highly recommend the Garud lab [website](https://garud.eeb.ucla.edu/selection-scans/) or for the brave, the [paper](https://https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005004).

<img src="https://storage.googleapis.com/vo_agam_release/reference/training-images/workshop-6/Screenshot%20from%202022-09-01%2010-37-53.png" alt="Garud paper">

### H12 output

It is possible to run "peak" or "outlier" detection on the H12 GWSS results, but for this module we will interpret the results by visualising a plot of the data (we will explore how to make these plots in detail later).

Just to get a sense of what the H12 output looks like, let's run an H12 analysis over chromosome arm 2R, using *An. gambiae* mosquitoes from a cohort collected in Burkina Faso in 2014.

In [6]:
ag3.plot_h12_gwss(
    contig="2R", 
    analysis="gamb_colu", 
    window_size=1000, 
    sample_sets="3.0",
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'"
)

Load sample metadata:   0%|          | 0/28 [00:00<?, ?it/s]

In the example above, we can see from the y-axis that the H12 statistic ranges from 0 to 1. Each point on the plot is the H12 statistic computed in a window, in this case `window_size=1000` SNPs, across the 2R chromosome arm (x-axis). Selective sweeps are suggested by peaks in the data. Single high value windows are generally ignored as noise, because the imprint of selection on the genome should leave a peak centered near to the locus under selection, with shoulders of the peak dropping away either side.

If we zoom in to ~28,500,000bp, and roll the mouse over the gene track, we can see that H12 has detected the expected selective sweep over the Cyp6aa/p gene cluster, containing genes known to be involved in insecticide resistance. The sweep is composed of many 1000 SNP windows, providing strong evidence a true sweep has occurred here.

### H12 analytical workflow

Running an H12 analysis can be broken down into the following steps:

1.   Select cohorts for analysis
2.   Calibrate `window_size` parameter for each cohort and contig
3.   Run and plot H12 scans
4.   Peak identification
5.   Investigate signals to identify candidate resistance genes

Let's explore each of these in detail.

## Step 1. Select cohorts for analysis

### Recap on cohorts

First we need to define a cohort of individuals to run the H12 analysis on. The MalariaGEN Ag3.0 data resource contain mosquito samples collected across large spatial and temporal scales, and from different mosquito species. When we want to run population genetic analyses on datasets like these, the data must be divided into biologically relevant cohorts, where a cohort is simply a group of samples we want to analyse together.

### Cohort size

When we are choosing the cohort for a H12 analyses we need to consider how many samples it contains, too few samples and it could become difficult to identify selective sweep signals over background noise, too many and computational time is wasted needlessly. We have found that a reasonable cohort size is 30 samples, hence this is the default `cohort_size` parameter in all the H12 GWSS functions that we will use. If you try to analyse a cohort with less samples than this parameter, it will error. If your cohort has more samples, they will be randomly downsampled to `cohort size`.

For this example, let's focus on Burkina Faso, and first examine what cohorts are available.

In [7]:
burkina_samples_df = ag3.sample_metadata(sample_query="country == 'Burkina Faso'")
burkina_samples_df

Unnamed: 0,sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call,...,aim_species,country_iso,admin1_name,admin1_iso,admin2_name,taxon,cohort_admin1_year,cohort_admin1_month,cohort_admin2_year,cohort_admin2_month
0,AB0085-Cx,BF2-4,Austin Burt,Burkina Faso,Pala,2012,7,11.151,-4.235,F,...,gambiae,BFA,Hauts-Bassins,BF-09,Houet,gambiae,BF-09_gamb_2012,BF-09_gamb_2012_07,BF-09_Houet_gamb_2012,BF-09_Houet_gamb_2012_07
1,AB0086-Cx,BF2-6,Austin Burt,Burkina Faso,Pala,2012,7,11.151,-4.235,F,...,gambiae,BFA,Hauts-Bassins,BF-09,Houet,gambiae,BF-09_gamb_2012,BF-09_gamb_2012_07,BF-09_Houet_gamb_2012,BF-09_Houet_gamb_2012_07
2,AB0087-C,BF3-3,Austin Burt,Burkina Faso,Bana Village,2012,7,11.233,-4.472,F,...,coluzzii,BFA,Hauts-Bassins,BF-09,Houet,coluzzii,BF-09_colu_2012,BF-09_colu_2012_07,BF-09_Houet_colu_2012,BF-09_Houet_colu_2012_07
3,AB0088-C,BF3-5,Austin Burt,Burkina Faso,Bana Village,2012,7,11.233,-4.472,F,...,coluzzii,BFA,Hauts-Bassins,BF-09,Houet,coluzzii,BF-09_colu_2012,BF-09_colu_2012_07,BF-09_Houet_colu_2012,BF-09_Houet_colu_2012_07
4,AB0089-Cx,BF3-8,Austin Burt,Burkina Faso,Bana Village,2012,7,11.233,-4.472,F,...,coluzzii,BFA,Hauts-Bassins,BF-09,Houet,coluzzii,BF-09_colu_2012,BF-09_colu_2012_07,BF-09_Houet_colu_2012,BF-09_Houet_colu_2012_07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
291,AB0314-C,6775,Nora Besansky,Burkina Faso,Monomtenga,2004,8,12.060,-1.170,F,...,gambiae,BFA,Centre-Sud,BF-07,Bazega,gambiae,BF-07_gamb_2004,BF-07_gamb_2004_08,BF-07_Bazega_gamb_2004,BF-07_Bazega_gamb_2004_08
292,AB0315-C,6777,Nora Besansky,Burkina Faso,Monomtenga,2004,8,12.060,-1.170,F,...,gambiae,BFA,Centre-Sud,BF-07,Bazega,gambiae,BF-07_gamb_2004,BF-07_gamb_2004_08,BF-07_Bazega_gamb_2004,BF-07_Bazega_gamb_2004_08
293,AB0316-C,6779,Nora Besansky,Burkina Faso,Monomtenga,2004,8,12.060,-1.170,F,...,gambiae,BFA,Centre-Sud,BF-07,Bazega,gambiae,BF-07_gamb_2004,BF-07_gamb_2004_08,BF-07_Bazega_gamb_2004,BF-07_Bazega_gamb_2004_08
294,AB0318-C,5072,Nora Besansky,Burkina Faso,Monomtenga,2004,7,12.060,-1.170,F,...,gambiae,BFA,Centre-Sud,BF-07,Bazega,gambiae,BF-07_gamb_2004,BF-07_gamb_2004_07,BF-07_Bazega_gamb_2004,BF-07_Bazega_gamb_2004_07


In [8]:
burkina_samples_df.groupby("cohort_admin2_year").size()

cohort_admin2_year
BF-07_Bazega_gamb_2004    13
BF-09_Houet_arab_2014      3
BF-09_Houet_colu_2012     82
BF-09_Houet_colu_2014     53
BF-09_Houet_gamb_2012     98
BF-09_Houet_gamb_2014     46
dtype: int64

In the Burkina Faso data from Ag3.0, there are six cohorts, four of which have more than 30 samples, so lets pick one of these to run a H12 GWSS on.

## Step 2. Calibrate window size

### Recap: Windowed analyses

The _Anopheles gambiae_ genome is relatively large, _e.g._ there are millions of SNPs on the 3R chromosome arm alone. We are going to use a windowed approach for H12, where we define a window size in number of SNPs, then move this window along the chromosome arm haplotypes, calculating our statistic across each window. This gives us a summary of data that is easier to interpret. This is similar to the approach we used to analyse heterozygosity in [workshop 5](https://anopheles-genomic-surveillance.github.io/workshop-5/module-4-roh.html), however, in the case of heterozygosity, our window was defined by fixed genomic length rather than by number of SNPs.

### Choosing the window size for H12 analyses 

Once we have decided on our cohorts, we need to calibrate the window size, an important parameter for the H12 analysis. Every cohort needs its own window calibration because their demographic history will be unique. In [workshop 5](https://anopheles-genomic-surveillance.github.io/workshop-5/about.html) we learnt how this demographic history can affect genome-wide background genetic diversity.

Let's have a look at what happens if the wrong window size is used, then run through our methodology for picking the best size.

Here's the example we saw earlier, using a 1000 SNP window on the 2R chromosome arm.

In [9]:
ag3.plot_h12_gwss(
    contig="2R", 
    analysis="gamb_colu", 
    window_size=1000, 
    sample_sets="3.0", 
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
    title="BF-09_Houet_gamb_2014; window_size=1000"
)

As we saw earlier, by zooming into an established pyrethroid metabolic insecticide resistance locus, known to be under selection, we are able to see a clear peak in the H12 statistic if we have a good window size. Let's zoom in to 28,500,000bp (Cyp6aa/p gene cluster) and remind ourselves.

However, if our window size is too small, then our signal may not be visible over background noise. To illustrate what happens, let's run the same scan but with a much smaller window size of 100 bp.

In [None]:
ag3.plot_h12_gwss(
    contig="2R", 
    analysis="gamb_colu", 
    window_size=100, 
    sample_sets="3.0", 
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
    title="BF-09_Houet_gamb_2014; window_size=100"
)

This scan is far too noisy and it is impossible to detect any signals.

@@TODO