# Diplotype clustering

## Theme: Analysis

 In this module we are going to learn how to apply diplotype clustering using genome variation data from the [MalariaGEN Vector Observatory](https://www.malariagen.net/vobs/) [Ag3.0 release](https://www.malariagen.net/data_package/ag30-anopheles-gambiae-data-resource/). This method allows us to rapidly zoom in on a genome region of interest and identify selective sweeps, assess their size, detect potential gene flow events between countries or species, and investigate whether sweeps are driven by copy number variants (CNVs), amino acid mutations or both. The diplotype clustering the method and an example we will use in the training materials was first published in [Nagi et al. (2024)](https://academic.oup.com/mbe/article/41/7/msae140/7710633). Please cite this paper if you use diplotype clustering for your own research.

## Learning Objectives

At the end of this module you will be able to:

- Perform diplotype clustering across different mosquito cohorts.
- Plot and interpret the results.
- Identify genomic variation associated with diplotypes under selection.

# Recap: Hierarchical clustering

In [Workshop 7 Module 3](https://anopheles-genomic-surveillance.github.io/workshop-7/module-3-haplotype-clustering.html) of the Anopheles genomics for surveillance training course we covered the concept of hierarchical clustering including an example of how the algorithm works. To recap, hierarchical clustering is a method of grouping a set of objects that are similar to one another. In hierarchical clustering, the algorithms typically begin with every data point in its own cluster, and then pairs of data points or clusters are merged moving up the hierarchy. The order in which we merge two points is based on their distance (smallest distance first). We can visualise this clustering with dendrograms, which provide us with a visual interpretation of the hierarchical relationship between variables. When looking for selective sweeps in dendrograms, we are looking for large clusters of identical or very closely related haplotypes.

The below animation shows this clustering process both in 2D space (left), and moving up the dendrogram hierarchy (right). We can see that we begin by grouping the two closest data points first, and continue iteratively until all data points are in one cluster, at which point, the dendrogram is complete. Here, the sample indices correspond to the dendrogram leaves.

<center><img src="https://raw.githubusercontent.com/sanjaynagi/locusPocus/main/docs/hier_slow.gif" alt="drawing" width="700"/></center>

## What are diplotypes?

A diplotype, sometimes referred to as a multi-locus genotype, is essentially the combination of two haplotypes from a single mosquito - one from each chromosome - at a particular genomic region. By analysing diplotypes rather than haplotypes, we can better capture the full genetic variation present in an individual, including complex structural variants like CNVs that can be difficult to phase onto haplotypes. Often, CNVs and multiallelic SNPs are ignored when analysing haplotype data. The more mosquitoes we sequence, the worse this problem gets - *An. gambiae s.l* is so genetically diverse, eventually, a significant proportion of all SNPs become multiallelic (i.e., there are more than two possible alleles at a given site).

<img src="http://vobs-resources.cog.sanger.ac.uk/training/img/advanced/diplotype-clustering/diplotype_fig.png" width="800" ></img>

## Calculating the distance

When we perform hiearchical clustering we must specify the distance metric, which is how we calculate the distance between data points. We are going to use diplotypes to calculate this distance. We can use city block distance to calculate the difference among diplotypes. City block distance is calculated by summing the absolute difference between a pair of objects. This is analogous to the concept of walking blocks in a city where you must move around buildings in order to get to your destination rather than walk in a straight line. In the simple example below, the distance of two objects is calculated by obtaining the sum of the absolute distance between X and Y.


|   |   |   |   |   |
|---:|---:|---:|---:|---:|
**X** | 5 | 7 | 10 | 3
**Y** | 0 | 5 | 5 | 10


The sum of the difference between X and Y is calculated as follows.

(5-0)+(7-5)+(10-5)+(3-10)=19

Applying to concept of city block distance to diplotypes allows us to account for the presence of multialleic SNPs that are ignored when using phased haplotypes. This is the default parameter applied when performing an analysis of diplotype clustering.

To calculate the distance between groups of objects, we must also specify a linkage method. This parameter determines how we calculate the distance between clusters at each stage of the clustering process and was covered in [Workshop 7 Module 3](https://anopheles-genomic-surveillance.github.io/workshop-7/module-3-haplotype-clustering.html) of the Anopheles genomics for surveillance training course.

<center><img src="https://editor.analyticsvidhya.com/uploads/40351linkages.PNG" alt="drawing" width="500"/></center>

In practice, the exact choice of linkage method will not have much affect on the overall dendrogram shape, or on our conclusions. This is because we are looking for large clusters of identical or near-identical haplotypes, and they will appear similar regardless of the linkage method. For diplotype clustering, the default parameter is complete linkage which is the longest possible distance between the clusters being analysed.

## Set up

Now that we have introduced diplotype clustering, we will now demostrate its use and interpretation. Let's install and import the Python package malariagen_data that we can use to perform the analysis on the avaliable data.

In [1]:
%pip install -q --no-warn-conflicts malariagen_data

Note: you may need to restart the kernel to use updated packages.


Note that authentication is required to access data through the package, more details can be found [here](https://malariagen.github.io/vector-data/vobs/vobs-data-access.html).

In [2]:
import malariagen_data
import os
import plotly.io as pio
pio.renderers.default = "notebook+colab"

## Saving diplotype clustering results

Some diplotype clustering runs may take a while to complete, particularly if you're running this code on a service with modest computational resources such as Google Colab.

To avoid having to rerun these analyses, we'll save the results so we can come back to them later. In Google Colab, you can save results to your Google Drive, which will mean you don't lose results even if you leave the notebook and come back several days later.

When mounting your Google Drive you will need to follow the authorization instructions.

In [3]:
try:
    from google.colab import drive
    drive.mount("drive")
except ImportError:
    pass

With our Google Drive now mounted, we can define and make a directory where we want to save our results.

In [4]:
results_dir = "drive/MyDrive/Colab Data/ag3-diplotype-clustering-results"
os.makedirs(results_dir, exist_ok=True)

In Google Colab, we can actually see our mounted drive and NJT results directory by clicking on the file tab on the left hand side of the screen.

Next we should setup the malariagen_data package. As we want to save our diplotype clustering results in the Google Drive folder we just set up, we'll use the results_cache parameter and assign our results directory to it. If we were running this notebook locally, then we could assign a local folder to this parameter and the diplotype clustering results would instead get stored on our hard drive.

In [5]:
ag3 = malariagen_data.Ag3(results_cache=results_dir)
ag3

MalariaGEN Ag3 API client,MalariaGEN Ag3 API client
"Please note that data are subject to terms of use,  for more information see the MalariaGEN website or contact support@malariagen.net.  See also the Ag3 API docs.","Please note that data are subject to terms of use,  for more information see the MalariaGEN website or contact support@malariagen.net.  See also the Ag3 API docs..1"
Storage URL,gs://vo_agam_release_master_us_central1
Data releases available,"3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14"
Results cache,/home/kellylbennett/github/anopheles-genomic-surveillance.github.io/docs/advanced-training-materials/drive/MyDrive/Colab Data/ag3-diplotype-clustering-results
Cohorts analysis,20250131
AIM analysis,20220528
Site filters analysis,dt_20200416
Software version,malariagen_data 15.0.1
Client location,"Iowa, United States (Google Cloud us-central1)"


The output of ag3 shows us the “Client location”. This is where our Google Colab virtual machine is running on the cloud. As the data we would like to analyse is physically stored in the US, we should check that our notebook is running there too. If not, click “Runtime > Disconnect and delete runtime” from the menu, then re-run the notebook and check again.

# The ag3.plot_diplotype_clustering() function

The plot_diplotype_clustering function in malariagen_data performs hierarchical clustering on diplotypes from a specified genomic region.

We can see that there is just one required parameter:

- `region` defines what region of the genome we want to use to run the analysis. We could assign a whole chromosome arm, a chromosome region, with a start and stop point, or a specific gene of interest.

Let's run the function to see an example of the resulting dendrogram. We will investigate one of the genes in the CYP6 gene cluster for *An. gambiae* from Tanzania because these are key genes involved in the metabolic resistance of pyrethroids. We will focus our analysis on the CYP6P4 gene region previously investigated by [Mwinyi et al. 2025](https://www.authorea.com/users/852843/articles/1238610-genomic-analysis-reveals-a-new-cryptic-taxon-of-malaria-vectors-with-a-distinct-insecticide-resistance-profile-in-the-coast-of-east-africa) and colour the individuals by location to investigate variation across Tanzania.


In [10]:
ag3.plot_diplotype_clustering(
    region="2R:28,480,576-28,482,637",
    sample_sets="AG1000G-TZ",
    sample_query="taxon == 'gambiae'",
    site_mask="gamb_colu",
    color="location",
    )

                                

We can see at least one cluster composed of a relatively large number of individuals with the same diplotype at the end of the dendrogram. When a variant is selected,  we expect many individuals in the population to be homozygous for the region under selection and to therefore have identical diplotypes. This appears as a flattened area in the dendrogram.  The diplotype cluster we observe here is only found in individuals from Muleba suggesting it has a restricted geographical distribution.

Looking within the plot, there is also another small cluster of individuals which have the same diplotype in Muleba but since the cluster is composed of few individuals it is unclear whether it could be under selection. In this case, the dendrogram would be much more interpretable if we were able to view a measure of heterozygosity found within the diplotype cluster. Diplotypes with low levels of heterozygosity are a much clearer indication that the cluster is under selection.

We would also gain a lot more information from the dendrogram if we were able to look at the genomic variation associated with the diplotype clusters we have plotted. For example, whether the individuals in each cluster had CNVs or amino acid substitutions and whether these were unique to that particular cluster. By doing this for genes involved in insecticide resistance, we can get an idea of which variants are being selected upon.

# The ag3.plot_diplotype_clustering_advanced() function

The plot_diplotype_clustering_advanced function in malariagen_data not only performs hierarchical clustering on diplotypes but overlays data useful for its interpretation. It allows for the identification of genomic variation associated with diplotypes under selection. In addition to the clustering dendrogram, the function will return...

   1. Heterozygosity of each sample (within this genomic region)
   2. Copy number at genes of interest
   3. Amino acid variants in a specified transcript

This function requires two more parameters in order to return a plot with the genomic variation data.

- `SNP transcript` Gene transcript identifier that will be used to report on amino acid substitutions. For example, this might be a gene known or suspected to be involved in insecticide resistance.
- `CNV region` Defines the region of the genome over which we want to report copy number variants. This should cover the region where the SNP transcript is found.

Let's rerun the analysis above using the advanced function to see what new information we can gain from using the advanced function.


In [7]:
ag3.plot_diplotype_clustering_advanced(
    region='2R:28,480,576-28,482,637',
    snp_transcript='AGAP002867-RA',
    cnv_region='2R:28,480,576-28,482,637',
    site_mask='gamb_colu',
    sample_sets='AG1000G-TZ',
    sample_query='taxon == "gambiae"',
    color="location")

                                   

Load genotypes for heterozygosity calculation:   0%|          | 0/20 [00:00<?, ?it/s]

                                       

Load CNV HMM data:   0%|          | 0/15 [00:00<?, ?it/s]

Compute modal gene copy number:   0%|          | 0/2 [00:00<?, ?it/s]

                                     

Load SNP genotypes:   0%|          | 0/14 [00:00<?, ?it/s]

                                      

Compute SNP effects:   0%|          | 0/4764 [00:00<?, ?it/s]

There is a lot to unpack in this figure.

Directly below the dendrogram we now have a bar indicating the level of heterozygosity within the diplotype region for each individual. The measure of heterozygosity is represented by a colour bar ranging from black to white, with black indicating high heterozygosity and white indicating low heterozygosity. We are interested in diplotypes with low heterozygosity - this is what you expect when a selective sweep has occurred, and you find many individuals that are homozygous for the variation under selection. We can see that the diplotype we observed at the end of the dendrogram shared by numerous individuals from Muleba is underpinned by a white colour bar and is therefore likely under selection. However, the smaller diplotype cluster including individuals from Muheza is coloured a shade of grey, indicating fairly high levels of heterozygosity. This means we can now rule out this diplotype cluster as potentially under selection.

Undeneath the heterozygosity bar, we have another two colour bars, one for each gene transcript found across the CNV region we specfied. Darker shades of colour indicate a higher copy number. Rolling over the bar provides further information on the number of copies. On our plot we can see that all individuals within the low heterozygosity diplotype cluster from Muleba have high copy number variants at CYP6AA1 indicating that a CNV at the region is being selected upon.

Finally, we have three colour bars representing the amino acid substitutions present at the SNP transcript we input into analysis. The two possible amino acid substitutions are either represented by black (2) or white (0) while heterozygotes are coloured grey (1). Interestingly, we can see there is one substitution I236M that is found in the low heterozygosity diplotype cluster from Muleba only. Therefore it seems both a CNV and amino acid substitution are associated with the diplotype cluster under selection. If we knew nothing about these variants, this figure would give us a clue that the CNV coupled with the substitution is driving selection for insecticide resistance. In fact, our findings support the recent observation of a triple mutant associated with high pyrethroid resistance in *An. gambiae* from East and Central Africa [Njoroge et al. 2022](https://pubmed.ncbi.nlm.nih.gov/35775282/).  Along with a transposable element, the highly resistant mosquitoes had a duplication at the CYPAA1 gene and the substitution I236M at CYP6P4 similar to what we have observed here for Tanzania. As a result, we can conclude that *An. gambiae* from Muleba are also likely to have high levels of pyrethroid resistance.



# GSTE variation in *An. gambiae* from Ghana

To illustrate the power of diplotype clustering, let's look at a case study of the Gste2 gene from some recent whole-genome data of *An. gambiae s.l* from Obuasi, central Ghana, i.e., the sample set 1244-VO-GH-YAWSON-VMF00149 investigated by [Nagi et al. (2024)](https://academic.oup.com/mbe/article/41/7/msae140/7710633). Anopheles mosquitoes from this area are highly resistant to multiple classes of insecticides. The Gste2 gene is known to be involved in resistance to DDT (and potentially other insecticides), through either copy number variation, amino acid mutations, or both. Gste2-I114T and Gste2-L119V are the major amino acid mutations at this locus known to confer resistance.

In [8]:
ag3.plot_diplotype_clustering_advanced(
    region="3R:28,597,000-28,600,000",
    cnv_region="3R:28,594,000-28,605,000",
    snp_transcript="AGAP009194-RA",
    sample_sets="1244-VO-GH-YAWSON-VMF00149",
    site_mask="gamb_colu",
    color="taxon",
    )

                                     

Load genotypes for heterozygosity calculation:   0%|          | 0/32 [00:00<?, ?it/s]

                                       

Load CNV HMM data:   0%|          | 0/42 [00:00<?, ?it/s]

Compute modal gene copy number:   0%|          | 0/7 [00:00<?, ?it/s]

                                     

Load SNP genotypes:   0%|          | 0/20 [00:00<?, ?it/s]

                                      

Compute SNP effects:   0%|          | 0/2967 [00:00<?, ?it/s]

We can see three diplotype clusters which are all genetically identical and have very low heterozygosity as expected when a selective sweep has occurred.

The first low heterozygosity cluster includes a small number of *An. gambiae* but we do not observe any CNVs or amino acid substitutions unique to the group that would indicate they are driving selection.

The second cluster includes *An. coluzzii*, all of which carry the Gste2-I114T amino acid substitution observed as a block of black colour underneath the cluster. We  know this mutation causes insecticide resistance, so it is no surprise to find it linked to a selective sweep in this dataset.

In the third cluster we observe individuals of *An. coluzzii* which do not harbour either I114T or L119V, but instead, the Gste2-F120L mutation. This cluster is homozygous for F120L and shows low heterozygosity, again indicative of diplotypes which have two copies of the same swept haplotype. Two things about the Gste2-F120L mutation are convincing as a potential driver of resistance. Firstly, it is in very close physical proximity to known resistance mutations in codons 114 and 119. According to Riveron et al., the 120 codon is located at the active site of the enzyme and is therefore likely to interact with the insecticide. Secondly, there are no CNVs associated with this sweep, and no other amino acid variants except N3K, which is less likely to be causative due to its physical location away from the active site.

Finally, we also observe a small number of *An. coluzzii* individuals which harbour a copy number variant (CNV) spanning Gste2, Gste1, Gste3 and Gste7. In these individuals, CNVs could be driving insecticide resistance by increasing the expression of the genes they encompass, allowing the mosquito to detoxify more of the insecticide as a result.




# The KEAP1 gene in *An. arabiensis* from Kenya

We have seen how diplotype clustering can be applied to a gene we already know is involved in insecticide resistance and its ability to uncover which variants are associated with diplotypes under selection. Now let's demonstrate how this function can also be used to investigate genes for which we have no prior knowledge of their involvement in insecticide resistance. We do however suspect their involvement because we have for example, seen a novel signal of selection.

The Keap1 gene has the potential to impact on insecticide resistance because it regulates the transcription factor Maf-S. Maf-S is known to trigger the expression of multiple metabolic resistance genes, including cytochrome p450s and glutathione S-transferases in response to oxidative stress. We have also observed a signal of selection at this gene. However, we do not currently know what genomic variation at Keap1 is associated with insecticide resistance. This provides a good case for further investigation using diplotype clustering to assess whether either a CNV duplication or SNP can be associated with the signal of selection.

We will go ahead and run the analysis restricting it to *An. arabiensis* for the sample sets from Kenya ('1274-VO-KE-KAMAU-VMF00246','AG1000G-KE') where using genome wide selection scans we have observed a novel signal of selection in the central regions of Thika and Mwea ([Polo et al. 2025](https://assets-eu.researchsquare.com/files/rs-5328087/v1_covered_6ee1300e-e2e0-4f1c-96eb-570e9de14653.pdf)).

In [9]:
ag3.plot_diplotype_clustering_advanced(
    region='2R:40,926,195-40,945,169',
    snp_transcript='AGAP003645-RA',
    cnv_region='2R:40,926,195-40,945,169',
    site_mask='arab',
    sample_sets=['1274-VO-KE-KAMAU-VMF00246','AG1000G-KE'],
    sample_query='taxon == "arabiensis"',
    color="location",
)

                                     

Load genotypes for heterozygosity calculation:   0%|          | 0/79 [00:00<?, ?it/s]

                                       

Load CNV HMM data:   0%|          | 0/85 [00:00<?, ?it/s]

Compute modal gene copy number:   0%|          | 0/1 [00:00<?, ?it/s]

                                     

Load SNP genotypes:   0%|          | 0/63 [00:00<?, ?it/s]

                                      

Compute SNP effects:   0%|          | 0/56925 [00:00<?, ?it/s]

In the plot above there are three diplotypes with low heterozygosity. However, in our example, we only observed a signal of selection for the locations of Mwea and Thika in Central Kenya. Therefore, in our case we are only interested in the diplotype clusters which contain these samples. Two of the clusters contain samples from these Central regions. However, one also contains samples from Kwale and Kilifi where we did not observe a signal of selection and it also does not appear to have any variants uniquely associated to it.

We are therefore particularly interested in the remaining diplotype cluster that appears unique to central Kenya. Interestingly, this diplotype also has a unique set of SNPs, E762 and D780N. One of these substitutions (E762*) is a stop gain mutation, indicating a potential loss of function. The role Keap1 within its transcriptional pathway is to act as a repressor that prevents the unnecessary expression of detox genes in the absence of stress. Therefore, it is plausible that a loss of function could result in elevated antioxidant defence. All the variants we have identified here are potentially driving selection in Central Kenya for a gene which we suspect could be invovled in insecticide resistance. Therefore, these SNPs could be targeted for experimental work to validate their role in insecticide resistance or their frequencies monitored to evaluate whether they are on the rise.

## Well done!

In this module we have run diplotype clustering analyses and demonstrated how diplotype clustering can provide insights into the substitutions causing insecticide resistance; in a single snapshot, we can explore amino acid and CNV data and really understand the nature of selection at a genomic region.
