***Training course in data analysis for genomic surveillance of African malaria vectors - Workshop 1***

---

# Module 4 - Analysing SNPs in the *Vgsc* gene

**Theme: Analysis**

In this module we're going to perform an analysis to discover single nucleotide polymorphisms (SNPs) in the voltage-gated sodium channel gene (*Vgsc*), which encodes the binding target for pyrethroid insecticides. 

## Learning objectives

At the end of this module you will be able to:

* Discover mutations (SNPs) that could potentially cause pyrethroid target-site resistance.
* Compute SNP allele frequencies, i.e., how common are they in different mosquito cohorts?
* Perform analyses to compare SNP allele frequences between mosquitoes from different species, geographical locations and dates of collection.


## Lecture

### English

In [1]:
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/m3R5SuNgKmw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

### Français

Coming soon.

## Discovering SNPs in the *Vgsc* gene and computing SNP allele frequencies

First, let's set up the [`malariagen_data`](https://github.com/malariagen/malariagen-data-python) package to access MalariaGEN data in the cloud.

In [2]:
!pip install -q malariagen_data

In [3]:
import malariagen_data

In [4]:
ag3 = malariagen_data.Ag3()
ag3

MalariaGEN Ag3 data resource API,MalariaGEN Ag3 data resource API.1
Storage URL,gs://vo_agam_release/
Releases available,3.0
Cohorts analysis,20211101
Species analysis,aim_20200422
Site filters analysis,dt_20200416


To discover SNPs and compute allele frequencies, we're going to use the [`snp_allele_frequencies()` function](https://malariagen.github.io/vector-data/ag3/api.html#snp-allele-frequencies). Let's have a look at the documentation for this function.

In [5]:
ag3.snp_allele_frequencies?

To discover SNPs in the *Vgsc* gene, we need to define some parameters. 

First, we need to decide which gene transcript to use when determining what SNP effects will be. Here we'll use the transcript with identifier "AGAP004707-RD".

In [6]:
transcript = "AGAP004707-RD"

Next, to compute allele frequencies, we need to decide how our mosquitoes will be grouped into cohorts. There are different ways you can do this, for this analysis we'll group spatially by level 1 administrative divisions within countries, and temporally by year.

In [7]:
cohorts = "admin1_year"

Next, we need to choose which samples to analyse. There are a number of different sample sets in the Ag3.0 data resource that we could use for this analysis. Let's check what's available.

In [8]:
ag3.sample_sets(release="3.0")

Unnamed: 0,sample_set,sample_count,release
0,AG1000G-AO,81,3.0
1,AG1000G-BF-A,181,3.0
2,AG1000G-BF-B,102,3.0
3,AG1000G-BF-C,13,3.0
4,AG1000G-CD,76,3.0
5,AG1000G-CF,73,3.0
6,AG1000G-CI,80,3.0
7,AG1000G-CM-A,303,3.0
8,AG1000G-CM-B,97,3.0
9,AG1000G-CM-C,44,3.0


To keep things simple, for this module we'll focus on mosquitoes from Burkina Faso. There are three sample sets in the Ag3.0 resource providing data on mosquitoes from Burkina Faso.

In [9]:
sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"]

OK, now we're ready to run the analysis.

In [10]:
snp_allele_freqs_df = ag3.snp_allele_frequencies(
    transcript=transcript, 
    cohorts=cohorts, 
    sample_sets=sample_sets, 
    drop_invariant=False,
)
snp_allele_freqs_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pass_gamb_colu_arab,pass_gamb_colu,pass_arab,frq_BF-09_gamb_2012,frq_BF-09_colu_2012,frq_BF-09_colu_2014,frq_BF-09_gamb_2014,frq_BF-07_gamb_2004,max_af,effect,impact,ref_codon,alt_codon,aa_pos,ref_aa,alt_aa
contig,position,ref_allele,alt_allele,aa_change,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2L,2358158,A,C,M1L,True,True,True,0.0,0.0,0.0,0.0,0.0,0.0,START_LOST,HIGH,Atg,Ctg,1.0,M,L
2L,2358158,A,T,M1L,True,True,True,0.0,0.0,0.0,0.0,0.0,0.0,START_LOST,HIGH,Atg,Ttg,1.0,M,L
2L,2358158,A,G,M1V,True,True,True,0.0,0.0,0.0,0.0,0.0,0.0,START_LOST,HIGH,Atg,Gtg,1.0,M,V
2L,2358159,T,A,M1K,True,True,True,0.0,0.0,0.0,0.0,0.0,0.0,NON_SYNONYMOUS_CODING,MODERATE,aTg,aAg,1.0,M,K
2L,2358159,T,C,M1T,True,True,True,0.0,0.0,0.0,0.0,0.0,0.0,NON_SYNONYMOUS_CODING,MODERATE,aTg,aCg,1.0,M,T
2L,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2L,2431616,G,C,*2119S,True,True,True,0.0,0.0,0.0,0.0,0.0,0.0,STOP_LOST,HIGH,tGa,tCa,2119.0,*,S
2L,2431616,G,T,*2119L,True,True,True,0.0,0.0,0.0,0.0,0.0,0.0,STOP_LOST,HIGH,tGa,tTa,2119.0,*,L
2L,2431617,A,C,*2119C,True,True,True,0.0,0.0,0.0,0.0,0.0,0.0,STOP_LOST,HIGH,tgA,tgC,2119.0,*,C
2L,2431617,A,T,*2119C,True,True,True,0.0,0.0,0.0,0.0,0.0,0.0,STOP_LOST,HIGH,tgA,tgT,2119.0,*,C


The output from this function is a pandas DataFrame, where each row provides information about a SNP in the *Vgsc* gene.

Before we go further, to improve our understanding of what these data, let's look at some background.

## Grouping samples into cohorts

The MalariaGEN Ag3.0 data resource contain mosquito samples collected across large spatial and temporal scales, and from different mosquito species. When we want to run population genetic analyses on datasets like these, the data must be divided into biologically relevant **cohorts**, where a cohort is simply a group of samples we want to analyse together. 

To help define cohorts and analyse these data, we have added some metadata for each sample about its time and place of collection and its species: 

- **Spatially** - For most analyses we use administrative divisions to group the samples into cohorts. These give two levels of spatial resolution, where admin level 1 divides each country into a few large regions, while admin level 2 provides finer scale divisions such as provinces.

- **Temporally** - For each sample we provide the year and month of collection. Depending on your analysis, you can choose to group samples by year or by year and by month, although note that for some samples the collection month is missing.

- **Taxonomically** - Ag3.0 contains samples from different species in the *Anopheles gambiae* complex. To help with grouping by taxon, we have included a "taxon" field in the sample metadata.
   
Using these three dimensions, we have pre-defined four **cohort sets**, each of which groups samples into cohorts at different levels of spatio-temporal resolution. Within all cohort sets, samples are further subdivided by taxon.

- **admin1_year** - Cohorts obtained by grouping samples by admin level 1, collection year and taxon.
- **admin1_month** - Cohorts obtained by grouping samples by admin level 1, collection year and month, and taxon.
- **admin2_year** - Cohorts obtained by grouping samples by admin level 2, collection year and taxon.
- **admin2_month** - Cohorts obtained by grouping samples by admin level 2, collection year and month, and taxon.

Remember above we chose to use the "admin1_year" cohorts for our *Vgsc* analysis. 

In [11]:
cohorts

'admin1_year'

Let's now use this to understand the frequency columns in the SNP allele frequencies DataFrame. Here are the frequency column names:

In [12]:
frequency_columns = [
    col for col in snp_allele_freqs_df.columns 
    if col.startswith("frq_")
]
frequency_columns

['frq_BF-09_gamb_2012',
 'frq_BF-09_colu_2012',
 'frq_BF-09_colu_2014',
 'frq_BF-09_gamb_2014',
 'frq_BF-07_gamb_2004']

Here "frq_" is used to mean that these columns contain frequencies.

The second part of the column name is either "BF-09" or "BF-07". These are [standard identifiers](https://en.wikipedia.org/wiki/ISO_3166-2) that refer to level 1 administrative divisions within countries, often called regions. "BF-07" is the Centre-Sud region within Burkina Faso, and "BF-09" is the Haut-Bassins region.

The third part refers to the species, and in this example is either "gamb" meaning *Anopheles gambiae* or "colu" meaning *Anopheles coluzzii*.

The final part is the year of collection, which in this example ie either 2004, 2012 or 2014.

Thus, e.g., "frq_BF-09_gamb_2012" means the frequency of the allele within the cohort of *Anopheles gambiae* mosquitoes from the BF-09 (Haut-Bassins) region collected in 2012.

## Single nucleotide polymorphisms (SNPs)

In the last module we learnt about reference genomes and gene annotations. Now one of the cool things we can do once we have an _Anopheles gambiae_ reference genome, is sequence the genomes of wild mosquitoes and compare them to the reference genome. By aligning an individual's sequencing reads to the reference, we can call its **genotype**. 

The genotype is derived from how similar the individual is to the reference genome at each **nucleotide position**. 

Where a nucleotide matches the reference genome, it will be called a **reference allele**, where "allele" is just a term for any kind of genetic variant. Where a nucleotide differs from the reference genome, it will be called an **alternative allele**. 

Since, there are four possible nucleotides and one of these is the reference, there are always three possible alternative alleles for any given site. Another term for a genetic difference where two individuals differ at a single nucleotide position is **single nucleotide polymorphism** or **SNP**.

In [13]:
%%html
<iframe frameborder="0" style="width:100%;height:700px;" src="https://viewer.diagrams.net/?tags=%7B%7D&highlight=000000&edit=_blank&layers=1&nav=1#G1MXxUT74w6lQ7FRppVmza4M4cdS_ZecNw"></iframe>

This SNP information is displayed in the first four columns of the `snp_allele_frequencies()` index in the output dataframe. We can see each alternative allele represented by it's own row in our output.

In [14]:
snp_allele_freqs_df[[]]

contig,position,ref_allele,alt_allele,aa_change
2L,2358158,A,C,M1L
2L,2358158,A,T,M1L
2L,2358158,A,G,M1V
2L,2358159,T,A,M1K
2L,2358159,T,C,M1T
2L,...,...,...,...
2L,2431616,G,C,*2119S
2L,2431616,G,T,*2119L
2L,2431617,A,C,*2119C
2L,2431617,A,T,*2119C


## SNP effects - some SNPs may be more interesting than others

SNPs can have different effects depending on what the nucleotide change is and where in the genome it occurs.

In this analysis, we are interested in SNPs that affect protein structure, specifically those which will change the voltage-gated sodium-channel and could affect protein function and therefore insecticide resistance phenotype, so we need to look within the coding sequences (CDS) of the _Vgsc_ gene.

However, not all SNPs which fall in CDSs cause protein changes, because the genetic code has some redundancy, meaning that different nucleotide sequences can encode the same amino acid. If a SNP in a CDS does change the amino acid, it is called a **non-synonymous** (NS) or **missense** SNP, and if the SNP does not change the amino acid, it is called a **synonymous** SNP.  

Manually predicting SNP effects is quite involved, but `snp_allele_frequencies()` can predict them for us, let's look at how this is represented in our output DataFrame.


In [15]:
snp_effects_df = snp_allele_freqs_df[["effect", "impact"]]
snp_effects_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,effect,impact
contig,position,ref_allele,alt_allele,aa_change,Unnamed: 5_level_1,Unnamed: 6_level_1
2L,2358158,A,C,M1L,START_LOST,HIGH
2L,2358158,A,T,M1L,START_LOST,HIGH
2L,2358158,A,G,M1V,START_LOST,HIGH
2L,2358159,T,A,M1K,NON_SYNONYMOUS_CODING,MODERATE
2L,2358159,T,C,M1T,NON_SYNONYMOUS_CODING,MODERATE
2L,...,...,...,...,...,...
2L,2431616,G,C,*2119S,STOP_LOST,HIGH
2L,2431616,G,T,*2119L,STOP_LOST,HIGH
2L,2431617,A,C,*2119C,STOP_LOST,HIGH
2L,2431617,A,T,*2119C,STOP_LOST,HIGH


Let's look specifically at the genomic position where a SNP occurs which causes an insecticide-resistance mutation, also known as "kdr". 

In [16]:
snp_effects_df.loc[("2L", 2_422_652)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,effect,impact
ref_allele,alt_allele,aa_change,Unnamed: 3_level_1,Unnamed: 4_level_1
A,C,L995F,NON_SYNONYMOUS_CODING,MODERATE
A,T,L995F,NON_SYNONYMOUS_CODING,MODERATE
A,G,L995L,SYNONYMOUS_CODING,LOW


For interest, let's count the number of SNPs we have by their effect.

In [17]:
snp_allele_freqs_df.query("max_af > 0").groupby(["effect", "impact"]).size()

effect                 impact  
INTRONIC               MODIFIER    9451
NON_SYNONYMOUS_CODING  MODERATE     121
SPLICE_CORE            HIGH           8
SPLICE_REGION          MODERATE       9
STOP_GAINED            HIGH          14
SYNONYMOUS_CODING      LOW           56
dtype: int64

## SNP allele frequencies

Identifying the presence or absence of SNPs in wild caught mosquitoes is interesting, but the real value in generating SNP genotypes from large spatiotemporal collections of mosquitoes comes from the ability to see how groups of samples (cohorts) differ between geographical locations, species, and over time.  

One way to compare SNP differences between cohorts is to calculate and compare **SNP allele frequencies** at each position in the genome by dividing the number of times each SNP allele is found in the cohort by the total number of individuals present in the cohort (multiplied by 2 because each individual mosquito is diploid and so carries two genome copies).

In [18]:
%%html
<iframe frameborder="0" style="width:100%;height:700px;" src="https://viewer.diagrams.net/?tags=%7B%7D&highlight=0000ff&edit=_blank&layers=1&nav=1&title=allele_freqs.drawio#Uhttps%3A%2F%2Fdrive.google.com%2Fuc%3Fid%3D1vbRdL36K4NAeQWOF59o4KR723rZrBnak%26export%3Ddownload"></iframe>

Let's take another look at the allele frequencies we computed above, focusing just on the frequency columns.

In [19]:
snp_allele_freqs_df[frequency_columns + ['max_af']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,frq_BF-09_gamb_2012,frq_BF-09_colu_2012,frq_BF-09_colu_2014,frq_BF-09_gamb_2014,frq_BF-07_gamb_2004,max_af
contig,position,ref_allele,alt_allele,aa_change,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2L,2358158,A,C,M1L,0.0,0.0,0.0,0.0,0.0,0.0
2L,2358158,A,T,M1L,0.0,0.0,0.0,0.0,0.0,0.0
2L,2358158,A,G,M1V,0.0,0.0,0.0,0.0,0.0,0.0
2L,2358159,T,A,M1K,0.0,0.0,0.0,0.0,0.0,0.0
2L,2358159,T,C,M1T,0.0,0.0,0.0,0.0,0.0,0.0
2L,...,...,...,...,...,...,...,...,...,...
2L,2431616,G,C,*2119S,0.0,0.0,0.0,0.0,0.0,0.0
2L,2431616,G,T,*2119L,0.0,0.0,0.0,0.0,0.0,0.0
2L,2431617,A,C,*2119C,0.0,0.0,0.0,0.0,0.0,0.0
2L,2431617,A,T,*2119C,0.0,0.0,0.0,0.0,0.0,0.0


And let's inspect the frequencies for SNPs at a specific genomic position of interest.

In [20]:
snp_allele_freqs_df.loc[("2L", 2_422_652), frequency_columns]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,frq_BF-09_gamb_2012,frq_BF-09_colu_2012,frq_BF-09_colu_2014,frq_BF-09_gamb_2014,frq_BF-07_gamb_2004
ref_allele,alt_allele,aa_change,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A,C,L995F,0.0,0.0,0.0,0.0,0.0
A,T,L995F,1.0,0.865854,0.886792,1.0,0.076923
A,G,L995L,0.0,0.0,0.0,0.0,0.0


## Visualising SNP allele frequencies

To make our SNP allele frequencies DataFrame easier to interpret, we can filter it down to just non-synonymous SNPs that are at frequency above 5% in at least one of our cohorts.

In [21]:
ns_snps_df = snp_allele_freqs_df.query("effect == 'NON_SYNONYMOUS_CODING' and max_af >= 0.05")
ns_snps_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pass_gamb_colu_arab,pass_gamb_colu,pass_arab,frq_BF-09_gamb_2012,frq_BF-09_colu_2012,frq_BF-09_colu_2014,frq_BF-09_gamb_2014,frq_BF-07_gamb_2004,max_af,effect,impact,ref_codon,alt_codon,aa_pos,ref_aa,alt_aa
contig,position,ref_allele,alt_allele,aa_change,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2L,2391228,G,C,V402L,True,True,True,0.0,0.067073,0.028302,0.0,0.0,0.067073,NON_SYNONYMOUS_CODING,MODERATE,Gta,Cta,402.0,V,L
2L,2391228,G,T,V402L,True,True,True,0.0,0.054878,0.084906,0.0,0.0,0.084906,NON_SYNONYMOUS_CODING,MODERATE,Gta,Tta,402.0,V,L
2L,2416980,C,T,T791M,True,True,True,0.163265,0.018293,0.0,0.23913,0.0,0.23913,NON_SYNONYMOUS_CODING,MODERATE,aCg,aTg,791.0,T,M
2L,2422652,A,T,L995F,True,True,True,1.0,0.865854,0.886792,1.0,0.076923,1.0,NON_SYNONYMOUS_CODING,MODERATE,ttA,ttT,995.0,L,F
2L,2429617,T,C,I1527T,True,True,True,0.0,0.121951,0.113208,0.0,0.0,0.121951,NON_SYNONYMOUS_CODING,MODERATE,aTt,aCt,1527.0,I,T
2L,2429745,A,T,N1570Y,True,True,True,0.209184,0.25,0.320755,0.141304,0.038462,0.320755,NON_SYNONYMOUS_CODING,MODERATE,Aat,Tat,1570.0,N,Y
2L,2429897,A,G,E1597G,True,True,True,0.066327,0.0,0.0,0.032609,0.0,0.066327,NON_SYNONYMOUS_CODING,MODERATE,gAa,gGa,1597.0,E,G
2L,2429915,A,C,K1603T,True,True,True,0.0,0.054878,0.056604,0.0,0.0,0.056604,NON_SYNONYMOUS_CODING,MODERATE,aAg,aCg,1603.0,K,T
2L,2430424,G,T,A1746S,False,True,False,0.153061,0.0,0.0,0.23913,0.0,0.23913,NON_SYNONYMOUS_CODING,MODERATE,Gcc,Tcc,1746.0,A,S
2L,2430863,T,C,I1868T,True,True,True,0.25,0.0,0.0,0.206522,0.0,0.25,NON_SYNONYMOUS_CODING,MODERATE,aTa,aCa,1868.0,I,T


To make things even clearer, we have included a heatmap plotting function to style our filtered DataFrame, called [`plot_frequencies_heatmap()`](https://malariagen.github.io/vector-data/ag3/api.html#plot-frequencies-heatmap).

In [22]:
ag3.plot_frequencies_heatmap(ns_snps_df, width=600)

## Amino acid substitution frequencies

You might have noticed that there are two rows with `V402L` in our previous heatmap plot. This is because in Burkina Faso, we find two different alternative alleles at the same genomic position, both causing the same amino acid substitution (valine to leucine).

If we are just interested in amino acid change frequencies, for example, when looking at potential insecticide resistance conferring mutations, we might want to combine the frequencies of the two alleles which cause V402L. In this case, we can use the `aa_allele_frequencies()` function in exactly the same way as we used `snp_allele_frequencies()`.

In [23]:
aa_allele_freqs_df = ag3.aa_allele_frequencies(
    transcript=transcript, 
    cohorts=cohorts, 
    sample_sets=sample_sets
)
aa_allele_freqs_df

Unnamed: 0_level_0,frq_BF-09_gamb_2012,frq_BF-09_colu_2012,frq_BF-09_colu_2014,frq_BF-09_gamb_2014,frq_BF-07_gamb_2004,max_af
aa_change,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A32V,0.000000,0.006098,0.0,0.000000,0.000000,0.006098
G54C,0.000000,0.018293,0.0,0.010870,0.000000,0.018293
P55L,0.005102,0.000000,0.0,0.010870,0.000000,0.010870
P59T,0.000000,0.000000,0.0,0.021739,0.000000,0.021739
G73D,0.000000,0.006098,0.0,0.000000,0.000000,0.006098
...,...,...,...,...,...,...
A2023G,0.000000,0.000000,0.0,0.000000,0.038462,0.038462
S2037R,0.005102,0.000000,0.0,0.000000,0.000000,0.005102
I2053V,0.000000,0.000000,0.0,0.000000,0.038462,0.038462
G2055V,0.005102,0.000000,0.0,0.000000,0.000000,0.005102


Let's filter it again to just amino acid changes greater than 5% in at least one cohort. We don't need to filter for non-synonymous mutations this time as this function has already done that for us.


In [24]:
aa_filt_df = aa_allele_freqs_df.query("max_af >= 0.05")
aa_filt_df

Unnamed: 0_level_0,frq_BF-09_gamb_2012,frq_BF-09_colu_2012,frq_BF-09_colu_2014,frq_BF-09_gamb_2014,frq_BF-07_gamb_2004,max_af
aa_change,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
V402L,0.0,0.121951,0.113208,0.0,0.0,0.121951
T791M,0.163265,0.018293,0.0,0.23913,0.0,0.23913
L995F,1.0,0.865854,0.886792,1.0,0.076923,1.0
I1527T,0.0,0.121951,0.113208,0.0,0.0,0.121951
N1570Y,0.209184,0.25,0.320755,0.141304,0.038462,0.320755
E1597G,0.066327,0.0,0.0,0.032609,0.0,0.066327
K1603T,0.0,0.054878,0.056604,0.0,0.0,0.056604
A1746S,0.153061,0.0,0.0,0.23913,0.0,0.23913
I1868T,0.25,0.0,0.0,0.206522,0.0,0.25
P1874L,0.22449,0.073171,0.056604,0.26087,0.0,0.26087


Now we can visualise these frequencies the same way we did before.

In [25]:
ag3.plot_frequencies_heatmap(aa_filt_df, width=600)

## Well done!

In this module we have learnt how to analyse SNP mutations in the target of pyrethroid insecticides, the voltage-gated sodium-channel. We have calculated the allele frequencies of the SNPs in cohorts of mosquitoes and learnt how to filter and plot them for ease of interpretation.

## Practical exercises

### English

1. Open this notebook in Google Colab and run it for yourself from top to bottom. Hint: click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.
2. Looking at the heatmap output (either amino acid or SNP), can you spot a relationship between the `V402L` and `L995F` frequencies? If so, what is it?
2. Re-run the whole analysis but using the Ghanaian sample set. Hint: Try `sample_sets = "AG1000G-GH"`. Or any other samples of interest.
3. What are the cohorts for this new sample set? Hint: see `frequency_columns`.
4. Above, we looked at the _kdr_ "West" SNP position, compare and contrast this with the _kdr_ "East" SNP position. What is the amino acid change for _kdr_ "East"? Hint: The position is 2422651. 
5. For the Ghanaian sample set, add the a y_label which says "aa change" to the amino acid frequency heatmap. Hint: you can view all the function's parameters with `ag3.aa_allele_frequencies?`.
6. Remove the colorbar from the same heatmap. Hint: `False`.
7. Is the same relationship between `V402L` and `L995F` frequencies present in Ghana? What might be an evolutionary interpretation of this relationship?

**When you’ve had enough, create a link to your notebook, and share it with someone. If you’re attending a training workshop, paste the link into the workshop slack channel.**

### Français

Coming soon.