# **Insecticide resistance**

Insecticide resistance in malaria vectors occurs when mosquitoes develop the ability to survive exposure to commonly used insecticides, posing a significant challenge to malaria control efforts. This reduces the effectiveness of interventions like insecticide-treated bed nets and indoor residual spraying. Resistance can result from genetic modifications in the mosquito population, leading to mechanisms such as: (i) modification of the insecticide's molecular target, (ii) metabolic resistance through pathway modification for degradation, (iii) cuticular resistance by thickening the cuticle to prevent penetration, and (iv) behavioral changes to avoid insecticide contact. Addressing insecticide resistance in malaria vectors is essential for the success of control programs and to prevent disease resurgence.
Here, we will focus on target site modification and metabolic resistance mechanisms.

### **Target site mutations**

Commonly recognized mutations include the G280S mutation in the acetylcholinesterase gene (Ace-1), the L995F or L995S, and N1575Y mutations in the voltage-gated sodium channel (VGSC) or para gene, as well as the A296S or A296G mutation in the GABA (Gamma-Amino Butyric Acid) gene.


To discover SNPs in a specific gene, we need the gene transcript to use when determining what SNP effects will be. As the malariagen data is large as previously seen, we cand look for the SNP =s frequency in one or multiple population, we can look for the SNPs presence in a taxa accross all sample and so on.

First, let's identify the SNPs present in a specific gene. You can easily obtain the gene ID from databases like [vectorbase](https://vectorbase.org/vectorbase/app/), which contain information on genes related to vectors such as mosquitoes. Let's search for the gene ID or gene identification of the aforementioned genes.

**Question1** : Give the ID of the different genes cited above

In [1]:
%pip install -q --no-warn-conflicts malariagen_data

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.9/158.9 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.2/44.2 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m41.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.5/302.5 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.8/144.8 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.4/24.4 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [2]:
#import needed packages
import pandas as pd
import malariagen_data

In [4]:
# store the malariagen data to ag3
ag3 = malariagen_data.Ag3()
ag3

MalariaGEN Ag3 API client,MalariaGEN Ag3 API client
"Please note that data are subject to terms of use,  for more information see the MalariaGEN website or contact data@malariagen.net.  See also the Ag3 API docs.","Please note that data are subject to terms of use,  for more information see the MalariaGEN website or contact data@malariagen.net.  See also the Ag3 API docs..1"
Storage URL,gs://vo_agam_release/
Data releases available,"3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8"
Results cache,
Cohorts analysis,20231215
AIM analysis,20220528
Site filters analysis,dt_20200416
Software version,malariagen_data 8.8.0
Client location,"Iowa, US"


Using the gene ID AGAP004050 with transcript ID AGAP004050-RA let's dive in

**Question 2:** To which gene does this ID belongs to?

**ag3.snp_allele_frequencies** is a function created to compute allele frequencies, you can go ahead and question the function on how it works and what are the parameters important for its function doing

In [10]:
?ag3.snp_allele_frequencies

In [11]:
ag3.sample_sets(release="3.3")

Unnamed: 0,sample_set,sample_count,study_id,study_url,release
0,1178-VO-UG-LAWNICZAK-VMF00025,57,1178-VO-UG-LAWNICZAK,https://www.malariagen.net/partner_study/1178-...,3.3
1,1190-VO-GH-AMENGA-ETEGO-VMF00013,235,1190-VO-GH-AMENGA-ETEGO,https://www.malariagen.net/partner_study/1190-...,3.3
2,1190-VO-GH-AMENGA-ETEGO-VMF00014,2,1190-VO-GH-AMENGA-ETEGO,https://www.malariagen.net/partner_study/1190-...,3.3
3,1190-VO-GH-AMENGA-ETEGO-VMF00028,76,1190-VO-GH-AMENGA-ETEGO,https://www.malariagen.net/partner_study/1190-...,3.3
4,1190-VO-GH-AMENGA-ETEGO-VMF00029,265,1190-VO-GH-AMENGA-ETEGO,https://www.malariagen.net/partner_study/1190-...,3.3
5,1190-VO-GH-AMENGA-ETEGO-VMF00046,186,1190-VO-GH-AMENGA-ETEGO,https://www.malariagen.net/partner_study/1190-...,3.3
6,1190-VO-GH-AMENGA-ETEGO-VMF00047,181,1190-VO-GH-AMENGA-ETEGO,https://www.malariagen.net/partner_study/1190-...,3.3


In [13]:
# set the transcript ID
transcript = "AGAP004050-RA"
# set the cohorts of your choice (you can go back and find the diverse cohort in the table)
cohorts = "admin2_year"
# set the sample sets by using this code that could be modified for the expected release
# ag3.sample_sets(release="3.3")[['sample_set', 'study_id', 'sample_count']]
sample_sets = ["1178-VO-UG-LAWNICZAK-VMF00025", "1190-VO-GH-AMENGA-ETEGO-VMF00013", "1190-VO-GH-AMENGA-ETEGO-VMF00046"]

In [14]:
# with all those sets we can ru the code with the above function
snp_allele_freqs_df = ag3.snp_allele_frequencies(
    transcript=transcript,
    cohorts=cohorts,
    sample_sets=sample_sets,
    drop_invariant=False,
)
snp_allele_freqs_df



Load SNP genotypes:   0%|          | 0/44 [00:00<?, ?it/s]



Compute allele frequencies:   0%|          | 0/9 [00:00<?, ?it/s]



Compute SNP effects:   0%|          | 0/254391 [00:00<?, ?it/s]



Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pass_gamb_colu_arab,pass_gamb_colu,pass_arab,frq_GH-UE_Kasena-Nankana-East_colu_2016,frq_GH-UE_Kasena-Nankana-East_colu_2017,frq_GH-UE_Kasena-Nankana-East_gamb_2016,frq_GH-UE_Kasena-Nankana-West_colu_2016,frq_GH-UE_Kasena-Nankana-West_colu_2017,frq_GH-UE_Kasena-Nankana-West_gamb_2016,frq_GH-UE_Kasena-Nankana-West_gamb_2017,...,max_af,transcript,effect,impact,ref_codon,alt_codon,aa_pos,ref_aa,alt_aa,label
contig,position,ref_allele,alt_allele,aa_change,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
2R,48703664,G,A,,True,True,True,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,AGAP004050-RA,THREE_PRIME_UTR,LOW,,,,,,"2R:48,703,664 G>A"
2R,48703664,G,C,,True,True,True,0.019868,0.019481,0.115385,0.037736,0.017442,0.038462,0.166667,...,0.166667,AGAP004050-RA,THREE_PRIME_UTR,LOW,,,,,,"2R:48,703,664 G>C"
2R,48703664,G,T,,True,True,True,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,AGAP004050-RA,THREE_PRIME_UTR,LOW,,,,,,"2R:48,703,664 G>T"
2R,48703665,T,A,,True,True,True,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,AGAP004050-RA,THREE_PRIME_UTR,LOW,,,,,,"2R:48,703,665 T>A"
2R,48703665,T,C,,True,True,True,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,AGAP004050-RA,THREE_PRIME_UTR,LOW,,,,,,"2R:48,703,665 T>C"
2R,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2R,48788459,G,C,,False,False,False,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,AGAP004050-RA,FIVE_PRIME_UTR,LOW,,,,,,"2R:48,788,459 G>C"
2R,48788459,G,T,,False,False,False,0.003311,0.000000,0.000000,0.000000,0.005882,0.000000,0.000000,...,0.005882,AGAP004050-RA,FIVE_PRIME_UTR,LOW,,,,,,"2R:48,788,459 G>T"
2R,48788460,A,C,,False,False,False,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,AGAP004050-RA,FIVE_PRIME_UTR,LOW,,,,,,"2R:48,788,460 A>C"
2R,48788460,A,T,,False,False,False,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,AGAP004050-RA,FIVE_PRIME_UTR,LOW,,,,,,"2R:48,788,460 A>T"


 all SNPs which fall in CDSs doesn't cause protein changes, because the genetic code has some redundancy, meaning that different nucleotide sequences can encode the same amino acid. As we said, SNP in a CDS that change the amino acid are called non-synonymous (NS) or missense SNP, those that do not change the amino acid are called synonymous SNP.
 Let's see the impact and effect of the previous SNPs

In [16]:
snp_effects_df = snp_allele_freqs_df[["effect", "impact"]]
snp_effects_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,effect,impact
contig,position,ref_allele,alt_allele,aa_change,Unnamed: 5_level_1,Unnamed: 6_level_1
2R,48703664,G,A,,THREE_PRIME_UTR,LOW
2R,48703664,G,C,,THREE_PRIME_UTR,LOW
2R,48703664,G,T,,THREE_PRIME_UTR,LOW
2R,48703665,T,A,,THREE_PRIME_UTR,LOW
2R,48703665,T,C,,THREE_PRIME_UTR,LOW
2R,...,...,...,...,...,...
2R,48788459,G,C,,FIVE_PRIME_UTR,LOW
2R,48788459,G,T,,FIVE_PRIME_UTR,LOW
2R,48788460,A,C,,FIVE_PRIME_UTR,LOW
2R,48788460,A,T,,FIVE_PRIME_UTR,LOW


Let's assume you have a specific SNP to interogate as those mentioned in lthe lecture; you can look for them by slicing the table:
snp_effects_df.loc[("2R", 2_422_652)] depending on the chromosome and the position

In [17]:
#As the previous table being huge, we can look for most important SNPs if there are any
ns_snps_df = snp_allele_freqs_df.query("effect == 'NON_SYNONYMOUS_CODING' and max_af >= 0.05")
ns_snps_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,pass_gamb_colu_arab,pass_gamb_colu,pass_arab,frq_GH-UE_Kasena-Nankana-East_colu_2016,frq_GH-UE_Kasena-Nankana-East_colu_2017,frq_GH-UE_Kasena-Nankana-East_gamb_2016,frq_GH-UE_Kasena-Nankana-West_colu_2016,frq_GH-UE_Kasena-Nankana-West_colu_2017,frq_GH-UE_Kasena-Nankana-West_gamb_2016,frq_GH-UE_Kasena-Nankana-West_gamb_2017,...,max_af,transcript,effect,impact,ref_codon,alt_codon,aa_pos,ref_aa,alt_aa,label
contig,position,ref_allele,alt_allele,aa_change,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
2R,48711721,G,A,P594S,True,True,True,0.003311,0.0,0.076923,0.0,0.0,0.0,0.041667,...,0.076923,AGAP004050-RA,NON_SYNONYMOUS_CODING,MODERATE,Ccg,Tcg,594.0,P,S,"2R:48,711,721 G>A (P594S)"
2R,48711752,C,A,Q583H,False,False,False,0.003311,0.0,0.0,0.0,0.0,0.0,0.0,...,0.119048,AGAP004050-RA,NON_SYNONYMOUS_CODING,MODERATE,caG,caT,583.0,Q,H,"2R:48,711,752 C>A (Q583H)"
2R,48711758,C,G,Q581H,False,False,True,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,...,0.076923,AGAP004050-RA,NON_SYNONYMOUS_CODING,MODERATE,caG,caC,581.0,Q,H,"2R:48,711,758 C>G (Q581H)"
2R,48711860,C,A,E547D,True,True,True,0.056291,0.064935,0.0,0.018868,0.052326,0.0,0.0,...,0.064935,AGAP004050-RA,NON_SYNONYMOUS_CODING,MODERATE,gaG,gaT,547.0,E,D,"2R:48,711,860 C>A (E547D)"
2R,48712108,C,T,G465S,True,True,True,0.0,0.0,0.192308,0.0,0.0,0.038462,0.083333,...,0.192308,AGAP004050-RA,NON_SYNONYMOUS_CODING,MODERATE,Ggc,Agc,465.0,G,S,"2R:48,712,108 C>T (G465S)"
2R,48712123,C,T,G460S,True,True,True,0.387417,0.311688,0.0,0.481132,0.377907,0.0,0.0,...,0.481132,AGAP004050-RA,NON_SYNONYMOUS_CODING,MODERATE,Ggc,Agc,460.0,G,S,"2R:48,712,123 C>T (G460S)"
2R,48712249,C,T,A418T,False,False,True,0.910596,0.941558,0.153846,0.971698,0.930233,0.269231,0.166667,...,0.971698,AGAP004050-RA,NON_SYNONYMOUS_CODING,MODERATE,Gcc,Acc,418.0,A,T,"2R:48,712,249 C>T (A418T)"
2R,48712480,G,T,H341N,True,True,True,0.003311,0.0,0.0,0.0,0.0,0.0,0.083333,...,0.083333,AGAP004050-RA,NON_SYNONYMOUS_CODING,MODERATE,Cac,Aac,341.0,H,N,"2R:48,712,480 G>T (H341N)"


See, the table is more smaller and readable

In [18]:
#Let's use an integrated function to plot it
ag3.plot_frequencies_heatmap(ns_snps_df)

**Next!**

Let's look accross a larger dataset but specifying one specie or sometime one country. In this case there is a useful sample_query parameter available with the functions snp_allele_frequencies() and aa_allele_frequencies().
Let’s see it in action by computing both frequencies for all An. gambiae samples in the Ag3.3 data resource.

In [21]:
aa_gam_freqs_df = ag3.aa_allele_frequencies(
    transcript=transcript,
    cohorts=cohorts,
    sample_sets="3.3",
    sample_query="taxon == 'gambiae'",
)
aa_gam_freqs_df = aa_gam_freqs_df.query("max_af > 0.05")
ag3.plot_frequencies_heatmap(aa_gam_freqs_df)



Load SNP genotypes:   0%|          | 0/76 [00:00<?, ?it/s]



Compute allele frequencies:   0%|          | 0/6 [00:00<?, ?it/s]

Compute SNP effects:   0%|          | 0/9933 [00:00<?, ?it/s]



More on this topic [here](https://anopheles-genomic-surveillance.github.io/workshop-1/module-4-vgsc-snps.html#)