Large genomic datasets like the [vector observatory](https://malariagen.net/vobs) can be difficult to analyse. To make analysing these data easier, we created [analytical software](https://malariagen.github.io/malariagen-data-python/latest/) and an [online training course](https://anopheles-genomic-surveillance.github.io/), and these have definitely helped a lot. But many potential users of these data are coming from an entomology background without much experience of either genomics or programming, and so coming up with the right code can still be a challenge. AI coding assistants are getting very good. Would it be possible to create a specialised AI assistant to support generation of code to analyse genomic data on malaria mosquitoes?

# Use case: investigating insecticide resistance

Consider a use case where an analyst would like to investigate evidence for insecticide resistance in mosquito populations within a country of interest. Assume the analyst has good knowledge of vector biology but patchy knowledge of genomics and the molecular biology of insecticide resistance. Assume the analyst has completed [basic training in vector genomics data analysis](https://anopheles-genomic-surveillance.github.io/) but is not an experienced programmer and finds writing data analysis code challenging.

The analyst might start with a relatively broad question like, "Is there resistance to pyrethroid insecticides in Uganda?" But there are multiple possible mechanisms of pyrethroid resistance that need to be considered, and there are also multiple species of malaria vector present in Uganda. So this question is likely to need some refinement. If I were assisting this user, I would ask a series of questions to narrow down the focus, and to help the user become aware of the relevant data that are available. I would then be able to identify a smaller set of analysis function calls and parameter values that are relevant. 

Would it be possible to create a chatbot assistant to help with this refinement? E.g., here is a hypothetical conversation between a user and an assistant that would refine the scope of the initial question to reach a point where code generation would be easier:

<table style="width: 100%; font-size: 1.1em; font-style: italic;">
    <colgroup>
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 50%;">
    </colgroup>
    <thead>
        <tr>
            <th style="text-align: left;">User</th>
            <th style="text-align: left;">Assistant</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>
<p>Is there resistance to pyrethroid insecticides in Uganda?</p> 
            </td>
            <td></td>
        </tr>
        <tr>
            <td></td>
            <td>
<p>Which mosquito species are you interested in?</p>
<p>In Uganda there are data available for <em>Anopheles gambiae</em>, <em>Anopheles arabiensis</em> and <em>Anopheles funestus</em>.</p>
            </td>
        </tr>
        <tr>
            <td>
<p>Anopheles gambiae.</p>
            </td>
            <td></td>            
        </tr>
        <tr>
            <td></td>
            <td>
<p>Which molecular mechanism of pyrethroid resistance are you interested in?</p>
<p>E.g., target-site resistance, Cyp-mediated metabolic resistance, or other resistance mechanisms?</p>
            </td>            
        </tr>
        <tr>
            <td>
<p>Cyp-mediated metabolic resistance.</p>
            </td>
            <td></td>            
        </tr>
    </tbody>
</table>


At this point, we probably have just enough information to start suggesting some code. E.g., we know that copy number amplification of *Cyp* genes has been associated with pyrethroid resistance. I might suggest to try analysing CNV frequencies at a selection of genome regions containing *Cyp* genes which have been previously linked to pyrethroid resistance. Here is some code which uses two function calls to compute and then visualise gene CNV frequencies in Uganda, grouping mosquitoes by year and top level administrative units.

In [2]:
# Set up the API.
import malariagen_data
ag3 = malariagen_data.Ag3()

In [3]:
# Define genome regions containing genes of interest.
cyp6aap_region = "2R:28,480,000-28,510,000"
cyp9k1_region = "X:15,240,000-15,250,000"
cyp6mz_region = "3R:6,924,000-6,980,000"
cyp_regions = [cyp6aap_region, cyp9k1_region, cyp6mz_region]

# Compute gene CNV frequencies.
df_cyp_cnv_frq = ag3.gene_cnv_frequencies(
    region=cyp_regions,
    sample_query="country == 'Uganda' and taxon == 'gambiae'",
    cohorts="admin1_year",
)

# Visualise CNV frequencies as a table.

                                     

Load CNV HMM data:   0%|          | 0/1360 [00:00<?, ?it/s]

KeyboardInterrupt: 

But there are still a few decisions to make.

E.g., know that there are 107 cytochrome P450 (*Cyp*) genes in the *Anopheles gambiae* genome. only some of these 107 *Cyp* genes have been associated with resistance previously. But it's possible that prior knowledge is incomplete. Should we analyse all 107 genes, or narrow down to a smaller set of validated genes?

Also, some studies are also starting to find SNPs in *Cyp* genes are markers of resistance. Should we analyse SNPs as well as CNVs?

These are tricky decisions because we could come up with a function call that is perfectly sensible from a biological point of view, but would overwhelm the user with data.

In [4]:
df_gff = ag3.genome_features()
df_gff

                                     

Unnamed: 0,contig,source,type,start,end,score,strand,phase,ID,Parent,Name,description
0,2L,VectorBase,chromosome,1,49364325,,,,2L,,,
1,2L,VectorBase,gene,157348,186936,,-,,AGAP004677,,,methylenetetrahydrofolate dehydrogenase(NAD ) ...
2,2L,VectorBase,mRNA,157348,181305,,-,,AGAP004677-RA,AGAP004677,,
3,2L,VectorBase,three_prime_UTR,157348,157495,,-,,,AGAP004677-RA,,
4,2L,VectorBase,exon,157348,157623,,-,,,AGAP004677-RA,AGAP004677-RB-E4,
...,...,...,...,...,...,...,...,...,...,...,...,...
196140,Y_unplaced,VectorBase,five_prime_UTR,47932,48111,,+,,,AGAP029375-RA,,
196141,Y_unplaced,VectorBase,exon,47932,48138,,+,,,AGAP029375-RA,AGAP029375-RA-E2,
196142,Y_unplaced,VectorBase,CDS,48112,48138,,+,0.0,AGAP029375-PA,AGAP029375-RA,,
196143,Y_unplaced,VectorBase,exon,48301,48385,,+,,,AGAP029375-RA,AGAP029375-RA-E3,


In [11]:
df_gff[df_gff["Name"].str.startswith('CYP').fillna(False)]

  df_gff[df_gff["Name"].str.startswith('CYP').fillna(False)]


Unnamed: 0,contig,source,type,start,end,score,strand,phase,ID,Parent,Name,description
16867,2L,VectorBase,gene,18340786,18345362,,-,,AGAP005656,,CYP305A1,cytochrome P450 [Source:VB Community Annotation]
16877,2L,VectorBase,gene,18346305,18347977,,-,,AGAP005657,,CYP305A3,cytochrome P450 [Source:VB Community Annotation]
16885,2L,VectorBase,gene,18348979,18350777,,+,,AGAP005658,,CYP15B1,cytochrome P450 [Source:VB Community Annotation]
16909,2L,VectorBase,gene,18353521,18355707,,+,,AGAP005660,,CYP305A4,cytochrome P450 [Source:VB Community Annotation]
18417,2L,VectorBase,gene,20455252,20464457,,+,,AGAP005774,,CYP49A1,cytochrome P450 [Source:VB Community Annotation]
...,...,...,...,...,...,...,...,...,...,...,...,...
183449,X,VectorBase,gene,5112021,5114815,,-,,AGAP000284,,CYP315A1,cytochrome P450 [Source:VB Community Annotation]
191625,X,VectorBase,gene,15240572,15242864,,-,,AGAP000818,,CYP9K1,cytochrome P450 [Source:VB Community Annotation]
192555,X,VectorBase,gene,16618800,16621269,,-,,AGAP000877,,CYP4G17,cytochrome P450 [Source:VB Community Annotation]
195075,X,VectorBase,gene,20008895,20018400,,+,,AGAP001039,,CYP307A1,cytochrome P450 [Source:VB Community Annotation]
