***Training course in data analysis for genomic surveillance of African malaria vectors - Workshop 1***

---

# Module 2 - Accessing sample metadata

**Subject: Data**

This first data module provides an introduction to accessing data relating to whole-genome sequencing of *Anopheles gambiae* mosquitoes provided by the MalariaGEN Vector Observatory.

## Learning outcomes

After completing this module, you will be able to:

* Explain what types of data are provided the MalariaGEN Vector Observatory and where the data are stored
* Use the `malariagen_data` Python package to access MalariaGEN data in Google Cloud
* Explore the MalariaGEN `Ag3` data resource and summarise the mosquito samples for which genomic data are available


## What is the MalariaGEN Vector Observatory?

@@TODO

## What data are available?

@@TODO

* Sample metadata
* Genome variation data
  * Single nucleotide polymorphism (SNP) calls
  * Copy number variant (CNV) calls
  * Phased haplotypes

## Where are the data stored?

@@TODO


## Accessing the `Ag3` data resource

@@TODO

In [1]:
!pip install -q malariagen_data

In [2]:
import malariagen_data

In [3]:
ag3 = malariagen_data.Ag3("gs://vo_agam_release")
ag3

<malariagen_data.ag3.Ag3 at 0x7f5e08aa0ad0>

## Loading sample metadata

@@TODO

In [4]:
df_samples = ag3.sample_metadata()
df_samples

Unnamed: 0,sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call,sample_set,release,aim_fraction_colu,aim_fraction_arab,species_gambcolu_arabiensis,species_gambiae_coluzzii,species
0,AR0047-C,LUA047,Joao Pinto,Angola,Luanda,2009,4,-8.884,13.302,F,AG1000G-AO,v3,0.945,0.001,gamb_colu,coluzzii,coluzzii
1,AR0049-C,LUA049,Joao Pinto,Angola,Luanda,2009,4,-8.884,13.302,F,AG1000G-AO,v3,0.933,0.001,gamb_colu,coluzzii,coluzzii
2,AR0051-C,LUA051,Joao Pinto,Angola,Luanda,2009,4,-8.884,13.302,F,AG1000G-AO,v3,0.937,0.002,gamb_colu,coluzzii,coluzzii
3,AR0061-C,LUA061,Joao Pinto,Angola,Luanda,2009,4,-8.884,13.302,F,AG1000G-AO,v3,0.938,0.002,gamb_colu,coluzzii,coluzzii
4,AR0078-C,LUA078,Joao Pinto,Angola,Luanda,2009,4,-8.884,13.302,F,AG1000G-AO,v3,0.926,0.001,gamb_colu,coluzzii,coluzzii
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2779,AC0295-C,K92,Martin Donnelly,Uganda,Kihihi,2012,11,-0.751,29.701,F,AG1000G-UG,v3,0.026,0.002,gamb_colu,gambiae,gambiae
2780,AC0296-C,K93,Martin Donnelly,Uganda,Kihihi,2012,11,-0.751,29.701,F,AG1000G-UG,v3,0.029,0.003,gamb_colu,gambiae,gambiae
2781,AC0297-C,K94,Martin Donnelly,Uganda,Kihihi,2012,11,-0.751,29.701,F,AG1000G-UG,v3,0.026,0.002,gamb_colu,gambiae,gambiae
2782,AC0298-C,K95,Martin Donnelly,Uganda,Kihihi,2012,11,-0.751,29.701,F,AG1000G-UG,v3,0.029,0.002,gamb_colu,gambiae,gambiae


## Querying sample metadata

@@TODO

In [19]:
df_samples.query("country == 'Burkina Faso'")

Unnamed: 0,sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call,sample_set,release,aim_fraction_colu,aim_fraction_arab,species_gambcolu_arabiensis,species_gambiae_coluzzii,species
81,AB0085-Cx,BF2-4,Austin Burt,Burkina Faso,Pala,2012,7,11.150,-4.235,F,AG1000G-BF-A,v3,0.024,0.002,gamb_colu,gambiae,gambiae
82,AB0086-Cx,BF2-6,Austin Burt,Burkina Faso,Pala,2012,7,11.150,-4.235,F,AG1000G-BF-A,v3,0.038,0.002,gamb_colu,gambiae,gambiae
83,AB0087-C,BF3-3,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F,AG1000G-BF-A,v3,0.982,0.002,gamb_colu,coluzzii,coluzzii
84,AB0088-C,BF3-5,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F,AG1000G-BF-A,v3,0.990,0.002,gamb_colu,coluzzii,coluzzii
85,AB0089-Cx,BF3-8,Austin Burt,Burkina Faso,Bana,2012,7,11.233,-4.472,F,AG1000G-BF-A,v3,0.975,0.002,gamb_colu,coluzzii,coluzzii
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
372,AB0314-C,6775,Nora Besansky,Burkina Faso,Monomtenga,2004,8,12.060,-1.170,F,AG1000G-BF-C,v3,0.028,0.002,gamb_colu,gambiae,gambiae
373,AB0315-C,6777,Nora Besansky,Burkina Faso,Monomtenga,2004,8,12.060,-1.170,F,AG1000G-BF-C,v3,0.021,0.002,gamb_colu,gambiae,gambiae
374,AB0316-C,6779,Nora Besansky,Burkina Faso,Monomtenga,2004,8,12.060,-1.170,F,AG1000G-BF-C,v3,0.031,0.002,gamb_colu,gambiae,gambiae
375,AB0318-C,5072,Nora Besansky,Burkina Faso,Monomtenga,2004,7,12.060,-1.170,F,AG1000G-BF-C,v3,0.032,0.002,gamb_colu,gambiae,gambiae


## Summarising sample metadata with pivot tables

@@TODO

In [14]:
pivot_country_year_species = (
    df_samples
    .pivot_table(
        index=["country", "year"], 
        columns=["species"], 
        values="sample_id",
        aggfunc="count",
        fill_value=0
    )
)
pivot_country_year_species

Unnamed: 0_level_0,species,arabiensis,coluzzii,gambiae,intermediate_arabiensis_gambiae,intermediate_gambiae_coluzzii
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Angola,2009,0,81,0,0,0
Burkina Faso,2004,0,0,13,0,0
Burkina Faso,2012,0,82,98,0,1
Burkina Faso,2014,3,53,46,0,0
Cameroon,2005,0,7,90,0,0
Cameroon,2009,0,0,303,0,0
Cameroon,2013,2,19,23,0,0
Central African Republic,1993,0,5,2,0,0
Central African Republic,1994,0,13,53,0,0
Cote d'Ivoire,2012,0,80,0,0,0


In [17]:
pivot_location_year_species_bf = (
    df_samples
    .query("country == 'Burkina Faso'")
    .pivot_table(
        index=["country", "location", "year"], 
        columns=["species"], 
        values="sample_id",
        aggfunc="count",
        fill_value=0
    )
)
pivot_location_year_species_bf

Unnamed: 0_level_0,Unnamed: 1_level_0,species,arabiensis,coluzzii,gambiae,intermediate_gambiae_coluzzii
country,location,year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Burkina Faso,Bana,2012,0,42,22,1
Burkina Faso,Bana,2014,1,47,15,0
Burkina Faso,Monomtenga,2004,0,0,13,0
Burkina Faso,Pala,2012,0,11,48,0
Burkina Faso,Pala,2014,2,0,16,0
Burkina Faso,Souroukoudinga,2012,0,29,28,0
Burkina Faso,Souroukoudinga,2014,0,6,15,0


## Plotting maps of sampling locations

@@TODO

In [6]:
!pip install -q ipyleaflet

In [7]:
from ipyleaflet import Map, Marker, basemaps, ScaleControl

In [8]:
# create a map
m = Map(
    basemap=basemaps.OpenStreetMap.Mapnik, 
    center=[0, 20], 
    zoom=3,
)

# display the map
m

Map(center=[0, 20], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoom_out_text…

In [9]:
# create a dataframe of sampling locations and sample sizes
pivot_location_species = (
    df_samples
    .pivot_table(
        index=["country", "location", "latitude", "longitude"], 
        columns=["species"], 
        values="sample_id",
        aggfunc="count",
        fill_value=0,
    )
)

# inspect the dataframe of locations
pivot_location_species

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,species,arabiensis,coluzzii,gambiae,intermediate_arabiensis_gambiae,intermediate_gambiae_coluzzii
country,location,latitude,longitude,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Angola,Luanda,-8.884,13.302,0,81,0,0,0
Burkina Faso,Bana,11.233,-4.472,1,89,37,0,1
Burkina Faso,Monomtenga,12.060,-1.170,0,0,13,0,0
Burkina Faso,Pala,11.150,-4.235,2,11,64,0,0
Burkina Faso,Souroukoudinga,11.235,-4.535,0,35,43,0,0
...,...,...,...,...,...,...,...,...
Tanzania,Muheza,-4.940,38.948,1,0,36,0,6
Tanzania,Muleba,-1.962,31.651,137,0,32,0,1
Tanzania,Tarime,-1.431,34.199,47,0,0,0,0
Uganda,Kihihi,-0.751,29.701,1,0,95,0,0


In [10]:
# create a map
m = Map(
    basemap=basemaps.Gaode.Satellite, 
    center=[0, 20], 
    zoom=3,
)

# add markers for sampling locations
for row in pivot_location_species.reset_index().itertuples():
    title = (
        f"{row.location}, {row.country} ({row.latitude:.3f}, {row.longitude:.3f})\n"
        f"{row.gambiae} gambiae, {row.coluzzii} coluzzii, {row.arabiensis} arabiensis"
    )
    marker = Marker(
        location=(row.latitude, row.longitude), 
        draggable=False,
        title=title,
    )
    m.add_layer(marker)

# add a scale bar
m.add_control(ScaleControl(position="bottomleft"))

# display the map
m

Map(center=[0, 20], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoom_out_text…

## Practical exercises

@@TODO 

* Plot a map of sampling locations, changing the `basemap` parameter to show a different background map. Hint: see the [ipyleaflet basemaps documentation](https://ipyleaflet.readthedocs.io/en/latest/api_reference/basemaps.html) for a list of available options.

## Further study

@@TODO