# Creating a gold-standard dataset from pf8 and GenRE MEKONG data

The aim of this notebook is to create a gold-standard dataset for malaria genomics pipelines that predict drug resistance.

The following resources will be used:
1. [MalariaGEN Pf8](https://www.malariagen.net/data_package/open-dataset-plasmodium-falciparum-v80/)
2. [GenRE Mekong](https://www.malariagen.net/resource/29/)

## Rationale for selection of resources
Pipelines are usually tailored to a particular sequences technology, such as whole-genome or amplicon sequencing. While Pf8 is based on (selective) Whole Genome Sequencing (sWGS) data, GenRE Mekong utilised the SPotMalaria panel for amplicon sequencing (download available at [https://www.malariagen.net/wp-content/uploads/2023/11/20200705-GenRe-04b-SpotMalaria-SupplementaryFile1.xlsx](https://www.malariagen.net/wp-content/uploads/2023/11/20200705-GenRe-04b-SpotMalaria-SupplementaryFile1.xlsx).
Some GenRE Mekong samples have been re-analysed with sWGS for Pf8, thus providing data for the same sample from two different technologies and two independent resistance phenotype calls.
We will use the intersect between Pf8 and GenRE Mekong as the gold-standard dataset.

In Pf8, drug-resistance phenotypes are inferred based on criteria described in [this PDF](https://pf8-release.cog.sanger.ac.uk/Pf8_resistance_classification.pdf)

## Goal
THe notebook creates a csv file of sample IDs and metadata. The sample IDs can be used to retrieve the raw read data from public databases.  
This can be done manually or by using a tool such as [nf-core fetchngs](https://nf-co.re/fetchngs/1.12.0).


## Set up environment
All packages imported here must be installed locally

In [None]:
import pandas as pd
import re
from warnings import simplefilter 
simplefilter(action="ignore", category=pd.errors.PerformanceWarning)
pd.options.mode.copy_on_write = True

## Load Pf8 data
Obtain sample metadata and drug resisance inference data from Pf8.
Details can be found [here](https://www.malariagen.net/data_package/open-dataset-plasmodium-falciparum-v80/)

In [2]:
pf8_infer_resistance_df=pd.read_csv("https://pf8-release.cog.sanger.ac.uk/Pf8_inferred_resistance_status_classification.tsv", sep="\t")

In [3]:
pf8_samples_df=pd.read_csv("https://pf8-release.cog.sanger.ac.uk/metadata/Pf8_samples.txt", sep="\t")

## Merge Pf8 resistance and sample metadata
Also rename the sample column with lower case for consistency across data

In [4]:
pf8_samples_df.rename(columns={'Sample': 'sample'}, inplace=True)

In [5]:
pf8_df = pd.merge(pf8_infer_resistance_df, pf8_samples_df, on=["sample"], how="outer")

In [6]:
pf8_df

Unnamed: 0,sample,Chloroquine,Pyrimethamine,Sulfadoxine,Mefloquine,Artemisinin,Piperaquine,SP (uncomplicated),SP (IPTp),AS-MQ,...,Admin level 1 longitude,Year,ENA,All samples same case,Population,% callable,QC pass,Exclusion reason,Sample type,Sample was in Pf7
0,FP0008-C,Undetermined,Undetermined,Undetermined,Sensitive,Sensitive,Sensitive,Sensitive,Sensitive,Sensitive,...,-9.832345,2014.0,ERR1081237,FP0008-C,AF-W,82.48,True,Analysis_set,gDNA,True
1,FP0009-C,Resistant,Resistant,Sensitive,Sensitive,Sensitive,Sensitive,Resistant,Sensitive,Sensitive,...,-9.832345,2014.0,ERR1081238,FP0009-C,AF-W,88.95,True,Analysis_set,gDNA,True
2,FP0010-CW,Undetermined,Resistant,Resistant,Sensitive,Sensitive,Sensitive,Resistant,Sensitive,Sensitive,...,-9.832345,2014.0,ERR2889621,FP0010-CW,AF-W,87.01,True,Analysis_set,sWGA,True
3,FP0011-CW,Undetermined,Resistant,Undetermined,Sensitive,Sensitive,Sensitive,Resistant,Sensitive,Sensitive,...,-9.832345,2014.0,ERR2889624,FP0011-CW,AF-W,86.95,True,Analysis_set,sWGA,True
4,FP0012-CW,Resistant,Resistant,Sensitive,Sensitive,Sensitive,Sensitive,Resistant,Sensitive,Sensitive,...,-9.832345,2014.0,ERR2889627,FP0012-CW,AF-W,89.86,True,Analysis_set,sWGA,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33320,SPT92268,,,,,,,,,,...,3.618846,2000.0,ERR10940681,SPT92268,AF-W,0.10,False,Low_coverage,sWGA,False
33321,SPT92269,,,,,,,,,,...,3.618846,2000.0,ERR11009733,SPT92269,AF-W,0.12,False,Low_coverage,sWGA,False
33322,SPT92270,,,,,,,,,,...,3.618846,2000.0,ERR11009737,SPT92270,AF-W,0.01,False,Low_coverage,sWGA,False
33323,SPT94772,Resistant,Resistant,Resistant,Undetermined,Undetermined,Undetermined,Resistant,Sensitive,Undetermined,...,-16.401559,2017.0,ERR10789456,SPT94772,AF-W,64.96,True,Analysis_set,sWGA,False


## Load GenRe Mekong data
Using this file from the GenRe Mekong webstie, release v1.4 14/05/24 (latest release):  

https://assets.super.so/abd63e6d-92df-4a9f-b19c-3a65d59b8a34/files/4accef63-0436-4893-a00a-6ea0e8a55249/Pf-GenRe-GRCv1.4-Full-PublicRelease-20240514.xlsx  
Tab: GRC

Sample IDs for samples that were re-analysed in Pf8 are the same in both projects, i.e. it is possible to merge GenRe Mekong and Pf8 data on sample ID.  

For more information on the data in this spreadsheet, consult the [GenRe Mekong data dictionary](https://github.com/GenRe-Mekong/Documents/blob/e331501b8e93752221ebd7ae72e5eddd3e6e924d/Pf-GenRe-GRCv1.4-DataDictionary.xlsx)

In [7]:
GenRe_df = pd.read_excel(
    "https://assets.super.so/abd63e6d-92df-4a9f-b19c-3a65d59b8a34/files/4accef63-0436-4893-a00a-6ea0e8a55249/Pf-GenRe-GRCv1.4-Full-PublicRelease-20240514.xlsx",
    sheet_name="GRC"
)

# make sample ID column name match Pf8
GenRe_df.rename(columns={'SampleId': 'sample'}, inplace=True)

## Merge GenRe Mekong and Pf8 data
Sample IDs for samples that were analysed in Pf8 are not changed, i.e. we can merge GenRe and Pf8 on sample ID.  
Retain only rows that have a sample ID that is found in GenRe and Pf8.

In [8]:
genre_pf8_merged_df = pd.merge(GenRe_df, pf8_df, on=["sample"], suffixes=('_GRMK','_pf8'), how="inner") 

## Remove GenRe Agena data
In the GenRe Mekong project, three techniques were used:
- Agena
- AmpSeqV1
- AmpSeqV2

These are recorded in the "Process" column of the GenRe Mekong data. Only AmpSeq data is relevant here because we are building a dataset for genotyping by sequencing. "Agena" refers to ["Agena Bioscience MassARRAY"](https://link.springer.com/protocol/10.1007/978-1-4939-6442-0_5), a technique based on PCR amplification of SNP regions that does not use sequencing and should thus not be used in this dataset.  

Count samples by process used in GenRe Mekong and "Sample type" in Pf8.  The Pf8 sample type, according to the [Pf8 data release README](https://pf8-release.cog.sanger.ac.uk/Pf8_README.txt), refers to
"Amplification technology used on the sample (MDA, gDNA or sWGA)", where "gDNA" means non-selective WGS, as opposed to selective sWGS. 

In [9]:
genre_pf8_merged_df.groupby(['Process','Sample type']).size().to_frame('count').reset_index()

Unnamed: 0,Process,Sample type,count
0,Agena,sWGA,3690
1,AmpSeqV1,gDNA,64
2,AmpSeqV1,sWGA,558
3,AmpSeqV2,sWGA,2730


-> 3690 samples in GenRe Mekong are based on Agena array, not on sequencing.  
Remove those samples.

In [10]:
genre_pf8_merged_seq_df = genre_pf8_merged_df[ genre_pf8_merged_df['Process'] != "Agena" ]

In [11]:
genre_pf8_merged_seq_df.groupby(['Process','Sample type']).size().to_frame('count').reset_index()

Unnamed: 0,Process,Sample type,count
0,AmpSeqV1,gDNA,64
1,AmpSeqV1,sWGA,558
2,AmpSeqV2,sWGA,2730


## Identify concordant drug-resistance calls and high-quality data
Both projects, GenRe Mekong and Pf8, provide drug-resistance phenotype predictions.  the final gold-standard dataset should contain only samples that have the same phenotype predictions in both projects and are of high quality.  


Identify columns that are found in both GenRe and Pf8

In [12]:
list(GenRe_df.columns.intersection( pf8_df.columns).values)

['sample',
 'Study',
 'Year',
 'Country',
 'Artemisinin',
 'Piperaquine',
 'Mefloquine',
 'Chloroquine',
 'Pyrimethamine',
 'Sulfadoxine',
 'DHA-PPQ',
 'AS-MQ']

-> all except 'sample', 'Study', 'Year' and 'Country'are drug-resistance phenotypes.  
Create a list of field names to compare.

In [13]:
phenotypes = [
 'Artemisinin',
 'Piperaquine',
 'Mefloquine',
 'Chloroquine',
 'Pyrimethamine',
 'Sulfadoxine',
 'DHA-PPQ',
 'AS-MQ']

Get counts for concordant vs. discordant phenotype calls in Pf8 and GenRe Mekong

In [14]:
discordant_counts = []
for p in phenotypes:
    discordant_counts.append(
        {
            'drug': p, 
            'discordant_count': len(
                genre_pf8_merged_seq_df[
                    genre_pf8_merged_seq_df[ p + '_GRMK'] != genre_pf8_merged_seq_df[ p + '_pf8']
                ]
            ),
            'concordant_count': len(
                genre_pf8_merged_seq_df[
                    genre_pf8_merged_seq_df[ p + '_GRMK'] == genre_pf8_merged_seq_df[ p + '_pf8']
                ]
            )
    })
pd.DataFrame(discordant_counts)


Unnamed: 0,drug,discordant_count,concordant_count
0,Artemisinin,832,2520
1,Piperaquine,1807,1545
2,Mefloquine,1853,1499
3,Chloroquine,413,2939
4,Pyrimethamine,480,2872
5,Sulfadoxine,478,2874
6,DHA-PPQ,1784,1568
7,AS-MQ,1834,1518


-> the data shows that there are discordant as well as concordant calls for all drug resistance phenotypes, thus it is important to filter for samples that are concordant.

Filter for rows that are concordant on all 8 phenotypes

In [15]:
genre_pf8_conc_df = genre_pf8_merged_seq_df
for p in phenotypes:
    cols = [ p + '_GRMK', p + '_pf8' ]
    genre_pf8_conc_df = genre_pf8_conc_df[
        ( genre_pf8_conc_df[cols].notnull().all(axis='columns') ) &
        ( genre_pf8_conc_df[cols[0]] == genre_pf8_conc_df[cols[1]] )
    ]

Quality filter: percent of callable bases > 85 and 'QC pass' is True

In [16]:
genre_pf8_conc_df = genre_pf8_conc_df[ 
    (genre_pf8_conc_df['% callable'] >= 85 ) &
    (genre_pf8_conc_df['QC pass'] == True )
]

Number of rows after filtering

In [17]:
len(genre_pf8_conc_df)

968

Reduce the dataset to a smaller number of relevant columns.

In [18]:
keep_cols = [
        'sample', 'Artemisinin_GRMK', 'Piperaquine_GRMK',
       'Mefloquine_GRMK', 'Chloroquine_GRMK', 'Pyrimethamine_GRMK',
       'Sulfadoxine_GRMK', 'DHA-PPQ_GRMK', 'AS-MQ_GRMK', 'Species', 
       'Chloroquine_pf8', 'Pyrimethamine_pf8',
       'Sulfadoxine_pf8', 'Mefloquine_pf8', 'Artemisinin_pf8',
       'Piperaquine_pf8', 'AS-MQ_pf8',
       'DHA-PPQ_pf8', 'ENA', '% callable'
]

genre_pf8_conc_s_df = genre_pf8_conc_df[keep_cols]

Combine the phenotype calls into single column each (they are all concordant due to the filter used)

In [19]:
for p in phenotypes:
    # just to be safe
    if not genre_pf8_conc_s_df[p + '_GRMK'].equals(genre_pf8_conc_s_df[p + '_pf8']):
        raise Exception('phenotypes are not concordant but they should be if filtered correctly')
    genre_pf8_conc_s_df = genre_pf8_conc_s_df.rename(columns={p + '_GRMK': p}, errors="raise")
    genre_pf8_conc_s_df = genre_pf8_conc_s_df.drop([ p + '_pf8' ], axis=1)

Overview of final data

In [20]:
genre_pf8_conc_s_df

Unnamed: 0,sample,Artemisinin,Piperaquine,Mefloquine,Chloroquine,Pyrimethamine,Sulfadoxine,DHA-PPQ,AS-MQ,Species,ENA,% callable
3730,RCN12025,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,ERR3409305,89.14
3731,RCN12026,Resistant,Resistant,Sensitive,Resistant,Resistant,Resistant,Resistant,Sensitive,Pf,ERR3409312,85.83
3733,RCN12028,Resistant,Resistant,Sensitive,Resistant,Resistant,Resistant,Resistant,Sensitive,Pf,ERR3409373,87.23
3734,RCN12031,Resistant,Resistant,Sensitive,Resistant,Resistant,Resistant,Resistant,Sensitive,Pf,ERR3409314,87.04
3735,RCN12032,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,ERR3409311,88.65
...,...,...,...,...,...,...,...,...,...,...,...,...
6975,RCN26767,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,ERR7221085,87.92
6976,RCN26775,Sensitive,Sensitive,Sensitive,Resistant,Resistant,Sensitive,Sensitive,Sensitive,Pf,ERR7221086,87.30
6987,RCN26820,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,ERR7221102,89.35
7021,RCN26932,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,ERR7221141,86.03


## Save results
Save to csv file
```Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv```  

In summary: this table of 968 rows represents samples that have been sequenced by amplicon sequencing for GenRe Mekong and WGS for Pf8. All samples are marked as high-quality in Pf8 and they have conconrdant phenotype calls, inferred from genotype, in both projects.  

In [21]:
genre_pf8_conc_s_df.to_csv('Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv', index=False)

## Add genotypes
In the above steps, concordance was defined on high-level drug resistance phenotype calls.  
Adding the underlying genotype calls will provide a lower-level truth dataset that can be used to validate genotype calls of a novel pipeline.  
___Note___ that "genotype" calls are provided at amino acid level - not nucelotide level - in both studies. 

### Genes and haplotypes
For the purpose of this gold-standard dataset, the intention is to identify samples that are fully concordant on their haplotype calls for important genes between Pf8 and GenRe Mekong, i.e. two different sequencing techinques and two pipelines have resulted in the same haplotype call.  

This dataset will focus on 8 key drug resistance genes where drug resistance is determined by non-synonymous point mutations:
- dhfr
- crt
- dhps
- kelch13
- mdr1
- mdr2
- fd
- arps10

This list of genes and "core" amino acid positions was compiled according to Table SM1 (Markers associated with drug resistance for P. falciparum) in the [SpotMalaria Platform Technical Notes and Methods](https://ngs.sanger.ac.uk/production/malaria/Resource/29/20200705-GenRe-04a-SpotMalaria-0.39.pdf). For Pf8, details about drug resistance classification can be found in in [this document](https://pf8-release.cog.sanger.ac.uk/Pf8_resistance_classification.pdf).  

This results in the following list of drug-resistance haplotypes to be considered for the gold- standard dataset:

| Gene | Amino Acid position |
|---|---|
| dhfr | 51, 59, 108, 164 |
| crt | 72, ~~73~~\*, 74, 75, 76, 326, 356 |
| dhps | 436, 437, 540, 581, 613 |
| kelch13 | 349-726 |
| mdr1 | 86, 184, 1246 |
| mdr2 | 484 |
| fd | 193 |
| arps10 | 127, 128 |
\* Position crt:73 is listed as a core position in the reference table but it is not provided in GenRe Mekong data and it is always WT in Pf8 data, hence this position will be ignored in the following code.  

For more details on gene haplotypes in P. falciparum see the [Pf-HaploAtlas](https://www.malariagen.net/article/introducing-pf-haploatlas-a-new-app-to-track-malaria-parasite-mutations/). 
       

### Load Pf8 genotype calls from data release
The GenRe data loaded earlier does contain genotype data already.

In [22]:
pf8_gt_df = pd.read_csv("https://pf8-release.cog.sanger.ac.uk/Pf8_drug_resistance_marker_genotypes.tsv", sep="\t")

### Merge Pf8 genotypes 
Merge into into existing dataframe  
```genre_pf8_merged_seq_df```  
created above, which is the merged data for Pf8 and GenRe Mekong, with array data removed. 


In [23]:
pf8_gt_df.rename(columns={'Sample': 'sample'}, inplace=True)

genre_pf8_merged_seq_gt_df = pd.merge(
    genre_pf8_merged_seq_df, 
    pf8_gt_df, 
    on=["sample"], 
    suffixes=('_GRMK','_pf8'), 
    how="inner") 

The dataframe now contains Pf8 and GenRe Mekong genotypes for all samples that have been sequenced (excluding Array data) in both projects and can be linked via sample ID.

### Create new dataframe with amino acid genotype calls
Create a list of gene/aa-positions from the above table and use it to comile a new data frame with amino acid calls from Pf8 and GenRe Mekong for these core positions. The kelch13 list needs to be built separately because the data is provided in separate column in GenRe Mekong and as a sigle column of comma-separated lists in Pf8.


The new columns will have a unified naming scheme:  
"core_mutation_{GENE}_{AAPos}_\[pf8/GMRK\]"

The majority of genotypes can be extracted from the Pf8 and GenRe data by a simple pattern.  
The respective pattern of genotype calls by gene and AA position is as follows:  
__Pf8__: {GENE}\_{AAPos} (example "crt_75\[N\]")  
__GenRe Mekong__: Pf{GENE}:{AAPos} (example: "PfCRT:75")  

Create a set of new column names and respective Pf8 and GenRe column names accordingly.  
There are ___two excpetions___: 
1. kelch13: in both datasets, kelch13 data is provided as a column of lists of AA positions that need to be converted separately
2. arps10: provided as AA pos 127-128 in Pf8 and just pos 127 in GenRe. Needs to be converted separately


In [24]:
genes_pos = {
    'dhfr': [51, 59, 108, 164],
    'crt': [72, 74, 75, 76, 326, 356], # omitting pos 73, see above for explanation
    'dhps': [436, 437, 540, 581, 613],
    'mdr1': [86, 184, 1246],
    'mdr2': [484],
    'fd': [193]
}

gt_cols = {}
for gene, pos_list in genes_pos.items():
    gt_cols[ gene ] = {}
    for pos in pos_list:
        gt_cols[ gene ][ pos ] = {
            'pf8_pattern': gene + '_' + str(pos) + '\[[A-Z]\]',
            'GRMK_name': 'Pf' + gene.upper() + ':' + str(pos),
            'new_col_basename': 'core_mutation_' + gene + '_' + str(pos)
        }

### Add new genotype columns 
Add new genotype columns for the core mutation positions to dataframe   
```genre_pf8_merged_seq_gt_df```

In [25]:
for gene in gt_cols:
    for pos in gt_cols[ gene ]:
        
        # genotype from Pf8
        pf8_pattern = gt_cols[gene][pos]['pf8_pattern']
        new_col_pf8 = gt_cols[gene][pos]['new_col_basename'] + '_pf8'
        pf8_col = [ c for c in genre_pf8_merged_seq_gt_df if re.match(pf8_pattern, c) ]
        if not pf8_col:
            raise ValueError(f'could not find Pf8 column starting with "{pf8_pattern}"')
        elif len(pf8_col) > 1:
            raise ValueError(f'found more than one Pf8 column starting with "{pf8_pattern}":{", ".join(pf8_col)}')
        else:
            genre_pf8_merged_seq_gt_df[new_col_pf8] = genre_pf8_merged_seq_gt_df[pf8_col[0]].copy()
            
        # genotype from GenRe
        genre_pattern = gt_cols[gene][pos]['GRMK_name']
        new_col_genre = gt_cols[gene][pos]['new_col_basename'] + '_GRMK'
        if genre_pattern in genre_pf8_merged_seq_gt_df:
            genre_pf8_merged_seq_gt_df[new_col_genre] = genre_pf8_merged_seq_gt_df[genre_pattern].copy()
        else:
            raise ValueError(f'could not find GenRe column "{genre_pattern}"')
                                                                         

Add columns for arps10: use only position arsp10:127, which is provided as a column inf GenRe and needs to be extracted from column 'arps10_127-128\[VD\]' in Pf8.

In [26]:
arps10_basename = 'core_mutation_arps10_127'
genre_pf8_merged_seq_gt_df[ arps10_basename + '_GRMK'] = genre_pf8_merged_seq_gt_df['PfARPS10:127'].copy()
genre_pf8_merged_seq_gt_df[ arps10_basename + '_pf8'] = genre_pf8_merged_seq_gt_df['arps10_127-128[VD]'].astype(str).str[0]

### Add columns for kelch13
Add columns for all kelch13 AA positions for which data exists in at least one sample in both projects.  
__Pf8__:  
    _column_: 'kelch13_349-726_ns_changes'  
    _content_: comma separated list of haplotypes at Kelch13 positions 349-726. Each haplotype contains one or more non-synonymous variations, separateed by "/". Homozygous mutations are shown in upper case and heterozygous in lower case.   Details are [here (heading "Pf8_drug_resistance_marker_genotypes.tsv"](https://pf8-release.cog.sanger.ac.uk/Pf8_README.txt)  
    
__GenRe__:  
    _column_: 'Pfkelch13'  
    _content_: Comma-separated list of amino acid mutations. ___NOTE___ this differs from Pf8. In GenRe, the comma delimits mutations in the same haplotype, while in Pf8 (see above) a comma delimts haplotypes. There is no disctiontion between homo- and heterozygous mutations.  For details see the [GenRe Mekong data dictionary](https://github.com/GenRe-Mekong/Documents/blob/e331501b8e93752221ebd7ae72e5eddd3e6e924d/Pf-GenRe-GRCv1.4-DataDictionary.xlsx)

Show the unique contents of the respective kelch13 columns in Pf8 and GenRe Mekong: 

In [27]:
genre_pf8_merged_seq_gt_df['kelch13_349-726_ns_changes'].unique()

array(['C580Y', nan, 'c580y,p553l*', '-', 'c580y,v445i/c580y', 'c580y',
       'R561H', 'F446I', 'c580y,r561h/c580y', '!*', 'c580y,a578d/c580y',
       'c580y,p419s/c580y', 'P553L', 'R539T', 'Y493H', 'e668k', 'g625e',
       'g718s', 'e362k', 'A578S', 'e567k', '!', 'd353n', 'd399n', 'r515k',
       'c580y,t573i/c580y', 'p570l', 'c580y,f506y/c580y', 'p615l',
       'G453S', 'd584n', 'd397n', 'c580y,p574l/c580y',
       'e401k/c580y,c580y', 'G357D', 'C580Y/E705K', 'G449D', 'r404k',
       'c580y,a564t/c580y', 'c580y,c580y/q654h', 'c580y,c580y/i723v',
       'c580y,c580y/p615t', 'c580y,y456f/c580y', 't474i',
       'c580y,d512n/c580y', 'r539t', 'r539t,r539t/c580f'], dtype=object)

In [28]:
genre_pf8_merged_seq_gt_df['Pfkelch13'].unique()

array(['C580Y', 'WT,A432S,C580Y', 'WT', 'WT,P553L,C580Y', '-', 'R561H',
       'WT,C469.,C580Y', 'WT,P419S,C580Y', 'P553L', 'WT,C580Y', 'R539T',
       'Y493H', 'WT,G453S', 'WT,G545R', 'WT,D464N', 'WT,M351I', 'A578S',
       'WT,E567K', 'WT,F662Y', 'WT,G548V', 'WT,K658R', 'WT,D353N',
       'WT,D399N', 'WT,R515K', 'WT,E431K', 'WT,P615L', 'WT,G453S,W518.',
       'WT,D397N', 'WT,P574L,C580Y', 'WT,G538D,C580Y', 'WT,G358V,C580Y',
       'WT,C580Y,N599D', 'WT,C580Y,Q654R', 'WT,S549P,C580Y', 'WT,D373G',
       'WT,S485G,C580Y', 'WT,G496D,C580Y', 'WT,C580Y,R622K',
       'WT,I540V,C580Y', 'WT,C580Y,D648G', 'WT,C580Y,F656L',
       'WT,E401K,C580Y', 'WT,D547G', 'WT,F442S,C580Y', 'WT,R471S,C580Y',
       'WT,G357D,C580Y', 'WT,C542R', 'WT,L444S,C580Y', 'WT,C580Y,G595D',
       'WT,I376M,C580Y', 'WT,C580Y,E612.', 'WT,W565.,C580Y',
       'WT,H384R,C580Y', 'WT,G449D,C580Y', 'WT,A557V', 'WT,R404K',
       'WT,C542Y,C580Y', 'WT,C580Y,S649P', 'WT,F434S,C580Y',
       'WT,Y493H,S550P,C580Y', 'WT,K378

### Create individual kelch13 mutation columns
For each individual mutation found in any sample in any project, create a column with a name following this schema:  
```core_mutation_kelch13:{MUTATION}_{pf8/GRMK}```  
MUTATION is of the format ```{WT allele}{Pos}{mutant allele}```   
The column contains boolean values where True indicates that this mutation is present in a given sample. Due to the differences in data provided between the two projects, no resolution at haplotype level is provided and no information about zygosity is recorded. Consequently, the "WT" allele recorded in some samples in GenRe Mekong is ignored. 

In [29]:
kelch13_mutations = set()

# Pf8
for mut_list in genre_pf8_merged_seq_gt_df['kelch13_349-726_ns_changes'].unique():
    mut_list = str(mut_list)
    if mut_list and re.search(r'[1-9]', mut_list):
        mut_list = mut_list.replace('/',',') # treat all mutations individually, regardless of haplotype grouping
        mut_list = mut_list.replace('*','')
        for mut in mut_list.split(','):
            kelch13_mutations.add(mut.upper())
            
# GenRe Mekong
for mut_list in genre_pf8_merged_seq_gt_df['Pfkelch13'].unique():
    mut_list = str(mut_list)
    if mut_list and re.search(r'[1-9]', mut_list):
        for mut in mut_list.split(','):
            if mut != 'WT':
                kelch13_mutations.add(mut.upper())

# add the new columns
genre_pf8_merged_seq_gt_df = genre_pf8_merged_seq_gt_df.copy()
for mut in kelch13_mutations:
    base_name = 'core_mutation_kelch13:' + mut
    genre_pf8_merged_seq_gt_df[base_name + '_pf8'] = genre_pf8_merged_seq_gt_df['kelch13_349-726_ns_changes'].str.contains(mut, case=False)
    genre_pf8_merged_seq_gt_df[base_name + '_GRMK'] = genre_pf8_merged_seq_gt_df['Pfkelch13'].str.contains(mut, case=False)
    
    

### Identify concordant samples
Filter for samples that are fully concordant on all core mutation genotype calls, i.e. all columns added with the "core_mutation_" prefix above.  

In [30]:
def is_concordant(row, *core_mut_col_basenames):
    """
    Return True if all 'core_mutation' columns match between 
    Pf8 and GenRe Mekong
    """
    for col in core_mut_col_basenames:
        if row[col + '_pf8'] != row[col + '_GRMK']:
            return False
    return True

regex = re.compile(r'^(core_mutation_.+)_(pf8|GRMK)$')
core_mut_col_basenames = set(
    [ regex.sub(r'\1', x) for x in genre_pf8_merged_seq_gt_df.columns.values if regex.match(x) ]
)
genre_pf8_merged_seq_gt_df = genre_pf8_merged_seq_gt_df.copy()
genre_pf8_merged_seq_gt_df['fully_concordant_gt'] = genre_pf8_merged_seq_gt_df.apply(is_concordant, axis=1, args=core_mut_col_basenames)


Count fully concordant (on all genotypes reported) vs. not fully concordant samples across the two projects:

In [31]:
genre_pf8_merged_seq_gt_df.groupby( ['fully_concordant_gt']).size().to_frame('count').reset_index()

Unnamed: 0,fully_concordant_gt,count
0,False,1572
1,True,1394


47% of samples are fully concordant on all drug-reistance "core mutation" genotypes between Pf8 (WGS) and GenRe Mekong (AmpSeq).

### Create a filtered dataset containing only fully concordant samples
In addition, apply filters on Pf8 quality, as done above for phenotypes (%callable >=85 and QC pass)

In [32]:
genre_pf8_merged_seq_gt_concord_df = genre_pf8_merged_seq_gt_df[
    (genre_pf8_merged_seq_gt_df['fully_concordant_gt'] == True ) &
    (genre_pf8_merged_seq_gt_df['% callable'] >= 85 ) &
    (genre_pf8_merged_seq_gt_df['QC pass'] == True )
]

In [33]:
len(genre_pf8_merged_seq_gt_concord_df)

976

There are 976 high-quality samples where all reported drug-resitance genotype positions are fully concordant between Pf8 and GenRe Mekong.

### Save file of fully concordant samples
Keep only relevant columns. As all samples are fully concordant, keep only one column per core mutation and drop the pf8/GRMK suffix.

In [34]:
keep_cols = ['sample', 'Source', 'Process', 
    'TimePoint', 'Country_GRMK', 'Species',
    'fully_concordant_gt']

core_mut_cols = [x for x in genre_pf8_merged_seq_gt_concord_df.columns.values if x.startswith('core_mutation')]
keep_cols.extend( core_mut_cols )
genre_pf8_merged_seq_gt_concord_small_df = genre_pf8_merged_seq_gt_concord_df[keep_cols]
genre_pf8_merged_seq_gt_concord_small_df.rename({'Country_GRMK':'Country'})

regex = re.compile(r'^(core_mutation_.+)_(pf8|GRMK)$')
for col in core_mut_cols:
    col_basename = regex.sub(r'\1', col)
    genre_pf8_merged_seq_gt_concord_small_df[col_basename] = genre_pf8_merged_seq_gt_concord_small_df[col].copy()
    genre_pf8_merged_seq_gt_concord_small_df.drop(col, axis=1, inplace=True)

In [35]:
genre_pf8_merged_seq_gt_concord_small_df.to_csv('Pf8-GenReMekong_fully_concordant_genotype_high_quality_samples.csv', index=False)

### Extract distinct genotype patterns and representative samples
First, get a count of different patterns.

In [36]:
core_mut_cols = [x for x in genre_pf8_merged_seq_gt_concord_small_df.columns.values if x.startswith('core_mutation')]
genre_pf8_merged_seq_gt_concord_small_df.groupby( core_mut_cols).size().to_frame('count').reset_index()

Unnamed: 0,core_mutation_dhfr_51,core_mutation_dhfr_59,core_mutation_dhfr_108,core_mutation_dhfr_164,core_mutation_crt_72,core_mutation_crt_74,core_mutation_crt_75,core_mutation_crt_76,core_mutation_crt_326,core_mutation_crt_356,...,core_mutation_kelch13:S550P,core_mutation_kelch13:D464N,core_mutation_kelch13:G358V,core_mutation_kelch13:E362K,core_mutation_kelch13:E431K,core_mutation_kelch13:P615L,core_mutation_kelch13:R471S,core_mutation_kelch13:G625E,core_mutation_kelch13:N599D,count
0,I,R,N,I,C,I,D,T,N,I,...,False,False,False,False,False,False,False,False,False,1
1,I,R,N,I,C,I,D,T,N,I,...,False,False,False,False,False,False,False,False,False,10
2,I,R,N,I,C,I,E,T,N,I,...,False,False,False,False,False,True,False,False,False,1
3,I,R,N,I,C,I,E,T,N,I,...,False,False,False,False,False,False,False,False,False,3
4,I,R,N,I,C,I,E,T,S,T,...,False,False,False,False,False,False,False,False,False,22
5,I,R,N,I,C,I,E,T,S,T,...,False,False,False,False,False,False,False,False,False,11
6,I,R,N,I,C,I,E,T,S,T,...,False,False,False,False,False,False,False,False,False,27
7,I,R,N,I,C,I,E,T,S,T,...,False,False,False,False,False,False,False,False,False,2
8,I,R,N,I,C,I,E,T,S,T,...,False,False,False,False,False,False,False,False,False,1
9,I,R,N,I,C,I,E,T,S,T,...,False,False,False,False,False,False,False,False,False,1


There are 40 distinct patterns of genotypes.
Extract a list of sample IDs of representative samples for each pattern.

In [37]:
genre_pf8_merged_seq_gt_concord_patterns_df = genre_pf8_merged_seq_gt_concord_small_df.groupby( core_mut_cols)['sample'].min().reset_index()

# move the samples column to first
cols = list(genre_pf8_merged_seq_gt_concord_patterns_df.columns)
cols = [cols[-1]] + cols[:-1]
genre_pf8_merged_seq_gt_concord_patterns_df = genre_pf8_merged_seq_gt_concord_patterns_df[cols]
genre_pf8_merged_seq_gt_concord_patterns_df

Unnamed: 0,sample,core_mutation_dhfr_51,core_mutation_dhfr_59,core_mutation_dhfr_108,core_mutation_dhfr_164,core_mutation_crt_72,core_mutation_crt_74,core_mutation_crt_75,core_mutation_crt_76,core_mutation_crt_326,...,core_mutation_kelch13:D373G,core_mutation_kelch13:S550P,core_mutation_kelch13:D464N,core_mutation_kelch13:G358V,core_mutation_kelch13:E362K,core_mutation_kelch13:E431K,core_mutation_kelch13:P615L,core_mutation_kelch13:R471S,core_mutation_kelch13:G625E,core_mutation_kelch13:N599D
0,RCN13568,I,R,N,I,C,I,D,T,N,...,False,False,False,False,False,False,False,False,False,False
1,RCN13560,I,R,N,I,C,I,D,T,N,...,False,False,False,False,False,False,False,False,False,False
2,RCN15107,I,R,N,I,C,I,E,T,N,...,False,False,False,False,False,False,True,False,False,False
3,RCN14580,I,R,N,I,C,I,E,T,N,...,False,False,False,False,False,False,False,False,False,False
4,RCN13532,I,R,N,I,C,I,E,T,S,...,False,False,False,False,False,False,False,False,False,False
5,RCN13481,I,R,N,I,C,I,E,T,S,...,False,False,False,False,False,False,False,False,False,False
6,RCN01778,I,R,N,I,C,I,E,T,S,...,False,False,False,False,False,False,False,False,False,False
7,RCN13501,I,R,N,I,C,I,E,T,S,...,False,False,False,False,False,False,False,False,False,False
8,RCN01871,I,R,N,I,C,I,E,T,S,...,False,False,False,False,False,False,False,False,False,False
9,RCN12873,I,R,N,I,C,I,E,T,S,...,False,False,False,False,False,False,False,False,False,False


### Save file
Save the file of representative samples for every genotype pattern.

In [38]:
genre_pf8_merged_seq_gt_concord_patterns_df.to_csv('Pf8-GenReMekong_fully_concordant_genotype_patterns_representative_samples.csv', index=False)