# Prepartion for read count stats for simulated data
## Aim of this notebook
This notebook prepares read-count stats for a selection of real-world data from the [GenRe Mekong project](https://www.malariagen.net/resource/29/) for the three AmpSeq primer panels used in SpotMalaria (GRC1, GRC2 and SPEC).  
Data for the SpotMalaria panels is available in [this spreadsheet](https://www.malariagen.net/wp-content/uploads/2023/11/20200705-GenRe-04b-SpotMalaria-SupplementaryFile1.xls).  
The read counts are used as guides for the creation of simulated data.  

## FASTQ data
The FASTQ files used in this notebook are based on the ENA data download manifests created in [this notebook](../real-world_gold-standard_data/Pf8-GenReMekong/pf8genre.ipynb). But __note__ that the FASTQ files are __not__ icluded in this repo because of the large amount of data contained in them. If you would like to re-run this notebook, please refer to the explanation provided [here](../real-world_gold-standard_data/Pf8-GenReMekong/pf8genre.ipynb#Example:-how-to-use-the-downloader-tool) to download the FASTQ files from ENA using the downloader tool provided in this repo.

For the purpose of this analysis, the dataset [Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv](../real-world_gold-standard_data/Pf8-GenReMekong/Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv), created [here](../real-world_gold-standard_data/Pf8-GenReMekong/pf8genre.ipynb#Save-results:-concordant-phenotype-data) was used. This contains an essentially random subset of 907 samples of high-quality data from the GenRe Mekong project. This is sufficient for the purpose of obtaining guides for real-world read counts of the AmpSeq panels.  


## Creating the read counts data file
A file "fastq_readcounts.csv" is already prepared in this directory and used by this notebook. 

It is recommended to use that file as provided in this folder. The following instructions are only needed to re-create that file and to document how it was generated. 

Ensure that the following Python packages are installed on your system:
- requests
- pandas

Then, in a terminal, run the following commands. The commands can be found alongside this notebook.   
Both commands will take a potentially long time to run (depending on your system and network) and the first command will download a large amount of data over the internet. If you are working in a HPC/cloud environment, you may want to split the data into chunks and parallelise these tasks. This is beyond the scope of this recipe.  

__Step 1: Download FASTQ files__    
Run the following command in a terminal:
```bash
python ../real-world_gold-standard_data/Pf8-GenReMekong/ENA_data_helper.py \
    download \
    --data ../real-world_gold-standard_data/Pf8 GenReMekong/ENA_download_manifest_concordant_phenotype_high_quality_samples.csv \
    --skip_errors \
    --download_attempts 10 \
    --out ENA_download_concordant_phenotype_high_quality_sample \
```
It will create a folder "ENA_download_concordant_phenotype_high_quality_sample" in this directory. The folder contains the 3628 FASTQ (gzipped) files belonging to the [Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv](../real-world_gold-standard_data/Pf8-GenReMekong/Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv), created [here](../real-world_gold-standard_data/Pf8-GenReMekong/pf8genre.ipynb#Save-results:-concordant-phenotype-data).  

__Step 2: Obtain read counts__  
Once the FASTQ download has completed, use the script "run_counts.sh", also in this folder, like this:  
(or change the path to the downloaded FASTQ files if different)

```bash
./run_counts.sh ENA_download_concordant_phenotype_high_quality_sample
```  
This will create a file "fastq_readcounts.csv" with file names and read counts for all mate1 files.

## Calculate AmpSeq panel stats
Using the dataset [Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv](../real-world_gold-standard_data/Pf8-GenReMekong/Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv), link the FASTQ read counts to AmpSeq panels and calcualte stats.

In [47]:
import pandas as pd
import numpy as np
import re
from pathlib import Path

In [18]:
counts_df = pd.read_csv('fastq_readcounts.csv')
counts_df

Unnamed: 0,file,count
0,ERR14388603_1.fastq.gz,32819
1,ERR14388604_1.fastq.gz,29409
2,ERR14388605_1.fastq.gz,1094
3,ERR14388606_1.fastq.gz,38189
4,ERR14388607_1.fastq.gz,35804
...,...,...
3254,ERR15627188_1.fastq.gz,14927105
3255,ERR15627189_1.fastq.gz,12862572
3256,ERR15627190_1.fastq.gz,11781727
3257,ERR15627191_1.fastq.gz,12052828


In [19]:
samples_df = pd.read_csv('../real-world_gold-standard_data/Pf8-GenReMekong/Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv')
samples_df

Unnamed: 0,sample,Artemisinin,Piperaquine,Mefloquine,Chloroquine,Pyrimethamine,Sulfadoxine,DHA-PPQ,AS-MQ,Species,...,ENA_acc_GenRe_GRC2,ENA_acc_GenRe_SPEC,GenRe_GRC1_ENA_FASTQ_FTP_1,GenRe_GRC2_ENA_FASTQ_FTP_1,GenRe_SPEC_ENA_FASTQ_FTP_1,Pf8_ENA_FASTQ_FTP_1,GenRe_GRC1_ENA_FASTQ_FTP_2,GenRe_GRC2_ENA_FASTQ_FTP_2,GenRe_SPEC_ENA_FASTQ_FTP_2,Pf8_ENA_FASTQ_FTP_2
0,RCN12025,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,...,ERR14388604,ERR14388605,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/003/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/004/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/005/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/006/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/003/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/004/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/005/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/006/...
1,RCN12026,Resistant,Resistant,Sensitive,Resistant,Resistant,Resistant,Resistant,Sensitive,Pf,...,ERR14388607,ERR14388608,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/006/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/007/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/008/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/007/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/006/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/007/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/008/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/007/...
2,RCN12028,Resistant,Resistant,Sensitive,Resistant,Resistant,Resistant,Resistant,Sensitive,Pf,...,ERR14388613,ERR14388614,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/012/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/013/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/014/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/009/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/012/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/013/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/014/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/009/...
3,RCN12031,Resistant,Resistant,Sensitive,Resistant,Resistant,Resistant,Resistant,Sensitive,Pf,...,ERR14388622,ERR14388623,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/021/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/022/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/023/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/010/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/021/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/022/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/023/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/010/...
4,RCN12032,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,...,ERR14388625,ERR14388626,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/024/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/025/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/026/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/011/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/024/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/025/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/026/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/011/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
902,RCN26767,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,...,ERR14397547,ERR14397548,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/046/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/047/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/048/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/052/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/046/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/047/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/048/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/052/...
903,RCN26775,Sensitive,Sensitive,Sensitive,Resistant,Resistant,Sensitive,Sensitive,Sensitive,Pf,...,ERR14397571,ERR14397572,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/070/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/071/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/072/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/053/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/070/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/071/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/072/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/053/...
904,RCN26820,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,...,ERR14397664,ERR14397665,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/063/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/064/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/065/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/064/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/063/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/064/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/065/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/064/...
905,RCN26932,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,...,ERR14397976,ERR14397977,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/075/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/076/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/077/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/098/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/075/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/076/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/077/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/098/...


Assign FASTQ files back to GenRe AmpSeq primer panel or to Pf8 (WGS), then drop the Pf8 rows, which are not relevant here.

In [40]:
counts_df['panel']=counts_df.apply(
    lambda row: 
    'SPEC' if samples_df.GenRe_SPEC_ENA_FASTQ_FTP_1.str.contains(row.file).any() else 
    'GRC1' if samples_df.GenRe_GRC1_ENA_FASTQ_FTP_1.str.contains(row.file).any() else 
    'GRC2' if samples_df.GenRe_GRC2_ENA_FASTQ_FTP_1.str.contains(row.file).any() else 
    'Pf8' if samples_df.Pf8_ENA_FASTQ_FTP_1.str.contains(row.file).any() else None, axis=1)
counts_df = counts_df[ counts_df['panel'].isin(['SPEC','GRC1','GRC2'])]

Sanity check: we should have 907 samples for each GenRe primer panel

In [42]:
counts_df.groupby('panel').size().to_frame('count').reset_index()

Unnamed: 0,panel,count
0,GRC1,907
1,GRC2,907
2,SPEC,907


Calculate stats for read counts per primer panel: minimum, maximum, mean and percentiles at 25%, 50% and 75%.  
These will be used to guide the number of reads to simulate for each panel

In [52]:
def q25(x):
    return x.quantile(0.25)
def q50(x):
    return x.quantile(0.5)
def q75(x):
    return x.quantile(0.75)

counts_df.groupby('panel').agg({'count': ['mean', 'min','max', q25,q50,q75]})

Unnamed: 0_level_0,count,count,count,count,count,count
Unnamed: 0_level_1,mean,min,max,q25,q50,q75
panel,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
GRC1,33917.249173,18,96442,25568.5,32424.0,40281.5
GRC2,33500.464168,21,78617,26732.5,32519.0,39864.5
SPEC,1291.057332,0,17924,654.5,988.0,1445.5
