# Prepartion for creating simulated data
## Aim of this notebook
This notebook contains the preparation required to run read simulations using the read simulation pipeline published in this repository.  
The main aim of this notebook is to provide a working recipe for a read simulation run that is based on real-world data. In contrast to whole-genome data, for which real-world datasets are readily availble from the [Pf8 resource](https://www.malariagen.net/resource/36/), amplicon data for a pipeline that may have been developed for a specific amplicon primer panel, my not be available from public resources. In this case, simulating reads may be the only option for pipeline validation. Use this notebook as a guideline on how to generate the input files for a read simulation pipeline run.

## Part 1: read counts to guide simulation
An important consideration for a simulated dataset is the number of reads to create. This should be within the range of expected read numbers from real-world experiments because unrealistically high or low read counts will not provide useful insights into the validity of pipeline results.  

In this example, we are simulating reads for a wild-type and a mutant _P. falciparum_ strain. We will simulate an AmpSeq run using the [SpotMalaria](https://www.malariagen.net/project/spotmalaria/) panel, as used for the [GenRe Mekong project](https://www.malariagen.net/resource/29/) project. 

This primer panel consists of three sub-panels: SPEC (speciation), GRC1 and GRC2.  
The data for these panels is provided in [this spreadsheet](https://www.malariagen.net/wp-content/uploads/2023/11/20200705-GenRe-04b-SpotMalaria-SupplementaryFile1.xls).  

The notebook and data provided here is not intended to be used as-is, but to serve as a template for creating custom simulated datasets, tailored to the pipeline that is to be tested. That said, you are welcome to use the data produced here for your own pipeline, if you wish.  

### FASTQ data
The FASTQ files used in this notebook are based on the ENA data download manifests created in [this notebook](../real-world_gold-standard_data/Pf8-GenReMekong/pf8genre.ipynb). But __note__ that the FASTQ files are __not__ icluded in this repo because of the large amount of data contained in them. If you would like to re-run this notebook, please refer to the explanation provided [here](../real-world_gold-standard_data/Pf8-GenReMekong/pf8genre.ipynb#Example:-how-to-use-the-downloader-tool) to download the FASTQ files from ENA using the downloader tool provided in this repo.

For the purpose of this analysis, the dataset [Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv](../real-world_gold-standard_data/Pf8-GenReMekong/Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv), created [here](../real-world_gold-standard_data/Pf8-GenReMekong/pf8genre.ipynb#Save-results:-concordant-phenotype-data) was used. This contains a subset of 907 samples of all high-quality data from the GenRe Mekong project where drug resistance phenotype agrees with the WGS sample analysis in Pf8.  

### Creating the read counts data file
A file "fastq_readcounts.csv" is already prepared in this directory and used by this notebook. 

It is recommended to use that file as provided in this folder. The following instructions are only needed to re-create that file and to document how it was generated. 

Ensure that the following Python packages are installed on your system:
- requests
- pandas

Then, in a terminal, run the following commands. The commands can be found alongside this notebook.   
Both commands will take a potentially long time to run (depending on your system and network) and the first command will download a large amount of data over the internet. If you are working in a HPC/cloud environment, you may want to split the data into chunks and parallelise these tasks. This is beyond the scope of this recipe.  

__Step 1: Download FASTQ files__    
Run the following command in a terminal:
```bash
python ../real-world_gold-standard_data/Pf8-GenReMekong/ENA_data_helper.py \
    download \
    --data ../real-world_gold-standard_data/Pf8 GenReMekong/ENA_download_manifest_concordant_phenotype_high_quality_samples.csv \
    --skip_errors \
    --download_attempts 10 \
    --out ENA_download_concordant_phenotype_high_quality_sample \
```
It will create a folder "ENA_download_concordant_phenotype_high_quality_sample" in this directory. The folder contains the 3628 FASTQ (gzipped) files belonging to the [Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv](../real-world_gold-standard_data/Pf8-GenReMekong/Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv), created [here](../real-world_gold-standard_data/Pf8-GenReMekong/pf8genre.ipynb#Save-results:-concordant-phenotype-data).  

__Step 2: Obtain read counts__  
Once the FASTQ download has completed, use the script "run_counts.sh", also in this folder, like this:  
(or change the path to the downloaded FASTQ files if different)

```bash
./run_counts.sh ENA_download_concordant_phenotype_high_quality_sample
```  
This will create a file "fastq_readcounts.csv" with file names and read counts for all mate1 files.

### Calculate AmpSeq panel stats
Using the dataset [Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv](../real-world_gold-standard_data/Pf8-GenReMekong/Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv) and the pre-computed (see above) csv file of read counts, create read count stats for the three SpotMalaria amplicon panels.  

In [1]:
import pandas as pd
import numpy as np
import re
from pathlib import Path

In [2]:
counts_df = pd.read_csv('fastq_readcounts.csv')
counts_df

Unnamed: 0,file,count
0,ERR14388603_1.fastq.gz,32819
1,ERR14388604_1.fastq.gz,29409
2,ERR14388605_1.fastq.gz,1094
3,ERR14388606_1.fastq.gz,38189
4,ERR14388607_1.fastq.gz,35804
...,...,...
3623,ERR15628752_1.fastq.gz,18788821
3624,ERR15628753_1.fastq.gz,13952241
3625,ERR15628764_1.fastq.gz,14159323
3626,ERR15628798_1.fastq.gz,7910579


In [3]:
samples_df = pd.read_csv('../real-world_gold-standard_data/Pf8-GenReMekong/Pf8-GenReMekong_concordant_phenotype_high_quality_samples.csv')
samples_df

Unnamed: 0,sample,Artemisinin,Piperaquine,Mefloquine,Chloroquine,Pyrimethamine,Sulfadoxine,DHA-PPQ,AS-MQ,Species,...,ENA_acc_GenRe_GRC2,ENA_acc_GenRe_SPEC,GenRe_GRC1_ENA_FASTQ_FTP_1,GenRe_GRC2_ENA_FASTQ_FTP_1,GenRe_SPEC_ENA_FASTQ_FTP_1,Pf8_ENA_FASTQ_FTP_1,GenRe_GRC1_ENA_FASTQ_FTP_2,GenRe_GRC2_ENA_FASTQ_FTP_2,GenRe_SPEC_ENA_FASTQ_FTP_2,Pf8_ENA_FASTQ_FTP_2
0,RCN12025,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,...,ERR14388604,ERR14388605,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/003/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/004/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/005/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/006/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/003/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/004/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/005/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/006/...
1,RCN12026,Resistant,Resistant,Sensitive,Resistant,Resistant,Resistant,Resistant,Sensitive,Pf,...,ERR14388607,ERR14388608,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/006/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/007/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/008/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/007/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/006/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/007/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/008/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/007/...
2,RCN12028,Resistant,Resistant,Sensitive,Resistant,Resistant,Resistant,Resistant,Sensitive,Pf,...,ERR14388613,ERR14388614,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/012/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/013/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/014/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/009/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/012/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/013/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/014/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/009/...
3,RCN12031,Resistant,Resistant,Sensitive,Resistant,Resistant,Resistant,Resistant,Sensitive,Pf,...,ERR14388622,ERR14388623,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/021/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/022/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/023/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/010/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/021/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/022/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/023/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/010/...
4,RCN12032,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,...,ERR14388625,ERR14388626,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/024/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/025/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/026/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/011/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/024/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/025/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/026/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/011/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
902,RCN26767,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,...,ERR14397547,ERR14397548,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/046/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/047/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/048/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/052/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/046/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/047/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/048/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/052/...
903,RCN26775,Sensitive,Sensitive,Sensitive,Resistant,Resistant,Sensitive,Sensitive,Sensitive,Pf,...,ERR14397571,ERR14397572,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/070/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/071/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/072/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/053/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/070/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/071/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/072/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/053/...
904,RCN26820,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,...,ERR14397664,ERR14397665,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/063/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/064/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/065/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/064/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/063/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/064/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/065/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/064/...
905,RCN26932,Resistant,Sensitive,Sensitive,Resistant,Resistant,Resistant,Sensitive,Sensitive,Pf,...,ERR14397976,ERR14397977,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/075/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/076/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/077/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/098/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/075/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/076/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR143/077/...,ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/098/...


Read counts in ```counts_df``` are assigned to file names that do not contain the name of the primer panel to which the file belongs. Use the ```samples_df``` data to assign the panel name (SPEC, GRC1 or GRC2) to each read count by looking up the file name.  
Drop Pf8 (WGS) data, which is not relevant here.

In [None]:
counts_df['panel']=counts_df.apply(
    lambda row: 
    'SPEC' if samples_df.GenRe_SPEC_ENA_FASTQ_FTP_1.str.contains(row.file).any() else 
    'GRC1' if samples_df.GenRe_GRC1_ENA_FASTQ_FTP_1.str.contains(row.file).any() else 
    'GRC2' if samples_df.GenRe_GRC2_ENA_FASTQ_FTP_1.str.contains(row.file).any() else 
    'Pf8' if samples_df.Pf8_ENA_FASTQ_FTP_1.str.contains(row.file).any() else None, axis=1)
counts_df = counts_df[ counts_df['panel'].isin(['SPEC','GRC1','GRC2'])]

Sanity check: we should have 907 samples for each GenRe primer panel

In [None]:
counts_df.groupby('panel').size().to_frame('count').reset_index()

Calculate stats for read counts per primer panel: minimum, maximum, mean and percentiles at 25%, 50% and 75%.  
These will be used to guide the number of reads to simulate for each panel

In [None]:
def q25(x):
    return x.quantile(0.25)
def q50(x):
    return x.quantile(0.5)
def q75(x):
    return x.quantile(0.75)

counts_df.groupby('panel').agg({'count': ['mean', 'min','max', q25,q50,q75]})

### Conclusion
The read count stats shown above can be used as guidelines for simulation runs. Specifically, the 25, 50 and 75 quantile are useful guides for three datasets to be created.  

## Part 2: Configuring the read simulation pipeline
The [read simulation pipeline](https://github.com/ISOinaBox/iib_readsim) requires a set of input files to configure the simulation run. This part is about creating these files.

### Create a bed-like file of primer positions
A file of primer positions in a bed-like format is rquired to configure start/end positions of amplicons to simulate.  
As we are simulating a SpotMalaria AmpSeq run, we are using primer positions for the GenRe-04b-SpotMalaria AmpSeq panel, available from www.malariagen.net in form of an Excel file. 

In [None]:
panel_df = pd.read_excel(
    "https://www.malariagen.net/wp-content/uploads/2023/11/20200705-GenRe-04b-SpotMalaria-SupplementaryFile1.xlsx",
    sheet_name="P. falciparum amplicon primers"
)

Change column names in line with expected headers for bed-like config file for the simulation pipeline.  
Add a column DESC, expected by the simulation pipeline. Use it to record the primer panel ('GRC1','GRC2' or 'SPEC') and drop rows (if any) that do not have a primer panel assigned.  
Only keep the columns relevant to the simulation pipeline.

In [None]:
panel_df.rename(
    columns={
        'Chromosome':'#CHROM', 
        'Amplicon Start Position':'START',
        'Amplicon Stop Position':'END'
    },inplace=True)

panel_df['DESC'] = panel_df.apply(
    lambda row:
    'SPEC' if 'SPEC' in row['Multiplex'] else 
    'GRC1' if 'GRC1' in row['Multiplex'] else 
    'GRC2' if 'GRC2' in row['Multiplex'] else
    np.NaN, 
    axis=1
)
panel_df=panel_df[panel_df['DESC'].notna()]
keep_cols=['#CHROM','START','END','DESC']
panel_df=panel_df[keep_cols]

Coordinates are 1-based in this file but need to be converted to 0-based for the simulation pipeline.

In [None]:
panel_df['START']=pd.to_numeric(panel_df['START'],errors='coerce')-1

In [None]:
panel_df['END']=pd.to_numeric(panel_df['END'],errors='coerce')-1

Keep only rows where primer start and end is provided

In [None]:
panel_df = panel_df[(panel_df['START'].notna()) & (panel_df['END'].notna())]

convert positions to integer

In [None]:
panel_df['START']=panel_df['START'].astype(np.int64)

In [None]:
panel_df['END']=panel_df['END'].astype(np.int64)

In [None]:
panel_df

Create versions of the panel file for each of the three sub-panels and save files.

In [None]:
spec_panel_df = panel_df[panel_df['DESC']=='SPEC']
grc1_panel_df = panel_df[panel_df['DESC']=='GRC1']
grc2_panel_df = panel_df[panel_df['DESC']=='GRC2']

In [None]:
spec_panel_df.to_csv('SpotMalaria-SPEC_Pf_amplicon_primers.bed', index=False)
grc1_panel_df.to_csv('SpotMalaria-GRC1_Pf_amplicon_primers.bed', index=False)
grc2_panel_df.to_csv('SpotMalaria-GRC2_Pf_amplicon_primers.bed', index=False)