# Filter and save `ivar` variants

I used `ivar` to identify variants relative to the sequence of the DMS library strain [MW473668](https://www.ncbi.nlm.nih.gov/nuccore/MW473668). In this notebook, I filter the variants to get only the coding mutations in the CHIKV E glycoprotein. I annotate these mutations with SRA metadata, sample coverage, and the amino acid in the library strain.

In [72]:
import os
import pandas as pd
from Bio import SeqIO

In [73]:
raw_variants = pd.read_csv("../../results/variants/variants.ivar.csv")
sra_metadata = pd.read_csv("../../data/SRA_Runs.csv")
sitemap = pd.read_csv("../../../data/site_numbering_map.csv")

## Process the variants

Only keep the variants with a more than 100 reads at a site, a minimum frequency of 5%, a significant p-value < 0.05 and variants in the 'structural polyprotein' that encode the CHIKV E protein.

In [74]:
# Filter the variants
min_depth = 100
min_freq = 0.05
pvalue = 0.05
feature = "structural polyprotein:"
filtered_variants = raw_variants[
    (raw_variants['GFF_FEATURE'] == feature) &
    (raw_variants['TOTAL_DP'] >= min_depth) &
    (raw_variants['ALT_FREQ'] >= min_freq) &
    (raw_variants['PVAL'] <= pvalue)
]

Convert the numbering so it's relative to each region in the CHIKV E protein.

In [75]:
# Regions in the polyprotein
polyprotein_numbering = {
    "C": [1, 261],
    "E3": [262, 325],
    "E2": [326, 748],
    "6K": [749, 809],
    "E1": [810, 1248],
}
# Map the mature peptide region and position for each variant
def map_peptide_and_position(pos_aa, polyprotein_numbering):
    if pd.isna(pos_aa):
        return None, None
    
    for peptide, (start, end) in polyprotein_numbering.items():
        if start <= pos_aa <= end:
            relative_pos = pos_aa - start + 1  # +1 for 1-based indexing
            return peptide, relative_pos
    
    # If position doesn't fall within any range
    return None, None
# Apply the function to create both columns
filtered_variants = filtered_variants.copy()
filtered_variants[['peptide', 'peptide_position']] = filtered_variants['POS_AA'].apply(
    lambda x: pd.Series(map_peptide_and_position(x, polyprotein_numbering))
)

In [76]:
# Get the useful columns from the metadata
metadata_columns = ['Run', 'avgLength', 'LibraryStrategy', 'LibrarySelection', 'LibrarySource', 'LibraryLayout', 'Platform', 'Model', 'BioProject', 'ScientificName']
sra_metadata = sra_metadata[metadata_columns]
# Merge the filtered variants with SRA metadata
annotated_variants = pd.merge(
    filtered_variants,
    sra_metadata,
    left_on='Accession',
    right_on='Run',
    how='left'
)
# Add a column that makes a URL like to the BioProject
annotated_variants['BioProject_URL'] = annotated_variants['BioProject'].apply(
    lambda x: f"https://www.ncbi.nlm.nih.gov/bioproject/?term={x}" if pd.notna(x) else None
)

Only keep non-synonymous mutations in the subunits of the E protein.

In [77]:
# Drop some columns from the dataframe
columns_to_drop = ["REGION", "GFF_FEATURE", "POS_AA", "PASS", "ALT_QUAL", "REF_QUAL", "Run", "REF_RV", "ALT_RV"]
annotated_variants.drop(columns=columns_to_drop, inplace=True)
# Rename columns for clarity
annotated_variants.rename(columns={
    "ALT_FREQ": "frequency",
    "TOTAL_DP": "depth",
    "PVAL": "p_value",
    "peptide": "region",
    "peptide_position": "region_position"
}, inplace=True)
# Convert region_position to integer
annotated_variants['region_position'] = annotated_variants['region_position'].astype('Int64')
# Filter to keep only regions in the region list
regions_to_keep = ["E1", "E2", "6K", "E3"]
annotated_variants = annotated_variants[
    annotated_variants['region'].isin(regions_to_keep)
]
# Only keep non-synonymous variants where REF_AA != ALT_AA
annotated_variants = annotated_variants[
    annotated_variants['REF_AA'] != annotated_variants['ALT_AA']
]
annotated_variants.head()

Unnamed: 0,POS,REF,ALT,REF_DP,ALT_DP,frequency,depth,p_value,REF_CODON,REF_AA,...,avgLength,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,Platform,Model,BioProject,ScientificName,BioProject_URL
7,9618,T,C,0,185,1.0,185.0,5.63859e-131,GTG,V,...,308.0,WGS,RANDOM PCR,VIRAL RNA,PAIRED,ILLUMINA,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...
8,9626,G,T,1,190,0.994764,191.0,1.84862e-132,GTG,V,...,308.0,WGS,RANDOM PCR,VIRAL RNA,PAIRED,ILLUMINA,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...
9,9644,G,A,0,373,0.997326,374.0,1.42144e-260,GTA,V,...,308.0,WGS,RANDOM PCR,VIRAL RNA,PAIRED,ILLUMINA,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...
13,9842,C,A,8,7122,0.997898,7137.0,0.0,CTG,L,...,308.0,WGS,RANDOM PCR,VIRAL RNA,PAIRED,ILLUMINA,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...
14,9845,C,T,1717,5524,0.755884,7308.0,0.0,CAA,Q,...,308.0,WGS,RANDOM PCR,VIRAL RNA,PAIRED,ILLUMINA,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...


Join the numbering from the DMS library to the variant data

In [78]:
# Make a column called `reference_site` by joining the region and region_position column like region_position(region)
annotated_variants['reference_site'] = annotated_variants.apply(
    lambda row: f"{row['region_position']}({row['region']})" if pd.notna(row['region_position']) else None,
    axis=1
)
# Join with the site_numbering_map by reference_site
annotated_variants = pd.merge(
    annotated_variants,
    sitemap[['reference_site', 'wildtype', 'sequential_site']],
    on='reference_site',
    how='left'
)
annotated_variants.head()

Unnamed: 0,POS,REF,ALT,REF_DP,ALT_DP,frequency,depth,p_value,REF_CODON,REF_AA,...,LibrarySource,LibraryLayout,Platform,Model,BioProject,ScientificName,BioProject_URL,reference_site,wildtype,sequential_site
0,9618,T,C,0,185,1.0,185.0,5.63859e-131,GTG,V,...,VIRAL RNA,PAIRED,ILLUMINA,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,368(E2),V,433
1,9626,G,T,1,190,0.994764,191.0,1.84862e-132,GTG,V,...,VIRAL RNA,PAIRED,ILLUMINA,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,371(E2),V,436
2,9644,G,A,0,373,0.997326,374.0,1.42144e-260,GTA,V,...,VIRAL RNA,PAIRED,ILLUMINA,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,377(E2),V,442
3,9842,C,A,8,7122,0.997898,7137.0,0.0,CTG,L,...,VIRAL RNA,PAIRED,ILLUMINA,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,20(6K),L,508
4,9845,C,T,1717,5524,0.755884,7308.0,0.0,CAA,Q,...,VIRAL RNA,PAIRED,ILLUMINA,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,21(6K),Q,509


Calculate read coverage from the depth.

In [79]:
# Load in the read coverage over each position in the genome
accessions_to_keep = set(annotated_variants.Accession.to_list())
depth_df = pd.read_csv("../../results/depth/merged.depth", sep="\t")
# Filter to only the relevant accessions
depth_df = depth_df[depth_df['Accession'].isin(accessions_to_keep)]
# Calculate the % coverage and mean read depth per sample
genome_length = 11992
coverage_summary_df = depth_df.groupby('Accession')['DP'].agg([
    ('sample_mean_read_depth', 'mean'),
    ('sample_coverage', lambda x: (x > 0).sum()),
    ('sample_100x_coverage', lambda x: (x > 100).sum())
]).reset_index()
coverage_summary_df = coverage_summary_df.assign(
    sample_percent_coverage=lambda x: x['sample_coverage'] / genome_length,
    sample_percent_100X_coverage_breadth=lambda x: x['sample_100x_coverage'] / genome_length
)
coverage_summary_df.drop(columns=['sample_coverage', 'sample_100x_coverage'], inplace=True)
# Join the coverage summary to the annotated variants by Accession
annotated_variants = pd.merge(
    annotated_variants,
    coverage_summary_df,
    on='Accession',
    how='left'
)
annotated_variants.head()

Unnamed: 0,POS,REF,ALT,REF_DP,ALT_DP,frequency,depth,p_value,REF_CODON,REF_AA,...,Model,BioProject,ScientificName,BioProject_URL,reference_site,wildtype,sequential_site,sample_mean_read_depth,sample_percent_coverage,sample_percent_100X_coverage_breadth
0,9618,T,C,0,185,1.0,185.0,5.63859e-131,GTG,V,...,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,368(E2),V,433,1566.448549,0.998332,0.574967
1,9626,G,T,1,190,0.994764,191.0,1.84862e-132,GTG,V,...,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,371(E2),V,436,1566.448549,0.998332,0.574967
2,9644,G,A,0,373,0.997326,374.0,1.42144e-260,GTA,V,...,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,377(E2),V,442,1566.448549,0.998332,0.574967
3,9842,C,A,8,7122,0.997898,7137.0,0.0,CTG,L,...,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,20(6K),L,508,1566.448549,0.998332,0.574967
4,9845,C,T,1717,5524,0.755884,7308.0,0.0,CAA,Q,...,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,21(6K),Q,509,1566.448549,0.998332,0.574967


Get the 'fixed' variants. These are variants with a frequency that's greater than `1 - minimum frequency` such that they cannot be converted to minor alleles with a frequency > our `minimum frequency`.

In [80]:
fixed_variants = annotated_variants[annotated_variants['frequency'] > (1 - min_freq)].reset_index(drop=True)
fixed_variants.head()

Unnamed: 0,POS,REF,ALT,REF_DP,ALT_DP,frequency,depth,p_value,REF_CODON,REF_AA,...,Model,BioProject,ScientificName,BioProject_URL,reference_site,wildtype,sequential_site,sample_mean_read_depth,sample_percent_coverage,sample_percent_100X_coverage_breadth
0,9618,T,C,0,185,1.0,185.0,5.63859e-131,GTG,V,...,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,368(E2),V,433,1566.448549,0.998332,0.574967
1,9626,G,T,1,190,0.994764,191.0,1.84862e-132,GTG,V,...,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,371(E2),V,436,1566.448549,0.998332,0.574967
2,9644,G,A,0,373,0.997326,374.0,1.42144e-260,GTA,V,...,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,377(E2),V,442,1566.448549,0.998332,0.574967
3,9842,C,A,8,7122,0.997898,7137.0,0.0,CTG,L,...,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,20(6K),L,508,1566.448549,0.998332,0.574967
4,9909,T,G,28,8577,0.995705,8614.0,0.0,TTT,F,...,Illumina MiSeq,PRJNA294670,Chikungunya virus,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,42(6K),F,530,1566.448549,0.998332,0.574967


Convert the remaining variants with a frequency > 50% into 'minor alleles' by reversing the reference and alternative columns and subtracting the allele frequency from 1.

In [81]:
# Get the minor variants (anything that didn't end up in the fixed_variants dataframe)
not_fixed_variants = annotated_variants[annotated_variants['frequency'] <= (1 - min_freq)].reset_index(drop=True)

# Split dataframe based on frequency
major_alleles = not_fixed_variants[not_fixed_variants['frequency'] > 0.5].copy()
minor_alleles = not_fixed_variants[not_fixed_variants['frequency'] <= 0.5].copy()

if len(major_alleles) > 0:
    # Convert frequency for major alleles
    major_alleles['frequency'] = 1 - major_alleles['frequency']
    
    # Swap REF and ALT columns
    # First, identify all REF_ and ALT_ columns
    ref_cols = [col for col in major_alleles.columns if col.startswith('REF_')]
    alt_cols = [col for col in major_alleles.columns if col.startswith('ALT_')]
    
    # Create mapping for column swapping
    swap_mapping = {}
    for ref_col in ref_cols:
        alt_col = ref_col.replace('REF_', 'ALT_')
        if alt_col in alt_cols:
            swap_mapping[ref_col] = alt_col
            swap_mapping[alt_col] = ref_col
    
    # Also swap the basic REF and ALT columns
    swap_mapping['REF'] = 'ALT'
    swap_mapping['ALT'] = 'REF'
    
    # Perform the swapping
    major_alleles = major_alleles.rename(columns=swap_mapping)
    
    # Reorder columns to match original dataframe
    major_alleles = major_alleles[not_fixed_variants.columns]

# Combine the dataframes
joined_df = pd.concat([minor_alleles, major_alleles], ignore_index=True)

# Sort by original index to maintain some order (optional)
minor_variants_df = joined_df.sort_values('POS').reset_index(drop=True)
minor_variants_df.head()


Unnamed: 0,POS,REF,ALT,REF_DP,ALT_DP,frequency,depth,p_value,REF_CODON,REF_AA,...,Model,BioProject,ScientificName,BioProject_URL,reference_site,wildtype,sequential_site,sample_mean_read_depth,sample_percent_coverage,sample_percent_100X_coverage_breadth
0,8324,C,A,385,20,0.054054,407.0,2.1728299999999998e-254,CGT,R,...,Illumina HiSeq 4000,PRJNA714555,Homo sapiens,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,1(E3),S,2,3886.443045,0.091728,0.06546
1,8327,C,A,1260,78,0.058252,1339.0,4.2759100000000005e-17,CTT,L,...,Illumina NovaSeq 6000,PRJNA667927,Homo sapiens,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,2(E3),L,3,663.712141,1.0,0.993245
2,8327,C,A,1278,74,0.054734,1352.0,4.22466e-30,CTT,L,...,Illumina NovaSeq 6000,PRJNA667927,Homo sapiens,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,2(E3),L,3,640.163192,1.0,0.993579
3,8327,C,A,1316,86,0.061297,1403.0,2.0459e-35,CTT,L,...,Illumina NovaSeq 6000,PRJNA667927,Homo sapiens,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,2(E3),L,3,690.586558,1.0,0.995247
4,8327,C,A,1374,77,0.052957,1454.0,4.54883e-16,CTT,L,...,Illumina NovaSeq 6000,PRJNA667927,Homo sapiens,https://www.ncbi.nlm.nih.gov/bioproject/?term=...,2(E3),L,3,709.473399,1.0,0.996998


Rename the columns to something easier to understand.

In [82]:
rename_dictionary = {
    'POS': 'nucleotide_position',
    'REF': 'reference_nucleotide',
    'ALT': 'variant_nucleotide',
    'REF_DP': 'reference_depth',
    'ALT_DP': 'variant_depth',
    'frequency': 'variant_frequency',
    'depth': 'read_depth',
    'REF_CODON': 'reference_codon',
    'REF_AA': 'reference_amino_acid',
    'ALT_CODON': 'variant_codon',
    'ALT_AA': 'variant_amino_acid',
    'avgLength': 'avg_read_length',
    'LibraryStrategy': 'library_strategy',
    'LibrarySelection': 'library_selection',
    'LibrarySource': 'library_source',
    'LibraryLayout': 'library_layout',
    'Platform': 'sequencing_platform',
    'Model': 'sequencing_model',
    'ScientificName': 'library_organism',
    'wildtype': 'library_amino_acid'
 }

Save each of these tables to the summary directory to be tracked by `git`.

In [83]:
# Make a 'summary' directory in results if it doesn't exist
summary_dir = "../../results/summary"
if not os.path.exists(summary_dir):
    os.makedirs(summary_dir)
# Save the minor variants to a CSV file
minor_variants_csv_path = os.path.join(summary_dir, "minor_variants.csv")
minor_variants_df.rename(columns=rename_dictionary).to_csv(minor_variants_csv_path, index=False)
# Save the fixed variants to a CSV file
fixed_variants_csv_path = os.path.join(summary_dir, "fixed_variants.csv")
fixed_variants.rename(columns=rename_dictionary).to_csv(fixed_variants_csv_path, index=False) 
# Save the filtered variants to a CSV file
filtered_variants_csv_path = os.path.join(summary_dir, "all_variants.csv")
annotated_variants.rename(columns=rename_dictionary).to_csv(filtered_variants_csv_path, index=False)

## Basic variant information

How many variants are there, how many BioProjects and BioSamples are they coming from?

In [88]:
minor_variants_df["amino_acid_variant"] = (minor_variants_df['REF_AA'] + 
                                          minor_variants_df['sequential_site'].astype(str) + 
                                          minor_variants_df['ALT_AA'])

In [94]:
print(f"There are {len(set(minor_variants_df.amino_acid_variant.to_list()))} unique minor variants in the dataset.")

There are 1075 unique minor variants in the dataset.


In [95]:
print(f"These were identified in {len(set(minor_variants_df.Accession.to_list()))} unique SRA runs.")

These were identified in 1173 unique SRA runs.


In [96]:
print(f"From {len(set(minor_variants_df.BioProject.to_list()))} unique BioProjects.")

From 75 unique BioProjects.
