### Training Dataset Preprocessing for Synthetic Lethality Prediction

This notebook preprocesses training and testing datasets by cleaning missing values and filtering for paralog gene pairs. It prepares the final datasets for machine learning model training and evaluation.

**Inputs:**
- Processed Ito et al. CRISPR screen data (training set)
- Processed Klingbeil et al. CRISPR screen data (testing set)  
- DeKegel paralog pairs reference dataset (Table S8)

**Outputs:**
- Clean training dataset with missing values handled
- Clean testing dataset with missing values handled
- Both datasets filtered to contain only paralog gene pairs

**Key Processing Steps:**
1. Filter datasets to include only paralog gene pairs
2. Remove rows with critical missing values
3. Fill remaining missing values with appropriate defaults
4. Generate summary statistics and quality checks

### Setup and File Paths

**Import required libraries and set up file paths:**

In [2]:
# import modules
import os
import re
import pandas as pd
import numpy as np

In [3]:
cwd = os.getcwd()
BASE_DIR = os.path.abspath(os.path.join(cwd, ".."))

# build paths inside the repo
get_data_path = lambda folders, fname: os.path.normpath(
    os.path.join(BASE_DIR, *folders, fname)
)

file_path_training_data = get_data_path(['output', 'processed_CRISPR_screens'], 'processed_ito_df_labelled.csv')
file_path_testing_data = get_data_path(['output', 'processed_CRISPR_screens'], 'processed_klingbeil_df_labelled.csv')
file_path_testing_data_parrish = get_data_path(['output', 'processed_CRISPR_screens'], 'processed_parrish_df_labelled.csv')
dekegel_table8_path = get_data_path(['input', 'other'], 'processed_DeKegel_TableS8.csv')

### Load Input Datasets

**Load training data (Ito et al.):**

In [7]:
training_df = pd.read_csv(file_path_training_data, low_memory=False)
training_df.head()

Unnamed: 0,genepair,A1,A2,A1_entrez,A2_entrez,DepMap_ID,cell_line,Gemini_FDR,raw_LFC,SL,...,colocalisation,interact,n_total_ppi,fet_ppi_overlap,gtex_spearman_corr,gtex_min_mean_expr,gtex_max_mean_expr,GEMINI,LFC,SL_new
0,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000022,PATU8988S_PANCREAS,0.998944,0.088856,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,0.118768,0.088856,False
1,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000307,PK1_PANCREAS,0.986587,0.201704,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,0.132501,0.201704,False
2,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000632,HS944T_SKIN,1.0,0.069772,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,0.024593,0.069772,False
3,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000681,A549_LUNG,0.977988,0.379455,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,-0.241323,0.379455,False
4,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000756,GI1_CENTRAL_NERVOUS_SYSTEM,0.999586,-0.077118,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,0.299715,-0.077118,False


**Load testing data (Klingbeil et al.):**

In [4]:
testing_df = pd.read_csv(file_path_testing_data, low_memory=False)
testing_df.head()

Unnamed: 0,GENE_COMBINATION,domain_combination,genepair,A1,A2,A1_entrez,A2_entrez,cell_line,DepMap_ID,GEMINI,...,either_in_complex,mean_complex_essentiality,colocalisation,interact,n_total_ppi,fet_ppi_overlap,gtex_spearman_corr,gtex_min_mean_expr,gtex_max_mean_expr,SL_new
0,AAK1:Kinase_domain;BMP2K:Kinase_domain,Kinase_domain_Kinase_domain,AAK1_BMP2K,AAK1,BMP2K,22848.0,55589.0,HEL,ACH-000004,0.218665,...,False,0.0,0.0,False,77.0,21.867726,0.261701,6.713555,6.761786,False
1,AAK1:Kinase_domain;BMP2K:Kinase_domain,Kinase_domain_Kinase_domain,AAK1_BMP2K,AAK1,BMP2K,22848.0,55589.0,T3M4,ACH-000085,0.205641,...,False,0.0,0.0,False,77.0,21.867726,0.261701,6.713555,6.761786,False
2,AAK1:Kinase_domain;BMP2K:Kinase_domain,Kinase_domain_Kinase_domain,AAK1_BMP2K,AAK1,BMP2K,22848.0,55589.0,HPAFII,ACH-000094,0.044486,...,False,0.0,0.0,False,77.0,21.867726,0.261701,6.713555,6.761786,False
3,AAK1:Kinase_domain;BMP2K:Kinase_domain,Kinase_domain_Kinase_domain,AAK1_BMP2K,AAK1,BMP2K,22848.0,55589.0,THP1,ACH-000146,0.031737,...,False,0.0,0.0,False,77.0,21.867726,0.261701,6.713555,6.761786,False
4,AAK1:Kinase_domain;BMP2K:Kinase_domain,Kinase_domain_Kinase_domain,AAK1_BMP2K,AAK1,BMP2K,22848.0,55589.0,NOMO1,ACH-000168,0.148144,...,False,0.0,0.0,False,77.0,21.867726,0.261701,6.713555,6.761786,False


**Load testing data (Parrish et al.):**

In [4]:
testing_df_parrish = pd.read_csv(file_path_testing_data_parrish, low_memory=False)
testing_df_parrish.head()

Unnamed: 0,genepair,A1,A2,A1_entrez,A2_entrez,PC9_GI_score,PC9_GI_fdr,HeLa_GI_score,HeLa_GI_fdr,DepMap_ID,...,colocalisation,interact,n_total_ppi,fet_ppi_overlap,gtex_spearman_corr,gtex_min_mean_expr,gtex_max_mean_expr,GEMINI,LFC,SL_new
0,A2M_PZP,A2M,PZP,2.0,5858.0,0.264313,0.138809,-0.15432,0.424612,ACH-000779,...,0.0,True,122.0,2.97235,0.601651,0.804023,473.357464,-0.249934,0.333503,False
1,A2M_PZP,A2M,PZP,2.0,5858.0,0.264313,0.138809,-0.15432,0.424612,ACH-001086,...,0.0,True,122.0,2.97235,0.601651,0.804023,473.357464,-0.038689,-0.115678,False
2,AADACL3_AADACL4,AADACL3,AADACL4,126767.0,343066.0,-0.000281,0.992873,0.120862,0.433194,ACH-000779,...,0.0,False,0.0,0.0,0.141775,0.213038,0.350221,0.022368,-0.106713,False
3,AADACL3_AADACL4,AADACL3,AADACL4,126767.0,343066.0,-0.000281,0.992873,0.120862,0.433194,ACH-001086,...,0.0,False,0.0,0.0,0.141775,0.213038,0.350221,-0.252247,0.257577,False
4,AADAC_AADACL2,AADAC,AADACL2,13.0,344752.0,-0.29915,0.066517,0.016004,0.951159,ACH-000779,...,0.0,False,1.0,0.0,0.417877,2.854048,9.815447,-0.007339,-0.023724,False


**Load paralog pairs reference:**
- This dataset defines which gene pairs are paralogs
- Only paralog pairs will be retained in the final datasets

In [5]:
paralog_pairs = pd.read_csv(dekegel_table8_path)

### Data Quality Assessment

**Check missing values in training dataset:**
- Identify which features have missing data
- Determine appropriate handling strategy for each feature type

In [8]:
# summary 
training_df[['rMaxExp_A1A2', 'rMinExp_A1A2',
             'max_ranked_A1A2', 'min_ranked_A1A2',
             'max_cn', 'min_cn', 'Protein_Altering', 'Damaging', 
             'min_sequence_identity',
             'prediction_score', 
             'weighted_PPI_essentiality', 'weighted_PPI_expression',
             'go_CC_expression', 'smallest_BP_GO_essentiality', 'smallest_BP_GO_expression',
             'smallest_CC_GO_essentiality','closest', 'WGD', 'family_size',
             'cds_length_ratio', 'shared_domains', 'has_pombe_ortholog',
             'has_essential_pombe_ortholog', 'has_cerevisiae_ortholog', 'has_essential_cerevisiae_ortholog', 
             'conservation_score', 'mean_age', 'either_in_complex', 'mean_complex_essentiality', 'colocalisation',
             'interact', 'n_total_ppi', 'fet_ppi_overlap',
             'gtex_spearman_corr', 'gtex_min_mean_expr', 'gtex_max_mean_expr']].isna().sum()

rMaxExp_A1A2                            11
rMinExp_A1A2                            11
max_ranked_A1A2                       4926
min_ranked_A1A2                       4926
max_cn                                   5
min_cn                                   5
Protein_Altering                         0
Damaging                                 0
min_sequence_identity                 3347
prediction_score                      3347
weighted_PPI_essentiality             7901
weighted_PPI_expression               3793
go_CC_expression                     39944
smallest_BP_GO_essentiality          26127
smallest_BP_GO_expression            23970
smallest_CC_GO_essentiality          40778
closest                                  0
WGD                                      0
family_size                           3347
cds_length_ratio                      3347
shared_domains                        3347
has_pombe_ortholog                       0
has_essential_pombe_ortholog          3347
has_cerevis

In [9]:
# Check if each genepair + cell line combination adds up to expected values

# Method 1: Count unique combinations of genepair + DepMap_ID
unique_combinations = training_df[['genepair', 'DepMap_ID']].drop_duplicates().shape[0]
print(f"Unique genepair + cell line combinations: {unique_combinations}")

# Method 2: Expected vs Actual
expected_combinations = training_df['genepair'].nunique() * training_df['DepMap_ID'].nunique()
print(f"Expected combinations (gene pairs × cell lines): {expected_combinations}")
print(f"Actual combinations: {unique_combinations}")
print(f"Match expected? {unique_combinations == expected_combinations}")

# Method 3: Check if we have a complete matrix (every pair in every cell line)
print(f"\nDetailed breakdown:")
print(f"Unique gene pairs: {training_df['genepair'].nunique()}")
print(f"Unique cell lines: {training_df['DepMap_ID'].nunique()}")
print(f"Total rows in dataset: {len(training_df)}")
print(f"Expected if complete matrix: {training_df['genepair'].nunique() * training_df['DepMap_ID'].nunique()}")

# Method 4: Check if dataset has duplicates using pandas duplicated()
print(f"\nDuplicate check using pandas:")
duplicates = training_df[['genepair', 'DepMap_ID']].duplicated()
print(f"Total rows: {len(training_df)}")
print(f"Duplicate rows: {duplicates.sum()}")
print(f"Unique combinations: {unique_combinations}")
print(f"Has duplicates? {duplicates.any()}")

# Show duplicate combinations if any exist
if duplicates.any():
    print("\nDuplicate combinations:")
    duplicate_rows = training_df[duplicates][['genepair', 'DepMap_ID']].drop_duplicates()
    print(duplicate_rows.head(10))

# Method 5: Verify the 4170*10 calculation
print(f"\nVerifying your 4170*10 calculation:")
print(f"4170 * 10 = {4170 * 10}")
print(f"Actual unique gene pairs: {training_df['genepair'].nunique()}")
print(f"Actual unique cell lines: {training_df['DepMap_ID'].nunique()}")
print(f"Matches 4170 gene pairs? {training_df['genepair'].nunique() == 4170}")
print(f"Matches 10 cell lines? {training_df['DepMap_ID'].nunique() == 10}")

# Method 6: Check for missing combinations (simplified approach)
if unique_combinations != expected_combinations:
    print(f"\nMissing combinations: {expected_combinations - unique_combinations}")
    
    # More efficient approach using pandas merge
    # Create a DataFrame with all possible combinations
    all_genepairs = pd.DataFrame({'genepair': training_df['genepair'].unique()})
    all_celllines = pd.DataFrame({'DepMap_ID': training_df['DepMap_ID'].unique()})
    
    # Cross join to get all possible combinations
    all_genepairs['key'] = 1
    all_celllines['key'] = 1
    all_possible_df = all_genepairs.merge(all_celllines, on='key').drop('key', axis=1)
    
    # Get actual combinations
    actual_combinations_df = training_df[['genepair', 'DepMap_ID']].drop_duplicates()
    
    # Find missing combinations using merge
    missing_df = all_possible_df.merge(
        actual_combinations_df, 
        on=['genepair', 'DepMap_ID'], 
        how='left', 
        indicator=True
    )
    missing_combinations_df = missing_df[missing_df['_merge'] == 'left_only'][['genepair', 'DepMap_ID']]
    
    print(f"Number of missing combinations: {len(missing_combinations_df)}")
    
    # Save missing gene pairs as a list
    missing_genepairs = missing_combinations_df['genepair'].unique().tolist()
    print(f"Number of unique missing gene pairs: {len(missing_genepairs)}")
    
    if len(missing_combinations_df) > 0:
        print("Missing combinations (showing first 10):")
        print(missing_combinations_df.head(10))
        
        print("\nMissing gene pairs (showing first 10):")
        print(missing_genepairs[:10])
        
        # Store the missing gene pairs list for later use
        print(f"\nMissing gene pairs saved as 'missing_genepairs' list with {len(missing_genepairs)} unique pairs")
else:
    print(f"\n✓ Perfect! All expected combinations are present.")
    missing_genepairs = []  # Empty list if no missing pairs

Unique genepair + cell line combinations: 49156
Expected combinations (gene pairs × cell lines): 49698
Actual combinations: 49156
Match expected? False

Detailed breakdown:
Unique gene pairs: 4518
Unique cell lines: 11
Total rows in dataset: 49156
Expected if complete matrix: 49698

Duplicate check using pandas:
Total rows: 49156
Duplicate rows: 0
Unique combinations: 49156
Has duplicates? False

Verifying your 4170*10 calculation:
4170 * 10 = 41700
Actual unique gene pairs: 4518
Actual unique cell lines: 11
Matches 4170 gene pairs? False
Matches 10 cell lines? False

Missing combinations: 542
Number of missing combinations: 542
Number of unique missing gene pairs: 285
Missing combinations (showing first 10):
        genepair   DepMap_ID
326  ACAA2_ACAT2  ACH-000915
341  ACACA_ACACB  ACH-000022
343  ACACA_ACACB  ACH-000632
344  ACACA_ACACB  ACH-000681
346  ACACA_ACACB  ACH-000801
347  ACACA_ACACB  ACH-000881
348  ACACA_ACACB  ACH-000915
349  ACACA_ACACB  ACH-000987
350  ACACA_ACACB  AC

In [10]:
def preprocess_dataset(df, old_df,
                       required_genepairs_col='genepair',
                       dropna_cols=None,
                       fillna_zero_cols=None,
                       fillna_large_cols=None,
                       fillna_large_value=18000):
    """
    Preprocess a training or testing dataset:
    - Keep only rows with genepairs present in `old_df`
    - Drop rows with NaN in specific columns
    - Fill missing values with default values
    
    Parameters:
        df (pd.DataFrame): Dataset to process
        old_df (pd.DataFrame): Dataset with allowed genepairs
        required_genepairs_col (str): Column name for genepairs to match
        dropna_cols (list): Columns for which rows with NaNs should be dropped
        fillna_zero_cols (list): Columns to fill NaNs with 0
        fillna_large_cols (list): Columns to fill NaNs with large constant
        fillna_large_value (int or float): The large value to fill (default: 18000)

    Returns:
        pd.DataFrame: Cleaned and processed DataFrame
    """

    # Step 1: Filter rows to only those in old_df
    df_filtered = df[df[required_genepairs_col].isin(old_df[required_genepairs_col])].copy()

    # Step 2: Drop rows with any NA in required columns
    if dropna_cols:
        df_filtered = df_filtered.dropna(axis=0, how='any', subset=dropna_cols)

    # Step 3: Fill NaNs with 0 or large number
    if fillna_zero_cols:
        df_filtered[fillna_zero_cols] = df_filtered[fillna_zero_cols].fillna(0)

    if fillna_large_cols:
        df_filtered[fillna_large_cols] = df_filtered[fillna_large_cols].fillna(fillna_large_value)

    # Step 4: Reset index for clean result
    return df_filtered.reset_index(drop=True)

### Apply Data Preprocessing

**Define missing value handling strategy:**
- **Drop rows**: For critical features where missing values indicate invalid data
- **Fill with 0**: For expression features where 0 indicates no expression/interaction
- **Fill with large value**: For ranking features where high rank indicates low importance

In [None]:
drop_na_values = ['rMaxExp_A1A2', 'rMinExp_A1A2', 'max_ranked_A1A2', 'min_ranked_A1A2']
fillna_values = ['weighted_PPI_expression', 'smallest_BP_GO_expression', 'go_CC_expression']
fillna_values_v2 = ['weighted_PPI_essentiality', 'smallest_BP_GO_essentiality', 'smallest_CC_GO_essentiality']

# Apply to training set
training_df_clean = preprocess_dataset(
    df=training_df,
    old_df=paralog_pairs,
    dropna_cols=drop_na_values,
    fillna_zero_cols=fillna_values,
    fillna_large_cols=fillna_values_v2
)

# Apply to testing set
testing_df_clean = preprocess_dataset(
    df=testing_df,
    old_df=paralog_pairs,
    dropna_cols=drop_na_values,
    fillna_zero_cols=fillna_values,
    fillna_large_cols=fillna_values_v2
)

# Apply to testing set Parrish
testing_df_parrish_clean = preprocess_dataset(
    df=testing_df_parrish,
    old_df=paralog_pairs,
    dropna_cols=drop_na_values,
    fillna_zero_cols=fillna_values,
    fillna_large_cols=fillna_values_v2
)

### Quality Control and Summary Statistics

**Training dataset summary after preprocessing:**

In [13]:
#summary of the training dataset after removing NA values
print('Num SL:', training_df_clean[training_df_clean['SL_new'] == True].shape[0], '/', training_df_clean.shape[0])
print('Num non-SL:', training_df_clean[training_df_clean['SL_new'] == False].shape[0], '/', training_df_clean.shape[0])
print(f'Number of unique gene pairs: {training_df_clean.genepair.nunique()}')
print(f'Number of unique cell lines: {training_df_clean.cell_line.nunique()}')
training_df_clean[:3]

Num SL: 958 / 41244
Num non-SL: 40286 / 41244
Number of unique gene pairs: 4170
Number of unique cell lines: 10


Unnamed: 0,genepair,A1,A2,A1_entrez,A2_entrez,DepMap_ID,cell_line,Gemini_FDR,raw_LFC,SL,...,colocalisation,interact,n_total_ppi,fet_ppi_overlap,gtex_spearman_corr,gtex_min_mean_expr,gtex_max_mean_expr,GEMINI,LFC,SL_new
0,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000022,PATU8988S_PANCREAS,0.998944,0.088856,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,0.118768,0.088856,False
1,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000307,PK1_PANCREAS,0.986587,0.201704,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,0.132501,0.201704,False
2,A3GALT2_ABO,A3GALT2,ABO,127550.0,28.0,ACH-000632,HS944T_SKIN,1.0,0.069772,False,...,0.0,False,3.0,0.0,0.114847,0.258739,11.702,0.024593,0.069772,False


In [14]:
# For each genepair, count cell lines labeled SL and nonSL
sl_counts = training_df_clean[training_df_clean['SL_new'] == True].groupby('genepair')['cell_line'].nunique()
nonsl_counts = training_df_clean[training_df_clean['SL_new'] == False].groupby('genepair')['cell_line'].nunique()

# Find genepairs with at least 1 SL and at least 1 nonSL cell line
mixed_pairs_1 = sl_counts[(sl_counts >= 1) & (nonsl_counts >= 1)].index
print(f"Paralog pairs SL in ≥1 cell line and nonSL in ≥1 cell line: {len(mixed_pairs_1)}")

# At least 2 SL cell lines and at least 1 nonSL
mixed_pairs_2 = sl_counts[(sl_counts >= 2) & (nonsl_counts >= 1)].index
print(f"Paralog pairs SL in ≥2 cell lines and nonSL in ≥1 cell line: {len(mixed_pairs_2)}")

# At least 3 SL cell lines and at least 1 nonSL
mixed_pairs_3 = sl_counts[(sl_counts >= 3) & (nonsl_counts >= 1)].index
print(f"Paralog pairs SL in ≥3 cell lines and nonSL in ≥1 cell line: {len(mixed_pairs_3)}")

Paralog pairs SL in ≥1 cell line and nonSL in ≥1 cell line: 502
Paralog pairs SL in ≥2 cell lines and nonSL in ≥1 cell line: 212
Paralog pairs SL in ≥3 cell lines and nonSL in ≥1 cell line: 105


In [15]:
# # Check that missing values have been properly handled
training_df_clean[['rMaxExp_A1A2', 'rMinExp_A1A2',
             'max_ranked_A1A2', 'min_ranked_A1A2',
             'max_cn', 'min_cn', 'Protein_Altering', 'Damaging', 
             'min_sequence_identity',
             'prediction_score', 
             'weighted_PPI_essentiality', 'weighted_PPI_expression',
             'go_CC_expression', 'smallest_BP_GO_essentiality', 'smallest_BP_GO_expression',
             'smallest_CC_GO_essentiality', 'closest', 'WGD', 'family_size',
             'cds_length_ratio', 'shared_domains', 'has_pombe_ortholog',
             'has_essential_pombe_ortholog', 'has_cerevisiae_ortholog', 'has_essential_cerevisiae_ortholog', 
             'conservation_score', 'mean_age', 'either_in_complex', 'mean_complex_essentiality', 'colocalisation',
             'interact', 'n_total_ppi', 'fet_ppi_overlap',
             'gtex_spearman_corr', 'gtex_min_mean_expr', 'gtex_max_mean_expr']].isna().sum()

rMaxExp_A1A2                         0
rMinExp_A1A2                         0
max_ranked_A1A2                      0
min_ranked_A1A2                      0
max_cn                               0
min_cn                               0
Protein_Altering                     0
Damaging                             0
min_sequence_identity                0
prediction_score                     0
weighted_PPI_essentiality            0
weighted_PPI_expression              0
go_CC_expression                     0
smallest_BP_GO_essentiality          0
smallest_BP_GO_expression            0
smallest_CC_GO_essentiality          0
closest                              0
WGD                                  0
family_size                          0
cds_length_ratio                     0
shared_domains                       0
has_pombe_ortholog                   0
has_essential_pombe_ortholog         0
has_cerevisiae_ortholog              0
has_essential_cerevisiae_ortholog    0
conservation_score       

**Testing dataset summary after preprocessing:**

In [None]:
#summary of the training dataset after removing NA values
print('Num SL:', testing_df_clean[testing_df_clean['SL_new'] == True].shape[0], '/', testing_df_clean.shape[0])
print('Num non-SL:', testing_df_clean[testing_df_clean['SL_new'] == False].shape[0], '/', testing_df_clean.shape[0])
print(f'Number of unique gene pairs: {testing_df_clean.genepair.nunique()}')
print(f'Number of unique cell lines: {testing_df_clean.cell_line.nunique()}')
testing_df_clean[:3]

In [17]:
#summary of the training dataset after removing NA values
print('Num SL:', testing_df_parrish_clean[testing_df_parrish_clean['SL_new'] == True].shape[0], '/', testing_df_parrish_clean.shape[0])
print('Num non-SL:', testing_df_parrish_clean[testing_df_parrish_clean['SL_new'] == False].shape[0], '/', testing_df_parrish_clean.shape[0])
print(f'Number of unique gene pairs: {testing_df_parrish_clean.genepair.nunique()}')
print(f'Number of unique cell lines: {testing_df_parrish_clean.cell_line.nunique()}')
testing_df_parrish_clean[:3]

Num SL: 138 / 1765
Num non-SL: 1627 / 1765
Number of unique gene pairs: 908
Number of unique cell lines: 2


Unnamed: 0,genepair,A1,A2,A1_entrez,A2_entrez,PC9_GI_score,PC9_GI_fdr,HeLa_GI_score,HeLa_GI_fdr,DepMap_ID,...,colocalisation,interact,n_total_ppi,fet_ppi_overlap,gtex_spearman_corr,gtex_min_mean_expr,gtex_max_mean_expr,GEMINI,LFC,SL_new
0,A2M_PZP,A2M,PZP,2.0,5858.0,0.264313,0.138809,-0.15432,0.424612,ACH-000779,...,0.0,True,122.0,2.97235,0.601651,0.804023,473.357464,-0.249934,0.333503,False
1,A2M_PZP,A2M,PZP,2.0,5858.0,0.264313,0.138809,-0.15432,0.424612,ACH-001086,...,0.0,True,122.0,2.97235,0.601651,0.804023,473.357464,-0.038689,-0.115678,False
2,AADACL3_AADACL4,AADACL3,AADACL4,126767.0,343066.0,-0.000281,0.992873,0.120862,0.433194,ACH-000779,...,0.0,False,0.0,0.0,0.141775,0.213038,0.350221,0.022368,-0.106713,False


**Verify missing values are handled in testing data:**

In [13]:
testing_df_clean[['rMaxExp_A1A2', 'rMinExp_A1A2',
             'max_ranked_A1A2', 'min_ranked_A1A2',
             'max_cn', 'min_cn', 'Protein_Altering', 'Damaging', 
             'min_sequence_identity',
             'prediction_score', 
             'weighted_PPI_essentiality', 'weighted_PPI_expression',
             'go_CC_expression', 'smallest_BP_GO_essentiality', 'smallest_BP_GO_expression',
             'smallest_CC_GO_essentiality', 'closest', 'WGD', 'family_size',
             'cds_length_ratio', 'shared_domains', 'has_pombe_ortholog',
             'has_essential_pombe_ortholog', 'has_cerevisiae_ortholog', 'has_essential_cerevisiae_ortholog', 
             'conservation_score', 'mean_age', 'either_in_complex', 'mean_complex_essentiality', 'colocalisation',
             'interact', 'n_total_ppi', 'fet_ppi_overlap',
             'gtex_spearman_corr', 'gtex_min_mean_expr', 'gtex_max_mean_expr']].isna().sum()

rMaxExp_A1A2                         0
rMinExp_A1A2                         0
max_ranked_A1A2                      0
min_ranked_A1A2                      0
max_cn                               0
min_cn                               0
Protein_Altering                     0
Damaging                             0
min_sequence_identity                0
prediction_score                     0
weighted_PPI_essentiality            0
weighted_PPI_expression              0
go_CC_expression                     0
smallest_BP_GO_essentiality          0
smallest_BP_GO_expression            0
smallest_CC_GO_essentiality          0
closest                              0
WGD                                  0
family_size                          0
cds_length_ratio                     0
shared_domains                       0
has_pombe_ortholog                   0
has_essential_pombe_ortholog         0
has_cerevisiae_ortholog              0
has_essential_cerevisiae_ortholog    0
conservation_score       

In [18]:
testing_df_parrish_clean[['rMaxExp_A1A2', 'rMinExp_A1A2',
             'max_ranked_A1A2', 'min_ranked_A1A2',
             'max_cn', 'min_cn', 'Protein_Altering', 'Damaging', 
             'min_sequence_identity',
             'prediction_score', 
             'weighted_PPI_essentiality', 'weighted_PPI_expression',
             'go_CC_expression', 'smallest_BP_GO_essentiality', 'smallest_BP_GO_expression',
             'smallest_CC_GO_essentiality', 'closest', 'WGD', 'family_size',
             'cds_length_ratio', 'shared_domains', 'has_pombe_ortholog',
             'has_essential_pombe_ortholog', 'has_cerevisiae_ortholog', 'has_essential_cerevisiae_ortholog', 
             'conservation_score', 'mean_age', 'either_in_complex', 'mean_complex_essentiality', 'colocalisation',
             'interact', 'n_total_ppi', 'fet_ppi_overlap',
             'gtex_spearman_corr', 'gtex_min_mean_expr', 'gtex_max_mean_expr'
]].isna().sum()

rMaxExp_A1A2                         0
rMinExp_A1A2                         0
max_ranked_A1A2                      0
min_ranked_A1A2                      0
max_cn                               0
min_cn                               0
Protein_Altering                     0
Damaging                             0
min_sequence_identity                0
prediction_score                     0
weighted_PPI_essentiality            0
weighted_PPI_expression              0
go_CC_expression                     0
smallest_BP_GO_essentiality          0
smallest_BP_GO_expression            0
smallest_CC_GO_essentiality          0
closest                              0
WGD                                  0
family_size                          0
cds_length_ratio                     0
shared_domains                       0
has_pombe_ortholog                   0
has_essential_pombe_ortholog         0
has_cerevisiae_ortholog              0
has_essential_cerevisiae_ortholog    0
conservation_score       

### Save Processed Datasets

**Export cleaned training and testing datasets:**

In [None]:
# save the files
output_path = get_data_path(['output', 'models'], '')
training_df_clean.to_csv(os.path.join(output_path, 'training_data.csv'), index=False)
testing_df_clean.to_csv(os.path.join(output_path, 'testing_data.csv'), index=False)
testing_df_parrish_clean.to_csv(os.path.join(output_path, 'testing_data_parrish.csv'), index=False)