<br>

<b>This may be simple for some, but just in case... here is a simple way to leverage the "newly" created <code>train_updates_202220929.csv</code> file that fixes some of the issues found in the original <code>train.csv</code> file.</b>
* Here's <b><a href="https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction/discussion/356251">the link to the thread where the announcement was made</a></b> (published on September 29, 2022)
* Below is the post text for easy reading:

> As has been pointed out, there are some data issues in the training data. A file has been added to the Data page which contains the rows that should not be used due to data quality issues (2409 rows, with all features marked as NaN), as well as the rows where the pH and tm were transposed (25 rows, with corrected features in this dataset).
>
>The original train.csv has not been modified. Please use this file to make adjustments as necessary.

<br>

<br>

**Imports**

In [1]:
import numpy as np
import pandas as pd
import os

<br>

**Setup For Fixing**

In [2]:
train_df = pd.read_csv("../input/novozymes-enzyme-stability-prediction/train.csv")
train_updates_df = pd.read_csv("../input/novozymes-enzyme-stability-prediction/train_updates_20220929.csv")

# Identify which sequence ids need to have the tm and pH values changed and create a dictionary mapping 
seqid_2_phtm_update_map = train_updates_df[~pd.isna(train_updates_df["pH"])].groupby("seq_id")[["pH", "tm"]].first().to_dict("index")

# Identify the sequence ids that will be dropped due to data quality issues
bad_seqids = train_updates_df[pd.isna(train_updates_df["pH"])].seq_id.to_list()

<br>

**Demonstrate the Problem**
* Rows with data quality issues ?
* pH and tm swapped (pH has to be less than 14)

In [3]:
# Data quality issue rows
print("\n... EXAMPLES OF 10 ROWS WITH DATA QUALITY ISSUE ...\n")
display(train_df[train_df.seq_id.isin(bad_seqids)].head(10))

print("\n... EXAMPLES OF 10 ROWS WHERE pH & tm HAVE BEEN SWAPPED ERRONEOUSLY ...\n")
display(train_df[train_df.pH>14.0].head(10))
### OR ###
# display(train_df[train_df.seq_id.isin(list(seqid_2_phtm_update_map.keys()))].head(10))


... EXAMPLES OF 10 ROWS WITH DATA QUALITY ISSUE ...



Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
69,69,ADLEDNWETLNDNLKVIEKADNAAQVKDALTKARAAALDAQKATPP...,5.0,,25.0
70,70,ADLEDNWETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPP...,5.0,,25.0
71,71,ADLEDNWETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPP...,5.0,,25.0
72,72,ADLEDNWETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPP...,5.0,,25.0
73,73,ADLEDNWETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPP...,5.0,,25.0
74,74,ADLEDNWETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPP...,5.0,,25.0
75,75,ADLEDNWETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPP...,5.0,,25.0
76,76,ADLEDNWETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPP...,5.0,,25.0
77,77,ADLEDNWETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPP...,5.0,,25.0
78,78,ADLEDNWETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPP...,5.0,,25.0



... EXAMPLES OF 10 ROWS WHERE pH & tm HAVE BEEN SWAPPED ERRONEOUSLY ...



Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
973,973,DTSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESVLPT...,48.4,,7.0
986,986,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVTFQNRESVLPT...,48.4,,7.0
988,988,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFANRESVLPT...,49.0,,7.0
989,989,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFANRESVLPT...,55.6,,5.5
1003,1003,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESTLPT...,48.4,,7.0
1012,1012,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESVLPT...,48.4,,7.0
1014,1014,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESVLPT...,55.6,,5.5
1018,1018,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESVLPT...,49.0,,7.0
1037,1037,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESVLPT...,49.0,,7.0
1042,1042,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESVLPT...,48.4,,7.0


<br>

**Do the Fixing**

In [4]:
# Drop useless all NaN rows
train_df = train_df[~train_df["seq_id"].isin(bad_seqids)].reset_index(drop=True)

# Correct tm-->pH swap
def fix_tm_ph(_row, update_map):
    update_vals = update_map.get(_row["seq_id"], None)
    if update_vals is not None:
        _row["tm"] = update_vals["tm"]
        _row["pH"] = update_vals["pH"]
    return _row
train_df = train_df.apply(lambda x: fix_tm_ph(x, seqid_2_phtm_update_map), axis=1)

print("\n... WE CAN'T CHECK FOR THE BAD DATA ROWS BUT WE CAN CHECK FOR BROKEN pH/tm VALUES ...\n")
display(train_df[train_df.pH>14.0].head(10)) # This should yield an empty dataframe

# Save to disk
train_df.to_csv("updated_train.csv", index=False)


... WE CAN'T CHECK FOR THE BAD DATA ROWS BUT WE CAN CHECK FOR BROKEN pH/tm VALUES ...



Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm


<br>

<b>Wrap this all up in a function so we can just get on with our lives...</b>

In [5]:
# Will take 3-5 seconds to run
def load_fixed_train_df(original_train_file_path="/kaggle/input/novozymes-enzyme-stability-prediction/train.csv",
                        update_file_path="/kaggle/input/novozymes-enzyme-stability-prediction/train_updates_20220929.csv",
                        was_fixed_col=False):
    def _fix_tm_ph(_row, update_map):
        update_vals = update_map.get(_row["seq_id"], None)
        if update_vals is not None:
            _row["tm"] = update_vals["tm"]
            _row["pH"] = update_vals["pH"]
        return _row
    
    # Load dataframes
    _df = pd.read_csv(original_train_file_path)
    _updates_df = pd.read_csv(update_file_path)

    # Identify which sequence ids need to have the tm and pH values changed and create a dictionary mapping 
    seqid_2_phtm_update_map = _updates_df[~pd.isna(_updates_df["pH"])].groupby("seq_id")[["pH", "tm"]].first().to_dict("index")

    # Identify the sequence ids that will be dropped due to data quality issues
    bad_seqids = _updates_df[pd.isna(_updates_df["pH"])]["seq_id"].to_list()
    
    # Fix bad sequence ids
    _df = _df[~_df["seq_id"].isin(bad_seqids)].reset_index(drop=True)
    
    # Fix pH and tm swaparoo
    _df = _df.apply(lambda x: _fix_tm_ph(x, seqid_2_phtm_update_map), axis=1)

    # Add in a bool to track if a row was fixed or not (tm/ph swap will look the same as bad data)
    if was_fixed_col: _df["was_fixed"] = _df["seq_id"].isin(bad_seqids+list(seqid_2_phtm_update_map.keys()))
    
    return _df


display(load_fixed_train_df())
display(load_fixed_train_df(was_fixed_col=True))


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm
0,0,AAAAKAAALALLGEAPEVVDIWLPAGWRQPFRVFRLERKGDGVLVG...,7.0,doi.org/10.1038/s41592-020-0801-4,75.7
1,1,AAADGEPLHNEEERAGAGQVGRSLPQESEEQRTGSRPRRRRDLGSR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.5
2,2,AAAFSTPRATSYRILSSAGSGSTRADAPQVRRLHTTRDLLAKDYYA...,7.0,doi.org/10.1038/s41592-020-0801-4,40.5
3,3,AAASGLRTAIPAQPLRHLLQPAPRPCLRPFGLLSVRAGSARRSGLL...,7.0,doi.org/10.1038/s41592-020-0801-4,47.2
4,4,AAATKSGPRRQSQGASVRTFTPFYFLVEPVDTLSVRGSSVILNCSA...,7.0,doi.org/10.1038/s41592-020-0801-4,49.5
...,...,...,...,...,...
28976,31385,YYMYSGGGSALAAGGGGAGRKGDWNDIDSIKKKDLHHSRGDEKAQG...,7.0,doi.org/10.1038/s41592-020-0801-4,51.8
28977,31386,YYNDQHRLSSYSVETAMFLSWERAIVKPGAMFKKAVIGFNCNVDLI...,7.0,doi.org/10.1038/s41592-020-0801-4,37.2
28978,31387,YYQRTLGAELLYKISFGEMPKSAQDSAENCPSGMQFPDTAIAHANV...,7.0,doi.org/10.1038/s41592-020-0801-4,64.6
28979,31388,YYSFSDNITTVFLSRQAIDDDHSLSLGTISDVVESENGVVAADDAR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.7


Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,was_fixed
0,0,AAAAKAAALALLGEAPEVVDIWLPAGWRQPFRVFRLERKGDGVLVG...,7.0,doi.org/10.1038/s41592-020-0801-4,75.7,False
1,1,AAADGEPLHNEEERAGAGQVGRSLPQESEEQRTGSRPRRRRDLGSR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.5,False
2,2,AAAFSTPRATSYRILSSAGSGSTRADAPQVRRLHTTRDLLAKDYYA...,7.0,doi.org/10.1038/s41592-020-0801-4,40.5,False
3,3,AAASGLRTAIPAQPLRHLLQPAPRPCLRPFGLLSVRAGSARRSGLL...,7.0,doi.org/10.1038/s41592-020-0801-4,47.2,False
4,4,AAATKSGPRRQSQGASVRTFTPFYFLVEPVDTLSVRGSSVILNCSA...,7.0,doi.org/10.1038/s41592-020-0801-4,49.5,False
...,...,...,...,...,...,...
28976,31385,YYMYSGGGSALAAGGGGAGRKGDWNDIDSIKKKDLHHSRGDEKAQG...,7.0,doi.org/10.1038/s41592-020-0801-4,51.8,False
28977,31386,YYNDQHRLSSYSVETAMFLSWERAIVKPGAMFKKAVIGFNCNVDLI...,7.0,doi.org/10.1038/s41592-020-0801-4,37.2,False
28978,31387,YYQRTLGAELLYKISFGEMPKSAQDSAENCPSGMQFPDTAIAHANV...,7.0,doi.org/10.1038/s41592-020-0801-4,64.6,False
28979,31388,YYSFSDNITTVFLSRQAIDDDHSLSLGTISDVVESENGVVAADDAR...,7.0,doi.org/10.1038/s41592-020-0801-4,50.7,False


In [6]:
_fixed_df = load_fixed_train_df(was_fixed_col=True)
_fixed_df[_fixed_df.was_fixed==True]

Unnamed: 0,seq_id,protein_sequence,pH,data_source,tm,was_fixed
948,973,DTSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESVLPT...,7.0,,48.4,True
959,986,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVTFQNRESVLPT...,7.0,,48.4,True
961,988,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFANRESVLPT...,7.0,,49.0,True
962,989,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFANRESVLPT...,5.5,,55.6,True
974,1003,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESTLPT...,7.0,,48.4,True
979,1012,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESVLPT...,7.0,,48.4,True
981,1014,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESVLPT...,5.5,,55.6,True
985,1018,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESVLPT...,7.0,,49.0,True
999,1037,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESVLPT...,7.0,,49.0,True
1004,1042,DVSGTVCLSALPPEATDTLNLIASDGPFPYSQDGVVFQNRESVLPT...,7.0,,48.4,True
