# Kaggle Challenge - where do proteins localise?  
Understanding where proteins localise is essential for uncovering their biological function and role in disease.  

### Challenge details and description  
This competition (https://www.kaggle.com/competitions/bbinf-26-subcell) challenges participlants to build a machine learning model that predicts the subcellular localisation of metazoan proteins based on their features:

- Proteins can be in more than one location (mutlti label type problem)
- Some of the proteins are natural, some are natural sequences, and some are engineered proteins
- Inbalanced data - some compartments have many examples (like cytoplasm), while others have less (like peroxisome)
- Protein localisation depends on many subtle factors: AA sequence motifs, signal peptides, post-translational modifications, and 3D structure. Capturing all from sequence alone is difficult

### Dataset Description
Data provided in `.csv` format. The following files are provided:  
- `train.csv` - training set
- `test.csv` - test set  
- `sample_submission.csv` - example of a submission in the correct format  
- `metaData.csv` - supplementary information about the data  

### Submission and Evaluation  
For each protein in the test set, a line with the protein ID followed by 1 or 0 depending on if the corresponding localisation is predicted or not. Example submission file:  

```
Id,cytoplasm,nucleus,extracellular,cell_surface,mitochondrion,endom
5,0,0,0,0,0,0
9,1,0,0,0,0,0
14,0,0,0,0,0,1
15,0,0,0,0,0,0
17,1,0,0,0,0,0
18,1,1,0,0,0,0
```

The submitted model is evaluated based on an F1-score (macro averaged)


# Challenge

In [1]:
import pandas as pd
import numpy as np
import sklearn  
import h5py
import os
from sklearn import tree  
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification


## Data setup

In [2]:
# data setup 
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")


print(f'Train DF length: {len(df_train)}')
print(f'Test/validation DF length: {len(df_test)}')
df_train.head()


Train DF length: 16077
Test/validation DF length: 4377


Unnamed: 0,Id,acc,partition,cytoplasm,nucleus,extracellular,cell_surface,mitochondrion,endom,sequence,...,aa_frac_M,aa_frac_N,aa_frac_P,aa_frac_Q,aa_frac_R,aa_frac_S,aa_frac_T,aa_frac_V,aa_frac_W,aa_frac_Y
0,0,P61966,0,0,0,0,0,0,1,MMRFMLLFSRQGKLRLQKWYLATSDKERKKMVRELMQVVLARKPKM...,...,0.051,0.013,0.013,0.044,0.063,0.057,0.019,0.063,0.013,0.044
1,1,Q9VTK2,0,0,0,0,0,0,1,MSATYTNTITQRRKTAKVRQQQQHQWTGSDLSGESNERLHFRSRST...,...,0.028,0.032,0.044,0.043,0.068,0.08,0.063,0.059,0.025,0.038
2,2,O95858,3,0,0,0,1,0,1,MPRGDSEQVRYCARFSYLWLKFSLIIYSTVFWLIGALVLSVGIYAE...,...,0.034,0.041,0.031,0.027,0.044,0.044,0.051,0.078,0.014,0.058
3,3,Q9WUX5,0,1,0,0,0,0,1,MGRSLTCPFGISPACGAQASWSIFGVGTAEVPGTHSHSNQAAAMPH...,...,0.023,0.036,0.089,0.051,0.05,0.117,0.044,0.058,0.008,0.011
4,4,Q9NQC3-3,1,0,0,0,0,0,1,MDGQKKNWKDKVVDLLYWRDIKKTGVVFGASLFLLLSLTVFSIVSV...,...,0.015,0.03,0.015,0.035,0.035,0.07,0.035,0.101,0.015,0.04


In [7]:
# count the partitions in the training data and test data
print(df_train["partition"].unique())
print(df_test["partition"].unique())

# so don't need to worry about partition splitting as no overlap with training and test dataset

[0 3 1 2]
[4]


## Checking for repeats in data

In [20]:
# Get the counts of each ID
id_counts = df_train["Id"].value_counts()

# checking for any duplicate ids:
df_train["Id"].value_counts().head(50)

# How many IDs are duplicated (appear more than once)
num_duplicated_ids = (id_counts > 1).sum()
print("Number of duplicated IDs:", num_duplicated_ids)

# group by Id and count unique sequences per ID
seq_per_id = df_train.groupby("Id")["sequence"].nunique()

# how many Ids have >1 unique sequence?
num_ids_with_multiple_sequences = (seq_per_id > 1).sum()
print("IDs with multiple sequences:", num_ids_with_multiple_sequences)

Number of duplicated IDs: 2425
IDs with multiple sequences: 0


In [66]:
# remove any occurance of same Id after first
df_train = df_train.drop_duplicates(subset="Id", keep="first")
df_test = df_test.drop_duplicates(subset="Id", keep="first")

# sanity check for any duplicate ids:
df_test["Id"].value_counts().tail(9)

Id
17854    1
17855    1
17856    1
17857    1
17858    1
17859    1
17860    1
17861    1
17862    1
Name: count, dtype: int64

In [65]:
# write this out to a new .csv for embedding using GPU on google collab
df_train.to_csv('train_trimmed.csv')
df_test.to_csv('test_trimmed.csv')

## Protein embedding
At this point I made an embedded h5 file from the protein sequences in `train_trimmed.csv` (see collab notebook).   

This was carried out using T4 GPU to create a .h5 embedded protein file for all sequences into fixed length vectors.

### TRAIN .h5 embedded file
Importing the embedded .h5 protein train file and linking to `df_train`

In [59]:
# store the embeddings in a dictionary
def getEmbeddings(filename):

  embeddings_dict = {}

  with h5py.File(filename, 'r') as f:

      # Iterate through all keys (protein accessions) in the HDF5 file
      for accession in f.keys():

          if accession != 'metadata':

              embeddings_dict[accession] = f[accession][:] # Load the embedding for each accession

  return embeddings_dict


filename="train_protT5_half_2048aa.h5"
embeddings_dict = getEmbeddings(filename)

len(embeddings_dict)


13398

In [60]:
# double checking to make sure embedding h5 file length = full df length
len(embeddings_dict) == len(df_train)

True

In [61]:
if embeddings_dict:
    # get the dimension of the embeddings from the first item
    embedding_dim = next(iter(embeddings_dict.values())).shape[0]
else:
    print("embeddings_dict is empty. Please check the loading process.")
    embedding_dim = 0

X = np.stack(df_train["Id"].astype(str).apply(
        lambda idx: embeddings_dict[idx]  
    )
)

print(X.shape)

(13398, 1024)


In [62]:
# sanity checks
print("X shape:", X.shape)
print("Any non-zero values?", np.any(X != 0))
print("Zero rows:", np.all(X == 0, axis=1).sum())


X shape: (13398, 1024)
Any non-zero values? True
Zero rows: 0


### TEST .h5 embedded file
Just repeating for the test set as above

In [None]:
filename_test="test_protT5_half_2048aa.h5"
test_embeddings_dict = getEmbeddings(filename_test)

len(embeddings_dict)

## Split the features
Need different partitions for training and test sets, as well as target and features columns.  

Will start with target as the cytoplasm, nucelus, extracellular, cell~_surface, mitochondria, and endom. And the fetaures simply as the sequence:

In [58]:
# split features and target 

# features = protein sequence
features_train = df_train["sequence"]
print(features_train)
features_test = df_test["sequence"]
print(features_test)

0        MMRFMLLFSRQGKLRLQKWYLATSDKERKKMVRELMQVVLARKPKM...
1        MSATYTNTITQRRKTAKVRQQQQHQWTGSDLSGESNERLHFRSRST...
2        MPRGDSEQVRYCARFSYLWLKFSLIIYSTVFWLIGALVLSVGIYAE...
3        MGRSLTCPFGISPACGAQASWSIFGVGTAEVPGTHSHSNQAAAMPH...
4        MDGQKKNWKDKVVDLLYWRDIKKTGVVFGASLFLLLSLTVFSIVSV...
                               ...                        
13393    MESLPARLFPGLSIKIQRSNGLIHSANISTVNVEKSCVSVEWIEGG...
13394    MESLRGYTHSDIGYRSLAVGEDIEEVNDEKLTVTSLMARGGEDEEN...
13395    MESLVDGDGFPDLEEDEDIDQFNDDTFGAGAVDDDWREEHERLAEM...
13396    MESNFNQEGVPRPSYVFSADPIARPSEINFDGIKLDLSHEFSLVAP...
13397    MESKALLVLTLAVWLQSLTASRGGVAAADQRRDFIDIESKFALRTP...
Name: sequence, Length: 13398, dtype: str
0       MSVSALSSTRFTGSISGFLQVASVLGLLLLLVKAVQFYLQRQWLLK...
1       MSVRRRTHSDDFSYLLEKTRRPSKLNVVQEDPKSAPPQGYSLTTVI...
2       MDGLRQRVEHFLEQRNLVTEVLGALEAKTGVEKRYLAAGAVTLLSL...
3       MDGLRQRFERFLEQKNVATEALGALEARTGVEKRYLAAGALALLGL...
4       MMSIKAFTLVSAVERELLMGDKERVNIECVECCGRDLYVGTNDCFV...
                   

In [57]:
# target = columns 3-8 in training data frame (location of protein)
target_train = df_train.iloc[:,3:9]
print(target_train)
target_test = df_train.iloc[:,3:9]
print(target_test)


       cytoplasm  nucleus  extracellular  cell_surface  mitochondrion  endom
0              0        0              0             0              0      1
1              0        0              0             0              0      1
2              0        0              0             1              0      1
3              1        0              0             0              0      1
4              0        0              0             0              0      1
...          ...      ...            ...           ...            ...    ...
13393          1        1              0             0              0      0
13394          1        1              0             0              0      0
13395          1        0              0             0              0      0
13396          1        1              0             0              0      0
13397          0        0              1             0              0      0

[13398 rows x 6 columns]
       cytoplasm  nucleus  extracellular  cell_sur