# Kaggle Challenge - where do proteins localise?  
Understanding where proteins localise is essential for uncovering their biological function and role in disease.  

### Challenge details and description  
This competition (https://www.kaggle.com/competitions/bbinf-26-subcell) challenges participlants to build a machine learning model that predicts the subcellular localisation of metazoan proteins based on their features:

- Proteins can be in more than one location (mutlti label type problem)
- Some of the proteins are natural, some are natural sequences, and some are engineered proteins
- Inbalanced data - some compartments have many examples (like cytoplasm), while others have less (like peroxisome)
- Protein localisation depends on many subtle factors: AA sequence motifs, signal peptides, post-translational modifications, and 3D structure. Capturing all from sequence alone is difficult

### Dataset Description
Data provided in `.csv` format. The following files are provided:  
- `train.csv` - training set
- `test.csv` - test set  
- `sample_submission.csv` - example of a submission in the correct format  
- `metaData.csv` - supplementary information about the data  

### Submission and Evaluation  
For each protein in the test set, a line with the protein ID followed by 1 or 0 depending on if the corresponding localisation is predicted or not. Example submission file:  

```
Id,cytoplasm,nucleus,extracellular,cell_surface,mitochondrion,endom
5,0,0,0,0,0,0
9,1,0,0,0,0,0
14,0,0,0,0,0,1
15,0,0,0,0,0,0
17,1,0,0,0,0,0
18,1,1,0,0,0,0
```

The submitted model is evaluated based on an F1-score (macro averaged)


# Challenge

In [None]:
import pandas as pd
import numpy as np
import sklearn  
import h5py
import os
from sklearn import tree  
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification


In [None]:
# data setup 
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")


print(f'Train DF length: {len(df_train)}')
print(f'Test/validation DF length: {len(df_test)}')
df_train.head()


Train DF length: 16077
Test/validation DF length: 4377


Unnamed: 0,Id,acc,partition,cytoplasm,nucleus,extracellular,cell_surface,mitochondrion,endom,sequence,...,aa_frac_M,aa_frac_N,aa_frac_P,aa_frac_Q,aa_frac_R,aa_frac_S,aa_frac_T,aa_frac_V,aa_frac_W,aa_frac_Y
0,0,P61966,0,0,0,0,0,0,1,MMRFMLLFSRQGKLRLQKWYLATSDKERKKMVRELMQVVLARKPKM...,...,0.051,0.013,0.013,0.044,0.063,0.057,0.019,0.063,0.013,0.044
1,1,Q9VTK2,0,0,0,0,0,0,1,MSATYTNTITQRRKTAKVRQQQQHQWTGSDLSGESNERLHFRSRST...,...,0.028,0.032,0.044,0.043,0.068,0.08,0.063,0.059,0.025,0.038
2,2,O95858,3,0,0,0,1,0,1,MPRGDSEQVRYCARFSYLWLKFSLIIYSTVFWLIGALVLSVGIYAE...,...,0.034,0.041,0.031,0.027,0.044,0.044,0.051,0.078,0.014,0.058
3,3,Q9WUX5,0,1,0,0,0,0,1,MGRSLTCPFGISPACGAQASWSIFGVGTAEVPGTHSHSNQAAAMPH...,...,0.023,0.036,0.089,0.051,0.05,0.117,0.044,0.058,0.008,0.011
4,4,Q9NQC3-3,1,0,0,0,0,0,1,MDGQKKNWKDKVVDLLYWRDIKKTGVVFGASLFLLLSLTVFSIVSV...,...,0.015,0.03,0.015,0.035,0.035,0.07,0.035,0.101,0.015,0.04


In [None]:
%%bash
!gdown 18XlzbtfEwqbmFZJOq3FXN-ir4vmg-uRO


bash: line 1: !gdown: command not found


CalledProcessError: Command 'b'!gdown 18XlzbtfEwqbmFZJOq3FXN-ir4vmg-uRO\n'' returned non-zero exit status 127.

In [None]:
# count the partitions in the training data and test data
print(df_train["partition"].unique())
print(df_test["partition"].unique())

# so don't need to worry about partition splitting as no overlap with training and test dataset


[0 3 1 2]
[4]


## Split the features
Need different partitions for training and test sets, as well as target and features columns.  

Will start with target as the cytoplasm, nucelus, extracellular, cell~_surface, mitochondria, and endom. And the fetaures simply as the sequence:

In [None]:
# split features and target 

# features = protein sequence
features_train = df_train["sequence"]
print(features_train)
features_test = df_test["sequence"]
print(features_test)

# target = columns 3-8 in training data frame (location of protein)
target_train = df_train.iloc[:,3:9]
print(target_train)
target_test = df_train.iloc[:,3:9]
print(target_test)


0        MMRFMLLFSRQGKLRLQKWYLATSDKERKKMVRELMQVVLARKPKM...
1        MSATYTNTITQRRKTAKVRQQQQHQWTGSDLSGESNERLHFRSRST...
2        MPRGDSEQVRYCARFSYLWLKFSLIIYSTVFWLIGALVLSVGIYAE...
3        MGRSLTCPFGISPACGAQASWSIFGVGTAEVPGTHSHSNQAAAMPH...
4        MDGQKKNWKDKVVDLLYWRDIKKTGVVFGASLFLLLSLTVFSIVSV...
                               ...                        
16072    MPVKFHTKTLESVIDPVAQQVGQLVLFHEQAESGLLKEDLTPLVQG...
16073    MKIFSESHKTVFVVDHCPYMAESCRQHVEFDMLVKNRTQGIIPLAP...
16074    MSQDRKPIVGSFHFVCALALIVGSMTPFSNELESMVDYSNRNLTHV...
16075    MREYKVVVLGSGGVGKSALTVQFVTGTFIEKYDPTIEDFYRKEIEV...
16076    MSNCWCFIFCKERVRSNSSSPQHDGTSREEADHQVDVSDGIRLVPD...
Name: sequence, Length: 16077, dtype: str
0       MSVSALSSTRFTGSISGFLQVASVLGLLLLLVKAVQFYLQRQWLLK...
1       MSVRRRTHSDDFSYLLEKTRRPSKLNVVQEDPKSAPPQGYSLTTVI...
2       MDGLRQRVEHFLEQRNLVTEVLGALEAKTGVEKRYLAAGAVTLLSL...
3       MDGLRQRFERFLEQKNVATEALGALEARTGVEKRYLAAGALALLGL...
4       MMSIKAFTLVSAVERELLMGDKERVNIECVECCGRDLYVGTNDCFV...
                   

In [None]:
# save all occurences of each amino acid throughout all df:
amino_acids = []

for seq in features_train:
    for aa in seq:
        if aa not in amino_acids:
            amino_acids.append(aa)

print(amino_acids)
print(len(amino_acids))


['M', 'R', 'F', 'L', 'S', 'Q', 'G', 'K', 'W', 'Y', 'A', 'T', 'D', 'E', 'V', 'P', 'C', 'I', 'N', 'H', 'U', 'X']
22


### At this point I made an embedded h5 file from the protein sequences in train.csv (see collab notebook)

### Reading in the .h5 embedded file

In [None]:
# store the embeddings in a dictionary
def getEmbeddings(filename):

  embeddings_dict = {}

  with h5py.File(filename, 'r') as f:

      # Iterate through all keys (protein accessions) in the HDF5 file
      for accession in f.keys():

          if accession != 'metadata':

              embeddings_dict[accession] = f[accession][:] # Load the embedding for each accession

  return embeddings_dict


filename="for_embed_prot_t5.h5"
embeddings_dict = getEmbeddings(filename)






if embeddings_dict:
    # Get the dimension of the embeddings from the first item
    embedding_dim = next(iter(embeddings_dict.values())).shape[0]
else:
    print("embeddings_dict is empty. Please check the loading process.")
    embedding_dim = 0 # Or handle this error appropriately

X = np.stack(df_train["acc"].apply(
    lambda acc: embeddings_dict.get(acc, np.zeros(embedding_dim)) # Use np.zeros for missing embeddings
))

print(X.shape)
