# Kaggle Challenge - where do proteins localise?  
Understanding where proteins localise is essential for uncovering their biological function and role in disease.  

### Challenge details and description  
This competition (https://www.kaggle.com/competitions/bbinf-26-subcell) challenges participlants to build a machine learning model that predicts the subcellular localisation of metazoan proteins based on their features:

- Proteins can be in more than one location (mutlti label type problem)
- Some of the proteins are natural, some are natural sequences, and some are engineered proteins
- Inbalanced data - some compartments have many examples (like cytoplasm), while others have less (like peroxisome)
- Protein localisation depends on many subtle factors: AA sequence motifs, signal peptides, post-translational modifications, and 3D structure. Capturing all from sequence alone is difficult

### Dataset Description
Data provided in `.csv` format. The following files are provided:  
- `train.csv` - training set
- `test.csv` - test set  
- `sample_submission.csv` - example of a submission in the correct format  
- `metaData.csv` - supplementary information about the data  

### Submission and Evaluation  
For each protein in the test set, a line with the protein ID followed by 1 or 0 depending on if the corresponding localisation is predicted or not. Example submission file:  

```
Id,cytoplasm,nucleus,extracellular,cell_surface,mitochondrion,endom
5,0,0,0,0,0,0
9,1,0,0,0,0,0
14,0,0,0,0,0,1
15,0,0,0,0,0,0
17,1,0,0,0,0,0
18,1,1,0,0,0,0
```

The submitted model is evaluated based on an F1-score (macro averaged)


# Challenge

In [2]:
import pandas as pd
import sklearn  
from sklearn import tree  
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [8]:
# data setup 
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")

df_train.head()

Unnamed: 0,Id,acc,partition,cytoplasm,nucleus,extracellular,cell_surface,mitochondrion,endom,sequence,...,aa_frac_M,aa_frac_N,aa_frac_P,aa_frac_Q,aa_frac_R,aa_frac_S,aa_frac_T,aa_frac_V,aa_frac_W,aa_frac_Y
0,0,P61966,0,0,0,0,0,0,1,MMRFMLLFSRQGKLRLQKWYLATSDKERKKMVRELMQVVLARKPKM...,...,0.051,0.013,0.013,0.044,0.063,0.057,0.019,0.063,0.013,0.044
1,1,Q9VTK2,0,0,0,0,0,0,1,MSATYTNTITQRRKTAKVRQQQQHQWTGSDLSGESNERLHFRSRST...,...,0.028,0.032,0.044,0.043,0.068,0.08,0.063,0.059,0.025,0.038
2,2,O95858,3,0,0,0,1,0,1,MPRGDSEQVRYCARFSYLWLKFSLIIYSTVFWLIGALVLSVGIYAE...,...,0.034,0.041,0.031,0.027,0.044,0.044,0.051,0.078,0.014,0.058
3,3,Q9WUX5,0,1,0,0,0,0,1,MGRSLTCPFGISPACGAQASWSIFGVGTAEVPGTHSHSNQAAAMPH...,...,0.023,0.036,0.089,0.051,0.05,0.117,0.044,0.058,0.008,0.011
4,4,Q9NQC3-3,1,0,0,0,0,0,1,MDGQKKNWKDKVVDLLYWRDIKKTGVVFGASLFLLLSLTVFSIVSV...,...,0.015,0.03,0.015,0.035,0.035,0.07,0.035,0.101,0.015,0.04
