# Naive baseline

Niave baseline is the official baseline of CAFA challenge. The idea is that if a functional annotation is frequent in the training data, it is likely that it could be assigned to another protein with a certain degree of confidence.

For each annotation term, its frequency in the training set is computed and is used as the probability of the protein having that term.
$$
S(G_i , P_j) = \frac{N_{G_i}}{N_D}
$$
This formula indicates that the probability of GO term $i$ being assinged to protein $j$ equals the number of appearences of $GO_i$ divided by the number of proteins in the dataset.

All the annotations are assigned to every protein with the same probability. Although this method is very basic, but it tells you how well you could perform if you had no knowledge about a protein.

### Loading the training data

In [1]:
import pandas as pd

In [2]:
train_df = pd.read_csv('../../data/students/train/train_set.tsv',sep='\t')
train_df.head()

Unnamed: 0,Protein_ID,aspect,GO_term
0,P91124,cellular_component,GO:0005575
1,P91124,cellular_component,GO:0110165
2,P91124,cellular_component,GO:0005737
3,P91124,cellular_component,GO:0005622
4,P91124,cellular_component,GO:0043226


### Calculating Frequencies from Training Set

In [3]:
def get_frequency(aspect,df):
    temp_df = df[df['aspect'] == aspect]
    value_counts = temp_df['GO_term'].value_counts()
    maximum = value_counts.max()
    value_counts = value_counts.to_dict()
    frequency_values = {k:round((v/maximum),3) for k,v in value_counts.items()}
    return frequency_values

In [4]:
mfo_freq = get_frequency('molecular_function',train_df)
bpo_freq = get_frequency('biological_process',train_df)
cco_freq = get_frequency('cellular_component',train_df)

### Loading the test data

In [5]:
test_df = pd.read_csv('../../data/students/test/test_ids.txt',header = None)
test_df.columns = ['Protein_ID']
test_df.head()

Unnamed: 0,Protein_ID
0,O43747
1,Q969H0
2,Q9JMA2
3,P18065
4,A0A8I6AN32


### Calculating the Frequencies for the Test Data

In [6]:
def get_go_terms():
    mfo_terms = list(mfo_freq.items())[:500]
    bpo_terms = list(bpo_freq.items())[:500]
    cco_terms = list(cco_freq.items())[:500]
    all_terms = mfo_terms + bpo_terms + cco_terms
    return all_terms

In [7]:
test_df['Predicted_terms'] = [get_go_terms()] * 1000
test_df

Unnamed: 0,Protein_ID,Predicted_terms
0,O43747,"[(GO:0003674, 1.0), (GO:0005488, 0.66), (GO:00..."
1,Q969H0,"[(GO:0003674, 1.0), (GO:0005488, 0.66), (GO:00..."
2,Q9JMA2,"[(GO:0003674, 1.0), (GO:0005488, 0.66), (GO:00..."
3,P18065,"[(GO:0003674, 1.0), (GO:0005488, 0.66), (GO:00..."
4,A0A8I6AN32,"[(GO:0003674, 1.0), (GO:0005488, 0.66), (GO:00..."
...,...,...
995,P9WPA7,"[(GO:0003674, 1.0), (GO:0005488, 0.66), (GO:00..."
996,P13504,"[(GO:0003674, 1.0), (GO:0005488, 0.66), (GO:00..."
997,P70062,"[(GO:0003674, 1.0), (GO:0005488, 0.66), (GO:00..."
998,Q80TN5,"[(GO:0003674, 1.0), (GO:0005488, 0.66), (GO:00..."


In [8]:
test_df_expanded = test_df.explode('Predicted_terms')
test_df_expanded

Unnamed: 0,Protein_ID,Predicted_terms
0,O43747,"(GO:0003674, 1.0)"
0,O43747,"(GO:0005488, 0.66)"
0,O43747,"(GO:0005515, 0.507)"
0,O43747,"(GO:0003824, 0.447)"
0,O43747,"(GO:0097159, 0.221)"
...,...,...
999,Q9V2V6,"(GO:0030660, 0.001)"
999,Q9V2V6,"(GO:1990752, 0.001)"
999,Q9V2V6,"(GO:0001917, 0.001)"
999,Q9V2V6,"(GO:0000407, 0.001)"


In [9]:
test_df_expanded[['GO_ID', 'score']] = pd.DataFrame(test_df_expanded['Predicted_terms'].tolist(), index=test_df_expanded.index)
test_df_expanded

Unnamed: 0,Protein_ID,Predicted_terms,GO_ID,score
0,O43747,"(GO:0003674, 1.0)",GO:0003674,1.000
0,O43747,"(GO:0005488, 0.66)",GO:0005488,0.660
0,O43747,"(GO:0005515, 0.507)",GO:0005515,0.507
0,O43747,"(GO:0003824, 0.447)",GO:0003824,0.447
0,O43747,"(GO:0097159, 0.221)",GO:0097159,0.221
...,...,...,...,...
999,Q9V2V6,"(GO:0030660, 0.001)",GO:0030660,0.001
999,Q9V2V6,"(GO:1990752, 0.001)",GO:1990752,0.001
999,Q9V2V6,"(GO:0001917, 0.001)",GO:0001917,0.001
999,Q9V2V6,"(GO:0000407, 0.001)",GO:0000407,0.001


In [10]:
test_df_expanded = test_df_expanded.drop('Predicted_terms', axis=1)
test_df_expanded

Unnamed: 0,Protein_ID,GO_ID,score
0,O43747,GO:0003674,1.000
0,O43747,GO:0005488,0.660
0,O43747,GO:0005515,0.507
0,O43747,GO:0003824,0.447
0,O43747,GO:0097159,0.221
...,...,...,...
999,Q9V2V6,GO:0030660,0.001
999,Q9V2V6,GO:1990752,0.001
999,Q9V2V6,GO:0001917,0.001
999,Q9V2V6,GO:0000407,0.001


In [17]:
test_df_expanded.to_csv('../preds/naive_submission.tsv',header=None, index=False,sep='\t')

# Evaluation

In [19]:
pip install cafaeval

Note: you may need to restart the kernel to use updated packages.


In [20]:
import cafaeval
from cafaeval.evaluation import cafa_eval, write_results
res = cafa_eval("../../data/students/train/go-basic.obo", "../preds/" , "../../data/outputs/ground_truth.tsv")
write_results(*res)