# TKO_7092 Evaluation of Machine Learning Methods 2025

---

Student name: Arttu Kuitunen

Student number: 1500155

Student email: arsaku@utu.fi

---

## Exercise 3

Complete the tasks given to you in the letter below. In your submission, explain clearly, precisely, and comprehensively why the cross-validation described in the letter failed, what is the correct way to perform cross-validation in the given scenario, and why the correct cross-validation method will give a reliable estimate of the generalisation performance. Then implement the correct cross-validation for the scenario and report its results.

Remember to follow all the general exercise guidelines that are stated in Moodle. Full points (2p) will be given for a submission that demonstrates a deep understanding of cross-validation on pair-input data and implements the requested cross-validation correctly (incl. reporting the results). Partial points (1p) will be given if there are small error(s) but the overall approach is correct. No points will be given if there are significant error(s).

The deadline of this exercise is **Wednesday 19 February 2025 at 11:59 PM**. Please contact Juho Heimonen (juaheim@utu.fi) if you have any questions about this exercise.

---


Dear Data Scientist,

I have a long-term research project regarding a specific set of proteins. I am attempting to discover small organic compounds that can bind strongly to these proteins and thus act as drugs. I have already made laboratory experiments to measure the affinities between some proteins and drug molecules.

My colleague is working on another set of proteins, and the objectives of his project are similar to mine. He has recently discovered thousands of new potential drug molecules. He asked me if I could find the pairs that have the strongest affinities among his proteins and drug molecules. Obviously I do not have the resources to measure all the possible pairs in my laboratory, so I need to prioritise. I decided to do this with the help of machine learning, but I have encountered a problem.

Here is what I have done so far: First I trained a K-nearest neighbours regressor with the parameter value K=10 using all the 400 measurements I had already made in the laboratory with my proteins and drug molecules. They comprise of 77 target proteins and 59 drug molecules. Then I performed a leave-one-out cross-validation with this same data to estimate the generalisation performance of the model. I used C-index and got a stellar score above 90%. Finally I used the model to predict the affinities of my colleague's proteins and drug molecules. The problem is: when I selected the highest predicted affinities and tried to verify them in the lab, I found that many of them are much lower in reality. My model clearly does not work despite the high cross-validation score.

Please explain why my estimation failed and how leave-one-out cross-validation should be performed to get a reliable estimate. Also, implement the correct leave-one-out cross-validation and report its results. I need to know whether it would be a waste of my resources if I were to use my model any further.

The data I used to create my model is available in the files `input.data`, `output.data` and `pairs.data` for you to use. The first file contains the features of the pairs, whereas the second contains their affinities. The third file contains the identifiers of the drug and target molecules of which the pairs are composed. The files are paired, i.e. the i<sup>*th*</sup> row in each file is about the same pair.

Looking forward to hearing from you soon.

Yours sincerely, \
Bio Scientist

---

#### Answer the questions about cross-validation on pair-input data

In [45]:
# Why did the estimation described in the letter fail?

## Biggest problem is that the data set has not been divided in to training and test sets.
## This will overfit the model to the training data (in this case the entire data set) 
## and will not be able to generalize as well as the c_index results indicate. There is also data leakage 

## Pair-input data evaluation should be done for all types of out-of-sample data. 
## A, B, C and D where a has dependencies on both inputs, B and C only share 
## dependencie for one of the inputs and D that is independent from both pair-inputs. 
## Then we get a better estimate of the models performance, since we cannot assume 
## the data to be independent as we could with single observations. 

## In this example, pair-input data is not properly handled in training. 


# How should leave-one-out cross-validation be performed in the given scenario and why?

## Cross-validation could be performed to choose the best hyperparameters for the model. Now there is only k=10 tested.

# Remember to provide comprehensive and precise arguments.

#### Import libraries

In [46]:
# Import the libraries you need.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import LeaveOneOut

#### Write utility functions

In [47]:
# Write the utility functions you need in your analysis.
def cindex(y, yp):
    """Calculate concordance index"""
    n = 0
    h_num = 0 
    for i in range(len(y)):
        t = y[i]
        p = yp[i]
        for j in range(i+1, len(y)):
            nt = y[j]
            np = yp[j]
            if t != nt:
                n += 1
                if (p < np and t < nt) or (p > np and t > nt):
                    h_num += 1
                elif p == np:
                    h_num += 0.5
    return h_num/n if n > 0 else 0

def get_modified_training_sets(X, pairs_data, train_idx, test_idx):
    """
    Create four different training sets for A, B, C, D scenarios based on dependencies
    """
    # Get test pair information
    test_protein = pairs_data.iloc[test_idx]['protein'].values[0]
    test_drug = pairs_data.iloc[test_idx]['drug'].values[0]
    
    # Create masks for each scenario
    mask_B = pairs_data.iloc[train_idx]['drug'] != test_drug
    mask_C = pairs_data.iloc[train_idx]['protein'] != test_protein
    mask_D = mask_B & mask_C
    
    # Get indices for each scenario
    train_A = train_idx  # All training data
    train_B = train_idx[mask_B]  # Remove pairs with test drug
    train_C = train_idx[mask_C]  # Remove pairs with test protein
    train_D = train_idx[mask_D]  # Remove pairs with either test protein or drug
    
    # Create training sets
    training_sets = {
        'A': X[train_A],
        'B': X[train_B],
        'C': X[train_C],
        'D': X[train_D]
    }
    
    indices = {
        'A': train_A,
        'B': train_B,
        'C': train_C,
        'D': train_D
    }
    
    # Count samples in each set
    counts = {
        'A': len(train_A),
        'B': len(train_B),
        'C': len(train_C),
        'D': len(train_D)
    }
    
    return training_sets, indices, counts



#### Load datasets

In [48]:
# Read the data files (input.data, output.data, pairs.data).
# Load data
input_data = pd.read_csv('input.data', header=None, sep=' ')
output_data = pd.read_csv('output.data', header=None, sep=' ')
pairs_data = pd.read_csv('pairs.data', header=None, sep=' ', names=['protein', 'drug'])

# Convert to numpy arrays
X = input_data.values
y = output_data.values.ravel()

In [49]:
print(input_data.head())
print(output_data.head())
print(pairs_data.head())

         0         1         2         3         4         5         6   \
0  0.759222  0.709585  0.253151  0.421082  0.727780  0.404487  0.709027   
1  0.034584  0.304720  0.688257  0.296396  0.151878  0.830755  0.270656   
2  0.737867  0.236079  0.905987  0.163612  0.801455  0.789823  0.393999   
3  0.406913  0.607740  0.235365  0.888679  0.150347  0.598991  0.130108   
4  0.697707  0.432565  0.650329  0.886065  0.328660  0.576926  0.523100   

         7         8         9   ...        57        58        59        60  \
0  0.242963  0.407292  0.379971  ...  0.838616  0.165050  0.515334  0.332678   
1  0.705392  0.186120  0.085594  ...  0.472762  0.730013  0.639373  0.445218   
2  0.522067  0.411352  0.781861  ...  0.595468  0.582292  0.836193  0.281514   
3  0.465818  0.799953  0.906878  ...  0.453880  0.311799  0.534668  0.563793   
4  0.080463  0.131349  0.913496  ...  0.583892  0.444141  0.249423  0.110690   

         61        62        63        64        65        66  
0  0

#### Implement and run cross-validation

In [50]:
# Implement and run the requested cross-validation. Report and interpret its results.
predictions = {'A': [], 'B': [], 'C': [], 'D': []}
true_values = {'A': [], 'B': [], 'C': [], 'D': []}
avg_train_sizes = {'A': 0, 'B': 0, 'C': 0, 'D': 0}
total_iterations = 0

# Perform leave-one-out cross-validation
loo = LeaveOneOut()
for train_idx, test_idx in loo.split(X):
    X_test = X[test_idx]
    y_test = y[test_idx]
    total_iterations += 1
    
    # Get modified training sets for each scenario
    training_sets, train_indices, counts = get_modified_training_sets(
        X, pairs_data, train_idx, test_idx)
    
    # Update average training set sizes
    for split_type in ['A', 'B', 'C', 'D']:
        avg_train_sizes[split_type] += counts[split_type]
    
    # For each scenario (A, B, C, D)
    for split_type in ['A', 'B', 'C', 'D']:
        if len(train_indices[split_type]) > 0:
            # Train model with modified training set
            X_train = training_sets[split_type]
            y_train = y[train_indices[split_type]]
            
            # Use min(10, len(y_train)) neighbors to handle small training sets
            k = min(10, len(y_train))
            model = KNeighborsRegressor(n_neighbors=k)
            model.fit(X_train, y_train)
            
            # Make prediction
            pred = model.predict(X_test)[0]
            
            # Store prediction and true value
            predictions[split_type].append(pred)
            true_values[split_type].append(y_test[0])

# Calculate average training set sizes
for split_type in avg_train_sizes:
    avg_train_sizes[split_type] = avg_train_sizes[split_type] / total_iterations

# Print results
print("\nCross-validation results:")
for split_type in ['A', 'B', 'C', 'D']:
    if len(predictions[split_type]) > 1:
        c_idx = cindex(true_values[split_type], predictions[split_type])
        print(f"\nSplit type {split_type}:")
        print(f"C-index = {c_idx:.3f}")
        print(f"Average training set size: {avg_train_sizes[split_type]:.1f}")
        print(f"Number of predictions: {len(predictions[split_type])}")


Cross-validation results:

Split type A:
C-index = 0.830
Average training set size: 399.0
Number of predictions: 400

Split type B:
C-index = 0.830
Average training set size: 393.7
Number of predictions: 400

Split type C:
C-index = 0.520
Average training set size: 392.4
Number of predictions: 400

Split type D:
C-index = 0.522
Average training set size: 387.1
Number of predictions: 400
