# Exercise: Cross-Validation with Symmetric Pair-Input Data

This exercise consists of two tasks. The first task is compulsory: you will not get the right to take the exam if you fail the first task. The second task optional: you do not have to complete the second task but a successful completion will give you an extra point in the exam.

In both tasks, use the K-nearest neighbors classifier with K=1 and Euclidean distance for learning and the concordance index for evaluation. You are encouraged to re-use your own code from the previous exercises. Use the data files `pairs.data`, `features.data`, and `labels.data` that are available in Moodle. The descriptions of these files are provided in the exercise overview, which is also available in Moodle.

Follow the general exercise guidelines of the course (listed in Moodle). Particularly,

- Describe and implement your solution directly to this Jupyter notebook file.
- Remember to describe your solution in general and add detailed comments to the critical parts of your code.
- Remember to justify your design choices and discuss your results.
- Your report must be easy to follow and your code must be runnable in Jupyter notebook.

Feel free to use markdown cells and code cells as you see appropriate.

Submit the finished work to Moodle before the **deadline Monday 18th of February 2019 at 23:59**. Late submissions will be ignored.

Submitted By

[Prashant Mahato, 516144, MSc Computer Science, UTU]

## Task 1 (compulsory)

**You must successfully complete this task in order to get the right to take the exam.**

1. Implement the modified leave-one-out cross-validation scheme that is described in the lecture notes.

2. Estimate and report the generalisation performance of the K-nearest neighbor classifier in predicting the functional similarity of proteins. Use both the unmodified and the modified leave-one-out cross-validation.

3. Discuss your results. In particular, answer the following questions:
 - Why do the two cross-validation schemes produce notably different estimates?
 - Which scheme is appropriate for estimating the generalisation performance on which types of pairs (A, B, or C) and why?

                        

Note: I have performed the all three cross-validation on normalized data and with out normalized. The z-score is used for normalization. 



In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import operator


In [2]:
# load the dataset
def load_dataset(data):
    df = pd.read_csv(data, header=None) # read dataset
    return df
    

In [3]:
#run load_dataset function
df = load_dataset('features.data')
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,31,32,33,34,35,36,37,38,39,40
0,94,0,16,51,54,54,74,21,43,61,...,52,26,55,47,225,140,43,30,132,92
1,21,0,6,8,12,5,19,5,6,20,...,6,8,10,7,32,23,11,7,26,19
2,88,0,14,59,58,55,76,24,53,57,...,57,32,56,52,227,146,45,35,135,105
3,52,0,5,18,17,51,47,16,50,15,...,19,10,21,19,184,109,15,10,106,75
4,39,0,15,27,23,51,64,19,65,15,...,25,16,31,24,194,118,16,15,117,106


#                           Normalizing the Data

In [4]:
# Normalize the data
data_np = df.values# change to numpy array
data_np = data_np.astype(float) # change to float type

#This transposes the above matrix. Z score is performed for each column. 
for i in range(len(data_np.T)): # transpose and calcualte 'mean' and 'stdv' row-wise
    
    # check if all elements in (ith) row are zeros 
    #  then  calculate 'mean' and 'stdv' for  other than ith rows
    
    #Without this if check the c index is 0.5
    if np.all(data_np.T[i] == 0) == False: #Few columns are 0. We perform normalization only to non-zero entries
        mean = np.mean(data_np.T[i])
        stdv = np.std(data_np.T[i])
        
        for j in range(len(data_np)) : # apply z-score in each element of df
            
            data_np[j][i]= (data_np[j][i] - mean)/stdv
        
data_np = pd.DataFrame(data_np)  # change to pandas dataframe 
data_np.head()





Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,31,32,33,34,35,36,37,38,39,40
0,1.800051,0.0,1.669472,1.937529,1.850945,0.820203,1.182565,0.591083,0.133767,2.191697,...,1.897792,1.666761,1.977547,1.811226,0.93606,1.042499,2.033321,1.904285,0.950928,0.86391
1,-1.146221,0.0,-0.500156,-1.15782,-1.056642,-1.507873,-1.086588,-1.537936,-1.739502,-0.204895,...,-1.451704,-1.039945,-1.459044,-1.520167,-1.677808,-1.608468,-0.836361,-0.93009,-1.67331,-1.676895
2,1.557892,0.0,1.235546,2.513408,2.127858,0.867715,1.26508,0.990274,0.640056,1.957883,...,2.261868,2.568996,2.053916,2.22765,0.963147,1.178446,2.212676,2.520454,1.025199,1.316383
3,0.104936,0.0,-0.717119,-0.437971,-0.710501,0.677668,0.068617,-0.074236,0.488169,-0.497162,...,-0.505107,-0.7392,-0.618988,-0.520749,0.380782,0.340106,-0.477651,-0.560389,0.307247,0.272216
4,-0.419743,0.0,1.452509,0.209893,-0.295131,0.677668,0.769992,0.324955,1.247602,-0.497162,...,-0.068216,0.163036,0.144699,-0.104325,0.516216,0.544027,-0.387973,0.055779,0.579574,1.351188


In [5]:
#Features and Target are combined to one 
df_target = pd.read_csv('labels.data', names=['Target'])
data_np_labels = pd.concat([data_np, df_target], axis=1)
data_np_labels.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,Target
0,1.800051,0.0,1.669472,1.937529,1.850945,0.820203,1.182565,0.591083,0.133767,2.191697,...,1.666761,1.977547,1.811226,0.93606,1.042499,2.033321,1.904285,0.950928,0.86391,1
1,-1.146221,0.0,-0.500156,-1.15782,-1.056642,-1.507873,-1.086588,-1.537936,-1.739502,-0.204895,...,-1.039945,-1.459044,-1.520167,-1.677808,-1.608468,-0.836361,-0.93009,-1.67331,-1.676895,1
2,1.557892,0.0,1.235546,2.513408,2.127858,0.867715,1.26508,0.990274,0.640056,1.957883,...,2.568996,2.053916,2.22765,0.963147,1.178446,2.212676,2.520454,1.025199,1.316383,0
3,0.104936,0.0,-0.717119,-0.437971,-0.710501,0.677668,0.068617,-0.074236,0.488169,-0.497162,...,-0.7392,-0.618988,-0.520749,0.380782,0.340106,-0.477651,-0.560389,0.307247,0.272216,1
4,-0.419743,0.0,1.452509,0.209893,-0.295131,0.677668,0.769992,0.324955,1.247602,-0.497162,...,0.163036,0.144699,-0.104325,0.516216,0.544027,-0.387973,0.055779,0.579574,1.351188,1


In [6]:
# read pairs-input in pandas dataframe
pair_data = pd.read_csv('pairs.data', header=None)
pair_data.head()

Unnamed: 0,0,1
0,P0,P9
1,P6,P19
2,P0,P4
3,P5,P17
4,P15,P17


  Own implementation of KNN
The three functions are used to perfome Knn classifier. The eculidan_dist() function calculates the eculidan distance between the two instances of data. The second function get_neighbours() finds the 'k' neighbours from the traiing dataset for the test instance, and the third function get_response() finds the 'class' for test instance that belongs to higest votes with in the 'k' neigbours.

In [7]:
#Function to calculate the Euclidean distance between two instances
def eculidean_dist(data1, data2):
    
    distance = 0
    length = len(data2) -1 # iterate over the length of target variable
        
    for x in range(length):
        distance += np.power((data1[x]- data2[x]),2)
    return np.sqrt(distance)
    


In [8]:
# # Get the K nearest neighbours
def get_neighbours(trainingSet, testInstance, k): #whole training set, test instance
        distance = []
        
        # find distance  between  a test instance to all training instances
        for x in range(len(trainingSet)):
            dist = eculidean_dist(trainingSet[x], testInstance) 
            distance.append((trainingSet[x], dist)) 
        
        # sort the distance 
        distance.sort(key=operator.itemgetter(1)) #sort based on distane

        
        neighbours = []
        for x in  range(k):
            # takes only the k-neighbours from distance list and append 
            neighbours.append(distance[x][0]) #
        return neighbours

In [9]:
# Get the k most nearest neighbours
def get_response(neighbors):
    classVotes = {} # # the dictionary to count label of each instance
    
    #counts the k neighbours of each class
    for x in range(len(neighbors)):
        response = neighbors[x][-1]  #target variable
        if response in classVotes:
            classVotes[response] +=1
            
        else:
            classVotes[response] = 1
    
    # sorts classes with higest values
    sortedVotes = sorted(classVotes.items(),key=operator.itemgetter(1), reverse=True) 
    return  sortedVotes[0][0] # take highest voted neighbour

In [10]:
#implementation  of C_index
def c_index (true_labels, predictions):
    n = 0
    h_sum =0
    
    for i in range(len(true_labels)):
        t = true_labels[i]
        p = predictions[i]
        
        
        for j in range(i+1, len(true_labels)):
            nt = true_labels[j]
            np = predictions[j]
            
            if( t!=nt):
                n += 1
                
                if(p < np and t <  nt) or (p > np and t > nt):
                    h_sum += 1
                
                elif(p == np):
                    h_sum += 0.5
                
    c_index = h_sum/n
    return c_index

                   Leave one out cross_validation 
  This is the standard leave one out cross validaiton (with out modification). There is one test instance and all other are training set. As in the lecture, we first do the normal CV with shared members

In [11]:
def leave_one_out_cv(dataset, input_pair):
    #features and labels are combined in one matrix
    #pair_data is the pair of protein for input features (matrix form)
   
    predicted_label = [] 
    true_label = []
    
    
 
    # use index for input-pair and dataset
    for x in range(len(dataset)):
        test_instance = dataset[x] # feature matrix for test instance of 'x' index
        train_data = np.concatenate([dataset[0:x],dataset[1+x:]]) # implementing leave one out 
        true_label.append(test_instance[-1]) # class of the test instance goes into the list true_label
          

        #KNN to get the neighbours
        neighbours = get_neighbours(train_data, test_instance, 1) #Only one test instance as it is leave one out
        predict = get_response(neighbours) 
        predicted_label.append(predict)
        
 
    c_index_val = c_index(true_label, predicted_label) # call c_index()
        
    return c_index_val
            
    

In [12]:

c_val= leave_one_out_cv(data_np_labels.values, pair_data.values) 
print('C_index : {0:.4f}'.format(c_val))



C_index : 0.6923


                    Modified LOO CV
Here we are removing both the members that are shared. (Case C)

In [13]:
def cross_validation(dataset, input_pair):

    #Dataset is the features + the labels in matrix form
    #pair_data is the pair of proteins  for input features converted into matrix form
    

    
    
    predicted_class = [] 
    true_class = [] 
    

    for x in range(len(dataset)):# iterate over test instance
        
        test_instance = dataset[x] 
        train_data = [] 
        true_class.append(test_instance[-1]) # class of the test instance
    
        
        #compare pair data of test_instance with rest of data_points in dataset and remove the ones that are common in training set
        
        for y in range(len(dataset)):
            if y!= x: # other then test instance index
                if input_pair[y][0] != input_pair[x][0] and input_pair[y][0]!= input_pair[x][1] and input_pair[y][1]!= input_pair[x][0] and input_pair[y][1]!= input_pair[x][1]:
                     train_data.append(dataset[y])
                
        
        #KNN to get neighbours
        train_dataset = np.array(train_data)
        neighbours = get_neighbours(train_dataset, test_instance, 1) 
        predict = get_response(neighbours) 
        predicted_class.append(predict)
        
    

    c_index_val = c_index(true_class, predicted_class) # call c_index()
        
    return c_index_val
            
            
    

In [14]:
#run cross validataion function


c_val= cross_validation(data_np_labels.values, pair_data.values) #nomarlised data
print('C_index with data normalization: {0:.4f}'.format(c_val))


C_index with data normalization: 0.6730


In [15]:
from sklearn.model_selection import LeaveOneOut
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [16]:
knn=KNeighborsClassifier(n_neighbors=1)

In [17]:
loo = LeaveOneOut()
x=data_np.values
y=df_target.values

loo_acc = []
true_labels=[]
predicted_label=[]
for train_index, test_index in loo.split(x):
    #print("TRAIN:", train_index, "TEST:", test_index)
    x_train, x_test =x[train_index], x[test_index]
    y_train,y_test =y[train_index],y[test_index]

    knn.fit(x_train, y_train)
    predict_val = knn.predict(x_test)
    
    predicted_label.append(predict_val)
    true_labels.append(y_test)
ci=c_index(true_labels,predicted_label)
print (ci)
    
#     accuracy=accuracy_score(predict_val,y_test)
#     loo_acc.append(accuracy)

# print(np.average(loo_acc))
    

  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.path[0]
  del sys.

0.6923258003766478


  del sys.path[0]


### Discussion:
Most of the codes were implemented in the earlier exercises and had been adapted from machinelearningmastery.com

I performed two types of CV as instructed. The results are as discussed:

1.For the modified version, we remove from the train data all the inputs which represent pairs of proteins such as the intersection between the two proteins in the test instance and the two proteins in the train instance is not empty. That way, we remove dependencies and do not get a result that is little optimistic.

2.If now we have a new instance that is the comparison of two proteins that are not in the previous set, we can still get a
good prediction.

Although I thought Normal leave one out should have given much optimistic result but it seems it is only slightly better than modified. 

Little help taken during the demo class

## Task 2 (optional)

**Successfully completing this task will give you an extra point in the exam.**

1. Design a leave-one-out cross-validation scheme that is appropriate for estimating the generalisation performance on the type of pairs for which the two aforementioned schemes are not appropriate.

2. Explain why your cross-validation scheme is appropriate.

3. Implement your cross-validation scheme. Estimate and report the generalisation performance as in the first task.

4. Discuss your results. In particular, compare the results to those you obtained in the first task and give reasons for any similarities or differences you observe.

In [31]:
def cross_validation(dataset, input_pair):

    #Dataset is the features + the labels in matrix form
    #pair_data is the pair of proteins  for input features converted into matrix form
    

    
    
    predicted_class = [] 
    true_class = [] 
    

    for x in range(len(dataset)):# iterate over test instance
        
        test_instance = dataset[x] 
        train_data = [] # 2nd mem
        train_data1 = [] # 1st mem
        true_class.append(test_instance[-1]) # class of the test instance
        true_class.append(test_instance[-1])
    
        
        #compare pair data of test_instance with rest of data_points in dataset and remove the ones that are common in training set
        
        for y in range(len(dataset)):
            if y!= x: # other then test instance index
                if (input_pair[y][0] != input_pair[x][0] and input_pair[y][1]!= input_pair[x][0]): 
                #or (input_pair[y][1]==input_pair[x][1] and input_pair[y][0]==input_pair[x][1]) :
                     train_data.append(dataset[y])
                    
                    
                #KNN to get neighbours
        train_dataset = np.array(train_data)
        neighbours = get_neighbours(train_dataset, test_instance, 1) 
        predict = get_response(neighbours) 
        predicted_class.append(predict)
        
        for y in range(len(dataset)):
            if y!= x: # other then test instance index
                              
                if (input_pair[y][0] != input_pair[x][1] and input_pair[y][1]!= input_pair[x][1]): 
                #or (input_pair[y][1]==input_pair[x][1] and input_pair[y][0]==input_pair[x][1]) :
                     train_data1.append(dataset[y])
                        
 
        #KNN to get neighbours
        train_dataset = np.array(train_data1)
        neighbours = get_neighbours(train_dataset, test_instance, 1) 
        predict = get_response(neighbours) 
        predicted_class.append(predict)
        
    

    c_index_val = c_index(true_class, predicted_class) # call c_index()
        
    return round(c_index_val,3)
            

In [32]:
print(cross_validation(data_np_labels.values, pair_data.values))

0.69
