<a href="https://colab.research.google.com/github/abhik-99/MFSGC/blob/master/(Train)_5_fold_Cross_Validation_on_MFSGC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Multi-filtering Supervised Gene Clustering.

**(Train) (Using 5-fold Cross Validation) (Using Representational Genes generated from top 800 Genes)**
##Current Status:-
Completed
**Please Run the Dataset Splitter before running this Notebook.** Provide the 
*DATASET_train.csv* generated from the Dataset Splitter Notebook as the 
input to this Notebook at cell 4 as an upload.\
\
*If you already have Gene Representatives from a previous iteration, you can load them and use them here. Loading can be done using the last two cells of this notebook.

In [12]:
!pip install -U -q PyDrive
!pip install skfeature-chappers

Collecting skfeature-chappers
  Downloading https://files.pythonhosted.org/packages/e6/45/19bb801eb3b4a892534ab86468ad0669a68ff63578610f90051190e3622f/skfeature-chappers-1.0.3.tar.gz
Building wheels for collected packages: skfeature-chappers
  Building wheel for skfeature-chappers (setup.py) ... [?25l[?25hdone
  Created wheel for skfeature-chappers: filename=skfeature_chappers-1.0.3-py2.py3-none-any.whl size=59510 sha256=195357e067c684126ad7049ee555405dfb0d70f4333890767215d2a124c54e61
  Stored in directory: /root/.cache/pip/wheels/ac/61/bf/1b3a8c232a0072409508c2ec4c12f316e95681ae72ba7315d2
Successfully built skfeature-chappers
Installing collected packages: skfeature-chappers
Successfully installed skfeature-chappers-1.0.3


In [13]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
from google.colab import files

import json

import pandas as pd
import numpy as np

from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import chi2
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import LeaveOneOut

from scipy.sparse import *

In [33]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Creating Filter Methods for Scoring and filtering top rated genes
The Filter Methods chosen for evaluation are:-

1. Mutual Information.
2. ReliefF.
3. Chi Sq.
4. Fisher Score.
5. Signal To Noise Ratio (adapted for multi-class datasets).
6. T-Test.
7. Pearson Corelation Coefficient.

In [1]:
#construction of ReliefF function

"""
Given a dataset, number of random instances to pick form the dataset and
number of features to consider in each iteration (k), the function returns the weigths of the attributes
of the dataset.
These weigths can then be used as the final results out of the ReliefF algorithm

Paper-

Marko Robnik-ˇSikonja and Igor Kononenko. Theoretical and empirical analysis of relieff
and rrelieff. Machine learning, 53(1-2):23–69, 2003.

"""

def hit_miss_calculator(target,instance,k = 10, hit = True, c = None, ):
    m=len(target)
    upper,lower=instance-1,instance+1
    hits=[]
    hit_flag=False
    #finds k nearest hits
    while(not hit_flag):
      #print(upper,lower)
      if(len(hits)>=k):
        hit_flag = True
        break
      if upper < 0 and lower > m:
        hit_flag = True
        break
      if(upper>=0):
        if((target[upper]==target[instance]) and hit):
          hits.append(upper)
        elif((target[upper]!=target[instance]) and (not hit) and target[upper]==c):
          hits.append(upper)
        upper-=1          
      if(lower<m):
        if((target[lower]==target[instance]) and hit):
          hits.append(lower)
        elif((target[lower]!=target[instance]) and (not hit) and target[lower]==c):
          hits.append(lower)
        lower+=1
    hits.sort()
    return hits


def reliefF(feature,target,k=10,repetitions=10, seed = 0):
  np.random.seed(seed)
  if len(feature.shape)>1:
    m,n=feature.shape
  else:
    m=len(feature)
    n=1
  #print(m,n)
  observations=list(range(m))
  classes=np.unique(target)
  weights=np.zeros(n)
  d=(np.max(feature,axis=0)-np.min(feature,axis=0))*m*k

  for i in range(repetitions):
    instance=np.random.choice(observations,1)[0]
    #print("Iteration",i)
    #print(instance)
    hits=hit_miss_calculator(target,instance,k)
    hit_class_prob=len(np.where(target==target[instance])[0])/m
    #print("\nHit Probability -",hit_class_prob)
    #print("Repetition",i,"Class",target[instance],"Hits -",hits)

    miss={}
    miss_class_prob={}

    for each_class in classes:
      if(each_class != target[instance]):
        miss[each_class]=hit_miss_calculator(target,instance,k,False,each_class)
        class_prob=len(np.where(target==each_class)[0])/m
        #print(each_class,class_prob)
        miss_class_prob[each_class]=hit_class_prob/(1 - (class_prob))

    #print("Repetition",i,"Miss-",miss,"Miss Class Probability -",miss_class_prob)
    
    for hit in hits:
      if len(feature.shape)>1:
        weights-=np.subtract(feature.iloc[instance,:],feature.iloc[hit,:])/d
      else:
        weights-=np.subtract(feature.iloc[instance],feature.iloc[hit])/d
    for each_class in miss:
      for each_miss in miss[each_class]:
        if len(feature.shape)>1:
          weights+=(np.subtract(feature.iloc[instance,:],feature.iloc[each_miss,:])/d)*miss_class_prob[each_class]
        else:
          weights+=(np.subtract(feature.iloc[instance],feature.iloc[each_miss])/d)*miss_class_prob[each_class]
    
    
  return weights.tolist()

In [2]:
#This function discretizes the given features into 3 categories
def discretize_feature(feature):
  
  mean=np.mean(feature)
  std=np.std(feature)
  discretized=np.copy(feature)
  
  discretized[np.where(feature<(mean+std/2)) ,]=2#within 1/2 std div
  discretized[np.where(feature>(mean-std/2)),]=2#within 1/2 std div
  
  discretized[np.where(feature>(mean+std/2)),]=0#greater than half
  discretized[np.where(feature<(mean-std/2)),]=1#less than half
  
  return discretized

def Xfreq(x):
  xL={}
  for e in x:
    if e not in xL:
      xL[e]=0
    else:
      xL[e]+=1
  for e in xL:
    xL[e]/=len(x)
  return xL

def XYfreq(x,y):
  freq={}
  
  rX=np.unique(x)
  rY=np.unique(y)
      
  for e in rX:
    for f in rY:
      freq[(e,f)]=round(len(np.where(y[np.where(x==e)[0]]==f)[0])/len(x),4)
       
  return freq

def mutual_info(x,y):

  xFreq=Xfreq(x)
  yFreq=Xfreq(y)
  joint=XYfreq(x,y)
  
  Xentropy=0
  for e in xFreq:
    if xFreq[e]!=0:
      Xentropy-=xFreq[e]*np.log2(xFreq[e])
      
  Yentropy=0
  for e in yFreq:
    if yFreq[e]!=0:
      Yentropy-=yFreq[e]*np.log2(yFreq[e])
      
  jentropy=0
  for e in xFreq:
    for f in yFreq:
      if joint[(e,f)]!=0:
        jentropy-=joint[(e,f)]*np.log2(joint[(e,f)])
  
  return (Xentropy+Yentropy-jentropy)

def mutual_info_wrapper(features,target):

  mi=np.array([])
  for x in features:
    # print(x)
    discrete=discretize_feature(features[x])
    mi=np.append(mi,mutual_info(discrete,target))
  return np.array(mi)

In [3]:
"""
This cell is used for defining the method for calculating the t-scores
"""

def t_test(df,target):
  """
  Input:
  df= Dataframe of features (n_samples,n_features)
  target= Pandas Series/1D Numpy Array containing the class labels (n_samples)
  
  Output:
  scores= Descendingly Sorted array of features based on t-test 
  """
  import numpy as np
  from scipy.stats import ttest_ind
  scores=ttest_ind(df[:][target==0],df[:][target==1])[0] #Storing just the t-test scores and discarding the p-values from the result.
  
  # scores=np.argsort(scores,0)
  return [scores] if type(scores) != np.ndarray else scores

  

In [4]:
from scipy.sparse import *
def fisher_score(X, y):
    import numpy as np
    
    from skfeature.utility.construct_W import construct_W
    """
    This function implements the fisher score feature selection, steps are as follows:
    1. Construct the affinity matrix W in fisher score way
    2. For the r-th feature, we define fr = X(:,r), D = diag(W*ones), ones = [1,...,1]', L = D - W
    3. Let fr_hat = fr - (fr'*D*ones)*ones/(ones'*D*ones)
    4. Fisher score for the r-th feature is score = (fr_hat'*D*fr_hat)/(fr_hat'*L*fr_hat)-1

    Input
    -----
    X: {numpy array}, shape (n_samples, n_features)
        input data
    y: {numpy array}, shape (n_samples,)
        input class labels

    Output
    ------
    score: {numpy array}, shape (n_features,)
        fisher score for each feature

    Reference
    ---------
    He, Xiaofei et al. "Laplacian Score for Feature Selection." NIPS 2005.
    Duda, Richard et al. "Pattern classification." John Wiley & Sons, 2012.
    """

    # Construct weight matrix W in a fisherScore way
    kwargs = {"neighbor_mode": "supervised", "fisher_score": True, 'y': y}
    W = construct_W(X, **kwargs)

    # build the diagonal D matrix from affinity matrix W
    D = np.array(W.sum(axis=1))
    L = W
    tmp = np.dot(np.transpose(D), X)
    D = diags(np.transpose(D), [0])
    Xt = np.transpose(X)
    t1 = np.transpose(np.dot(Xt, D.todense()))
    t2 = np.transpose(np.dot(Xt, L.todense()))
    # compute the numerator of Lr
    D_prime = np.sum(np.multiply(t1, X), 0) - np.multiply(tmp, tmp)/D.sum()
    # compute the denominator of Lr
    L_prime = np.sum(np.multiply(t2, X), 0) - np.multiply(tmp, tmp)/D.sum()
    # avoid the denominator of Lr to be 0
    D_prime[D_prime < 1e-12] = 10000
    lap_score = 1 - np.array(np.multiply(L_prime, 1/D_prime))[0, :]

    # compute fisher score from laplacian score, where fisher_score = 1/lap_score - 1
    score = 1.0/lap_score - 1
    return np.transpose(score)


In [5]:

#Pearson corelation
def pearson_corr(feature,targetClass):
  import numpy as np
  coef=[np.abs(np.corrcoef(feature[i].values,targetClass)[0,1]) for i in feature.columns]
  # range(feature.shape[1])
  coef=[0 if np.isnan(i) else i for i in coef]
  return coef


In [6]:
#gini_index 

In [7]:
#signal to noise ratio
#using weighted one-vs-all strategy for multi-class data
def signaltonoise(feature, target, axis = 0, ddof = 0):
  import numpy as np
  classes = np.unique(target)
  if len(feature.shape)<2:
    feature = feature.reshape(-1,1)
  row, _ = feature.shape
  if len(classes) <= 2:
    m = None
    std = 0
    for each in classes:
      idx = np.where(target == each)[0]
      #convinient way of doing m1-m2
      if m is None:
        m = feature.iloc[idx, :].mean(axis)
      else:
        m -= feature.iloc[idx, :].mean(axis)

      #sd1+sd2
      std += feature.iloc[idx, :].std(axis = axis, ddof = ddof)

    return np.asanyarray(m/std)

  else:
    snr_scores = [] #for storing the weighted scores
    #using the one vs all strategy for each class with
    for each in classes:
      idx = np.where(target == each)[0]
      idxn = np.where(target != each)[0]
      m = feature.iloc[idx, :].mean(axis) - feature.iloc[idxn, :].mean(axis)
      std = feature.iloc[idx, :].std(axis = axis, ddof = ddof) + feature.iloc[idxn, :].std(axis = axis, ddof = ddof) 
      snr_scores.append((m/std) * len(idx)/row) #weighted snr

    return np.asanyarray(snr_scores).sum(axis = axis)

In [8]:
#intentionally left blank!

In [9]:
def feature_ranking(score):
    """
    Rank features in descending order according to fisher score, the larger the fisher score, the more important the
    feature is
    """
    idx = np.argsort(score, 0)
    return idx[::-1]

## Loading the Dataset

In [15]:
files.upload()

Saving breast.txt to breast.txt
Saving colon.txt to colon.txt
Saving leukemia.txt to leukemia.txt
Saving lung.txt to lung.txt
Saving MLL.txt to MLL.txt
Saving Prostate.csv to Prostate.csv
Saving rahc.txt to rahc.txt
Saving raoa.txt to raoa.txt
Saving rbreast.txt to rbreast.txt
Saving SRBCT.txt to SRBCT.txt


In [16]:
#DATASET is the name of the dataset being used.
DATASET="Leukemia"

#NEIGHBOURS determines neighbours arg for ReliefF
#for any dataset which contains any class sample 
# <10, make it less than 10. Eg of such dataset - SRBCT
NEIGHBOURS = 2 

#p is the number of top genes taken after sorting the filter scores
p = 800

#q is the number of top augmented genes chosen from each filter after running 
#SGC
q = 5

In [17]:
data_df=pd.read_csv("%s.txt"%(DATASET),sep="\t", index_col=None, header=None)
print(data_df.shape)
target=data_df.iloc[:,-1]
feature=pd.DataFrame(data_df.iloc[:,:-1].values,dtype='float')
m,n=feature.shape
print(m,n)
print(feature.head())
print("Number of classes - ")
classes = np.unique(target)
for x in classes:
  print("Class -",x,"Number of Samples -", len(np.where(target == x)[0]))

feature_norm=pd.DataFrame(MinMaxScaler().fit_transform(feature))

count_genes = dict(zip(map(int,feature.columns.tolist()), np.zeros(data_df.shape[1], dtype= np.int16)))

(72, 7071)
72 7070
    0     1      2     3      4     ...    7065   7066  7067   7068  7069
0  151.0  72.0  281.0  36.0 -299.0  ...   793.0  329.0  36.0  191.0 -37.0
1  263.0  21.0  250.0  43.0 -103.0  ...   782.0  295.0  11.0   76.0 -14.0
2   88.0 -27.0  358.0  42.0  142.0  ...  1138.0  777.0  41.0  228.0 -41.0
3  484.0  61.0  118.0  39.0  -11.0  ...   627.0  170.0 -50.0  126.0 -91.0
4  118.0  16.0  197.0  39.0  237.0  ...   250.0  314.0  14.0   56.0 -25.0

[5 rows x 7070 columns]
Number of classes - 
Class - 0 Number of Samples - 25
Class - 1 Number of Samples - 47


In [18]:
data_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,7031,7032,7033,7034,7035,7036,7037,7038,7039,7040,7041,7042,7043,7044,7045,7046,7047,7048,7049,7050,7051,7052,7053,7054,7055,7056,7057,7058,7059,7060,7061,7062,7063,7064,7065,7066,7067,7068,7069,7070
0,151,72,281,36,-299,57,186,1647,137,803,-894,-632,378,-26,-691,2,-156,155,355,1149,-131,158,1084,87,125,11,553,152,759,551,34,588,129,1128,919,399,-206,20,271,78,...,-763,172,149,341,788,21210,13771,598,396,245,14476,10882,701,2762,-325,-67,346,-68,229,-14,108,28,349,61,273,384,-306,-1827,1582,185,511,-125,389,-37,793,329,36,191,-37,1
1,263,21,250,43,-103,169,219,2043,188,756,-812,-700,249,-242,-369,-14,-98,131,431,941,-95,328,1215,53,-81,20,215,104,1656,739,37,781,182,944,759,219,-87,13,595,142,...,51,154,418,433,736,21059,15097,563,171,-149,13686,11789,76,1567,-191,-88,290,14,194,56,303,-242,214,-28,143,231,-336,-2380,624,169,837,-36,442,-17,782,295,11,76,-14,1
2,88,-27,358,42,142,359,237,1997,91,2514,-1715,-603,362,-31,-1385,-374,-213,270,603,1924,94,301,1281,128,70,12,460,-3,1130,885,141,1044,423,1019,1019,650,-183,-33,457,205,...,-474,180,272,591,959,24292,17378,1808,363,325,6560,5023,804,1090,-258,9,220,-58,294,95,143,-25,464,513,238,720,-204,-1772,753,315,1199,33,168,52,1138,777,41,228,-41,1
3,484,61,118,39,-11,274,245,2128,-82,1489,-969,-909,266,-181,-900,-237,-156,115,255,1078,-24,238,1316,112,41,34,512,-147,1062,654,90,652,147,904,614,761,-254,-9,414,142,...,-336,325,149,173,431,17558,13818,576,455,594,8955,9567,367,1708,-357,45,430,-35,128,42,22,-131,342,142,277,307,-320,-2022,743,240,835,218,174,-110,627,170,-50,126,-91,1
4,118,16,197,39,237,311,186,1608,204,322,-444,-254,554,16,-58,-78,-95,45,569,501,-15,181,296,-39,-1,-2,514,-29,822,756,159,999,243,1037,691,263,-122,-17,397,85,...,-56,279,183,259,605,18530,15619,65,122,126,8443,8512,182,1503,-78,29,159,18,71,42,44,-33,159,71,134,178,-182,-179,626,156,649,57,504,-26,250,314,14,56,-25,1


In [19]:
#utility function
def plot_feature(feature, target, c = ['r', 'b', 'g', 'y']):
  import matplotlib.pyplot as plt
  from matplotlib import style
  import numpy as np
  style.use('ggplot')
  for idx, each in enumerate(np.unique(target)):
    y = feature[np.where(target == each)[0]]
    x = len(y)
    plt.scatter(range(1, x+1), y, color = c[idx])
    plt.plot(range(1, x+1), y, color = c[idx])

In [20]:
print(len(count_genes))

7070


In [21]:
"""
Loading the Cluster JSON files from memory
"""

with open('%s_p%s_q%smi_cluster.json'%(DATASET, p, q), 'r') as fp:
  mi_cluster=json.load(fp)

with open('%s_p%s_q%srelief_cluster.json'%(DATASET, p, q), 'r') as fp:
  chi_cluster=json.load(fp)


with open('%s_p%s_q%schi_cluster.json'%(DATASET, p, q), 'r') as fp:
  relief_cluster=json.load(fp)

with open('%s_p%s_q%spc_cluster.json'%(DATASET, p, q), 'r') as fp:
  pc_cluster=json.load(fp)

with open('%s_p%s_q%sfs_cluster.json'%(DATASET, p, q), 'r') as fp:
  fs_cluster=json.load(fp)

with open('%s_p%s_q%stt_cluster.json'%(DATASET, p, q), 'r') as fp:
  tt_cluster=json.load(fp)

with open('%s_p%s_q%ssnr_cluster.json'%(DATASET, p, q), 'r') as fp:
  snr_cluster=json.load(fp)

In [22]:
"""
Loading Representative Genes from Memory
"""
gene_repre_1 = pd.read_csv("%s_p%s_q%sRepresentative_Genes_1.csv"%(DATASET, p, q),index_col = None)
gene_repre_2 = pd.read_csv("%s_p%s_q%sRepresentative_Genes_2.csv"%(DATASET, p, q),index_col = None)
gene_repre_3 = pd.read_csv("%s_p%s_q%sRepresentative_Genes_3.csv"%(DATASET, p, q),index_col = None)
gene_repre_4 = pd.read_csv("%s_p%s_q%sRepresentative_Genes_4.csv"%(DATASET, p, q),index_col = None)
gene_repre_5 = pd.read_csv("%s_p%s_q%sRepresentative_Genes_5.csv"%(DATASET, p, q),index_col = None)
gene_repre_6 = pd.read_csv("%s_p%s_q%sRepresentative_Genes_6.csv"%(DATASET, p, q),index_col = None)
gene_repre_7 = pd.read_csv("%s_p%s_q%sRepresentative_Genes_7.csv"%(DATASET, p, q),index_col = None)

In [23]:
#Can Skip this cell
print(gene_repre_1.shape)
print(gene_repre_2.shape)
print(gene_repre_3.shape)
print(gene_repre_4.shape)
print(gene_repre_5.head())
print(gene_repre_6.head())
print(gene_repre_7.head())

(72, 50)
(72, 7)
(72, 21)
(72, 6)
      4787    1822     6743     1983     2126    6285
0 -10105.0  -178.0 -36011.0  -9359.0  24738.0  5642.0
1 -10565.0  1788.0 -33324.0  18524.0  39818.0  4858.0
2 -11940.0  3021.0 -49194.0   3105.0  37881.0  -398.0
3 -13170.0  3431.0 -38118.0   -218.0  23339.0  -121.0
4 -10161.0  2072.0 -31341.0   9250.0  27976.0   687.0
     4787     1822     6125     6907     2188     6149    6945
0  1000.0  16183.0   9010.0  55364.0  19677.0  29982.0   783.0
1   914.0  15363.0  18909.0  54113.0  33295.0  35320.0  4514.0
2  -466.0  13833.0  14862.0  44653.0  25203.0  27204.0   401.0
3  -560.0  12355.0  13337.0  46064.0  11296.0  24707.0   395.0
4   609.0  14253.0  11874.0  51216.0   9510.0  27443.0  1393.0
     4787    1822     1614     6737     5916     3301     1053
0  3412.0  8236.0  49301.0  21955.0  67450.0  33592.0   7693.0
1  5425.0  6058.0  58357.0  25545.0  82825.0  35900.0   8757.0
2  3268.0  6631.0  54494.0  18467.0  69772.0  28581.0    541.0
3  2095.0  5

In [24]:
print("Number of MI Clusters formed -",len(mi_cluster))
print("Number of ReliefF Clusters formed -",len(relief_cluster))
print("Number of ChiSq. Clusters formed -",len(chi_cluster))
print("Number of Pearson Clusters formed -",len(pc_cluster))
print("Number of Fisher Score Clusters formed -",len(fs_cluster))
print("Number of T-Test Clusters formed -",len(tt_cluster))
print("Number of SNR Clusters formed -", len(snr_cluster))

Number of MI Clusters formed - 50
Number of ReliefF Clusters formed - 21
Number of ChiSq. Clusters formed - 7
Number of Pearson Clusters formed - 6
Number of Fisher Score Clusters formed - 6
Number of T-Test Clusters formed - 7
Number of SNR Clusters formed - 7


In [25]:
def sort_keys(scores,gene_repre,target,flag=True):
  score_dict={}
  x=0
  for i in gene_repre.columns:
    # print(x)
    score_dict[i]=scores[x]
    x+=1
  return [k for k, v in sorted(score_dict.items(), key=lambda item: item[1], reverse = True)]

In [26]:
"""
feature_ranking cannot be used here because it sorts and returns the indices from 0-1
They need to be sorted using a different function
"""
sorted_mi_keys=sort_keys(mutual_info_wrapper(gene_repre_1,target),gene_repre_1,target,True)[:q]

sorted_relief_keys=sort_keys(reliefF(gene_repre_2,target,k=NEIGHBOURS,repetitions=5),gene_repre_2,target,True)[:q]

mms=MinMaxScaler()
nfeature=mms.fit_transform(gene_repre_3)
chi_score,p_val=chi2(nfeature,target)
sorted_chi_keys=sort_keys(chi_score,gene_repre_3,target,False)[:q]

sorted_pc_keys=sort_keys(pearson_corr(gene_repre_4,target),gene_repre_4,target,True)[:q]

sorted_fs_keys=sort_keys(fisher_score(gene_repre_5.values,target),gene_repre_5,target,True)[:q]

sorted_tt_keys=sort_keys(t_test(gene_repre_6,target),gene_repre_6,target,True)[:q]

sorted_snr_keys = sort_keys(signaltonoise(gene_repre_7, target), gene_repre_7, target, True)[:q]

  class_idx_all = (class_idx[:, np.newaxis] & class_idx[np.newaxis, :])
  class_idx_all = (class_idx[:, np.newaxis] & class_idx[np.newaxis, :])


In [27]:
print("MI cluster after sorting - ",sorted_mi_keys)
print("Relief cluster after sorting - ",sorted_relief_keys)
print("Chi cluster after sorting - ",sorted_chi_keys)
print("Pearson cluster after sorting - ",sorted_pc_keys)
print("Fisher cluster after sorting - ",sorted_fs_keys)
print("T-Test cluster after sorting - ",sorted_tt_keys)
print("SNR cluster after sorting - ",sorted_snr_keys)

MI cluster after sorting -  ['2061', '4979', '1827', '2566', '3192']
Relief cluster after sorting -  ['2729', '6780', '6158', '4664', '215']
Chi cluster after sorting -  ['2228', '1822', '6746', '4313', '5916']
Pearson cluster after sorting -  ['4787', '1822', '6743', '1983', '2126']
Fisher cluster after sorting -  ['4787', '1822', '6743', '1983', '2126']
T-Test cluster after sorting -  ['4787', '1822', '6125', '6907', '2188']
SNR cluster after sorting -  ['4787', '1822', '1614', '6737', '5916']


## Testing Classification of the Augmented Genes.
Here the classfication accuracy is tested using **KNN, Decision Tree, Naive Bayes** and **SVM** as well as the **Ensemble** of them.\
\
Top i (where i ranges from 1 to q) are chosen from each augmented dataset of filters and a new dataset is created using these augmented genes. This dataset is used for classification.

In [28]:
LOOCV=LeaveOneOut()
data_KNN = KNeighborsClassifier(n_neighbors= int(feature.shape[0] ** 0.5))
data_SVM = SVC(kernel='rbf',gamma='scale')
data_NB = GaussianNB()
data_Tree = DecisionTreeClassifier()
rows = feature.shape[0]

classifiers=["NB","KNN","Tree","SVM"]
classifier_dict = { "NB": GaussianNB(), 
                   "KNN": KNeighborsClassifier(n_neighbors= int(feature.shape[0] ** 0.5)), 
                   "Tree": DecisionTreeClassifier(), 
                   "SVM": SVC(kernel='rbf',gamma='scale')}

"""

"""

keys_list={"MI":sorted_mi_keys, "ReliefF":sorted_relief_keys, 
           "Chi":sorted_chi_keys,  
           "Fisher":sorted_fs_keys, "Pearson":sorted_pc_keys, "tTest":sorted_tt_keys, 
           "SNR":sorted_snr_keys}
"""
   
"""
cluster_list={"MI":gene_repre_1, "ReliefF":gene_repre_2,
              "Chi":gene_repre_3, "Pearson":gene_repre_4, 
              "Fisher":gene_repre_5,"tTest":gene_repre_6, 
              "SNR":gene_repre_7}

## MFSGC-EC

## LOOCV


In [34]:
acc_df = []
for i in range(1, q+1):  
  #Ensemble of same type classifiers for each filter

  #records the ensemble classifier accuracy of each classifier with i genes
  classifier_accuracy = [] 

  for each_classifier in classifier_dict:
    #creating confusion matrix.
    confusion_matrix = np.zeros(classes.shape[0]*2).reshape(2,2)
    #records the accuracy of the current classifier
    acc=0
    for train_index, test_index in LOOCV.split(range(rows)):
      """
      Data is divided into train-test splits and then polling method is used 
      to find the classification results (ensemble of KNN,SVM,NB,Decision Tree)
      """
      
      filter_result = []
      for filter_name in keys_list:

        #taking the train-test split from each filter
        train_data,train_labels = cluster_list[filter_name].iloc[train_index,:i],target[train_index]
        test_data,test_labels = cluster_list[filter_name].iloc[test_index,:i],target[test_index].values.tolist()[0]

        #filter wise fit and predict from the same type of classifier
        classifier_dict[each_classifier].fit(train_data, train_labels)
        filter_result.append(classifier_dict[each_classifier].predict(test_data)[0])

      #generating the polling result of all filters' classifiers
      polling_result = None
      max_freq = 0
      for each in set(filter_result):
        freq = filter_result.count(each)
        if freq>max_freq:
          max_freq = freq
          polling_result = each
      # print(polling_result, test_labels)
      confusion_matrix[test_labels, polling_result] += 1
      if polling_result == test_labels:
        acc+=1
    
    acc = np.round(acc/rows,4)
    classifier_accuracy.append(acc)
    # print("Confusion Matrix for %s with %s genes:-\n"%(each_classifier, i), confusion_matrix)
    pd.DataFrame(confusion_matrix, index = classes, columns = classes).to_csv("LOOCV_confusion_matrix_%s_%s_%s.csv"%(DATASET, each_classifier, i))

  #records the ensemble classifier accuracy from 1 to q genes for each classifier
  acc_df.append(classifier_accuracy)    
acc_df = pd.DataFrame(acc_df, columns= classifier_dict.keys())
acc_df.to_csv("Train-EC-Multi-SGC-%s_p%s_q%s_Accuracy_Matrix.csv"%(DATASET, p, q))
print(acc_df)

    NB     KNN    Tree  SVM
0  1.0  1.0000  1.0000  1.0
1  1.0  1.0000  1.0000  1.0
2  1.0  1.0000  1.0000  1.0
3  1.0  0.9861  1.0000  1.0
4  1.0  0.9722  0.9861  1.0


## 5 Fold Cross Validation

In [38]:
from sklearn.model_selection import KFold

kfcv = KFold(5, True, random_state = 0)

In [70]:
acc_df = []
for i in range(1, q+1):  
  #Ensemble of same type classifiers for each filter

  #records the ensemble classifier accuracy of each classifier with i genes
  classifier_accuracy = [] 

  for each_classifier in classifier_dict:
    #creating confusion matrix.
    confusion_matrix = np.zeros(classes.shape[0]*2).reshape(2,2)
    #records the accuracy of the current classifier
    acc=0
    for train_index, test_index in kfcv.split(range(rows)):
      """
      Data is divided into train-test splits and then polling method is used 
      to find the classification results (ensemble of KNN,SVM,NB,Decision Tree)
      """
      
      filter_result = np.zeros((len(keys_list), len(test_index)))
      # print(filter_result)
      for idx1, filter_name in enumerate(keys_list):

        #taking the train-test split from each filter
        train_data,train_labels = cluster_list[filter_name].iloc[train_index,:i],target[train_index]
        test_data,test_labels = cluster_list[filter_name].iloc[test_index,:i].values,target[test_index].values
        # print("Test Label",test_labels)
        #filter wise fit and predict from the same type of classifier
        classifier_dict[each_classifier].fit(train_data, train_labels)
        for idx2, tst_data in enumerate(test_data):
          # print(idx2,tst_data)
          filter_result[idx1, idx2] = classifier_dict[each_classifier].predict(tst_data.reshape(1,-1))[0]

      #generating the polling result of all filters' classifiers
      for each_col in range(filter_result.shape[1]):

        polling_result = None
        max_freq = 0
        data_col = filter_result[:,each_col]
        for each in set(data_col):
          freq = data_col.tolist().count(each)
          if freq>max_freq:
            max_freq = freq
            polling_result = each
        # print(polling_result)
        confusion_matrix[test_labels[each_col], int(polling_result)] += 1
        if polling_result == test_labels[each_col]:
          acc+=1
    
    acc = np.round(acc/rows,4)
    classifier_accuracy.append(acc)
    # print("Confusion Matrix for %s with %s genes:-\n"%(each_classifier, i), confusion_matrix)
    pd.DataFrame(confusion_matrix, index = classes, columns = classes).to_csv("5-Fold-CV_confusion_matrix_%s_%s_%s.csv"%(DATASET, each_classifier, i))

  #records the ensemble classifier accuracy from 1 to q genes for each classifier
  acc_df.append(classifier_accuracy)    
acc_df = pd.DataFrame(acc_df, columns= classifier_dict.keys())
acc_df.to_csv("Train-EC-Multi-SGC-%s_p%s_q%s_Accuracy_Matrix.csv"%(DATASET, p, q))
print(acc_df)

    NB     KNN    Tree  SVM
0  1.0  1.0000  1.0000  1.0
1  1.0  1.0000  1.0000  1.0
2  1.0  1.0000  0.9861  1.0
3  1.0  0.9861  1.0000  1.0
4  1.0  0.9722  1.0000  1.0


## Copying Confusion Matrix to Google Drive

In [73]:
!cp LOOCV_confusion_matrix*.* "/content/drive/My Drive/For Sir and Ma'am/MFSGC Results/Confusion Matrix/LOOCV/"
!cp 5-Fold-CV_confusion_matrix*.* "/content/drive/My Drive/For Sir and Ma'am/MFSGC Results/Confusion Matrix/5-Fold/"

## Downloading the Files
Downloads all the Files generated for this dataset. Please note: The dataset name must be specified in the DATASET 
variable before proceeding with the steps below.

Incase of 'failed to fetch' error, please rerun these cell.

In [None]:
#Download the MFSGC and MFSGC-EC Results
files.download("Train-Multi-SGC-%s_p%s_q%s_Accuracy_Matrix.csv"%(DATASET, p, q))
files.download("Train-EC-Multi-SGC-%s_p%s_q%s_Accuracy_Matrix.csv"%(DATASET, p, q))

In [None]:
#Downloading the Cluster Formation Sequence
for x in ['mi','chi','relief', 'pc', 'fs', 'tt', 'snr']:
  files.download("%s_p%s_q%s%s_cluster.json"%(DATASET, p, q, x))

In [None]:
#Download Representative Genes Formed
for x in range(1,8):
  files.download("%s_p%s_q%sRepresentative_Genes_%d.csv"%(DATASET, p, q, x))

In [None]:
#*****************-----Notebook Training Code Ends Here. Below are cells for loading pre-calculated values for Representative Genes and Cluster Sequences.-------*****************************

In [None]:
#Cell Left Empty Intentionally.

## Loading Gene Representatives and Clusters
The below cells can be run to load gene representatives and clusters if you already have them prepared.

In [None]:
"""
Loading the Cluster JSON files from memory
"""

with open('%s_p%s_q%smi_cluster.json'%(DATASET, p, q), 'r') as fp:
  mi_cluster=json.load(fp)

with open('%s_p%s_q%srelief_cluster.json'%(DATASET, p, q), 'r') as fp:
  chi_cluster=json.load(fp)


with open('%s_p%s_q%schi_cluster.json'%(DATASET, p, q), 'r') as fp:
  relief_cluster=json.load(fp)

with open('%s_p%s_q%spc_cluster.json'%(DATASET, p, q), 'r') as fp:
  pc_cluster=json.load(fp)

with open('%s_p%s_q%sfs_cluster.json'%(DATASET, p, q), 'r') as fp:
  fs_cluster=json.load(fp)

with open('%s_p%s_q%stt_cluster.json'%(DATASET, p, q), 'r') as fp:
  tt_cluster=json.load(fp)

with open('%s_p%s_q%ssnr_cluster.json'%(DATASET, p, q), 'r') as fp:
  snr_cluster=json.load(fp)