<a href="https://colab.research.google.com/github/abhik-99/MFSGC/blob/master/Univariate_Supervised_Gene_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Univariate Supervised Gene Clustering
**UFSGC** is a method where by a specific filter algorithm is used to score and filter out high ranking genes from Gene Expression Dataset and then the filtered Genes are put through SGC for Gene Augmentation. The resulting Augmentation not only increases the class separability of the genes but also their expressions.\
This Augmented gene expression set is now used for classification of cancer from healthy patients.\
The Filter Methods chosen for evaluation are:- 
1. Mutual Information.
2. ReliefF.
3. Chi Sq.
4. Fisher Score.
5. Signal To Noise Ratio (adapted for multi-class datasets).
6. T-Test.
7. Pearson Corelation Coefficient.

This method is used for evaluation of **MFSGC**.

In [1]:
!pip install -U -q PyDrive
!pip install skfeature-chappers

Collecting skfeature-chappers
  Downloading https://files.pythonhosted.org/packages/e6/45/19bb801eb3b4a892534ab86468ad0669a68ff63578610f90051190e3622f/skfeature-chappers-1.0.3.tar.gz
Building wheels for collected packages: skfeature-chappers
  Building wheel for skfeature-chappers (setup.py) ... [?25l[?25hdone
  Created wheel for skfeature-chappers: filename=skfeature_chappers-1.0.3-py2.py3-none-any.whl size=59512 sha256=f6a0228ff6f5ff50dc985361ea3ddc0457665ed85c1a39236f7b77f3743fea87
  Stored in directory: /root/.cache/pip/wheels/ac/61/bf/1b3a8c232a0072409508c2ec4c12f316e95681ae72ba7315d2
Successfully built skfeature-chappers
Installing collected packages: skfeature-chappers
Successfully installed skfeature-chappers-1.0.3


In [23]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import LeaveOneOut
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
from sklearn.feature_selection import chi2
import json

In [3]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
from google.colab import files

# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

#2. Get the file
downloaded = drive.CreateFile({'id':'1oaOATE0D_f8MGPIMJOMXYVt0hBUWNCKV'}) # replace the id with id of file you want to access
# For Leukemia- 1xcL-LT-E_gUqWLlqqeVJP1DVHHpiAGe_
# For Colon - 1AUOto0GhTHW9fX52XSsf9kzYJS5ggv0G
# for Prostate - 13Hf7uGbyJ1sWYo8KDRDL8scm-2Fs9_gd
# For Lung- 1xuLzTWDGUbr4x3Pq1dnJj08MZqBB5I3U
# for Rahc - 1oaOATE0D_f8MGPIMJOMXYVt0hBUWNCKV
# for Raoa - 1d2vhPcT3I7ZFcAGOQYVLGB3Jx_vEMata
# for Rbreast - 1Vf-h8zfVP_twMXivcJJtbWtjThShUHvn
# for SRBCT - 1rO5EEvsoRJl2VVUB3ywKUd3kNiQ24oy3
# for MLL - 1rS7x4x_DhrUzaBhrgKMQH3uIaLJdPgW3
# for Breast - 1enhhyA4u2ByvOjnF81WoHflVNpXtfKpu
downloaded.GetContentFile('data.txt')

In [29]:
#DATASET is the name of the dataset being used.
DATASET="RAHC"

#NEIGHBOURS determines neighbours arg for ReliefF
#for any dataset which contains any class sample 
# <10, make it less than 10. Eg of such dataset - SRBCT
NEIGHBOURS = 3 

#p is the number of top genes taken after sorting the filter scores
p = 800

#q is the number of top genes to be taken from each filter after augmentation
q = 5

#uncomment the line below if using the dataset splitter else leave it commented 
#data_df = pd.read_csv("%s_train.csv"%(DATASET),index_col=0)

#uncomment the lines below if using the original dataset
dataset = pd.read_table("data.txt",header=None)
data_df = dataset



target = data_df.iloc[:,-1]
feature = pd.DataFrame(data_df.iloc[:,:-1].values,dtype='float')
m,n = feature.shape
print(m,n)
print(feature.head())
print("Number of classes - ")
classes = np.unique(target)
for x in classes:
  print("Class -",x,"Number of Sampples -", len(np.where(target == x)[0]))

feature_norm=pd.DataFrame(MinMaxScaler().fit_transform(feature))

49 41056
     0      1      2       3      4      ...  41051  41052  41053  41054  41055
0  0.84300  0.925  0.602  0.1466 -0.827  ...  1.342  0.196 -0.763 -2.223 -0.349
1  0.72900  0.569  0.481 -0.2620 -0.666  ...  2.723  0.213  0.132 -2.008 -0.278
2  1.93300  0.005  0.636  0.9800 -0.950  ...  2.193 -0.715 -0.258 -2.376 -0.322
3  1.52500  1.819  0.899  0.5750 -0.025  ...  1.100  0.094 -0.346 -1.472 -0.288
4  0.36836  1.552  1.237  1.7100 -0.215  ...  2.944 -0.014  0.048 -1.213  0.646

[5 rows x 41056 columns]
Number of classes - 
Class - 0 Number of Sampples - 34
Class - 1 Number of Sampples - 15


In [5]:
#utility function
def plot_feature(feature, target, c = ['r', 'b', 'g', 'y']):
  import matplotlib.pyplot as plt
  from matplotlib import style
  import numpy as np
  style.use('ggplot')
  for idx, each in enumerate(np.unique(target)):
    y = feature[np.where(target == each)[0]]
    x = len(y)
    plt.scatter(range(1, x+1), y, color = c[idx])
    plt.plot(range(1, x+1), y, color = c[idx])

In [6]:
#construction of ReliefF function

"""
Given a dataset, number of random instances to pick form the dataset and
number of features to consider in each iteration (k), the function returns the weigths of the attributes
of the dataset.
These weigths can then be used as the final results out of the ReliefF algorithm

Paper-

Marko Robnik-ˇSikonja and Igor Kononenko. Theoretical and empirical analysis of relieff
and rrelieff. Machine learning, 53(1-2):23–69, 2003.

"""

def hit_miss_calculator(target,instance,k = 10, hit = True, c = None, ):
    m=len(target)
    upper,lower=instance-1,instance+1
    hits=[]
    hit_flag=False
    #finds k nearest hits
    while(not hit_flag):
      #print(upper,lower)
      if(len(hits)>=k):
        hit_flag = True
        break
      if upper < 0 and lower > m:
        hit_flag = True
        break
      if(upper>=0):
        if((target[upper]==target[instance]) and hit):
          hits.append(upper)
        elif((target[upper]!=target[instance]) and (not hit) and target[upper]==c):
          hits.append(upper)
        upper-=1          
      if(lower<m):
        if((target[lower]==target[instance]) and hit):
          hits.append(lower)
        elif((target[lower]!=target[instance]) and (not hit) and target[lower]==c):
          hits.append(lower)
        lower+=1
    hits.sort()
    return hits


def reliefF(feature,target,k=10,repetitions=10, seed = 0):
  np.random.seed(seed)
  if len(feature.shape)>1:
    m,n=feature.shape
  else:
    m=len(feature)
    n=1
  #print(m,n)
  observations=list(range(m))
  classes=np.unique(target)
  weights=np.zeros(n)
  d=(np.max(feature,axis=0)-np.min(feature,axis=0))*m*k

  for i in range(repetitions):
    instance=np.random.choice(observations,1)[0]
    #print("Iteration",i)
    #print(instance)
    hits=hit_miss_calculator(target,instance,k)
    hit_class_prob=len(np.where(target==target[instance])[0])/m
    #print("\nHit Probability -",hit_class_prob)
    #print("Repetition",i,"Class",target[instance],"Hits -",hits)

    miss={}
    miss_class_prob={}

    for each_class in classes:
      if(each_class != target[instance]):
        miss[each_class]=hit_miss_calculator(target,instance,k,False,each_class)
        class_prob=len(np.where(target==each_class)[0])/m
        #print(each_class,class_prob)
        miss_class_prob[each_class]=hit_class_prob/(1 - (class_prob))

    #print("Repetition",i,"Miss-",miss,"Miss Class Probability -",miss_class_prob)
    
    for hit in hits:
      if len(feature.shape)>1:
        weights-=np.subtract(feature.iloc[instance,:],feature.iloc[hit,:])/d
      else:
        weights-=np.subtract(feature.iloc[instance],feature.iloc[hit])/d
    for each_class in miss:
      for each_miss in miss[each_class]:
        if len(feature.shape)>1:
          weights+=(np.subtract(feature.iloc[instance,:],feature.iloc[each_miss,:])/d)*miss_class_prob[each_class]
        else:
          weights+=(np.subtract(feature.iloc[instance],feature.iloc[each_miss])/d)*miss_class_prob[each_class]
    
    
  return weights.tolist()

In [7]:
#This function discretizes the given features into 3 categories
def discretize_feature(feature):
  
  mean=np.mean(feature)
  std=np.std(feature)
  discretized=np.copy(feature)
  
  discretized[np.where(feature<(mean+std/2)) ,]=2#within 1/2 std div
  discretized[np.where(feature>(mean-std/2)),]=2#within 1/2 std div
  
  discretized[np.where(feature>(mean+std/2)),]=0#greater than half
  discretized[np.where(feature<(mean-std/2)),]=1#less than half
  
  return discretized

def Xfreq(x):
  xL={}
  for e in x:
    if e not in xL:
      xL[e]=0
    else:
      xL[e]+=1
  for e in xL:
    xL[e]/=len(x)
  return xL

def XYfreq(x,y):
  freq={}
  
  rX=np.unique(x)
  rY=np.unique(y)
      
  for e in rX:
    for f in rY:
      freq[(e,f)]=round(len(np.where(y[np.where(x==e)[0]]==f)[0])/len(x),4)
       
  return freq

def mutual_info(x,y):

  xFreq=Xfreq(x)
  yFreq=Xfreq(y)
  joint=XYfreq(x,y)
  
  Xentropy=0
  for e in xFreq:
    if xFreq[e]!=0:
      Xentropy-=xFreq[e]*np.log2(xFreq[e])
      
  Yentropy=0
  for e in yFreq:
    if yFreq[e]!=0:
      Yentropy-=yFreq[e]*np.log2(yFreq[e])
      
  jentropy=0
  for e in xFreq:
    for f in yFreq:
      if joint[(e,f)]!=0:
        jentropy-=joint[(e,f)]*np.log2(joint[(e,f)])
  
  return (Xentropy+Yentropy-jentropy)

def mutual_info_wrapper(features,target):

  mi=np.array([])
  for x in features:
    discrete=discretize_feature(features[x])
    mi=np.append(mi,mutual_info(discrete,target))
  return np.array(mi)

In [8]:
"""
This cell is used for defining the method for calculating the t-scores
"""

def t_test(df,target):
  """
  Input:
  df= Dataframe of features (n_samples,n_features)
  target= Pandas Series/1D Numpy Array containing the class labels (n_samples)
  
  Output:
  scores= Descendingly Sorted array of features based on t-test 
  """
  import numpy as np
  from scipy.stats import ttest_ind
  scores=ttest_ind(df[:][target==0],df[:][target==1])[0] #Storing just the t-test scores and discarding the p-values from the result.
  
  # scores=np.argsort(scores,0)
  return [scores] if type(scores) != np.ndarray else scores

  

In [9]:
from scipy.sparse import *
def fisher_score(X, y):
    import numpy as np
    
    from skfeature.utility.construct_W import construct_W
    """
    This function implements the fisher score feature selection, steps are as follows:
    1. Construct the affinity matrix W in fisher score way
    2. For the r-th feature, we define fr = X(:,r), D = diag(W*ones), ones = [1,...,1]', L = D - W
    3. Let fr_hat = fr - (fr'*D*ones)*ones/(ones'*D*ones)
    4. Fisher score for the r-th feature is score = (fr_hat'*D*fr_hat)/(fr_hat'*L*fr_hat)-1

    Input
    -----
    X: {numpy array}, shape (n_samples, n_features)
        input data
    y: {numpy array}, shape (n_samples,)
        input class labels

    Output
    ------
    score: {numpy array}, shape (n_features,)
        fisher score for each feature

    Reference
    ---------
    He, Xiaofei et al. "Laplacian Score for Feature Selection." NIPS 2005.
    Duda, Richard et al. "Pattern classification." John Wiley & Sons, 2012.
    """

    # Construct weight matrix W in a fisherScore way
    kwargs = {"neighbor_mode": "supervised", "fisher_score": True, 'y': y}
    W = construct_W(X, **kwargs)

    # build the diagonal D matrix from affinity matrix W
    D = np.array(W.sum(axis=1))
    L = W
    tmp = np.dot(np.transpose(D), X)
    D = diags(np.transpose(D), [0])
    Xt = np.transpose(X)
    t1 = np.transpose(np.dot(Xt, D.todense()))
    t2 = np.transpose(np.dot(Xt, L.todense()))
    # compute the numerator of Lr
    D_prime = np.sum(np.multiply(t1, X), 0) - np.multiply(tmp, tmp)/D.sum()
    # compute the denominator of Lr
    L_prime = np.sum(np.multiply(t2, X), 0) - np.multiply(tmp, tmp)/D.sum()
    # avoid the denominator of Lr to be 0
    D_prime[D_prime < 1e-12] = 10000
    lap_score = 1 - np.array(np.multiply(L_prime, 1/D_prime))[0, :]

    # compute fisher score from laplacian score, where fisher_score = 1/lap_score - 1
    score = 1.0/lap_score - 1
    return np.transpose(score)


In [10]:

#Pearson corelation
def pearson_corr(feature,targetClass):
  import numpy as np
  coef=[np.abs(np.corrcoef(feature[i].values,targetClass)[0,1]) for i in feature.columns]
  # range(feature.shape[1])
  coef=[0 if np.isnan(i) else i for i in coef]
  return coef


In [11]:
#signal to noise ratio
#using weighted one-vs-all strategy for multi-class data
def signaltonoise(feature, target, axis = 0, ddof = 0):
  import numpy as np
  classes = np.unique(target)
  if len(feature.shape)<2:
    feature = feature.reshape(-1,1)
  row, _ = feature.shape
  if len(classes) <= 2:
    m = None
    std = 0
    for each in classes:
      idx = np.where(target == each)[0]
      #convinient way of doing m1-m2
      if m is None:
        m = feature.iloc[idx, :].mean(axis)
      else:
        m -= feature.iloc[idx, :].mean(axis)

      #sd1+sd2
      std += feature.iloc[idx, :].std(axis = axis, ddof = ddof)

    return np.asanyarray(m/std)

  else:
    snr_scores = [] #for storing the weighted scores
    #using the one vs all strategy for each class with
    for each in classes:
      idx = np.where(target == each)[0]
      idxn = np.where(target != each)[0]
      m = feature.iloc[idx, :].mean(axis) - feature.iloc[idxn, :].mean(axis)
      std = feature.iloc[idx, :].std(axis = axis, ddof = ddof) + feature.iloc[idxn, :].std(axis = axis, ddof = ddof) 
      snr_scores.append((m/std) * len(idx)/row) #weighted snr

    return np.asanyarray(snr_scores).sum(axis = axis)

In [12]:
def feature_ranking(score):
    """
    Rank features in descending order according to fisher score, the larger the fisher score, the more important the
    feature is
    """
    idx = np.argsort(score, 0)
    return idx[::-1]

In [13]:

relief_score=reliefF(feature,target,NEIGHBOURS)

mutual_inf=mutual_info_wrapper(feature,target)

mms=MinMaxScaler()
nfeature=mms.fit_transform(feature)
chi_score,p_val=chi2(nfeature,target)

p_corr = pearson_corr(feature, target)

f_score = fisher_score(feature.values, target)

tt_score = t_test(feature, target)

snr_score = signaltonoise(feature, target)

In [14]:
#The Features are sorted as per their scores
sorted_relief = feature_ranking(relief_score)[:p]
sorted_mi = feature_ranking(mutual_inf)[:p]
sorted_chi = feature_ranking(chi_score)[:p]
sorted_pc = feature_ranking(p_corr)[:p]
sorted_fs = feature_ranking(f_score)[:p]
sorted_tt = feature_ranking(tt_score)[:p]
sorted_snr = feature_ranking(snr_score)[:p]

In [15]:
#Can Skip this Cell

print("Features after sorting -")
print("\nSorted MI -",sorted_mi)
print("\nSorted Relief -",sorted_relief)
print("\nSorted Chi -",sorted_chi)
print("\nSorted Pearson Corr -",sorted_pc)
print("\nSorted Fisher Score -",sorted_fs)
print("\nSorted T-test -",sorted_tt)
print("\nSorted SNR - ", sorted_snr)

Features after sorting -

Sorted MI - [30281  4363 29141  9389 37463 32392 13914 30726 26339 38514  2415 37915
 24926 24779 30853  1857 31676  4286 18189 39599 16488 12514 12563 31856
 24211 11946  1342 21019 40435  6587 34674 34661 19234 31059  5203 28818
 25710 39774 39754 30769 21834 19262  1584 36138 15186 28673 20211 32340
  4084 21088 23949 27166 21280 11557 15969 25415   794 27637 37862 28495
  4309  1856 16079 35279  9417 30705 32861 21009 39725 39651  7712  6302
 10481 16564 10741 33908 15648 30753 37716 16335 19076 10376 26958  3527
  4357 15907 15152  4287 17304 22706 12214 14659 18279 13311 31976  2690
 21243 29885  5966  9114 34199 16261 18263 25227 27947 18145  2323 33043
 12255 26319 21066 35309 38181 19705  1068 36968  7232 29086  7747 27540
 15839  5562 13504 38671  5323 12153 40012 40458 39692 40413 40420 39370
 37354 14683 39172 29829   386  1210 28326 27952 18450 36597 35640 38756
 26787 32470 27650 31606  7862 27470 24519  7106 17509 20678   981  3155
 26778 35987 

In [16]:
def score(a,p,target):  
  if p==1:
    return mutual_info_wrapper(pd.DataFrame(a.reshape(-1,1)),target)
    
  if p==2:    
    ndf=pd.DataFrame()
    ndf[0]=a
    reliefa=reliefF(ndf,target,NEIGHBOURS,2)
    return reliefa
  
  if p==3:    
    from sklearn.preprocessing import MinMaxScaler
    mms=MinMaxScaler() 
    a=mms.fit_transform(a.reshape(-1,1))
    chia=chi2(a,target)[0]
    return chia
  
  if p==4:
    return pearson_corr(pd.DataFrame(a.reshape(-1, 1)), target)
  
  if p==5:
    return fisher_score(a.reshape(-1,1), target)
  
  if p==6:
    return t_test(a, target)
  
  if p==7:
    return signaltonoise(pd.DataFrame(a.reshape(-1,1)), target)

In [17]:
def get_clusters(genes,features,p,target):
  """
  genes - list of subset gene. These are the genes of picked by the score function. Please note that these are just the gene names. Their actual values are passed in the features dataframe
  features - the dataframe which contains the values of the genes
  p - this denotes the  type of score function. 1- mutual information, 2- reliefF, 3- chi square test.
  target - target is a pandas series of target clases for each observation
  """
  clusters={}
  cluster_gene={}
  x,y=0,0
  genes_copy_1=np.copy(genes)
  while(len(genes_copy_1)>0):
    # print("Starting New Iteration with", len(genes_copy_1),"number of genes!")
    genes_copy_2=np.copy(genes_copy_1)
    r_gene=genes_copy_2[0]
    r_gene_values=features[r_gene].values

    clusters[str(r_gene)]=[]
    
    genes_copy_2=np.delete(genes_copy_2,0)
    genes_copy_1=np.delete(genes_copy_1,0)
    
    
    
    r_score=score(r_gene_values,p,target)[0]
    
    # print("\nCluster number=",len(clusters))
    # print("First feature =========================j1=",r_gene,"\n")
    x+=1
    # print("Intial Relevance Score",r_score)

    while(len(genes_copy_2)>0):
      
      gs=genes_copy_2[0]
      gene=features[gs].values

      y+=1      
      
      a_plus=np.add(r_gene_values,gene,dtype='float64') #creating A+
      a_minus=np.subtract(r_gene_values,gene,dtype='float64') #Creating A-

      a_plus_score=score(a_plus,p,target)[0]
      a_minus_score=score(a_minus,p,target)[0]
      
      new_score=a_plus_score if a_plus_score>a_minus_score else a_minus_score
      # print("Gene",gs,"+ Score",a_plus_score,"- Score",a_minus_score)

      if new_score>r_score:

        if a_plus_score==new_score:

          # print("Gene Under Consideration",gs)
          # print("Initial Relevance",r_score,"Final Relevance",a_plus_score,r_score<a_plus_score)

          clusters[str(r_gene)].append(str(gs)+"+")
          r_gene_values=a_plus[:]
          r_score=a_plus_score

          # print("cluster member = +",gs,"\tRelevance Changed to",r_score)

        elif a_minus_score==new_score:

          # print("Gene Under Consideration",gs)
          # print("Initial Relevance",r_score,"Final Relevance",a_minus_score,r_score<a_minus_score)
          
          clusters[str(r_gene)].append(str(gs)+"-")
          r_gene_values=a_minus[:]
          r_score=a_minus_score

        #   print("cluster member = -",gs,"\tRelevance Changed to",r_score)
        # print("Gene",gs,"selected!",np.where(genes_copy_1 == gs))
        genes_copy_1 = np.delete(genes_copy_1, np.where(genes_copy_1 == gs))      
      genes_copy_2=np.delete(genes_copy_2,0)
    
    # for each in clusters[str(r_gene)]:
    #     genes_copy_1=np.delete(genes_copy_1,np.where(genes_copy_1==each))
    cluster_gene[r_gene]=r_gene_values

  #   print("\nFinal Relevance Score",r_score)
  print("Clusters formed! Returning Clusters and Gene Representatives")
  return clusters,cluster_gene

In [18]:
mi_cluster, gene_repre_1 = get_clusters(sorted_mi, feature, 1, target)
relief_cluster ,gene_repre_2 = get_clusters(sorted_relief, feature, 2, target)
chi_cluster, gene_repre_3 = get_clusters(sorted_chi, feature, 3, target)

Clusters formed! Returning Clusters and Gene Representatives
Clusters formed! Returning Clusters and Gene Representatives
Clusters formed! Returning Clusters and Gene Representatives


In [19]:
pc_cluster, gene_repre_4 = get_clusters(sorted_pc, feature, 4, target)
fs_cluster, gene_repre_5 = get_clusters(sorted_fs, feature, 5, target)
tt_cluster, gene_repre_6 = get_clusters(sorted_tt, feature, 6, target)
snr_cluster, gene_repre_7 = get_clusters(sorted_snr, feature, 7, target)

Clusters formed! Returning Clusters and Gene Representatives
Clusters formed! Returning Clusters and Gene Representatives
Clusters formed! Returning Clusters and Gene Representatives
Clusters formed! Returning Clusters and Gene Representatives


In [20]:
print("Number of MI Clusters formed -",len(mi_cluster))
print("Number of ReliefF Clusters formed -",len(relief_cluster))
print("Number of ChiSq. Clusters formed -",len(chi_cluster))
print("Number of Pearson Clusters formed -",len(pc_cluster))
print("Number of Fisher Score Clusters formed -",len(fs_cluster))
print("Number of T-Test Clusters formed -",len(tt_cluster))
print("Number of SNR Clusters formed -", len(snr_cluster))

Number of MI Clusters formed - 76
Number of ReliefF Clusters formed - 22
Number of ChiSq. Clusters formed - 27
Number of Pearson Clusters formed - 10
Number of Fisher Score Clusters formed - 10
Number of T-Test Clusters formed - 12
Number of SNR Clusters formed - 13


In [34]:
qmin = min([len(mi_cluster), len(relief_cluster), len(chi_cluster), len(pc_cluster), len(fs_cluster), len(tt_cluster), len(snr_cluster)])
q = q if q <= qmin else qmin
print("Choosing top %s Augmented Genes from each cluster"%(q))

Choosing top 5 Augmented Genes from each cluster


In [21]:
gene_repre_1 = pd.DataFrame(gene_repre_1)
gene_repre_2 = pd.DataFrame(gene_repre_2)
gene_repre_3 = pd.DataFrame(gene_repre_3)
gene_repre_4 = pd.DataFrame(gene_repre_4)
gene_repre_5 = pd.DataFrame(gene_repre_5)
gene_repre_6 = pd.DataFrame(gene_repre_6)
gene_repre_7 = pd.DataFrame(gene_repre_7)

In [24]:
"""
Saving the clusters to JSON files, this preserves the gene selection sequence
"""
with open('%s_p%smi_cluster.json'%(DATASET, p), 'w') as fp:
    json.dump(mi_cluster, fp)

with open('%s_p%srelief_cluster.json'%(DATASET, p), 'w') as fp:
    json.dump(relief_cluster, fp)


with open('%s_p%schi_cluster.json'%(DATASET, p), 'w') as fp:
    json.dump(chi_cluster, fp)

with open('%s_p%spc_cluster.json'%(DATASET, p), 'w') as fp:
    json.dump(pc_cluster, fp)

with open('%s_p%sfs_cluster.json'%(DATASET, p), 'w') as fp:
    json.dump(fs_cluster, fp)

with open('%s_p%stt_cluster.json'%(DATASET, p), 'w') as fp:
    json.dump(tt_cluster, fp)

with open('%s_p%ssnr_cluster.json'%(DATASET, p), 'w') as fp:
    json.dump(snr_cluster, fp)

In [25]:
def sort_keys(scores,gene_repre,target,flag=True):
  score_dict={}
  x=0
  for i in gene_repre.columns:
    score_dict[i]=scores[x]
    x+=1
  return [k for k, v in sorted(score_dict.items(), key=lambda item: item[1], reverse = True)]

In [35]:
"""
feature_ranking cannot be used here because it sorts and returns the indices 
from 0-1. They need to be sorted using a different function
"""
sorted_mi_keys=sort_keys(mutual_info_wrapper(gene_repre_1,target),gene_repre_1,target,True)[:q]

sorted_relief_keys=sort_keys(reliefF(gene_repre_2,target,k=NEIGHBOURS,repetitions=5),gene_repre_2,target,True)[:q]

mms=MinMaxScaler()
nfeature=mms.fit_transform(gene_repre_3)
chi_score,p_val=chi2(nfeature,target)
sorted_chi_keys=sort_keys(chi_score,gene_repre_3,target,False)[:q]

sorted_pc_keys=sort_keys(pearson_corr(gene_repre_4,target),gene_repre_4,target,True)[:q]

sorted_fs_keys=sort_keys(fisher_score(gene_repre_5.values,target),gene_repre_5,target,True)[:q]

sorted_tt_keys=sort_keys(t_test(gene_repre_6,target),gene_repre_6,target,True)[:q]

sorted_snr_keys = sort_keys(signaltonoise(gene_repre_7, target), gene_repre_7, target, True)[:q]

In [36]:
print("MI cluster after sorting - ",sorted_mi_keys)
print("Relief cluster after sorting - ",sorted_relief_keys)
print("Chi cluster after sorting - ",sorted_chi_keys)
print("Pearson cluster after sorting - ",sorted_pc_keys)
print("Fisher cluster after sorting - ",sorted_fs_keys)
print("T-Test cluster after sorting - ",sorted_tt_keys)
print("SNR cluster after sorting - ",sorted_snr_keys)

MI cluster after sorting -  [30281, 30726, 31676, 4286, 16488]
Relief cluster after sorting -  [39638, 20152, 27404, 36796, 36319]
Chi cluster after sorting -  [39602, 19160, 39743, 25633, 18588]
Pearson cluster after sorting -  [17976, 4363, 7291, 1172, 32817]
Fisher cluster after sorting -  [17976, 4363, 7291, 1172, 32817]
T-Test cluster after sorting -  [31676, 2415, 40278, 30206, 11562]
SNR cluster after sorting -  [10353, 17976, 12514, 38303, 11562]


In [40]:
#creating a Dataframe for containing the augmented gene keys
aug_df_keys = pd.DataFrame({"MI":sorted_mi_keys, "ReliefF":sorted_relief_keys, 
                     "Chi Sq":sorted_chi_keys, "Pearson":sorted_pc_keys, 
                     "Fisher":sorted_fs_keys, "tTest":sorted_tt_keys, 
                     "SNR":sorted_snr_keys})

aug_df_dict = {"MI":gene_repre_1, "ReliefF":gene_repre_2, 
                     "Chi Sq":gene_repre_3, "Pearson":gene_repre_4, 
                     "Fisher":gene_repre_5, "tTest":gene_repre_6, 
                     "SNR":gene_repre_7}
                     
print(aug_df_keys.head())
print(aug_df_dict)

      MI  ReliefF  Chi Sq  Pearson  Fisher  tTest    SNR
0  30281    39638   39602    17976   17976  31676  10353
1  30726    20152   19160     4363    4363   2415  17976
2  31676    27404   39743     7291    7291  40278  12514
3   4286    36796   25633     1172    1172  30206  38303
4  16488    36319   18588    32817   32817  11562  11562
{'MI':       30281    37463    13914    30726  ...    13400  34131  15622     7415 
0   0.11700 -2.26300  1.73300  2.19500  ...  0.09700 -4.729 -1.406  -0.21000
1  -0.36600  0.07400  1.14400  3.66100  ...  0.97500 -5.854 -0.500   0.08900
2   0.80500 -1.36700  2.06900  2.90500  ...  2.13486 -4.180 -2.114  -0.01500
3   0.70100 -1.19300  2.38800  1.18500  ...  0.27200 -4.814 -0.897   0.05200
4  -0.56800 -2.13600  2.21000  0.66300  ...  0.34400 -4.665 -1.511  -0.08900
5   0.59500 -2.25100  1.44000  1.86600  ...  0.83000 -3.563 -1.753  -0.00900
6   1.93100 -0.32100  1.52900  1.75900  ...  2.05100 -3.268 -1.226  -0.37900
7   5.75800 -0.15600  0.84500  2.74

In [44]:
LOOCV=LeaveOneOut()
data_KNN=KNeighborsClassifier(n_neighbors= int(feature.shape[0] ** 0.5))
data_SVM=SVC(kernel='rbf',gamma='scale')
data_NB=GaussianNB()
data_Tree= DecisionTreeClassifier()
rows=feature.shape[0]
classifiers=["NB","KNN","Tree","SVM"]

In [45]:
#Iterating over filters
for filter_name in aug_df_keys.columns:  
  acc_matrix = pd.DataFrame()
  for i in range(1,q+1):
    """
    Make a dataframe out of i keys from the gene representatives obtained from
    augmenting the chosen filter.
    Than use LOOCV to measure accuracy on Train Dataset.
    """
    acc=0
    individual_acc = np.zeros(4)
    cluster_df = aug_df_dict[filter_name].iloc[:, :i]

    # print(cluster_df.shape)

    for train_index,test_index in LOOCV.split(cluster_df):
      """
      Data is divided into train-test splits and then polling method is used 
      to find the classification results (ensemble of KNN,SVM,NB,Decision Tree)
      """
      train_data,train_labels=cluster_df.iloc[train_index,:],target[train_index]
      test_data,test_labels=cluster_df.iloc[test_index,:],target[test_index].values.tolist()[0]
      data_KNN.fit(train_data,train_labels)
      data_SVM.fit(train_data,train_labels)
      data_NB.fit(train_data,train_labels)
      data_Tree.fit(train_data,train_labels)

      class_list = [data_NB, data_KNN, data_Tree, data_SVM]
      results=[]

      #getting individual results
      for x in range(4):
        tem_result = class_list[x].predict(test_data)[0]
        if tem_result == test_labels:
          individual_acc[x]+=1
        results.append(tem_result)
      polling_result=0
      max_freq=0

      #getting ensemble results
      for x in results:
        freq=results.count(x)
        if freq>max_freq:
          max_freq=freq
          polling_result=x
      if polling_result == test_labels:
        acc+=1

    individual_acc = np.round(individual_acc/cluster_df.shape[0],4)
    individual_acc = np.append(individual_acc, np.round(acc/cluster_df.shape[0],4))
    # print(individual_acc)

    acc_matrix[i] = individual_acc
  acc_matrix = acc_matrix.T

  acc_matrix.columns = classifiers[:]+['Ensemble']

  print("\nFilter:-",filter_name, "\n",acc_matrix)


Filter:- MI 
        NB     KNN    Tree     SVM  Ensemble
1  1.0000  1.0000  1.0000  1.0000       1.0
2  0.9796  1.0000  1.0000  1.0000       1.0
3  1.0000  1.0000  0.9796  1.0000       1.0
4  1.0000  1.0000  0.9796  1.0000       1.0
5  1.0000  0.9796  1.0000  0.9796       1.0

Filter:- ReliefF 
        NB     KNN    Tree     SVM  Ensemble
1  0.7347  0.7347  0.6122  0.6327    0.7347
2  0.7551  0.7347  0.6531  0.7959    0.7551
3  0.7551  0.7551  0.6735  0.7755    0.7551
4  0.7143  0.7755  0.6735  0.7755    0.7143
5  0.6735  0.7959  0.5918  0.8163    0.6939

Filter:- Chi Sq 
        NB     KNN    Tree     SVM  Ensemble
1  0.7959  0.8163  0.7959  0.7959    0.7959
2  0.7755  0.8367  0.8163  0.8367    0.8367
3  0.7755  0.8367  0.7959  0.8163    0.8163
4  0.8776  0.9796  0.9796  0.9796    0.9796
5  0.8571  0.9796  0.9796  0.9796    0.9796

Filter:- Pearson 
     NB  KNN  Tree  SVM  Ensemble
1  1.0  1.0   1.0  1.0       1.0
2  1.0  1.0   1.0  1.0       1.0
3  1.0  1.0   1.0  1.0       1.0
4 