### In this notebook I want to test a few simple classifiers (ZeroR, OneR, Naive Bayes, etc) to get some classification baselines. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [17]:
#Set file paths to data
readcounts_file_path = 'https://raw.githubusercontent.com/ekofman/CSE283_classifier_project/master/data/readcounts_96_nodup.tsv'
patient_info_file_path = 'https://github.com/ekofman/CSE283_classifier_project/raw/master/data/patient_info.csv'

In [23]:
#Import data
readcounts = pd.read_csv(readcounts_file_path, sep="\t")
patient_info = pd.read_csv(patient_info_file_path)

In [198]:
#Confirm data properly imported
#readcounts
patient_info

Unnamed: 0,sampleID,poiseid,fu,Final.Library.Conc..nM.,Final.Library.Conc..ng.ul.,sample.name.in.MiniSeq,sample_id,On.Plate,readcount,avglength,...,cancertype,cancerstage_cat,er_cat,pr_cat,her2_cat,chemo,datechemostart,datechemoend,daterecurrence,recurStatus
0,C1,2006,2,81.01,15.83,ZZ-20170524-Cellfree-10-01,S01_B14,A1,435826,75,...,Ductal,3,0,0,0,AC/T,31/12/2008,8/4/09,29/06/2009,R
1,C2,2006,3,35.99,7.03,ZZ-20170524-Cellfree-10-02,S02_B14,A2,563731,75,...,Ductal,3,0,0,0,AC/T,31/12/2008,8/4/09,29/06/2009,R
2,C3,2010,3,70.89,13.85,ZZ-20170524-Cellfree-10-03,S03_B14,A3,457102,74,...,Ductal,2,0,0,0,AC/T,7/1/09,20/05/2009,9/11/10,R
3,C4,2010,4,114.20,22.31,ZZ-20170524-Cellfree-10-04,S04_B14,A4,638280,75,...,Ductal,2,0,0,0,AC/T,7/1/09,20/05/2009,9/11/10,R
4,C5,2011,2,108.95,21.28,ZZ-20170524-Cellfree-10-05,S05_B14,A5,434579,75,...,Ductal,3,1,1,0,AC/T,12/1/09,27/04/2009,22/12/2010,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,C92,2005,3,14.16,2.77,ZZ-20170524-Cellfree-10-92,S92_B14,H8,351558,74,...,Ductal,2,1,1,0,TC,12/11/08,14/01/2009,1/3/25,N
92,C93,2003,2,114.52,22.37,ZZ-20170524-Cellfree-10-93,S93_B14,H9,446446,75,...,Ductal,2,1,1,0,AC/T,24/10/2008,12/2/09,1/3/25,N
93,C94,2003,3,67.67,13.22,ZZ-20170524-Cellfree-10-94,S94_B14,H10,367318,75,...,Ductal,2,1,1,0,AC/T,24/10/2008,12/2/09,1/3/25,N
94,C95,2002,2,38.65,7.55,ZZ-20170524-Cellfree-10-95,S95_B14,H11,363729,75,...,Ductal,3,1,1,1,AC/T,13/10/2008,19/01/2009,1/3/25,N


#### ZeroR Classifier
The ZeroR classifier is one of the simplest classifiers possible and is primarily used to establish a classification accuracy baseline. It uses none (zero) of the predictors (features) and is only dependent on the response variable (here recurance status). It works by first identifying the response with the highest frequency and then calling everything as that response. 

In [36]:
#First, identify the response variable with the highest frequency.
actual_counts = {}
for status in patient_info['recurStatus']:
    if status not in actual_counts:
        actual_counts[status] = 1
    else:
        actual_counts[status]+=1

#Determine the accuracy (FDR) when all samples are labeled with the highest frequency (i.e. the accuracy of ZeroR)
accuracy_of_zeror = max(actual_counts.values())/sum(actual_counts.values())
accuracy_of_zeror

0.7083333333333334

#### OneR Classifier
The OneR classifier is another simple classifier which can apparently be competitive with more robust algorithms. The OneR classifier uses the single best predictor variable (input attribute) to make all classifications. 
See these links for more on this approach:  
* https://www.youtube.com/watch?v=phnkMGDZUNI&list=PLea0WJq13cnCS4LLMeUuZmTxqsqlhwUoe&index=4  
* https://www.youtube.com/watch?v=bAqU3-1FsPA  

Here's how I created a OneR classifier for this data:  
First, I removed all rows from the dataset that had expression levels of 0 in more than 20% of the samples (an arbitrary cutoff).  
Next, for each gene (which I'm treating as the predictor variables), I sorted the expression levels from low to high and then tested the accuracy of every possible low/high partition of the expression levels.  
The gene and partition that had the highest accuracy would typically be selected as the classification scheme and tested on other data. 
* One thing potentially worth revisiting in my approach is that I allowed the partitioning of low and high expression values to occur after any number of samples but it might be better to force the algorithm to make the split into groups equal to the number of recurrent and non-recurrent samples. For example, the gene that had the best accuracy achieved that accuracy with the first 85/96 lowest expression samples being called non-recurrent, which might not be appropriate. 

In [189]:
#Remove rows in the dataframe "readcounts" that have more that have expression levels of 0 in more than 20% of the samples 
readcounts_fewer_zeros = readcounts[(readcounts == 0).astype(int).sum(axis=1) < 19]
len(readcounts_fewer_zeros)
readcounts_fewer_zeros

Unnamed: 0,gene_id,C1,C2,C3,C4,C5,C6,C7,C8,C9,...,C87,C88,C89,C90,C91,C92,C93,C94,C95,C96
0,ENSG00000000003,3,18,19,13,9,9,13,5,10,...,6,5,2,5,6,0,18,7,21,12
3,ENSG00000000457,4,1,6,4,9,20,4,0,0,...,12,11,17,0,21,32,13,2,9,6
4,ENSG00000000460,19,14,24,6,7,9,6,2,10,...,10,12,10,1,14,1,17,1,2,4
6,ENSG00000000971,25,12,8,18,39,14,21,10,18,...,14,17,8,6,17,4,30,18,12,13
8,ENSG00000001084,10,3,10,8,38,6,8,15,11,...,28,14,10,0,1,1,14,16,17,23
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60588,ENSG00000282935,0,8,5,3,1,3,3,1,7,...,1,12,0,0,3,2,0,3,6,1
60601,ENSG00000282961,13,4,22,14,24,15,6,6,29,...,20,23,9,12,15,23,9,12,24,24
60610,ENSG00000282987,12,1,2,7,8,4,11,0,2,...,0,4,1,3,4,0,5,1,12,1
60618,ENSG00000282998,1,2,0,7,3,6,0,5,4,...,2,2,8,1,0,0,2,13,3,6


In [199]:
max_accuracy = []
for row in range(13019):
    #For each row of data, I make a list of lists in which each element of the main list contains two elements. The
    #first is the expression level for that sample and the second is whether that sample was actually recurrent or
    #non-recurrent. I then sort the list of lists from low expression levels to high. 
    expression_and_status = list(map(list, sorted(tuple(zip(list(readcounts_fewer_zeros.iloc[row][1:97]), list(patient_info['recurStatus']))))))
    #print(test)
    count = 0
    matches = []
    accuracy = []
    for sample in range(0, 96):
        #As I iterate through each row of data, I count the number of non-recurrent samples and store the counts in a list
        if expression_and_status[sample][1] == 'N':
            count+=1
            matches.append(count)
        #If a given sample isn't non-recurrent (an N), I don't increment the count. 
        else:
            matches.append(count)
    #For each low/high cutoff, I estimate the accuracy by dividing the true positives by the total number of samples.
    #The true positives are given by the number of N in the first partition plus the number of R in the second
    #partition minus the number of R that appeared in the first partition. 
    for j in range(0, 96):
        accuracy.append((matches[j]+95-68-j+matches[j])/96)
    #The max_accuracy stores the highest accuracy result from each row. 
    max_accuracy.append(max(accuracy))
print(max(max_accuracy)) #This gives the accuracy of the highest accuracy gene.  
best_gene_index = max_accuracy.index(max(max_accuracy)) #This identifies the index of the gene with the highest accuracy
readcounts_fewer_zeros.iloc[best_gene_index][0:97] #This returns the data for the gene with the highest accuracy

0.8020833333333334


gene_id    ENSG00000165175
C1                       6
C2                       3
C3                       4
C4                       4
                ...       
C92                      0
C93                      3
C94                      5
C95                      7
C96                      0
Name: 11546, Length: 97, dtype: object

In [None]:
#An inferior approach that changed the expression level entry into a recurrence status (N or R) based on the 
#current partition and then checked whether the status matched the original status. 
max_accuracy = []
for row in range(1):
    expression_and_status = list(map(list, sorted(tuple(zip(list(readcounts.iloc[row][1:97]), list(patient_info['recurStatus']))))))
    matches = 0
    accuracy = []
    for sample in range(len(expression_and_status)):
        for j in range(0, 96):
            if sample >= j:
                expression_and_status[j][0] = 'N'
            else:
                expression_and_status[j][0] = 'R'
            if expression_and_status[j][0] == expression_and_status[j][1]:
                matches+=1
        accuracy.append(matches/96)
        matches = 0
    max_accuracy.append(max(accuracy))
max(max_accuracy)

#### Naive Bayesian Classifier