### In this notebook I want to test a few simple classifiers (ZeroR, OneR, Naive Bayes, etc) to get some classification baselines. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [17]:
#Set file paths to data
readcounts_file_path = 'https://raw.githubusercontent.com/ekofman/CSE283_classifier_project/master/data/readcounts_96_nodup.tsv'
patient_info_file_path = 'https://github.com/ekofman/CSE283_classifier_project/raw/master/data/patient_info.csv'

In [23]:
#Import data
readcounts = pd.read_csv(readcounts_file_path, sep="\t")
patient_info = pd.read_csv(patient_info_file_path)

In [29]:
#Confirm data properly imported
#readcounts
patient_info

Unnamed: 0,sampleID,poiseid,fu,Final.Library.Conc..nM.,Final.Library.Conc..ng.ul.,sample.name.in.MiniSeq,sample_id,On.Plate,readcount,avglength,...,cancertype,cancerstage_cat,er_cat,pr_cat,her2_cat,chemo,datechemostart,datechemoend,daterecurrence,recurStatus
0,C1,2006,2,81.01,15.83,ZZ-20170524-Cellfree-10-01,S01_B14,A1,435826,75,...,Ductal,3,0,0,0,AC/T,31/12/2008,8/4/09,29/06/2009,R
1,C2,2006,3,35.99,7.03,ZZ-20170524-Cellfree-10-02,S02_B14,A2,563731,75,...,Ductal,3,0,0,0,AC/T,31/12/2008,8/4/09,29/06/2009,R
2,C3,2010,3,70.89,13.85,ZZ-20170524-Cellfree-10-03,S03_B14,A3,457102,74,...,Ductal,2,0,0,0,AC/T,7/1/09,20/05/2009,9/11/10,R
3,C4,2010,4,114.20,22.31,ZZ-20170524-Cellfree-10-04,S04_B14,A4,638280,75,...,Ductal,2,0,0,0,AC/T,7/1/09,20/05/2009,9/11/10,R
4,C5,2011,2,108.95,21.28,ZZ-20170524-Cellfree-10-05,S05_B14,A5,434579,75,...,Ductal,3,1,1,0,AC/T,12/1/09,27/04/2009,22/12/2010,R
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,C92,2005,3,14.16,2.77,ZZ-20170524-Cellfree-10-92,S92_B14,H8,351558,74,...,Ductal,2,1,1,0,TC,12/11/08,14/01/2009,1/3/25,N
92,C93,2003,2,114.52,22.37,ZZ-20170524-Cellfree-10-93,S93_B14,H9,446446,75,...,Ductal,2,1,1,0,AC/T,24/10/2008,12/2/09,1/3/25,N
93,C94,2003,3,67.67,13.22,ZZ-20170524-Cellfree-10-94,S94_B14,H10,367318,75,...,Ductal,2,1,1,0,AC/T,24/10/2008,12/2/09,1/3/25,N
94,C95,2002,2,38.65,7.55,ZZ-20170524-Cellfree-10-95,S95_B14,H11,363729,75,...,Ductal,3,1,1,1,AC/T,13/10/2008,19/01/2009,1/3/25,N


#### ZeroR Classifier
The ZeroR classifier is one of the simplest classifiers possible and is primarily used to establish a classification accuracy baseline. It uses none (zero) of the predictors (features) and is only dependent on the response variable (here recurance status). It works by first identifying the response with the highest frequency and then calling everything as that response. 

In [36]:
#First, identify the response variable with the highest frequency.
actual_counts = {}
for status in patient_info['recurStatus']:
    if status not in actual_counts:
        actual_counts[status] = 1
    else:
        actual_counts[status]+=1

#Determine the accuracy (FDR) when all samples are labeled with the highest frequency (i.e. the accuracy of ZeroR)
accuracy_of_zeror = max(actual_counts.values())/sum(actual_counts.values())
accuracy_of_zeror

0.7083333333333334

#### OneR Classifier
The OneR classifier is another simple classifier but can apparently be competitive with more robust algorithms. The OneR classifier uses the single best predictor variable (input attribute) to make all classifications. 
See these links for more on this approach:  
https://www.youtube.com/watch?v=phnkMGDZUNI&list=PLea0WJq13cnCS4LLMeUuZmTxqsqlhwUoe&index=4  
https://www.youtube.com/watch?v=bAqU3-1FsPA  

Here's how I'll apply it to this problem:  
First, I think it'll be easier to test this algorithm if the read counts are split into two groups (e.g. high and low based on whether they are above or below the mean read count for that gene). 
Next 

#### Naive Bayesian Classifier