##  Machine Learning for Thyroid Cancer Diagnosis.
##  Part 2: Logistic regression
**The project was done with Rajiv Krishnakumar and Raghu Mahajan.**

The essential goal was to predict thyroid cancer given gene expressions. A key hope is to definitively predict benign samples; this helps to avoid unnecessary surgeries, which often turn out to be much more problematic to a patient's health, than the thyroid cancer itself.


- The data used here is pre-normalized, to mean zero and standard deviation 1. 
- The essentials of the data set are 265 patients whose biopsies were inconclusive, each with 173 reported gene expression levels. 
- There were a further 102 patients with 'conclusive' biopsies - i.e. a human determination of benign vs. malignant, to give 367 total patients.

Here is an abstract from our final report:

*We investigate the use of high throughput gene expression data in the diagnosis of thyroid cancers. Using logistic regression and support vector machines (SVMs), we develop a classifier which gives similar performance (89% sensitivity and 80% specificity) to the currently best- known classifier, but uses significantly fewer features. We used two different techniques, principal components analysis and mutual information score, to select features. The results do not depend significantly on which method is used for feature selection.*

The breakdown of topics covered in each notebook is as follows:
1. Data visualization, including PCA and tSNE visualizations.
2. Logistic regression, discussing the use of feature selection via mutual information vs. use of different regularizers.
3. SVMs with and without box constraints, and also using different kernel functions.

In [39]:
#As usual import some modules and import the dataset
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#import the data and look at it
X = pd.read_csv("data/normalized_data_265.csv", header =None)
y = pd.read_csv("data/outcome_265.csv", header = None)


X_full = pd.read_csv("data/normalized_data_367.csv", header =None)
y_full = pd.read_csv("data/outcome_367.csv", header = None)



#Now turn these into numpy arrays to avoid problems with pandas dataframes
X = X.as_matrix()
y = y.as_matrix().reshape(len(y))
X_full = X_full.as_matrix()
y_full = y_full.as_matrix().reshape(len(y_full))


## Logistic regression 

A simple and straightforward first pass is always a logistic regression.

In [70]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression



#First perform a train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

print "-----Results of train-test split-----"
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)




#Now perform the logistic regression
logit_bare = LogisticRegression(random_state = 1, solver = 'liblinear')
logit_bare.fit(X_train, y_train)


print ""
print "-----Logistic Regression-----"
print ""
print "Training accuracy: ",logit_bare.score(X_train,y_train)
print "Test set accuracy: ",logit_bare.score(X_test,y_test)

-----Results of train-test split-----
(212, 173)
(212,)
(53, 173)
(53,)

-----Logistic Regression-----

Training accuracy:  1.0
Test set accuracy:  0.754716981132


Ok - there is clearly some overfitting going on. We have 100% accuracy on the training set, but 70-80% accuracy on the test set. Let's break it down further by looking at commonly used metrics in biology:

$ \text{sensitivity} = \frac{\text{true_positive}}{\text{actual_condition_positive}}$ and 
$ \text{specificity} = \frac{\text{true_negative}}{\text{ actual_condition_negative}}$

In [71]:
from sklearn.metrics import confusion_matrix

y_pred_test = logit_bare.predict(X_test)
y_pred_full = logit_bare.predict(X)

confusion_test = confusion_matrix(y_test,y_pred_test)
confusion_full = confusion_matrix(y,y_pred_full)

print "Test Specificity =", confusion_test[0,0]/(1.0*(confusion_test[0,0] + confusion_test[0,1]))
print "Test Sensitivity =", confusion_test[1,1]/(1.0*(confusion_test[1,0] + confusion_test[1,1]))

print "Total Specificity =", confusion_full[0,0]/(1.0*(confusion_full[0,0] + confusion_full[0,1]))
print "Total Sensitivity =", confusion_full[1,1]/(1.0*(confusion_full[1,0] + confusion_full[1,1]))

Test Specificity = 0.775
Test Sensitivity = 0.692307692308
Total Specificity = 0.95
Total Sensitivity = 0.952941176471
