##  Machine Learning for Thyroid Cancer Diagnosis.
##  Part 2: Logistic regression
**The project was done with Rajiv Krishnakumar and Raghu Mahajan.**

The essential goal was to predict thyroid cancer given gene expressions. A key hope is to definitively predict benign samples; this helps to avoid unnecessary surgeries, which often turn out to be much more problematic to a patient's health, than the thyroid cancer itself.


- The data used here is pre-normalized, to mean zero and standard deviation 1. 
- The essentials of the data set are 265 patients whose biopsies were inconclusive, each with 173 reported gene expression levels. 
- There were a further 102 patients with 'conclusive' biopsies - i.e. a human determination of benign vs. malignant, to give 367 total patients.

Here is an abstract from our final report:

*We investigate the use of high throughput gene expression data in the diagnosis of thyroid cancers. Using logistic regression and support vector machines (SVMs), we develop a classifier which gives similar performance (89% sensitivity and 80% specificity) to the currently best- known classifier, but uses significantly fewer features. We used two different techniques, principal components analysis and mutual information score, to select features. The results do not depend significantly on which method is used for feature selection.*

The breakdown of topics covered in each notebook is as follows:
1. Data visualization, including PCA and tSNE visualizations.
2. Logistic regression, discussing the use of feature selection via mutual information vs. use of different regularizers.
3. SVMs with and without box constraints, and also using different kernel functions.

In [31]:
#As usual import some modules and import the dataset
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#import the data and look at it
X = pd.read_csv("data/normalized_data_265.csv", header =None)
y = pd.read_csv("data/outcome_265.csv", header = None)


X_full = pd.read_csv("data/normalized_data_367.csv", header =None)
y_full = pd.read_csv("data/outcome_367.csv", header = None)

#Look at the first few rows
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,163,164,165,166,167,168,169,170,171,172
0,0.17946,-0.92523,-0.6937,0.46983,-0.34871,-0.21465,-0.30662,1.3596,-0.50597,1.1031,...,0.068605,-0.53499,0.308,-0.33358,0.82609,0.025969,-0.70057,-0.70648,0.46854,0.34735
1,1.3137,-0.44796,0.12014,-0.64247,0.093455,-0.40658,-0.61403,-0.48841,0.77221,-0.6834,...,-1.1425,2.1383,-1.0025,-1.2112,0.82492,-0.013805,0.7909,1.2737,0.92238,1.193
2,-0.3854,1.073,-1.0373,0.48228,0.5017,0.43634,0.7702,-0.13205,-1.6549,0.57408,...,0.16262,-0.12682,1.5029,0.82987,-0.71825,0.01689,-0.73851,-0.68747,-1.0507,-0.87795
3,-0.20878,0.16227,1.0061,-0.49424,-0.36083,0.67324,-0.11529,0.80011,-1.1433,0.74535,...,0.27848,-1.0712,-1.5862,0.30024,0.2521,0.83114,2.3543,-0.86967,-0.17322,-0.4556
4,-2.5194,-0.60297,-1.4615,1.3105,-1.2644,-0.77405,-0.1842,-1.0592,-1.1224,-1.9363,...,-0.014778,-0.73844,1.3451,0.70413,-3.2079,-2.9599,-0.53812,-0.66792,-3.2954,-3.4168


### Train-Test-Validation split

(212, 173)
(212,)
(53, 173)
(53,)


## Logistic regression 

A simple and straightforward first pass is always a logistic regression.

In [35]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression


#First perform a train-test split
X = X.as_matrix()
y = y.as_matrix().reshape(len(y))
X_full = X_full.as_matrix()
y_full = y_full.as_matrix().reshape(len(y_full))


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)





logit_bare = LogisticRegression(random_state = 1, solver = 'liblinear')
logit_bare.fit(X_train, y_train)

print "Training accuracy: ",logit_bare.score(X_train,y_train)
print "Test set accuracy: ",logit_bare.score(X_test,y_test)

Training accuracy:  1.0
Test set accuracy:  0.754716981132
