# Dataset Description
<hr />

#### From https://www.kaggle.com/crawford/gene-expression:  This dataset comes from a proof-of-concept study published in 1999 by Golub et al. It showed how new cases of cancer could be classified by gene expression monitoring (via DNA microarray) and thereby provided a general approach for identifying new cancer classes and assigning tumors to known classes. These data were used to classify patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). 

#### There are training dataset contains 38 samples and the test dataset contains 34 sample used in the paper. Both datasets contain ALL and AML sample measurements from Bone Marrow and Peripheral Blood. According to Kaggle and the paper, intensity values have been re-scaled such that overall intensities for each chip are equivalent.

#### Acknowledgments

#### Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression

#### Science 286:531-537. (1999). Published: 1999.10.14

#### T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander

In [1]:
import pandas as pd

In [2]:
# Load the Dataset

test_data=pd.read_csv('data_set_ALL_AML_independent.csv')
training_data=pd.read_csv('data_set_ALL_AML_train.csv')
patient_cancer_labels=pd.read_csv('actual.csv')

In [3]:
# Dataset Analysis

#### shape 
training_data_shape = training_data.shape
test_data_shape = test_data.shape
patient_cancer_shape = patient_cancer_labels.shape
print("Training Data Shape: %s rows, %s columns" % (training_data_shape[0], training_data_shape[1]))
print("Testing Data Shape: %s rows,%s columns" % (test_data_shape[0], test_data_shape[1]))
print("Patient Cancer Label Data Shape: %s rows,%s columns" % (patient_cancer_shape[0], patient_cancer_shape[1]))

Training Data Shape: 7129 rows, 78 columns
Testing Data Shape: 7129 rows,70 columns
Patient Cancer Label Data Shape: 72 rows,2 columns


In [4]:
#### First 5 rows
training_data.head()


Unnamed: 0,Gene Description,Gene Accession Number,1,call,2,call.1,3,call.2,4,call.3,...,29,call.33,30,call.34,31,call.35,32,call.36,33,call.37
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-214,A,-139,A,-76,A,-135,A,...,15,A,-318,A,-32,A,-124,A,-135,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-153,A,-73,A,-49,A,-114,A,...,-114,A,-192,A,-49,A,-79,A,-186,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,-58,A,-1,A,-307,A,265,A,...,2,A,-95,A,49,A,-37,A,-70,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,88,A,283,A,309,A,12,A,...,193,A,312,A,230,P,330,A,337,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-295,A,-264,A,-376,A,-419,A,...,-51,A,-139,A,-367,A,-188,A,-407,A


In [5]:
#### First 5 rows
test_data.head()

Unnamed: 0,Gene Description,Gene Accession Number,39,call,40,call.1,42,call.2,47,call.3,...,65,call.29,66,call.30,63,call.31,64,call.32,62,call.33
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-342,A,-87,A,22,A,-243,A,...,-62,A,-58,A,-161,A,-48,A,-176,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-200,A,-248,A,-153,A,-218,A,...,-198,A,-217,A,-215,A,-531,A,-284,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,41,A,262,A,17,A,-163,A,...,-5,A,63,A,-46,A,-124,A,-81,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,328,A,295,A,276,A,182,A,...,141,A,95,A,146,A,431,A,9,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-224,A,-226,A,-211,A,-289,A,...,-256,A,-191,A,-172,A,-496,A,-294,A


In [6]:
patient_cancer_labels.head(30)

Unnamed: 0,patient,cancer
0,1,ALL
1,2,ALL
2,3,ALL
3,4,ALL
4,5,ALL
5,6,ALL
6,7,ALL
7,8,ALL
8,9,ALL
9,10,ALL


In [7]:
#### The columns labeled as "call" do not seem to help with classification so they can be filtered out

def drop_columns_containing_call(df):
    return [col for col in df.columns if "call" not in col]

training_data=training_data[drop_columns_containing_call(training_data)]
test_data=test_data[drop_columns_containing_call(test_data)]
training_data=training_data.drop("Gene Description", axis=1)
test_data=test_data.drop("Gene Description", axis=1)

#### Transpose both training and testing dataframes. This makes each gene a feature for the machine learning models. Each row represents a sample and the values represent the microarray data

test_data = test_data.T
training_data = training_data.T

In [8]:
print(training_data.shape)
training_data.head()

(39, 7129)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
Gene Accession Number,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U48730_at,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
1,-214,-153,-58,88,-295,-558,199,-176,252,206,...,185,511,-125,389,-37,793,329,36,191,-37
2,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,169,837,-36,442,-17,782,295,11,76,-14
3,-76,-49,-307,309,-376,-650,33,-367,206,-215,...,315,1199,33,168,52,1138,777,41,228,-41
4,-135,-114,265,12,-419,-585,158,-253,49,31,...,240,835,218,174,-110,627,170,-50,126,-91
