# Type of Leukemia Prediction

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
df = pd.read_csv("leukemia_data.csv")

In [6]:
# 72 leukemia patients (rows)
# 150 Expression Values (columns)
df.shape

(72, 151)

In [12]:
# Last column = Type of Leukemia (ALL or AML)
df.columns[-1]

'leukemia_type'

In [7]:
# Exploring data (first 5 rows)
df.head()

Unnamed: 0,Zyxin,"PRG1 Proteoglycan 1, secretory granule",CD33 CD33 antigen (differentiation antigen),DF D component of complement (adipsin),RNS2 Ribonuclease 2 (eosinophil-derived neurotoxin; EDN),CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage),APLP2 Amyloid beta (A4) precursor-like protein 2,"GLUTATHIONE S-TRANSFERASE, MICROSOMAL",CTSD Cathepsin D (lysosomal aspartyl protease),ATP6C Vacuolar H+ ATPase proton channel subunit,...,RNH Ribonuclease/angiogenin inhibitor,"MPO from Human myeloperoxidase gene, exons 1-4./ntype=DNA /annot=exon",Very-long-chain acyl-CoA dehydrogenase (VLCAD),Low-Mr GTP-binding protein (RAB31) mRNA,"KIAA0184 gene, partial cds",OBF-1 mRNA for octamer binding factor 1,LPAP gene,HOXB3 Homeo box B3,ALCAM Activated leucocyte cell adhesion molecule,leukemia_type
0,-8,69,2323,1656,575,-958,1524,795,480,952,...,1410,44,1284,303,174,-283,261,177,298,ALL
1,76,17,1238,-10,1169,-429,1980,833,678,388,...,2325,146,540,1358,309,-65,101,3460,307,ALL
2,184,333,1982,-91,545,-645,2311,1391,540,559,...,2789,110,236,254,226,-395,309,416,309,ALL
3,71,121,3947,166,716,-795,2408,996,538,626,...,1132,86,786,-304,151,-367,288,392,693,ALL
4,111,95,3324,405,1211,-237,1671,678,588,776,...,1269,104,968,-86,672,290,395,7972,713,ALL


## Part 1 - Using Weka

<img src="images/wekaibkout.jpg"/>

### 1) Record % correctly classified and confusion matrix (ALL positive class).

<img src="images/corrclass1.jpg"/>

<img src="images/confmat.JPG"/>



**True Positive Rate**  =  TP / (TP + FN) = 44 / (44 + 0) = 1.0

100% sensitive

**False Positive Rate** =  FP / (FP + TN) =  4 / (4  + 24) = 0.143

low error
 

### What are ALL and AML stand for?

Types of Leukemia:  

. **ALL = Acute lymphocytic leukemia**

. **AML = Acute myelocytic leukemia**


### 2) Derive the confusion matrix when AML is the positive class.

        AML (Positive)     ALL (Negative)    <-- classified (Predicted) as

            24                 4               a = AML (Genuinely Positive)
            0                 44               b = ALL (Genuinely Negative)

### 3) Calculate the class-dependent TP and FP rates when AML is the positive class


### TPR (Sensitivity): (more is better)

True Positive Rate  =  TP / (TP + FN) = 24 / (24 + 4) = 0.857

85.7% Sensitive

<img src="images/TPR.jpg"/>


### FPR (1 - Specificity): (less is better) (less = more specific)

False Positive Rate =  FP / (FP + TN) =  0 / (0  + 44) = 0

<img src="images/fpr2.jpg"/>
 

### ROC curves

ROC = Receiver Operator Characteristic 

. Helps determine the cutoff point which optimizes sensitivity and specificity for given tests. 

. Can be used to assess the overall diagnostic accuracy of a test

. Y axis = TPR = sensitivity  (more is better)

. X axis = FPR = (1 - specificity)  (less is better)

. Overall Diagnostic Accuracy = AUR = Area Under ROC curve

### 4) Capture the ROC curves for ALL positive

<img src="images/rocall.jpg"/>

### 5) Capture the ROC curves for AML positive

ROC = Receiver Operator Characteristic

<img src="images/rocaml.jpg"/>

### 6) Using the ZeroR baseline classifier

<img src="images/wekazerorout.jpg"/>

### 7) Record % correctly classified and confusion matrix (ALL positive class).

<img src="images/corrclass2.jpg"/>

<img src="images/confmat2.jpg"/>


**True Positive Rate**  =  TP / (TP + FN) = 44 / (44 + 0) = 1.0

100% sensitive

**False Positive Rate** =  FP / (FP + TN) = 28 / (28 + 0) = 1.0

100% erroneous
 

### 8) Derive the confusion matrix when AML is the positive class.

        AML (Positive)     ALL (Negative)    <-- classified (Predicted) as

            0                 28               a = AML (Genuinely Positive)
            0                 44               b = ALL (Genuinely Negative)

### 9) Calculate the class-dependent TP and FP rates when AML is the positive class


### TPR (Sensitivity): (more is better)

True Positive Rate  =  TP / (TP + FN) = 0 / (0 + 28) = 0

0% Sensitive

<img src="images/zerotpr2.jpg"/>

### FPR (1 - Specificity): (less is better) (less = more specific)

False Positive Rate =  FP / (FP + TN) =  0 / (0  + 44) = 0

<img src="images/zerofpr2.jpg"/>
 

### 10) Capture the ROC curves for ALL positive

ROC = Receiver Operator Characteristic

<img src="images/rocallzero.jpg"/>

### 11) Capture the ROC curves for AML positive

ROC = Receiver Operator Characteristic

<img src="images/rocamlzero.jpg"/>

## Part 2

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
df = pd.read_csv("leukemia_data.csv")

In [None]:
# 72 leukemia patients (rows)
# 150 Expression Values (columns)
df.shape

In [None]:
# Last column = Type of Leukemia (ALL or AML)
df.columns[-1]

In [None]:
# Exploring data (first 5 rows)
df.head()

In [15]:
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 
actual = [1, 1, 0, 1, 0, 0, 1, 0, 0, 0] 
predicted = [1, 0, 0, 1, 0, 0, 1, 1, 1, 0] 
results = confusion_matrix(actual, predicted) 
print('Confusion Matrix :')
print(results) 
print('Accuracy Score :',accuracy_score(actual, predicted)) 
print('Report : ')
print(classification_report(actual, predicted))

Confusion Matrix :
[[4 2]
 [1 3]]
Accuracy Score : 0.7
Report : 
              precision    recall  f1-score   support

           0       0.80      0.67      0.73         6
           1       0.60      0.75      0.67         4

    accuracy                           0.70        10
   macro avg       0.70      0.71      0.70        10
weighted avg       0.72      0.70      0.70        10

