## Assignment 6 

In this assignment, the breast cancer dataset of sklearn library is used.     
I will try to classify the patients to be having malignant or beneign tumors.

Goal of the assignment is to calculate a model's accuracy using the leave-one-out cross validation technique.  

### Brief dataset description  

The dataset consists of 569 entries of 30 numeric predictive variables and the outcome.  
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.   

#### Attribute information  
* radius (mean of distances from center to points on the perimeter)
* texture (standard deviation of gray-scale values)
* perimeter
* area
* smoothness (local variation in radius lengths)
* compactness (perimeter^2 / area - 1.0)
* concavity (severity of concave portions of the contour)
* concave points (number of concave portions of the contour)
* symmetry
* fractal dimension (“coastline approximation” - 1)

The mean, standard error, and “worst” or largest (mean of the three worst/largest values) of these features were computed for each image, resulting in 30 features. For instance, field 0 is Mean Radius, field 10 is Radius SE, field 20 is Worst Radius.  


As found on [sklearn documentation](https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset)




In [1]:
import pandas as pd 
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, LeaveOneOut
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from tqdm import tqdm

In [2]:
cancerDF = load_breast_cancer(as_frame=True)['frame']

In [3]:
selected_vars = ['mean radius', 'mean texture', 'mean smoothness', 'mean compactness',
       'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error',
       'smoothness error', 'compactness error', 'symmetry error', 'worst symmetry', 
       'target']

cancerDF = cancerDF[selected_vars]

In [4]:
cancerDF.head()

Unnamed: 0,mean radius,mean texture,mean smoothness,mean compactness,mean symmetry,mean fractal dimension,radius error,texture error,smoothness error,compactness error,symmetry error,worst symmetry,target
0,17.99,10.38,0.1184,0.2776,0.2419,0.07871,1.095,0.9053,0.006399,0.04904,0.03003,0.4601,0
1,20.57,17.77,0.08474,0.07864,0.1812,0.05667,0.5435,0.7339,0.005225,0.01308,0.01389,0.275,0
2,19.69,21.25,0.1096,0.1599,0.2069,0.05999,0.7456,0.7869,0.00615,0.04006,0.0225,0.3613,0
3,11.42,20.38,0.1425,0.2839,0.2597,0.09744,0.4956,1.156,0.00911,0.07458,0.05963,0.6638,0
4,20.29,14.34,0.1003,0.1328,0.1809,0.05883,0.7572,0.7813,0.01149,0.02461,0.01756,0.2364,0


Normalize all variables to follow N(0,1) (z-scores).

In [5]:
X = cancerDF.drop(columns = 'target')
y = cancerDF['target'].astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=33)

In [6]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Leave-One-Out cross validation (LOOCV)

Leave-one-out cross-validation, or LOOCV, is a configuration of k-fold cross-validation where k is set to the number of examples in the dataset.    
The benefit of so many fit and evaluated models is a more robust estimate of model performance as each row of data is given an opportunity to represent the entirety of the test dataset.

In [7]:
y_act = []
y_pred = []
cv = LeaveOneOut()

# each iteration returns row indices
for train_rows, test_rows in tqdm(cv.split(X_train), total=X_train.shape[0]):
    X_train_cv, X_test_cv = X_train[train_rows], X_train[test_rows]
    y_train_cv, y_test_cv = np.array(y_train)[train_rows], np.array(y_train)[test_rows]

    clf = RandomForestClassifier(random_state=0)
    clf.fit(X_train_cv, y_train_cv)
    yhat = clf.predict(X_test_cv)
    y_act.append(y_train_cv[0])
    y_pred.append(yhat[0])

100%|██████████| 381/381 [01:43<00:00,  3.68it/s]


In [8]:
def calculate_metrics(y_act, y_pred):
    accuracy = sum([x==y for x,y in zip(y_act, y_pred)])/len(y_act)

    tp = sum([x==y for x,y in zip(y_act, y_pred) if x==1])
    tn = sum([x==y for x,y in zip(y_act, y_pred) if x==0])
    fp = len([x for x in y_pred if x==1]) - tp
    fn = len([x for x in y_pred if x==0]) - tn
    tpr = tp/(tp+fn)
    tnr = tn/(tn+fn)

    confusionDF = pd.crosstab(pd.Series(y_pred, name='Predicted'), pd.Series(y_act, name='Actual'), margins=True)

    print(f'Model accuracy: {accuracy*100:.2f}%')
    print(f'True positive, true negative, false positive and false negative values are: {tp, tn, fp, fn}')
    print(f'Model recall: {tpr:.2f}. Model specificity: {tnr:.2f}.\n')

    print('Confusion Matrix')
    print(confusionDF)

In [9]:
print('Metrics as predicted using LOO cross validation')
calculate_metrics(y_act, y_pred)

Metrics as predicted using LOO cross validation
Model accuracy: 33.86%
True positive, true negative, false positive and false negative values are: (0, 129, 251, 1)
Model recall: 0.00. Model specificity: 0.99.

Confusion Matrix
Actual       0  1  All
Predicted             
0          129  1  130
1          251  0  251
All        380  1  381


In [10]:
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test).tolist()
y_act = y_test.tolist()

In [11]:
print('Metrics using actual train - test split (70-30 split)')
calculate_metrics(y_act, y_pred)

Metrics using actual train - test split (70-30 split)
Model accuracy: 94.15%
True positive, true negative, false positive and false negative values are: (109, 68, 7, 4)
Model recall: 0.96. Model specificity: 0.94.

Confusion Matrix
Actual      0    1  All
Predicted              
0          68    4   72
1           7  109  116
All        75  113  188
