# L3b: Classification of Clinical Breast Cancer Samples
In this problem, we'll use [the Perceptron (Rosenblatt, 1957)](https://en.wikipedia.org/wiki/Perceptron) to classify clinical Breast Cancer samples taken from the University of Wisconsin. This breast cancer dataset developed by [Wolberg, W. (1990)](https://doi.org/10.24432/C5HP4Z) was obtained from the University of Wisconsin Hospitals, Madison, from [Dr. William H. Wolberg](https://pages.cs.wisc.edu/~olvi/uwmp/cancer.html), and is available [from the UCI dataset archive](https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original). It contains 699 instances, with nine clinical features and a class label `{benign | malignant}.`

* __Challenge__: We've never seen this data and have no idea if it's linearly separable. Thus, we have no theoretical guarantee that [the Perceptron](https://en.wikipedia.org/wiki/Perceptron) will work. Let's load the dataset, do some preprocessing, and then explore the performance of [the Perceptron](https://en.wikipedia.org/wiki/Perceptron) on this data.
* Let's compare our results to the study [Sidey-Gibbons, J., Sidey-Gibbons, C. Machine learning in medicine: a practical introduction. BMC Med Res Methodol 19, 64 (2019). https://doi.org/10.1186/s12874-019-0681-4](https://rdcu.be/d5NjG), that used the same dataset.

### Setup, Data, and Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. The `Include.jl` file loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem.

In [3]:
include("Include.jl");

#### Load and process the clinical dataset
Fill me in

In [5]:
df = CSV.read(joinpath(_PATH_TO_DATA, "breast-cancer-wisconsin.csv"), DataFrame)

Row,id,ClumpThickness,UniformityCellSize,UniformityCellShape,MarginalAdhesion,SingleEpithelialCellSize,BareNuclei,BlandChromatin,NormalNucleoli,Mitoses,Class
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,1000025,5,1,1,1,2,1,3,1,1,2
2,1002945,5,4,4,5,7,10,3,2,1,2
3,1015425,3,1,1,1,2,2,3,1,1,2
4,1016277,6,8,8,1,3,4,3,7,1,2
5,1017023,4,1,1,3,2,1,3,1,1,2
6,1017122,8,10,10,8,7,10,9,7,1,4
7,1018099,1,1,1,1,2,10,3,1,1,2
8,1018561,2,1,2,1,2,1,3,1,1,2
9,1033078,2,1,1,1,2,1,1,1,5,2
10,1033078,4,2,1,1,2,1,2,1,1,2


The `Class` label is not in the form of $\{-1,1\}$ that the perceptron expects, so let's transform the original data where we map $2\rightarrow{-1}$ and $4\rightarrow{1}$. We'll save the transformed data in the `dataset::DataFrame` variable.

In [7]:
dataset = let

    number_of_examples = nrow(df);
    for i ∈ 1:number_of_examples
        c = df[i,:Class];
        if (c == 2)
            df[i,:Class] = -1 # not cancer
        elseif (c == 4)
            df[i,:Class] = 1 # cancer
        end
    end
    df
end

Row,id,ClumpThickness,UniformityCellSize,UniformityCellShape,MarginalAdhesion,SingleEpithelialCellSize,BareNuclei,BlandChromatin,NormalNucleoli,Mitoses,Class
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,1000025,5,1,1,1,2,1,3,1,1,-1
2,1002945,5,4,4,5,7,10,3,2,1,-1
3,1015425,3,1,1,1,2,2,3,1,1,-1
4,1016277,6,8,8,1,3,4,3,7,1,-1
5,1017023,4,1,1,3,2,1,3,1,1,-1
6,1017122,8,10,10,8,7,10,9,7,1,1
7,1018099,1,1,1,1,2,10,3,1,1,-1
8,1018561,2,1,2,1,2,1,3,1,1,-1
9,1033078,2,1,1,1,2,1,1,1,5,-1
10,1033078,4,2,1,1,2,1,2,1,1,-1


Next, partition the data into `training` and `test` sets. We'll use `training` to estimate the model parameters, and test to estimate how well the classifier performs.

In [9]:
training, test = let

    number_of_training_examples = 456; # from Sidey-Gibbons, 2019
    D = Matrix(dataset);
    number_of_features = size(D,2); # number of cols of housing data
    number_of_examples = size(D,1); # number of rows of housing data
    full_index_set = range(1,stop=number_of_examples,step=1) |> collect |> Set;
    
    # build index sets for training and testing
    training_index_set = Set{Int64}();
    should_stop_loop = false;
    while (should_stop_loop == false)
        i = rand(1:number_of_examples);
        push!(training_index_set,i);

        if (length(training_index_set) == number_of_training_examples)
            should_stop_loop = true;
        end
    end
    test_index_set = setdiff(full_index_set,training_index_set);

    # build the test and train datasets -
    training = D[training_index_set |> collect,:];
    test = D[test_index_set |> collect,:];

    # return
    training, test
end;

In [43]:
test

243×11 Matrix{Int64}:
  897471   4   8   8  5   4   5  10   4   1   1
 1124651   1   3   3  2   2   1   7   2   1  -1
  718641   1   1   1  1   5   1   3   1   1  -1
  536708   1   1   1  1   2   1   1   1   1  -1
  749653   3   1   1  1   2   1   2   1   1  -1
 1275807   4   2   4  3   2   2   2   1   1  -1
 1147748   5  10   6  1  10   4   4  10  10   1
 1257200  10  10  10  7  10  10   8   2   1   1
  763235   3   1   1  1   2   1   2   1   2  -1
  695091   5  10  10  5   4   5   4   4   1   1
 1293966   4   1   1  1   2   1   1   1   1  -1
 1276091   2   1   1  1   2   1   2   1   1  -1
 1132347   1   1   4  1   2   1   2   1   1  -1
       ⋮                  ⋮                   ⋮
  657753   3   1   1  4   3   1   2   2   1  -1
 1081791   6   2   1  1   1   1   7   1   1  -1
  566509   5   1   1  1   2   1   1   1   1  -1
  743348   3   2   2  1   2   1   2   3   1  -1
 1321942   5   1   1  1   2   1   3   1   1  -1
 1224329   1   1   1  2   2   1   3   1   1  -1
 1231853   4   2  

## Task 1: Build Classification Model Object and Learn Parameters
Fill me in

In [12]:
model = let

    # How many features do we have?
    D = training; # let's look at the training data
    number_of_features = size(D,2) - 1; # why minus one?
    
    # build a model
    model = build(MyPerceptronClassificationModel, (
        parameters = ones(number_of_features),
        mistakes = 0 # willing to like with m mistakes
    ));

    model;
end;

Next, we'll using the `training` dataset to estimate the model parameters. 

In [14]:
trainedmodel = let

    D = training; # what dataset are we going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2) - 1; # how many features do we have (cols)?
    X = [D[:,2:end-1] ones(number_of_examples)]; # features, what??
    y = D[:,end]; # output: this is the target data (label)
    
    # train the model -
    trainedmodel = learn(X,y,model, maxiter = 1000, verbose = true);

    # return
    trainedmodel;
end

Stopped after number of iterations: 1000. We have number of errors: 21


MyPerceptronClassificationModel([26.0, -4.0, 20.0, 12.0, 2.0, 17.0, 26.0, -7.0, 4.0, -365.0], 0)

## Task 2: Run a prediction to compute misclassification rate
Fill me in

In [16]:
ŷ,y = let

    D = test; # what dataset are going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2) - 1; # how many features do we have (cols)?
    X = [D[:,2:end-1] ones(number_of_examples)]; # features: need to add a 1 to each row (for bias), after removing the label
    y = D[:,end]; # output: this is the *actual* target data (label)

    # compute the estimated labels -
    ŷ = classify(X,model)

    # return -
    ŷ,y
end;

#### Performance
There are many ways to compute the performance of the binary classifier. Let's consider a few versions of the misclassification rate. We'll start with the overall misclassification rate (defined as the number of mistakes divided by the total number of samples).

In [18]:
overall_misclassified_percentage = let
    
    number_of_test_examples = length(ŷ);
    error_counter = 0;

    for i ∈ 1:number_of_test_examples
        if (ŷ[i] != y[i])
            error_counter += 1;
        end
    end
    
    error_counter/number_of_test_examples
end

0.03292181069958848

__Accuracy__: Accuracy is a fundamental metric used to evaluate the performance of binary classifiers. Accuracy is mathematically defined as the ratio of correctly predicted instances (both true positives and true negatives) to the total number of cases. 
$$
\begin{equation*}
\text{Accuracy} = \frac{N_{+}+N_{-}}{N_{T}}
\end{equation*}
$$
where $N_{+}$ (or $N_{-}$) denotes the number of true positive (or negative) classifications, i.e., the number of times the classifier estimates the proper label, and $N_{T}$ denotes the total number of prediction samples.

In [20]:
accuracy = let

    number_of_test_examples = length(ŷ);
    correct_counter = 0;

    for i ∈ 1:number_of_test_examples
        if (ŷ[i] == y[i])
            correct_counter += 1;
        end
    end
    
    correct_counter/number_of_test_examples
end

0.9670781893004116

__Specificity__: The specificity, also known as the True Negative Rate (TNR), measures how well the model distinguishes between the positive and negative classes. The specificity is defined as:
$$
\begin{equation*}
\text{Specificity} = \frac{N_{-,-}}{N_{-,-}+N_{+,-}}
\end{equation*}
$$
where $N_{-,-}$ denotes the number of actual `negative` samples that were classified as `negative,` and $N_{+,-}$ denotes the number of (actual) `positive` samples that were (mis)classified as `negative` (false negative). The denominator is the total number of samples predicted to be `negative.`

In [22]:
specificity = let

    number_of_test_examples = length(ŷ);
    total_number_of_predicted_negatives = findall(c-> c == -1, ŷ) |> length;
    counter = 0;

    for i ∈ 1:number_of_test_examples
        if (y[i] == -1 && ŷ[i] == -1) # N(-1,-1)
            counter += 1;
        end
    end

    counter/total_number_of_predicted_negatives
end

0.9876543209876543

__Precision__: The precision measures the proportion of true positive predictions among all positive predictions. Precision is defined as:
$$
\begin{equation*}
\text{Precision} = \frac{N_{+,+}}{N_{+,+} + N_{-,+}}
\end{equation*}
$$
where $N_{+,+}$  denotes the number of actual `positive` classifications in which the model correctly predicts `positive,` and $N_{-,+}$ denotes the number of actual `negative` samples that are predicted by the model to be `positive` (false positive). The denominator is the total number of `positive` predictions.

In [24]:
precision = let

    number_of_test_examples = length(ŷ);
    total_number_of_predicted_positives = findall(c-> c == 1, ŷ) |> length;
    counter = 0;

    for i ∈ 1:number_of_test_examples
        if (y[i] == 1 && ŷ[i] == 1)
            counter += 1;
        end
    end

    counter/total_number_of_predicted_positives
end

0.9259259259259259