<a href="https://colab.research.google.com/github/chaurasiat/breastcancer_logisticRegression/blob/master/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center> <h1> CSE 574 Project 1 </h1> </center>

<center> <h2> Authors: Mihir Chauhan, Sargur Srihari </h2> </center>

<center> <h2> Due Time and Date: 11:59PM October 7th 2020 </h2> </center>

### Project 1 Task

The task of this project is to perform classification using machine learning for a two class problem. The features used for classification are pre-computed from images of a fine needle aspirate (FNA) ofa breast mass.  Your task is to classify suspected FNA cells to Benign (class 0) or Malignant (class 1) using logistic regression as the classifier. The dataset in use is the Wisconsin Diagnostic Breast Cancer (wdbc.csv).



### Dataset Description

You will be using Wisconsin Diagnostic Breast Cancer (WDBC) dataset for training, validation and testing. The  dataset you are provided with is wdbc.csv  which contains  500  data points with  31  attributes (diagnosis  (B/M),  30  real-valued  inputfeatures). 

####  How are the 30 features computed? (Below info. is just for your knowledge)
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breastmass.  Computed features describes the following characteristics of the cell nuclei present in the image. 

|    |                              Feature                              |
|----|:-----------------------------------------------------------------:|
| 1  | radius (mean of distances from center to points on the perimeter) |
| 2  | texture (standard deviation of gray-scale                         |
| 3  | perimeter                                                         |
| 4  | area                                                              |
| 5  | smoothness (local variation in radius lengths)                    |
| 6  | compactness (perimeter2/area − 1.0)                               |
| 7  | concavity (severity of concave portions of the contour)           |
| 8  | concave points (number of concave portions of the contour)        |
| 9  | symmetry                                                          |
| 10 | fractal dimension (“coastline approximation” - 1)                 |

The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in <b>30 features<b>.

### Plan of work

#### STEP 1: Import Libraries

You are NOT ALLOWED to use any libraries for directly implementing Logistic Regression.

For eg. [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) <font color='red'>NOT ALLLOWED</font>

You need to implement Logistic Regression from scratch using Gradient Descent Optimization Algorithm. 

You can use libraries for:
* loading data (Pandas, Numpy), <font color='green'>ALLLOWED</font>
* Preprocessing Data (sklearn > preprocessing), <font color='green'>ALLLOWED</font>
* Partitioning Data (sklearns > train_test_split), <font color='green'>ALLLOWED</font>
* Plotting Graphs (Matplotlib) <font color='green'>ALLLOWED</font>
* Finding Accuracy, Precision, Recall using sklearn.metrics <font color='green'>ALLLOWED</font>

You can alternatively use other libraries to implement any sub-task (e.g. loading, partitioning etc.)

In [5]:
import math
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import accuracy_score, precision_score, recall_score
import matplotlib.pyplot as plt
from numpy import savetxt, loadtxt
% matplotlib inline

#### Step 2: Data Loading <font color='blue'>(5 Points)</font>

1. Read 'wdbc.csv' data using Pandas library. Load the data in a dataframe
2. Drop first row as it is the header row
2. Map Malignant to Class 1 and Benign to Class 0

In [9]:
data = pd.read_csv("wdbc.csv") 
data
X = data.loc[:, data.columns != 'y'] 
Y = data['y'] 
X,Y


(         x1     x2      x3      x4  ...      x27      x28     x29      x30
 0    17.990  10.38  122.80  1001.0  ...  0.71190  0.26540  0.4601  0.11890
 1    20.570  17.77  132.90  1326.0  ...  0.24160  0.18600  0.2750  0.08902
 2    19.690  21.25  130.00  1203.0  ...  0.45040  0.24300  0.3613  0.08758
 3    11.420  20.38   77.58   386.1  ...  0.68690  0.25750  0.6638  0.17300
 4    20.290  14.34  135.10  1297.0  ...  0.40000  0.16250  0.2364  0.07678
 ..      ...    ...     ...     ...  ...      ...      ...     ...      ...
 495  10.290  27.61   65.67   321.4  ...  0.20000  0.09127  0.2226  0.08283
 496  10.160  19.59   64.73   311.7  ...  0.01005  0.02232  0.2262  0.06742
 497   9.423  27.88   59.26   271.3  ...  0.00000  0.00000  0.2475  0.06969
 498  14.590  22.68   96.39   657.1  ...  0.36620  0.11050  0.2258  0.08004
 499  11.510  23.93   74.52   403.5  ...  0.36300  0.09653  0.2112  0.08732
 
 [500 rows x 30 columns], 0      M
 1      M
 2      M
 3      M
 4      M
       ..
 

#### Step 3: Data Partitioning <font color='blue'>(5 Points)</font>

1. Partition your data into training (80%), validation (20%) and testing data(20%) using sklearn library (Hint: use train_test_split)
2. Seperate Target Label (y) and Features (x1 to x30) for training, validation and testing data.

#### Step 4: Scaling Features <font color='blue'>(5 Points)</font>

One simple scaling function that you could use off the shelf is Min Max Scaler function of Sklearns. Min Max scaler function transforms features by scaling each feature to a range between <b> 0 and 1 </b>. This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

The transformation is given by:

$X_{std} = \frac{X - X_{min}}{X_{max} - X_{min}}$

$X_{scaled} = X_{std} * (maxRange - minRange) + minRange$

$maxRange$ = 1, <br>
$minRange$ = 0, <br>
$X_{max}$ and $X_{min}$ are over axis = 0 (Each columns max and min value) 

##### Why do we need to scale features?
We scale the data to bring all the features to the same range (in our case between 0 and 1).

#### Hint, make sure the dimensionality of the training features, weights, biases are consistent for np.dot and np.multiply

#### Step 5: Initialization of Variables
* Initialize Hyper-Parameters (Learning Rate, Number of Epochs) to some value
* Initialize weights to any random values (We have initialized weights as an array random values sampled from a normal distribution)
* Initialize bias to any scalar value (We have initialized bias to value 0)
* Initialize other variables which may be used for tracking cost, number of data points etc..

In [2]:
# Hyper parameter initialization
learningrate = 1
epochs = 1
bias = 0

# intialize matrix with random values for weights with normal distribution
# Return a sample (or samples) from the “standard normal” distribution.
weights = np.random.randn(30,1)

#### Step 6: TRAINING with Logistic Regression Implementation using Gradient Descent Algorithm <font color='blue'>(30 Points)</font>

Iteratively update the weights and biases for each epoch using:
* Step 6.1: Use genesis equation $\hat{y} = \sigma (W.X + b)$ where $W$ is the weight array, $X$ is the input features and $\hat{y}$ is the predicted value which will be between 0 and 1. (You will have to perform same operation on validation set as well)
* Step 6.2: Find Binary Cross Entropy Cost for training and validation set using predicted value $\hat{y}$ and truth value $y$
* Step 6.3: Find $ \Delta W = \frac{\delta L}{\delta W}$ and $ \Delta b = \frac{\delta L}{\delta b}$ (Proof for finding  $ \Delta W$ and $\Delta b$ is available in prof. slides)
* Step 6.4: Update $W$ and $b$ using learning rate as follows:
  - $W = W - learningRate*\Delta W$
  - $b = b - learningRate*\Delta b$
* Step 6.5: Store BCE Cost for training and validation in seperate cost tracking list
* Step 6.6: Calculate Training and Validation Accuracy and store in seperate accuracy tracking list (<b>Hint</b>: Threshold $\hat{y}$ to 0.5 for category determination for finding accuracy)

Run step 6 multiple times, each time with a different set of hyperparameters and determine the best set of hyperparameters which gives best training and validation accuracy. 

Corresponding to the best set of hyper parameters, you should get the best set of weights and bias.

In [3]:
# For each epoch:
    
    # Step 6.1
    
    # Step 6.2
    
    # Step 6.3
    
    # Step 6.4
    
    # Step 6.5
    
    # Step 6.6

After performing all the steps above, the best set of updated weights and bias should be stored as a <i>weights_biases.csv</i> file. Your <i>weights_biases.csv</i> will tested on a hidden test set and you would be graded on how well your model (weights) performed on this hidden set <font color='blue'>(30 Points)</font>

<b>(Do not change the code provided in the cell below for storing the weights and bias)</b>

In [4]:
# Save the weights file (DO NOT CHANGE THIS CODE)
weights_bias = np.append(weights,bias)

if weights_bias.shape == (31,):
    print("Weights and Bias consistent :) ")
    savetxt('weights_bias.csv', weights_bias, delimiter=',')
else:
    print("Weights and Bias inconsistent :( \\
          Weights array shape should be (30,) and Bias should be a scalar \\
          weights_bias variable should be shaped as (31,)")

Weights and Bias consistent :) 


#### Step 7: Plot Training and Validation Cost vs Number of Epochs <font color='blue'>(5 Points)</font>

####  Step 8: Plot Training and Validation Accuracy vs Number of Epochs <font color='blue'>(5 Points)</font>

#### Step 9: Test your model using tesing data <font color='blue'>(15 Points)</font>

* Step 9.1: Use genesis equation $\hat{y} = \sigma (W.X_{test} + b)$ where $W$ is the weight array, $X_{test}$ is the input test features and $\hat{y}$ is the predicted value which will be between 0 and 1.
* Step 9.2: Threshold $\hat{y}$ at 0.5 to find the category for each data point.
* Step 9.3: Find accuracy, precision and recall for testing data (you can use sklearns.metrics library)

#### Step 10: Submission to timberlake server

* The code for your implementation should be in this Python notebook with necessary comments within the code.

* Your <b> Python Code file </b> `main.ipynb`, <b> Data File </b> `wdbc.csv` and your <b>Trained Weights and Bias File</b> `weights_bias.csv`</b> should be put in a single folder named as `proj1code`. 

* `proj1code` folder should be zipped with the resulting zip file name as `proj1code.zip`.

* Submit the Python code on CSE timberlake server with the following script:

 - `submit_cse474 proj1code.zip` for undergraduates
 - `submit_cse574 proj1code.zip` for graduates

### Grading Rubric
* <b>30 Points:</b> Your trained `weights_bias.csv` will be automatically graded using a script on unbiased hidden test data file. Hence, it is important that your `weights_biases` should of dimensionality (31,) and properly trained.
* <b>30 Points:</b> Training logic for implementing logistic regression (Step 6)
* <b>15 Points:</b> Testing Accuracy, Precision and Recall (Step 9)
* <b>5 Points:</b> Plot of Training and Validation cost vs No. of epochs (Step 7) 
* <b>5 Points:</b> Plot of Training and Validation accuracy vs No. of epochs (Step 8)
* <b> 5 points: </b> Scaling features (Step 4)
* <b> 5 points: </b> Partitioning Data (Step 3)
* <b> 5 points: </b> Data loading (Step 2)