# Project objective
This project is designed to review logistic regression method and its python implementation using Wisconsin breast cancer dataset.

Information about the dataset, some technical details about the used machine learning method(s) and mathematical details of the quantifications approaches are provided in the code. 

# Packages we work with in this notebook
We are going to use the following libraries and packages:

* **numpy**: NumPy is the fundamental package for scientific computing with Python. (http://www.numpy.org/)
* **sklearn**: Scikit-learn is a machine learning library for Python programming language. (https://scikit-learn.org/stable/)


In [0]:
import numpy as np
import sklearn as sk

# Introduction to the dataset

**Name**: Wisconsin breast cancer dataset

**Summary**: Identifying if there is a malignant tumor or not using features that are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.

**number of features**: 30 (real, positive) 

**Number of data points (instances)**: 569

**dataset accessibility**: Dataset is available as part of sklearn package.

**Link to the dataset**: http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)




## Loading the dataset and separating features and labels
The dataset is available as part of sklearn package. Hence, we do not need to import the data directly from UCI ML repository. 

In [0]:
from sklearn.datasets import load_breast_cancer

# Loading breast cancer data
target_dataset = load_breast_cancer()

# separating feature arrays of pixel values (X) and labels (y) 
input_features = target_dataset.data
output_var = target_dataset.target
# printing number of features (pixels) and data points 
n_samples, n_features = input_features.shape
print("number of samples (data points):", n_samples)
print("number of features:", n_features)

number of samples (data points): 569
number of features: 30


## Splitting data to training and testing sets

We need to split the data to train and test, if we do not have a separate dataset for validation and/or testing, to make sure about gneralizability of the model we train.

**test_size**: Traditionally, 30%-40% of the dataset cna be used for test set. If you split the data to train, validation and test, you can use 60%, 20% and 20% of teh dataset, respectively.

**Note.**: We need the validation and test sets to be big enough for checking genralizability of our model. At the same time we would like to have as much data as possible in the training set to train a better model.

**random_state** as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case.


In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(input_features, output_var, test_size=0.30, random_state=5)

## Building the supervised learning model
We want to build a binary classification model as the output variable is categorical with 2 classes. Here we build a simple logistic regression model.

### Logistic regression
If we have set of features X1 to Xn, y can be obtained as:
\begin{equation*} y=b0+b1X1+b2X2+...+bnXn\end{equation*}

where y is the predicted value obtained by weighted sum of the feature values.

Then probability of each class (for example if there is a malignant tumor) can be obtained using the logistic function 

\begin{equation*} p(class=malignant)=\frac{1}{(1+exp(-y))} \end{equation*}

Based on the given class labels and the features given in the trainign data, coefficients b0 to bn can be ontained during the optimization process.

b0 to bn are fixed for all samples while X1 to Xn are feature values specific to each sample. Hence, the logistic function will give us probability of each class assigned to each sample. Finally, the model will choose the class with the highest probability for each sample.


**Note.** The logistic regression model is parametric and the parameters are the regression coefficiets b0 to bn.

In [0]:
from sklearn.linear_model import LogisticRegression 

# Create logistic regression object
logreg = LogisticRegression()

# Train the model using the training sets
logreg.fit(X_train, y_train)

## Prediction of test (or validation) set
We now have to use the trained model to predict y_test.

In [0]:
# Make predictions using the testing set
y_pred = logreg.predict(X_test)

## Evaluating performance of the model
We need to assess performance of the model using the predictions of the test set. We use accuracy and balanced accuracy. Here are their definitions:

* **recall** in this context is also referred to as the true positive rate or sensitivity




$${\displaystyle {\text{recall}}={\frac {tp}{tp+fn}}\,} $$

 

* **specificity** true negative rate



$${\displaystyle {\text{true negative rate}}={\frac {tn}{tn+fp}}\,}$$

* **accuracy**: This measure gives you a sense of performance for all the classes together as follows:

$$ {\displaystyle {\text{accuracy}}={\frac {tp+tn}{tp+tn+fp+fn}}\,}$$


\begin{equation*} accuracy=\frac{number\:of\:correct\:predictions}{(total\:number\:of\:data\:points (samples))} \end{equation*}


* **balanced accuracy**: This measure gives you a sense of performance for all the classes together as follows:

$${\displaystyle {\text{balanced accuracy}}={\frac {recall+specificity
}{2}}\,}$$


In [0]:
from sklearn import metrics

print("accuracy of the predictions:", metrics.accuracy_score(y_test, y_pred))
print("blanced accuracy of the predictions:", metrics.balanced_accuracy_score(y_test, y_pred))

accuracy of the predictions: 0.9532163742690059
blanced accuracy of the predictions: 0.9453800298062593


## Extracting the coefficient of the model
The trained logistic regresseion model predicts the class of a datapoint as a fucntion of linear combination of feature values. Hence, each feature has a coefficient in this linear combination for predicting output variable.

In [0]:
print('Coefficients: {}'.format(logreg.coef_))

Coefficients: [[ 0.80188312  0.47135725  0.34246283 -0.02092841 -0.02696836 -0.13034695
  -0.1897339  -0.08169113 -0.03612479 -0.0058486   0.04197023  0.44695641
   0.11623421 -0.10294458 -0.00241085 -0.02734028 -0.04143775 -0.0105686
  -0.00729451 -0.00227194  0.91413767 -0.52889934 -0.28273719 -0.00835908
  -0.04723636 -0.39423874 -0.5180321  -0.1517702  -0.11723418 -0.03670763]]
