# Project Info

- Title of the Project: **Breast Cancer Detection**.
- Name of the auther: **Samiulhaq Chardiwall**.
- date: **01 Sep 2023**.

- Project Goal: **Building a Breast Cancer Diagnostic Model**.
---
- **Project Overview**:

    The goal of this project is to learn the complete flow of machine learning algorithm and to grasp the concept of logistic regression. In ML projects the most important step is to know your data well therefore we have started from exploring the data then implemented logistic regression and evaluation matrices from scratch then it is compared with pre-implemented logistic regression, random forest and neural network then we have done hyperparameter optimization for logistic regression to acheive the best result. At last we close this project by testing and prediction some records.

    **Key Steps:**

    1. Data Collection:
       
       - Acquire the Breast Cancer Wisconsin dataset from the provided Kaggle link.
        - Data Preprocessing:
            - Handle missing values, if any.
            - Encode categorical variables, if present.
            - Split the dataset into training and testing sets.

    2. Exploratory Data Analysis (EDA):
       
        - Conduct exploratory data analysis to gain insights into the data.
        - Visualize the distribution of features and relationships between variables.
        - Feature Selection:

            - Identify and select relevant features for model building.
            - Use techniques like correlation analysis or feature importance scores.

    3. Model Selection:
    
        - Experiment with various machine learning algorithms such as Logistic Regression, Random Forests, and Nural Network.
        - Train and evaluate each model using appropriate metrics like accuracy, precision, recall, and F1-score.
        - Hyperparameter Tuning:
            - Fine-tune the hyperparameters of the selected models to optimize their performance.

    4. Model Evaluation:
    
       - Compare the performance of different models using cross-validation.
       - Select the best-performing model based on evaluation metrics.
       - Model Interpretation:

           - Interpret the results and provide insights into the factors contributing to accurate breast cancer classification.
    
    5. Documentation and Reporting:
    
       - Create a well-structured notebook or report that documents your entire workflow, from data preprocessing to model selection.
       - Explain your decisions, insights, and the rationale behind choosing the final model.

**Outcome**:
        The primary outcome of this project is to build a reliable machine learning model for learning purposes.

---

# Machine Specification

In [None]:
# This code uses shell commands to display system information such as the kernel name,
# network node host, machine hardware, and operating system in a clear and concise manner.

!echo Kernal Name: $(uname -s)
!echo

!echo Network Node Host: $(uname -n)
!echo

!echo Machine Hardware: $(uname -m)
!echo

!echo Machine Operating System: $(uname -o)

---

## Importing Essential Libraries

In [None]:
# This code imports essential libraries for data analysis, visualization, and machine learning plus additional libraries
import io
import os
from rich import print as pprint

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

from rich import print as pprint

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import ConfusionMatrixDisplay, PrecisionRecallDisplay
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.activations import linear, relu, sigmoid

# disable Tensorflow debugging information
import logging
logging.getLogger("tensorflow").setLevel(logging.INFO)

pd.set_option('display.precision', 3)

## Loading and Exploring Data

In [None]:
# This code uses shell commands to display the files in both the current directory and the parent directory using the "tree" shell command.
!echo Files in Current Directory: $(tree)

!echo Files in Parrent Directory: && tree ..

In [None]:
# This code loads a dataset from a specified path using Pandas library and then prints the path.
dataset_path = os.path.join('..', 'input', 'breast-cancer-wisconsin-data', 'data.csv')
print(dataset_path)

# This code displays the shape (columns x rows) of the data.
data = pd.read_csv(dataset_path)
print('\nShape of The data (comumns x rows):', data.shape, '\n')

# this line of code displays first few rows of the dataset using Pandas Library.
data.head()

In [None]:
# This method prints a concise summary of the DataFrame including the index data type and columns, non-null values and memory usage.
data.info()

In [None]:
# this portion of code displays the information about column data types in the DataFrame.

# - Print columns with 'object' data type (typically strings or categorical data).
pprint('Columns with Object Data Type:', data.select_dtypes(include='object').columns, sep='\n')

# - Print columns with 'integer' data type.
pprint('\n\nColumns with integer Data Type:', data.select_dtypes(include='int').columns, sep='\n')

# - Print columns with 'float' data type.
pprint('\n\nColumns with float Data Type:', data.select_dtypes(include='float').columns, sep='\n')

# - Also, print the total number of columns with 'float' data type.
print('Number of Columns with float data type:', len(data.select_dtypes(include='float').columns))

In [None]:
# Generate a statistical summary of the 'data' DataFrame. (count, mean, standard deviation, minimum vlaue, 25th percentile (first quratile), median (second quartile), 75th percentile (third quartile), maximum value)
data.describe()

## Dealing whith Null Values and Unusfull Columns

In [None]:
# A quick peak to the first 5 records
data.head()

In [None]:
# This code generates a dictionary that counts the number of unique values for each column in the 'data' DataFrame.
unique_value_sum = { c: len(data[c].unique()) for c in data.columns.values }
pprint(unique_value_sum)

In [None]:
# This code checks if there are any null (missing) values in the 'data' DataFrame and prints 'Yes' if there are any, otherwise 'No'.
print('is there any null values:', 'Yes' if data.isnull().values.any() else 'No')
print('Number of Null Values:', data.isnull().values.sum())

In [None]:
# This code identifies and returns the column names in the 'data' DataFrame that contain at least one missing (null) value.
data.columns[data.isnull().any()]

In [None]:
# This code snippet counts the number of non-null values in the 'Unnamed: 32' column
# If the result of .count() is 0 for a specific column in a DataFrame, it means there are no actual values (valid data points) in that column.
data['Unnamed: 32'].count()

In [None]:
# This code drops the 'Unnamed:32' column as there is no valid data point and 'id' column because it does not give any meaning.
data.drop(columns=['Unnamed: 32', 'id'], axis = 'columns', inplace = True)
print('is there any null values:', 'Yes' if data.isnull().values.any() else 'No')

## Dealing With Categorical Data

In [None]:
data.head()

In [None]:
# This line of code prints the count of categorical columns.
print('Number of Categorical Columns:', len(data.select_dtypes(include='object').columns ))

# Then, it lists the names of these categorical columns.
print('Column Name:', *data.select_dtypes(include='object').columns.values)

In [None]:
# This codes displays the number of unique values 'diagnosis' columns, and then actual unique values
print( 'Number of unique value:', len(data['diagnosis'].unique()) )
print( 'Unique Values:', *data['diagnosis'].unique() )

In [None]:
# .get_dummies Convert categorical variable into dummy/indicator variables
data = pd.get_dummies(data=data, drop_first=True)
data.head()

## Data Exploration Through Visualisation

In [None]:
# This portion of codes visualizes distribution of the frequency of Benign and Malignant records.
print("Benign(False): {}, Malignant(True): {}".format( *data.diagnosis_M.value_counts().values) )

sns.set_style('whitegrid')
ax = sns.countplot(data=data, x = 'diagnosis_M')

ax.set(xlabel = 'Diagnosis', ylabel = 'Count', title = 'Count of Diagnosis')
plt.show()

In [None]:
# This portion of code compares every feature against target variables.
# 
fig, axes = plt.subplots(5, 6, figsize = (20,12), sharey = True)
ax = axes.ravel()

for idx, col in enumerate(data.columns.values[:-1]):
    sns.histplot(data = data, x = col, hue = 'diagnosis_M', ax=ax[idx])

plt.show()

The histograms show that there are some differences in the distribution of features between malignant and benign cases. For example, the histograms for radius, perimeter, area, and concavity show that malignant cells tend to be larger than benign cells. The histograms for smoothness, compactness, and symmetry show that malignant cells tend to be less smooth, more compact, and less symmetrical than benign cells.

Overall, the histograms provide a useful overview of the relationship between the different features in the breast cancer dataset and the target variable. This information can be used to develop machine learning models to predict whether a new case of breast cancer is malignant or benign.

In [None]:
fig, axes = plt.subplots(5, 6, figsize = (20,12), sharey = True)
ax = axes.ravel()

for idx, col in enumerate(data.columns.values[:-1]):
    sns.boxplot(data = data, x = col, ax=ax[idx],)

plt.show()

In general, the box plots show that the distributions of most features are different between patients with and without breast cancer. For example, the median values of the features smoothness_worst, concave points_worst, symmetry_worst, and fractal dimension_worst are all higher in patients with breast cancer. This means that these features are more likely to have high values in patients with breast cancer.

However, there are also some features that do not appear to be different between patients with and without breast cancer. For example, the median values of the features compactness_worst and concavity_worst are similar in both groups of patients.

Here are some specific observations from the box plots:

- The distribution of the feature smoothness_worst is skewed to the right in patients with breast cancer, meaning that there are more patients with high values of this feature.
- The distribution of the feature concave points_worst is also skewed to the right in patients with breast cancer.
- The distribution of the feature symmetry_worst is more spread out in patients with breast cancer, meaning that there is a wider range of values for this feature in this group.
- The distribution of the feature fractal dimension_worst is also more spread out in patients with breast cancer.
- The distribution of the feature compactness_worst is similar in patients with and without breast cancer.
- The distribution of the feature concavity_worst is also similar in patients with and without breast cancer.

## Correlation

In [None]:
X_temp, y_temp = data.drop(['diagnosis_M'], axis = 'columns'), data.diagnosis_M

X_temp.corrwith(y_temp).plot.bar(
    figsize=(15,5,),
    title = "Correlation with diagnosis_M",
    grid = True, rot = 70)

In [None]:
corr = data.corr()
corr

In [None]:
plt.figure(figsize=(15, 8))
sns.heatmap(corr, annot=True,)
plt.show()

In [None]:
plt.figure(figsize=(15, 8))
sns.heatmap( corr.where( (abs(corr)>0.6) & (abs(corr)<1) ) , annot=True )
plt.show()

## Data Spliting

In [None]:
X = data.drop(columns = ['diagnosis_M'], axis = 'columns')
X.head()

In [None]:
Y = data.diagnosis_M
print(len(Y))
Y.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size= 0.4, random_state = 1)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size= 0.5, random_state =1)

print("X train shape: {}, y train shape: {}".format(X_train.shape, y_train.shape ) )
print("X test.shape: {}, y test.shape: {}".format( X_test.shape, y_test.shape) )
print("X Validation shape: {}, Y Validation shape: {}".format(X_val.shape, y_val.shape) )

## Feature Scaling

### Stndarization Code from Scratch

The formula for Standarization is $ x={x-mean \over std} $.

In [None]:
mean_list = list()
std_list = list()

transform_x_train = np.zeros(X_train.shape)
for idx, clm in enumerate(X_train.columns.values):
    mean_list.append( X_train[clm].mean() )
    std_list.append( X_train[clm].std() )
    
    transform_x_train[:, idx] = ( X_train[clm] - mean_list[idx] ) / std_list[idx]

    

transform_x_test = np.zeros(X_test.shape)
for idx, clm in enumerate(X_test.columns.values):
    transform_x_test[:, idx] = ( X_test[clm] - mean_list[idx]) / std_list[idx]
    
    
transform_x_val = np.zeros(X_test.shape)
for idx, clm in enumerate(X_val.columns.values):
    transform_x_val[:, idx] = ( X_val[clm] - mean_list[idx]) / std_list[idx]

### Standrization using Scikit Learn Library 

In [None]:
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_val = sc.transform(X_val)

#### Checking wheter both implementation are closely identical

In [None]:
print("Is X_train transfomed using scikit learn and X_train transformd using from customed code are idintical:", "Yes" if np.allclose( X_train, transform_x_train, rtol = 1e-2, atol = 1e-8) else 'No')
print("Is X_test transfomed using scikit learn and X_test transformd using from customed code are idintical:", "Yes" if np.allclose( X_test, transform_x_test, rtol = 1e-2, atol = 1e-8) else 'No')
print("Is X_validation transfomed using scikit learn and X_validation transformd using from customed code are idintical:", "Yes" if np.allclose( X_val, transform_x_val, rtol = 1e-2, atol = 1e-8) else 'No')

In [None]:
fig, axes = plt.subplots(5, 6, figsize = (20, 12))
ax = axes.ravel()

for i in range(X_train.shape[1]):
    sns.boxplot(x=X_train[:,i], ax = ax[i])
    
plt.show()

In [None]:
print('X Train\n', X_train[0])

print('\n\n\n X Test\n', X_test[0])
print('\n\n\n X Validation\n', X_val[0])

## Building Model

### Logistic Regression from Scratch

 $$ sigmoid = {1 \over 1+ e^{-z}} $$

In [None]:
def sigmoid(z):
    return 1/( 1+np.exp(-z) )

In [None]:
def forward_propagation(X, w, b):
    z = np.dot(X, w) + b
    return sigmoid(z)

$$ cost = {\sum {{-y * log(yhat) - (1-y) * log(1-yhat)}\over m}} $$

In [None]:
def compute_cost(y, y_pred):
    m = X.shape[0]
    cost = np.sum( -y * np.log(y_pred) - (1-y) * np.log(1-y_pred) ) / m
    
    return cost

In [None]:
def gradient_decent(X, y, w, b, alpha, num_iter):
    m = X.shape[0]
    
    for i in range(num_iter):
        y_pred = forward_propagation(X, w, b)

        dw = 1/m * np.dot(X.T, (y_pred - y))
        db = 1/m * np.sum( y_pred - y )

        w -= alpha * dw
        b -= alpha * db
    
    return w, b

In [None]:
def predict(X, w, b):
    y_pred = forward_propagation(X, w, b)
    return (y_pred > 0.5).astype(int)

In [None]:
def logistic_regression(X, y, alpha = 1e-9, num_iter = 100):
    n, m = X.shape
    
    w = np.zeros((m, ))
    b = 0
    
    return gradient_decent(X, y, w, b, alpha, num_iter)    

In [None]:
y_pred = forward_propagation(X_train, np.zeros(X_train.shape[1]), 0.)
compute_cost(y_train, y_pred)

In [None]:
w, b = logistic_regression(X_test, y_test)

predict(X_test, w, b)

### Logistic Regression using Scikit Learn

In [None]:
clf_log = LogisticRegression(random_state=0)

clf_log.fit(X_train, y_train)

clf_log.score(X_test, y_test)

### Evaluation Metrics 

#### Accuracy

In [None]:
def accuracyScore(y, y_pred):
    return sum(y == y_pred)/len(y)

In [None]:
def confusionMatrix(y, y_pred):
#     tp, tn, fp, fn = 0, 0, 0, 0
    
#     for i in range(len(y)):
#         if y[i] == 1: 
#             tp+=1 if y_pred[i] == 1 else fn+=1
        
#         else:
#             tn+=1 if y_pred[i] == 0 else fp+=1

# or

    tp = sum((y == 1) & (y_pred == 1))
    tn = sum((y == 0) & (y_pred == 0))

    fn = sum((y == 1) & (y_pred == 0))
    fp = sum((y == 0) & (y_pred == 1))

    return tp, fp, tn, fn

In [None]:
def precisionScore(y, y_pred):
    tp = sum((y == 1) & (y_pred == 1))
    fp = sum((y == 0) & (y_pred == 1))
    
    return tp/(tp+fp)

In [None]:
def recallScore(y, y_pred):
    tp = sum( (y == 1) & (y_pred == 1) )
    fn = sum( (y == 1) & (y_pred == 0) )
    
    return tp/(tp+fn)

In [None]:
def f1Score(y, y_pred):
    precision = precisionScore(y, y_pred)
    recall = recallScore(y, y_pred)
    
    return 2*precision*recall / (precision + recall)

In [None]:
w, b = logistic_regression(X_train, y_train)
y_pred = predict(X_test, w, b)

accuracy = accuracyScore(y_test, y_pred)
cfn_mtx = confusionMatrix(y_test, y_pred)
f1_scr = f1Score(y_test, y_pred)
psn_scr = precisionScore(y_test, y_pred)
rcl_scr = recallScore(y_test, y_pred)

result = pd.DataFrame.from_dict( {'Logistic Regression (from scratch)': 
                                  {'Accuracy': accuracy, 'F1 Score': f1_scr, 'Precision Score': psn_scr, 'Recall Score': rcl_scr} 
                                 })

result

In [None]:
y_pred = clf_log.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
cfn_mtx = confusion_matrix(y_test, y_pred)
f1_scr = f1_score(y_test, y_pred)
psn_scr = precision_score(y_test, y_pred)
rcl_scr = recall_score(y_test, y_pred)

result['Logistic Regression (sklearn)'] =  {'Accuracy': accuracy, 'F1 Score': f1_scr, 'Precision Score': psn_scr, 'Recall Score': rcl_scr} 

result

#### Confusion Metrics

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

In [None]:
PrecisionRecallDisplay(psn_scr, rcl_scr).from_predictions(y_test, y_pred)

In [None]:
cross_score = cross_val_score(clf_log, X_train, y_train, cv=10)

print('Accuracy: ', cross_score.mean()*100)
print('Standard Deviation: ', cross_score.std()*100, '%')

### Decision Tree from Scikit Learn

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf_rfc = RandomForestClassifier(random_state=0)

clf_rfc.fit(X_train, y_train)

clf_rfc.score(X_test, y_test)

In [None]:
y_pred = clf_rfc.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
cfn_mtx = confusion_matrix(y_test, y_pred)
f1_scr = f1_score(y_test, y_pred)
psn_scr = precision_score(y_test, y_pred)
rcl_scr = recall_score(y_test, y_pred)

In [None]:
result['Random Forest Classifier'] = {'Accuracy': accuracy, 'F1 Score': f1_scr, 'Precision Score': psn_scr, 'Recall Score': rcl_scr} 
result

In [None]:
cross_score = cross_val_score(clf_rfc, X_train, y_train, cv=10)

print('Accuracy: ', cross_score.mean()*100)
print('Standard Deviation: ', cross_score.std()*100, '%')

### Neural Network

In [None]:
model = Sequential([
    Input(shape=30, batch_size=16),
    Dense(units=16, activation='relu', kernel_initializer='he_uniform'),
    Dense(units=8, activation='relu', kernel_initializer='he_uniform'),
    Dense(units=1, activation='linear', kernel_initializer='glorot_uniform'),
], name= 'Model')

model.summary()

In [None]:
model.compile(
    optimizer = tf.keras.optimizers.Adam(learning_rate = 0.05),
    loss = tf.keras.losses.BinaryCrossentropy(from_logits = True)
)

model.fit(X_train, y_train, epochs = 20, batch_size=16)

In [None]:
pred = model.predict(X_test) > 0.5

accuracy = accuracy_score(y_test, pred,)
cfn_mtx = confusion_matrix(y_test, pred)
f1_scr = f1_score(y_test, y_pred)
psn_scr = precision_score(y_test, pred)
rcl_scr = recall_score(y_test, pred)

result['Neural Network'] =  {'Accuracy': accuracy, 'F1 Score': f1_scr, 'Precision Score': psn_scr, 'Recall Score': rcl_scr} 

result

## Hyper-Parameter Tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV

parameters = {
    'penalty': ['l1', 'l2', 'elastic', None],
    'C': [0.01, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2],
    'solver': ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga '],
    'fit_intercept': [True, False],
    'intercept_scaling': [0.01, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2]
}



In [None]:
random_search = RandomizedSearchCV(
    estimator = clf_log,
    param_distributions= parameters,
    scoring = 'roc_auc',
    n_jobs=-1,
    cv = 10,
    verbose = 3
)

random_search.fit(X_train, y_train.to_numpy())

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_score_

## Final Model

In [None]:
clf = LogisticRegression(C=1, intercept_scaling = .5, fit_intercept=True, solver='newton-cg', random_state = 0)

clf.fit(X_train, y_train)
clf.score(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
cfn_mtx = confusion_matrix(y_test, y_pred)
f1_scr = f1_score(y_test, y_pred)
psn_scr = precision_score(y_test, y_pred)
rcl_scr = recall_score(y_test, y_pred)

result['Logistic Regression (hypertuned)'] =  {'Accuracy': accuracy, 'F1 Score': f1_scr, 'Precision Score': psn_scr, 'Recall Score': rcl_scr} 

result