# Breast Cancer Classification using Machine Learning Algorithms

<p>
    <b>Description:</b> Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or felt as lumps in the breast area.
</p>
<p>
   <b>Problem: </b>The key challenge is how to classify tumors into malignant (cancerous) or benign (non cancerous).
</p>

### Importing Libraries and Dataset

In [None]:
#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#
from sklearn.datasets import load_breast_cancer
#
from sklearn.model_selection import train_test_split
#
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
# Loading the dataset
dataset = load_breast_cancer()

In [None]:
# Creating a dataframe for the dataset
df = pd.DataFrame(dataset.data, columns = dataset['feature_names'])

In [None]:
# A sample from the dataset
df.sample(10)

In [None]:
# Adding the target column to the dataframe
df['target'] = dataset['target']

In [None]:
df.sample(5)

In [None]:
# Getting statistical description of the columns in the dataset
df.describe()

In [None]:
# Getting general information about the columns in the dataframe
df.info()

### Exploratory Data Analysis (EDA) and Data Pre-processing

#### NaN values

In [None]:
df.isna().sum()

#### Univariate, Bivariate, and Multivariate Analysis

#### Univariate Analysis

In [None]:
df.columns

In [None]:
# Column: mean radius
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,8))
sns.histplot(df['mean radius'], ax=ax1, bins = 15)
sns.boxplot(x = df['mean radius'], ax = ax2)
plt.show()

In [None]:
# Column: mean radius
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,8))
sns.histplot(df['mean texture'], ax=ax1, bins = 15)
sns.boxplot(x = df['mean texture'], ax = ax2)
plt.show()

In [None]:
df.columns

In [None]:
## Plotting the histogram and boxplot plots for all the numerical columns

# Getting the list of numerical/float columns except the target column 'target'
list_num_cols = list(df.columns[:-1])

# A for loop to create histogram and boxplot plots for each of the columns in 'list_num_cols'
for col in list_num_cols:
    print("Histogram plot and Box plot: ", col)
    fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,8))
    sns.histplot(df[col], ax=ax1, bins = 15)
    sns.boxplot(x = df[col], ax = ax2)
    plt.show()

#### Bivariate Analysis Between The Features and The Target Columns

In [None]:
df.target.value_counts()

<b>Note:</b> Class Distribution: 212 - Malignant, 357 - Benign

In [None]:
sns.barplot(x = df.target.value_counts().index, y = df.target.value_counts().values)
plt.show()

In [None]:
df.columns

In [None]:
# Column: mean radius
sns.histplot(data = df, x = "mean radius", hue = "target")

In [None]:
# Column: mean texture
sns.histplot(data = df, x = "mean texture", hue = "target")

In [None]:
## Plotting the overlapping histograms between the numerical features and the target

# Getting the list of numerical/float columns except the target column 'target'
list_num_cols = list(df.columns[:-1])

# A for loop to create the overlapping histograms
for col in list_num_cols:
    print("Histogram plot: ", col)
    sns.histplot(data = df, x = col, hue = "target")
    plt.show()

#### Features Engineering

In [None]:
# Checking the correlations between the columns (features and target)

corr = df.corr().abs()

# Mask the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

plt.figure(figsize=(20,20))
sns.heatmap(corr, annot = True, mask = mask)

In [None]:
# Get only the correlations with the target column
plt.figure(figsize=(5,10))
sns.heatmap(df.corr().abs()[['target']].sort_values(by = 'target', ascending = False), annot = True)

In [None]:
df.corr().abs()['target'].sort_values()

<b>Note:</b> You can consider only selecting the columns/features that are highly correlated with the output target, or you can use them all.
<br>
<b>Note 2:</b> But, usually, selecting highly correlated features with the target gives better results!

#### Models Developments [Logistic Regression and Decision Trees]

In [None]:
# Defining the features and the target
features = df.drop(columns = ['target'])
target = df.target

In [None]:
# Splitting the data to training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.3)

<b>Note:</b> You can add the hyperparameters tunining/optimization part, to find the best model's hyperparameters, but, I am not going to do it at the moment!

In [None]:
## Logistic Regerssion
# Initializing the LogisticRegression model
LR_model = LogisticRegression(max_iter = 3000)
# Training the model, by fitting the data to it
LR_model.fit(X_train, y_train)

In [None]:
# Getting the LogisticRegression predictions on the test set (unseen data)
LR_predictions = LR_model.predict(X_test)

In [None]:
# Printing the classification report (actual values vs. predictions)
print(classification_report(y_test, LR_predictions))

In [None]:
# Showing the confusion matrix
sns.heatmap(confusion_matrix(y_test, LR_predictions), annot = True)

<p>
    <b>Remember:</b> Class Distribution: 212 - Malignant (Class 0 - Negative class), 357 - Benign (Class 1 - Positive class) 
</p>
<p>
    <b>Note:</b> Do not forget that we are dealing with a problem, in which minimizing the False Positive predictions (actual negative (malignant) and predicted positive (benign) ) is more important than the False Negative predictions (actual positive (benign) and predicted negative (malignant) ), that is wht it is important to maximize the precision metric
</p>

In [None]:
## Decision Trees
# Initializing the Decision Tree model
DT_model = DecisionTreeClassifier()
# Training the model, by fitting the data to it
DT_model.fit(X_train, y_train)

In [None]:
# Getting the Decision Tree predictions on the test set (unseen data)
DT_predictions = DT_model.predict(X_test)

In [None]:
# Printing the classification report (actual values vs. predictions)
print(classification_report(y_test, DT_predictions))

In [None]:
# Showing the confusion matrix
sns.heatmap(confusion_matrix(y_test, DT_predictions), annot = True)

<p>
    <b>Remember:</b> Class Distribution: 212 - Malignant (Class 0 - Negative class), 357 - Benign (Class 1 - Positive class) 
</p>
<p>
    <b>Note:</b> Do not forget that we are dealing with a problem, in which minimizing the False Positive predictions (actual negative (malignant) and predicted positive (benign) ) is more important than the False Negative predictions (actual positive (benign) and predicted negative (malignant) ), that is wht it is important to maximize the precision metric
</p>