# Breast Cancer - IA 2024

This project consists in the application of machine learning models and algorithms
related to supervised learning. 

- 0. Business Understanding 
    - 0.1. Business Goals
    - 0.2. Data Mining Problem Definition
    - 0.3. Data Mining Goals
- 1. Data Understanding
    - 1.1. About the Dataset
    - 1.2. Initial Data
    - 1.3. Explore Data
    - 1.4. Data Quality
- 2. Data Preparation
    - 2.1. Removing the Outliers
        - 2.1.1. Z-Score
        - 2.1.2. Interquartile Range (IQR)
- 3. Modeling
    - 3.1. Test Design
    - 3.2. Building the Models
    - 3.3. Model Selection
        - 3.3.1. Decision Trees
        - 3.3.2. Neural Networks
        - 3.3.3. K-NN
- 4. Evaluation

## 0. Business Understanding

#### 0.1. Business Goals

We aim to improve healthcare outcomes using the breast cancer dataset. Our goal is to develop predictive models that reduce misdiagnosis rates, minimize unnecessary procedures, and enhance treatment outcomes for patients with breast masses.

#### 0.2. Data Mining Problem Definition

Our task is to create a model that accurately classifies breast masses as malignant or benign using features from digitized images. This model will help healthcare professionals make timely and accurate diagnoses, improving treatment planning and patient management.

#### 0.3. Data Mining Goals

Our goals include building an accurate predictive model and identifying key features that contribute to classification. We aim to enhance model interpretability and integration into clinical practice for better patient care.

To assess our algorithms' performance in classifying breast masses, we'll utilize key metrics: accuracy for overall correctness, precision for accurate identification of malignant cases, recall for capturing all actual malignant cases, and the F1 score for a balanced evaluation. Additionally, the confusion matrix will provide a detailed breakdown of classifications, offering insights into our model's performance.

## 1. Data Understanding

The Breast Cancer dataset consists of features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features describe characteristics of cell nuclei present in the image.

### 1.1. About the Dataset

The dataset contains 569 registers 62.7% being of type B (Benign) and 37.3% being of type M (Malignant) and is composed of 32 attributes. 

These attributes are:
- **ID**: Unique identifier for each sample.
- **Diagnosis**: The diagnosis of the breast mass (Malignant - M, Benign - B).
- **Radius Mean**: Mean of distances from the center to points on the perimeter.
- **Texture Mean**: Standard deviation of gray-scale values.
- **Perimeter Mean**: Mean size of the core tumor.
- **Area Mean**: Mean area of the core tumor.
- **Smoothness Mean**: Mean smoothness of the cell nuclei.
- **Compactness Mean**: Mean compactness of the cell nuclei.
- **Concavity Mean**: Mean concavity of the cell nuclei.
- **Concave Points Mean**: Mean number of concave portions of the contour.
- **Symmetry Mean**: Mean symmetry of the cell nuclei.
- **Fractal Dimension Mean**: Mean "coastline approximation" - 1.
- **Radius SE**: Standard error for the mean of distances from the center to points on the perimeter.
- **Texture SE**: Standard error for the standard deviation of gray-scale values.
- **Perimeter SE**: Standard error for the mean size of the core tumor.
- **Area SE**: Standard error for the mean area of the core tumor.
- **Smoothness SE**: Standard error for the mean smoothness of the cell nuclei.
- **Compactness SE**: Standard error for the mean compactness of the cell nuclei.
- **Concavity SE**: Standard error for the mean concavity of the cell nuclei.
- **Concave Points SE**: Standard error for the mean number of concave portions of the contour.
- **Symmetry SE**: Standard error for the mean symmetry of the cell nuclei.
- **Fractal Dimension SE**: Standard error for the mean "coastline approximation" - 1.
- **Radius Worst**: Largest radius measured from the center to the perimeter of the tumor.
- **Texture Worst**: Highest variation in gray-scale values in the tumor.
- **Perimeter Worst**: Largest perimeter measurement of the tumor.
- **Area Worst**: Largest area measurement of the tumor.
- **Smoothness Worst**: Highest level of surface irregularity measured on the tumor cells.
- **Compactness Worst**: Greatest density of the tumor cells (closeness of the cells).
- **Concavity Worst**: Largest concavity observed in the tumor cell contours.
- **Concave Points Worst**: Maximum number of concave points detected on the tumor contour.
- **Symmetry Worst**: Least symmetry observed in the tumor cells.
- **Fractal Dimension Worst**: Highest complexity observed in the tumor cell borders.

Target Variable:
- **Diagnosis**: Malignant (M) or Benign (B).

### 1.2. Initial Data

In [None]:
import pandas
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from scipy.stats import zscore
from sklearn.manifold import TSNE
from matplotlib.colors import ListedColormap
from matplotlib.lines import Line2D
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

from sklearn.metrics import confusion_matrix 
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score



df = pandas.read_csv("../data/data.csv")

df.drop(['id'], axis=1, inplace=True) # Drop the 'id' column as it is not needed for analysis

df.head()

### 1.3. Explore Data

In [None]:
diagnosis_counts = df['diagnosis'].value_counts(normalize=True) * 100  # Convert to percentages

plt.figure(figsize=(8, 6))
ax = diagnosis_counts.plot(kind='bar') # Create a bar plot for percentages
plt.title('Percentage of Benign vs Malignant Diagnoses')
plt.xlabel('Diagnosis')
plt.ylabel('Percentage (%)')
plt.xticks(rotation=0) 

# Show the percentages on the bars for better understanding
for p in ax.patches:
    ax.annotate(f"{p.get_height():.1f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')

plt.show()

df.describe() # Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution

In [None]:
def box_plot():

    # Select numeric columns
    numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns.tolist()

    n_cols = 5  # Number of columns for subplots
    n_rows = (len(numeric_columns) + n_cols - 1) // n_cols  # Calculate the number of rows needed

    fig, axs = plt.subplots(n_rows, n_cols, figsize=(20, 4 * n_rows))  # Create subplots
    fig.subplots_adjust(hspace=0.4, wspace=0.3)  # Adjust the space between subplots

    custom_palette = {'B': 'green', 'M': 'red'}  # B for benign, M for malignant

    # Plot each boxplot on a subplot
    for index, column in enumerate(numeric_columns):
        row = index // n_cols
        col = index % n_cols

        # Adjust the whis parameter to extend the whiskers
        sb.boxplot(x='diagnosis', y=column, hue='diagnosis', data=df, ax=axs[row, col], palette=custom_palette, whis=2.5, legend=False)  
        axs[row, col].set_title(f'Box Plot of {column}')
        axs[row, col].set_xlabel('')
        axs[row, col].set_ylabel('')

    if len(numeric_columns) % n_cols != 0:  # Hide the last subplot if it is not used
        for ax in axs.flat[len(numeric_columns):]:
            ax.set_visible(False)

    plt.show()

box_plot()

def hist():
    df.hist(figsize=(20, 20), color='blue',
        edgecolor='black', alpha=0.4, bins=10, grid=False)

The box plot visualizes the distribution of data, highlighting the median, interquartile range (IQR), and overall variability within each class.

By extending the whiskers to 2.5 times the IQR, we capture a broader data range, reducing the number of points considered outliers. These whiskers indicate the minimum and maximum values within this adjusted range, while outliers — values that are unusually high or low relative to the rest of the data — are displayed as individual points outside the whiskers. These points will be addressed later in the analysis.

Additionally, the analysis suggests potential correlations among some values. Confirming these correlations could lead to the removal of redundant variables. We plan to explore and address these correlations in subsequent phases of our analysis.

This approach helps us better understand the dataset's spread and central tendencies, important for diagnosing and planning patient treatment. 

### 1.4. Data Quality

Poor data quality negatively affects many data processing efforts. As we have observed in our dataset, it contains outliers and noise — extraneous objects or modifications to original values — that disrupt the analysis. We will also investigate the presence of missing values and duplicated data. Addressing these issues is essential for ensuring the integrity and accuracy of our findings.

In [None]:
null = df.isnull().sum().sum()  # Check for null values
nan = df.isna().sum().sum()  # Check for NaN values
duplicate = df.duplicated().sum()  # Check for duplicate rows

number = null + nan + duplicate  # Calculate the total number for data issues

print(f"Number of Null + NaN + Duplicates: {number}")

## 2. Data Preparation

While our dataset is free of missing or duplicated entries, it does contain outliers.

### 2.1. Removing the Outliers

To address these issues, we are implementing two statistical methods: the Z-Score and the Interquartile Range (IQR). These techniques will be compared to ensure the most effective identification and removal of outliers, enhancing the accuracy of our analyses.

Let's begin by utilizing a scatter plot, which we find more intuitive to interpret compared to box plots due to its comprehensive display of all attributes. 

This will allow us to observe the raw data presentation without any alterations made to the dataset.

In [None]:
X = df.drop(['diagnosis'], axis=1)

diagnosis_map = {'M': -1, 'B': 1}
df['diagnosis_mapped'] = df['diagnosis'].map(diagnosis_map)

# Perform t-SNE
tsne = TSNE(n_components=3, perplexity=50, random_state=42).fit_transform(X)
tsne_df = pandas.DataFrame(tsne, columns=['comp1', 'comp2', 'comp3'])
tsne_df['diagnosis'] = df['diagnosis_mapped']

# Plotting
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(tsne_df['comp1'], tsne_df['comp2'], tsne_df['comp3'], c=tsne_df['diagnosis'],
                    cmap=ListedColormap(['red', 'green']))

# Labels and legend
ax.set_xlabel('comp1')
ax.set_ylabel('comp2')
ax.set_zlabel('comp3')
legend = plt.legend(*scatter.legend_elements(), loc='lower left')
legend.get_texts()[0].set_text('M')
legend.get_texts()[1].set_text('B')

plt.show()

Previously, we noted that certain data points, for the above plot, for example some red dots, deviate significantly from the expected distribution. 

Now, we'll revisit the plot, this time considering outliers and the methodologies employed to detect them.

#### 2.1.1. Z-Score

The Z-score quantifies how many standard deviations a data point is from the mean of the dataset. A Z-score of 0 signifies that the data point is at the mean. Positive or negative values indicate the number of standard deviations the data point is above or below the mean, respectively. This measurement is crucial for identifying outliers, which we define as observations that lie beyond 3 standard deviations from the mean.

We'll start by appling the method and identifying the outliers in a new column named "outliers_zscore". Outliers will be marked as True and Inliers will be marked as False

In [None]:
df.drop(['diagnosis_mapped'], axis=1, inplace=True)  # Drop the diagnosis_mapped column

# Drop the target variable for unsupervised learning
X = df.drop(['diagnosis'], axis=1)

X_scaled = X.apply(zscore)

# Detect outliers based on a threshold (e.g., 3 standard deviations)
outliers = (np.abs(X_scaled) > 3).any(axis=1)

pandas.set_option('future.no_silent_downcasting', True)
df['outliers_zscore'] = outliers.astype(int).replace({1: True, 0: False}) # Add a column to indicate outliers based on z-score method 

# check if the column was added
df.head(10)

Next, we'll compare the base scatter plot with the values flagged as outliers using the Z-Score method.

In [None]:

def zscore_scatter():
       colors = {
       ('B', False): 'green',  # Non-outlier Benign
       ('M', False): 'red',    # Non-outlier Malignant
       ('B', True): 'blue',    # Outlier Benign
       ('M', True): 'black'    # Outlier Malignant
       }

       # Apply the color map to each row in the dataframe
       outlier_colors = df.apply(lambda row: colors[(row['diagnosis'], row['outliers_zscore'])], axis=1)

       # Run t-SNE if not already done
       tsne = TSNE(n_components=3, perplexity=50, random_state=42).fit_transform(df.drop(['diagnosis', 'outliers_zscore'], axis=1))
       tsne_df = pandas.DataFrame(tsne, columns=['comp1', 'comp2', 'comp3'])

       # Create the scatter plot
       fig = plt.figure(figsize=(8, 8))
       ax = fig.add_subplot(111, projection='3d')
       scatter = ax.scatter(tsne_df['comp1'], tsne_df['comp2'], tsne_df['comp3'], c=outlier_colors)

       # Custom legend
       legend_elements = [
       Line2D([0], [0], marker='o', color='w', label='Benign',
              markerfacecolor='green', markersize=10),
       Line2D([0], [0], marker='o', color='w', label='Malignant',
              markerfacecolor='red', markersize=10),
       Line2D([0], [0], marker='o', color='w', label='Outlier Benign',
              markerfacecolor='blue', markersize=10),
       Line2D([0], [0], marker='o', color='w', label='Outlier Malignant',
              markerfacecolor='black', markersize=10)
       ]
       ax.legend(handles=legend_elements, loc='lower left')

       # Axis labels
       ax.set_xlabel('comp1')
       ax.set_ylabel('comp2')
       ax.set_zlabel('comp3')

       plt.show()

zscore_scatter()

Upon comparison, some of the previously suspected dots have indeed been confirmed as outliers. However, there are discrepancies: certain dots we anticipated as outliers remain unflagged, while unexpectedly, others are now identified as outliers. 

To further validate these findings, we'll explore another outlier detection method.

#### 2.1.2. Interquartile Range (IQR)

The Interquartile Range (IQR) measures the statistical dispersion of the data by evaluating the range between the first quartile (25th percentile) and the third quartile (75th percentile). It identifies the middle 50% of the dataset. To detect outliers, we utilize the IQR in conjunction with a multiplier, typically set at 1.5 times the IQR. Observations falling below the first quartile minus the multiplier times the IQR or above the third quartile plus the multiplier times the IQR are deemed outliers.

As before, we'll start by appling the method and identifying the outliers in a new column named "outliers_iqr". Outliers will be marked as True and Inliers will be marked as False


In [None]:
# Drop the target variable for unsupervised learning
X = df.drop(['diagnosis'], axis=1)

# Calculate the IQR
Q1 = X.quantile(0.25)
Q3 = X.quantile(0.75)
IQR = Q3 - Q1

# Determine outliers based on IQR
outliers = ((X < (Q1 - 1.5 * IQR)) | (X > (Q3 + 1.5 * IQR))).any(axis=1)

# Add a column to indicate outliers based on IQR method
df['outliers_iqr'] = outliers.astype(int).replace({1: True, 0: False})

# Check if the column was added
df.head(10)

Next, we'll compare the base scatter plot with the values flagged as outliers using the IQR method.

In [None]:

def irq_scatter():
       colors = {
       ('B', False): 'green',  # Non-outlier Benign
       ('M', False): 'red',    # Non-outlier Malignant
       ('B', True): 'blue',    # Outlier Benign
       ('M', True): 'black'    # Outlier Malignant
       }

       # Apply the color map to each row in the dataframe
       outlier_colors = df.apply(lambda row: colors[(row['diagnosis'], row['outliers_iqr'])], axis=1)

       # Run t-SNE if not already done
       tsne = TSNE(n_components=3, perplexity=50, random_state=42).fit_transform(df.drop(['diagnosis', 'outliers_iqr'], axis=1))
       tsne_df = pandas.DataFrame(tsne, columns=['comp1', 'comp2', 'comp3'])

       # Create the scatter plot
       fig = plt.figure(figsize=(8, 8))
       ax = fig.add_subplot(111, projection='3d')
       scatter = ax.scatter(tsne_df['comp1'], tsne_df['comp2'], tsne_df['comp3'], c=outlier_colors)

       # Custom legend
       legend_elements = [
       Line2D([0], [0], marker='o', color='w', label='Benign',
              markerfacecolor='green', markersize=10),
       Line2D([0], [0], marker='o', color='w', label='Malignant',
              markerfacecolor='red', markersize=10),
       Line2D([0], [0], marker='o', color='w', label='Outlier Benign',
              markerfacecolor='blue', markersize=10),
       Line2D([0], [0], marker='o', color='w', label='Outlier Malignant',
              markerfacecolor='black', markersize=10)
       ]
       ax.legend(handles=legend_elements, loc='lower left')

       # Axis labels
       ax.set_xlabel('comp1')
       ax.set_ylabel('comp2')
       ax.set_zlabel('comp3')

       plt.show()

irq_scatter()

It's evident that the IQR method flagged significantly more points as outliers. However, to ensure the accuracy of our outlier identification, we need to determine which points are truly outliers. 

To achieve this, we'll investigate where both methods agree and what implications this agreement has for the dataset as a whole.

In [None]:
# Calculate the number of outliers detected by each method

outliers_iqr_count = df['outliers_iqr'].sum()
outliers_zscore_count = df['outliers_zscore'].sum()
agreed_outliers_count = ((df['outliers_iqr'] == True) & (df['outliers_zscore'] == True)).sum()

# Calculate the percentage of True values for each method
total_samples = len(df)
percentage_outliers_iqr = (outliers_iqr_count / total_samples) * 100
percentage_outliers_zscore = (outliers_zscore_count / total_samples) * 100
percentage_agreed_outliers = (agreed_outliers_count / total_samples) * 100

# Print the results
print("Number of outliers detected by IQR method:", outliers_iqr_count)
print("Number of outliers detected by Z-Score method:", outliers_zscore_count)
print("Percentage of outliers detected by IQR method:", percentage_outliers_iqr, "%")
print("Percentage of outliers detected by Z-Score method:", percentage_outliers_zscore, "%")
print("Number of outliers detected by both methods (agreed outliers):", agreed_outliers_count)
print("Percentage of outliers detected by both methods (agreed outliers):", percentage_agreed_outliers, "%")

df.head(10)


Considering the size of our dataset, removing 13% of it would significantly impact our study of the models. Therefore, we've opted to replace the outlier values in the dataset with their respective median values. This approach ensures that our analysis remains robust while mitigating the influence of outliers on our results.

In [None]:
# Get the indices of agreed outliers
agreed_outliers_indices = df[(df['outliers_iqr'] == True) & (df['outliers_zscore'] == True)].index

# Replace outlier values with their respective median values
for column in X.columns:
    median_value = X[column].median()
    df.loc[agreed_outliers_indices, column] = median_value

df.drop(['outliers_iqr', 'outliers_zscore'], axis=1, inplace=True)  # Drop the outlier columns

# Check the first few rows of the modified dataset
df.head()

box_plot()


While the box plots show some improvement with fewer outliers, there are still some unusual values present. We'll need to study this deeper to ensure our approach is appropriate. 

It's also possible that these outliers represent rare but legitimate clinical cases.

For now, let's consider out data "free" of outliers and save it as clean dataset.

In [None]:
df.to_csv('../data/data_clean.csv', index=False)

Let's proceed by analyzing and comparing the cleaned dataset with the original dataset through the use of histograms. This will allow us to visually assess differences and identify any significant changes or patterns resulting from the data cleaning process.

First let's check the old dataset:

In [None]:
# DATA BEFORE ANYTHING
Y = pandas.read_csv("../data/data.csv")

Y.drop(['id'], axis=1, inplace=True) # Drop the 'id' column as it is not needed for analysis

Y.hist(figsize=(20, 20), color='blue',
        edgecolor='black', alpha=0.4, bins=10, grid=False)

Now the new dataset:

In [None]:
# DATA AFTER CLEANING
X = pandas.read_csv("../data/data_clean.csv", index_col=None)

X.hist(figsize=(20, 20), color='blue',
        edgecolor='black', alpha=0.4, bins=10, grid=False)

Upon removing outliers based on the median, the cleaned dataset exhibits histograms with tightened distributions, reflecting reduced variability and skewness in key features. 

The central tendency of the data is now more pronounced, suggesting that median values are more representative of the whole. 

Notably, the reduction in extreme values across features like radius_worst, texture_worst, and area_worst points to a more robust dataset, potentially leading to more reliable statistical analyses. 

Care has been taken to ensure valuable information is not lost in the process of outlier mitigation, maintaining the integrity of significant data points within the medical context of the study.

## Correlation Matrix

A correlation matrix is an essential tool in statistical analysis that measures the strength and direction of the linear relationship between pairs of variables. Each cell in the matrix displays the correlation coefficient between two variables. This coefficient ranges from -1 to 1, where:

- 1 indicates a perfect positive correlation, meaning that as one variable increases, the other variable increases proportionally.
- -1 indicates a perfect negative correlation, meaning that as one variable increases, the other decreases proportionally.
- 0 signifies no correlation, implying that the two variables do not have any linear relationship.

In the correlation matrix, variables are listed on both the rows and columns, with the main diagonal typically representing the correlation of each variable with itself, always equal to 1. Visualizing this matrix through a heatmap can quickly help identify areas where variables are strongly correlated, which can be crucial for tasks such as feature selection and predictive modeling. Filtering this matrix to display only strong correlations (for example, greater than 0.5 or less than -0.5) allows researchers to focus on more significant and potentially informative relationships.

In [None]:
# calculate the correlation between each pair of columns
df['diagnosis'] = df['diagnosis'].map({'B': 0, 'M': 1})
corr_matrix = df.corr()

# Select only the columns with a correlation greater than 0.5 or less than -0.5
corr_matrix = corr_matrix[(corr_matrix > 0.5) | (corr_matrix < -0.5)]

mask = np.zeros_like(corr_matrix)
mask[np.triu_indices_from(mask)] = True
fig, ax = plt.subplots(figsize=(20, 15))
sb.heatmap(corr_matrix, annot=True, cmap='coolwarm', ax=ax, mask=mask)
plt.show()

df['diagnosis'] = df['diagnosis'].map({0: 'B', 1: 'M'})  # Swapping Back

Several features show a high degree of positive correlation, particularly those related to size (like radius, perimeter, and area), suggesting these may be interdependent. Conversely, the fractal dimension shows weaker correlations with other features, which could imply it provides distinct information.

Now, let´s check, in more detail, how these features correlate with diagnosis.

In [None]:
# Calcular a matriz de correlação e obter as correlações com 'diagnosis'
df['diagnosis'] = df['diagnosis'].map({'B': 0, 'M': 1})
corr_matrix = df.corr()

diagnosis_corr = corr_matrix['diagnosis'].sort_values(ascending=False)

plt.figure(figsize=(25, 10))
colors = ['green' if x >= 0 else 'red' for x in diagnosis_corr.values]
bars = diagnosis_corr.plot(kind='bar', color=colors)

plt.title('Correlação com Diagnóstico')
plt.xlabel('Variáveis')
plt.ylabel('Coeficiente de Correlação')
plt.axhline(0, color='black', linewidth=0.8)
for bar in bars.patches:
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), f'{bar.get_height():.2f}', 
             ha='center', va='bottom', color='black')

plt.show()

df['diagnosis'] = df['diagnosis'].map({0: 'B', 1: 'M'})

## Backward elimination

Using backward elimination helps identify the most impactful attributes, allowing for the discarding of less critical ones. The importance of this lies in diminishing the model's complexity and cutting down training time.

It eliminates features that contribute little to the prediction, clearing away redundancy and irrelevant data that don't correlate strongly with the outcome.

Moreover, feature reduction affects the scope of data required, rendering the model more applicable in practical scenarios. For instance, it simplifies the data doctors need to collect from patients, streamlining the diagnostic process for efficiency and effectiveness.

In [None]:
df = pandas.read_csv("../data/data_clean.csv", index_col=None)
df['diagnosis'] = df['diagnosis'].map({'B': 0, 'M': 1})

# Separating the independent variables (X) and the dependent variable (y)
X = df.drop(['diagnosis'], axis=1)
y = df['diagnosis']

X = sm.add_constant(X)
model = sm.Logit(y, X).fit()
p_value_threshold = 0.05

while len(X.columns) > 0:
    model = sm.Logit(y, X).fit(disp=0)
    p_values = model.pvalues
    max_p_value = p_values.max() 
    feature_with_max_p_value = p_values.idxmax()
    if max_p_value > p_value_threshold:
        X.drop(feature_with_max_p_value, axis=1, inplace=True)
    else:
        break

significant_features = X.columns.drop('const')
significant_features_list = significant_features.tolist()

print('Significant features after backward elimination:')
print(significant_features_list)

The backward elimination process, has successfully identified 14 significant features that are potentially predictive of the diagnosis, which is coded as 0 for benign and 1 for malignant cases.

In [None]:
# keep only the significant features
df = pandas.read_csv("../data/data_clean.csv", index_col=None)
df = df[['diagnosis'] + significant_features_list]

df.head()

## 3. Modeling

With the dataset analyzed and cleaned, we now transition to the core of our project: modeling. In this phase, we aim to build predictive models that will allow us to test hypotheses and make predictions based on our data. 

This section outlines our approach to selecting, training, and evaluating different machine learning models. We will explore three distinct models to determine which best meets our project objectives.

### 3.1. Unbalanced Data

Since the dataset is quite unbalanced (357 benign and 212 malignant) and because models trained on unbalanced data tend to be biased towards the majority class, often at the expense of prediction accuracy for the minority class, we will explore three different feature sampling methods:

- **Oversampling the Minority Class**: This method involves replicating instances of the minority class to balance the class distribution. 
- **Undersampling the Majority Class**: This technique reduces the number of instances from the majority class to match the minority class size.
- **Synthetic Sample Generation (SMOTE)**: The Synthetic Minority Over-sampling Technique (SMOTE) generates synthetic samples from the minority class.

In [None]:
data = pandas.read_csv('../data/data_clean.csv')
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']

print("Original class distribution:", Counter(y))

ros = RandomOverSampler(random_state=42)
rus = RandomUnderSampler(random_state=42)
smote = SMOTE(random_state=42)

X_ros, y_ros = ros.fit_resample(X, y)
X_rus, y_rus = rus.fit_resample(X, y)
X_smote, y_smote = smote.fit_resample(X, y)

print("Oversampled class distribution:", Counter(y_ros))
print("Undersampled class distribution:", Counter(y_rus))
print("SMOTE class distribution:", Counter(y_smote))


Oversampling is good for small datasets because it keeps all the information, but it can cause overfitting. Overfitting happens when a model learns the training data so well, including its errors and unimportant details, that it performs poorly on new data. On the other hand, undersampling works well for very large datasets because it makes the model run faster, but it means losing some data.

Given that our dataset isn't very large and we want to keep as much information as possible without losing diversity, SMOTE (Synthetic Minority Over-sampling Technique) is the best choice. SMOTE creates new examples from the minority class, helping avoid overfitting and making our model predict better on new data.

So, we'll use SMOTE to prepare our data for building the model.

In [26]:
def applySampling(all_inputs, all_labels, test_size):
    smote = SMOTE(random_state=0)
    X_resampled, y_resampled = smote.fit_resample(all_inputs, all_labels)
    return train_test_split(X_resampled, y_resampled, test_size=test_size, random_state=0)        

### 3.2. Building the Models

In this phase, we will evaluate various classification models to identify the most effective option based on their performance. We will utilize several key metrics to assess each model's accuracy and reliability:

- **Accuracy**: This metric helps us understand the overall effectiveness of a model by measuring the proportion of true results (both true positives and true negatives) among the total number of cases examined.
- **Precision**: Important for contexts where the cost of a false positive is high, precision measures the accuracy of positive predictions.
- **Recall**: Also known as sensitivity, recall assesses the model's ability to identify all relevant instances, crucial in scenarios where missing a positive instance (false negative) carries significant consequences.
- **F1 Score**: Harmonic mean of precision and recall, providing a balance between the two in uneven class distributions where one metric alone might be misleading.

Additionally, we will examine the confusion matrix for each model, which visualizes true positives, true negatives, false positives, and false negatives, giving us further insight into the performance relative to each metric.

To ensure our results are robust and not skewed by any particular sample distribution, we will employ cross-validation techniques. This method enhances the validity of our model evaluation by using different subsets of the data for training and testing. Moreover, by tuning the parameters of each algorithm and selecting the best-performing settings, we aim to optimize the models' effectiveness. For the sake of clarity and efficiency in our presentation, we will focus on the outcomes derived from the optimal parameter settings. This approach ensures a streamlined and informative evaluation, allowing us to clearly demonstrate the most efficient solutions.

### 3.3. Model Selection

We will evaluate the following three models:

#### 3.3.1. Decision Trees

- Straightforward method used for both classification and regression. They split the data into branches to form a tree structure, making them easy to interpret. However, they can overfit easily, especially with complex datasets.

#### 3.3.2. Neural Networks

- Complex structures that simulate how human brains operate, ideal for capturing nonlinear relationships in large datasets. They are versatile and powerful but require substantial data and computational resources to train effectively. 

In [None]:
from sklearn.preprocessing import StandardScaler

def standardize(inputs):
    return StandardScaler().fit_transform(inputs)

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

df = pandas.read_csv("../data/data_clean.csv")

all_inputs = df.drop(['diagnosis'], axis=1).values
all_labels = df['diagnosis'].values

warnings.filterwarnings("ignore", category=ConvergenceWarning)


parameter_grid = {
}

clf = MLPClassifier()

(training_inputs, testing_inputs, training_classes, testing_classes) = applySampling(
        all_inputs, all_labels, 0.25)

training_inputs = standardize(training_inputs)
testing_inputs = standardize(testing_inputs)

cross_validation = StratifiedKFold(n_splits=10)
grid_search = GridSearchCV(clf, param_grid=parameter_grid, cv=cross_validation)
grid_search.fit(training_inputs, training_classes)
model = grid_search.best_estimator_


# Model prediction and evaluation
testing_predictions = model.predict(testing_inputs)
accuracy = accuracy_score(testing_classes, testing_predictions)
precision = precision_score(testing_classes, testing_predictions, average='weighted')
recall = recall_score(testing_classes, testing_predictions, average='weighted')
f1 = f1_score(testing_classes, testing_predictions, average='weighted')

print("Accuracy: ", round(accuracy*100, 3), "%")
print("Precision: ", round(precision*100, 3), "%")
print("Recall: ", round(recall*100, 3), "%")
print("F1: ", round(f1*100, 3), "%")

# Generate and display confusion matrix
fig, ax = plt.subplots(figsize=(8, 8))  # Adjust the figure size as needed
ConfusionMatrixDisplay.from_estimator(model, testing_inputs, testing_classes, display_labels=['Benign', 'Malignant'], cmap=plt.cm.Blues, ax=ax)
ax.set_title('Confusion Matrix of the Classifier')
plt.show()

#### 3.3.3. K-Nearest Neighbours

- Instance-based learning algorithm that classifies new cases based on the majority vote of their nearest neighbors, according to a distance metric. It's simple and effective but sensitive to the local data structure.

In [27]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

# Load your dataset
df = pd.read_csv("../data/data_clean.csv")

# Prepare data
all_inputs = df.drop(['diagnosis'], axis=1).values
all_labels = df['diagnosis'].values

## Finding Best Params

In [29]:
# Assuming applySampling 
(training_inputs, testing_inputs, training_classes, testing_classes) = applySampling(all_inputs, all_labels, 0.25)
training_inputs = standardize(training_inputs)
testing_inputs = standardize(testing_inputs)

# Classifier setup and parameter grid
knn = KNeighborsClassifier()

parameter_grid = {
        'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'weights': ['uniform', 'distance'],
        'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
        'leaf_size': [10, 20, 30, 40, 50],
        'p': [1, 2],
        'metric': ['euclidean', 'manhattan', 'chebyshev', 'minkowski']
}

# Cross-validation setup
cross_validation = StratifiedKFold(n_splits=10)
grid_search = GridSearchCV(knn, param_grid=parameter_grid, cv=cross_validation)
grid_search.fit(training_inputs, training_classes)

print("Best parameters: ", grid_search.best_params_)


Traceback (most recent call last):
  File "c:\Users\franc\FEUP\LEIC\3ANO\FEUP-IA\FEUP-IA\.venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 982, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\franc\FEUP\LEIC\3ANO\FEUP-IA\FEUP-IA\.venv\Lib\site-packages\sklearn\metrics\_scorer.py", line 415, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\franc\FEUP\LEIC\3ANO\FEUP-IA\FEUP-IA\.venv\Lib\site-packages\sklearn\base.py", line 764, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
                             ^^^^^^^^^^^^^^^
  File "c:\Users\franc\FEUP\LEIC\3ANO\FEUP-IA\FEUP-IA\.venv\Lib\site-packages\sklearn\neighbors\_classification.py", line 259, in predict
    probabilities = self.predict_proba(X)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\franc\FEU

KeyboardInterrupt: 

In [None]:
model = grid_search.best_estimator_

# Model prediction and evaluation
testing_predictions = model.predict(testing_inputs)
accuracy = accuracy_score(testing_classes, testing_predictions)
precision = precision_score(testing_classes, testing_predictions, average='weighted')
recall = recall_score(testing_classes, testing_predictions, average='weighted')
f1 = f1_score(testing_classes, testing_predictions, average='weighted')

print("Accuracy: ", round(accuracy*100, 3), "%")
print("Precision: ", round(precision*100, 3), "%")
print("Recall: ", round(recall*100, 3), "%")
print("F1: ", round(f1*100, 3), "%")

# Generate and display confusion matrix
fig, ax = plt.subplots(figsize=(8, 8))  # Adjust the figure size as needed
ConfusionMatrixDisplay.from_estimator(model, testing_inputs, testing_classes, display_labels=['Benign', 'Malignant'], cmap=plt.cm.Blues, ax=ax)
ax.set_title('Confusion Matrix of the Classifier')
plt.show()


## 4. Evaluation