# Lab 3: Breast Cancer Prediction Using Decision Tree

# Table of Contents
### Sections of Lab 3:
1. [Import Libraries](#section1)
2. [Load the Dataset](#section2)
3. [Preliminary Data Analysis](#section3)
    - [Initial Data Overview](#section3.1)
    - [Summary Statistics](#section3.2)
4. [Exploring the Data](#section4)
    - [Descriptive Interpretation of Data in Section 4](#section4.1)
    - [Plotting variation](#section4.2)
    - [Additional Recommendations based on research to optimise the descriptive interpretation](#section4.3)
5. [Preparing the Data](#section5)
    - [Imbalanced Dataset](#section5.1)
6. [Feature Selection and Analysis](#section6)
    - [Interpretation of Feature Selection and Analysis](#section6.1)
7. [Implementing Decision Tree algorithm](#section7)
    - [Interpretation of Decision Tree Algorithm](#section7.1)
8. [Optimization and Hyper-parameter Tuning](#section8)

## <a id='section1'></a>
## Section 1: Import Libraries


In [24]:
# Breast Cancer Prediction Lab 3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold, SelectKBest, chi2
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.tree import plot_tree
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split


<a id='section2'></a>
## Section 2: Loading the Dataset

* The dataset is loaded into a pandas DataFrame

* This allows for a preliminary view of the dataset’s structure, including the names of the columns and the initial few rows of data.

In [None]:
# Please enter full directory to load the .csv included in the zip file

try:
    file_path = "breast-cancer.csv"
    data = pd.read_csv(file_path)
except FileNotFoundError:
    print(f"The file at {file_path} does not exist. Please check the file path and try again.")
    data = None
except pd.errors.EmptyDataError:
    print(f"The file at {file_path} is empty. Please provide a valid data file.")
    data = None
except Exception as e:
    print(f"An unexpected error occurred while loading the file: {e}")
    data = None

if data is not None:
    # Displaying the first five rows
    print(data.head())
    print("\n")

<a id='section3'></a>
## Section 3: Preliminary Data Analysis

* This section aims to conduct a preliminary analysis of the dataset to ensure its readiness for further data exploration and modeling. 

* It involves checking the structure of the dataset, identifying and removing any duplicate rows, and ensuring that all necessary columns are present in the dataset.

* This section ensures the quality of the dataset by removing duplicates and verifying the presence of all required columns, setting a solid foundation for the subsequent analyses.

In [None]:
# All required columns per the csv file
columns = data.columns
print(data.columns, "\n")

required_columns = [
    'id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'
    ]

# Duplicate rows in the dataset
duplicate_rows = data.duplicated()
print(f"Number of duplicate rows: {duplicate_rows.sum()}")

# Removing any duplicate rows
if duplicate_rows.any():
    data = data.drop_duplicates()
    print("Duplicate rows have been dropped.")
else:
    print("No duplicate rows found.")


# Checking that all columns are accounted for 
missing_columns = [col for col in required_columns if col not in data.columns]

# Conditional test to ensure all columns are included for model testing
if not missing_columns:
    print("All required columns are present!")
    print("\n")
else:
    raise ValueError(f"Missing columns: {', '.join(missing_columns)}")


#### Interpretation

* The ValueError exception is raised if there are any missing columns in the dataset.

* A detailed error message that specifies which columns are missing is generated using a formatted string that joins the names of the missing columns with a comma. This approach ensures that the script will halt execution and alert the user to the specific issue, thereby preventing silent failures and facilitating debugging.

<a id='section3.1'></a>
### Section 3.1: Initial Data Overview

* Summary of the dataset - This gives an overview of the size of your dataset.

* Statistical summary for the numerical columns - Provides a statistical summary of all the numerical columns. This will give insights such as mean, standard deviation, minimum, and maximum values of each column, thus aiding in understanding the distribution and central tendency of your data.

* Data Integrity - Ensuring the data integrity by checking for both duplicate rows and missing values, which are essential steps in data preprocessing.

In [None]:
# Summary statistics for the numerical columns
print("Summary statistics for the numerical columns : ")
display(data.describe())

# Displaying dimensions of dataframe ***
print("The dataframe has", data.shape[0], "rows and", data.shape[1], "columns, \n")

# *** Running Test - Checking for duplicates ***
print("Number of duplicate data : ",data.duplicated().sum())
print("\n")

# *** Running Test - Checking for duplicates ***
print("Number of missing values for each feature column : \n", data.isna().sum())
print("\n")


<a id='section4'></a>
## 4. Exploring the Data

* Exploring Descriptive Statistics and Visualizations of Individual Variables
Kernel Density Estimate (KDE) gives a sense of the distribution of values in respective columns

* Plotting Variation of All Independent Variables vs Diagnosis
In this analysis, each numerical column in the dataset is represented through a boxplot, facilitating the identification of potential outliers as well as clarifying the central tendency and dispersion of data values within individual columns.


In [None]:
# Plot distributions of numerical columns
print("Distributions of numerical columns : \n")
for column in data.select_dtypes(include=['number']).columns:
    plt.figure()
    sns.histplot(data[column], kde=True)
    plt.title(f'Distribution of {column}')
    plt.show()

# Boxplot for numerical columns to identify potential outliers
print("\nBoxplots for numerical columns:")
for column in data.select_dtypes(include=['number']).columns:
    plt.figure()
    sns.boxplot(x=data[column])
    plt.title(f'Boxplot of {column}')
    plt.show()


<a id='#section4.1'></a>
## Section 4.1: Descriptive Interpretation of Data in Section 4

### Plot distributions of numerical columns

* This segment of the script facilitates the visualization of value distributions for each numerical attribute in your dataset, thereby providing insights into the range, central tendency, and dispersion of values for each respective feature.

### Boxplot for numerical columns to identify potential outliers

* Similar to the plot distribution of numerical columns, it iterates over all numerical columns.

* This section of the script enables the visualization of potential outliers and the interquartile range of each numerical column, serving as a vital tool for comprehending the data distribution and pinpointing potential data anomalies, including outliers.

### Overall Interpretation

* Collectively, these segments facilitate exploratory data analysis by offering visual elucidation of the distribution and attributes of the numerical variables within your dataset.

<a id='section4.2'></a>
## Section 4.2: Plotting variation of variables vs diagnosis 

### Interpretation :
* Primary distinctions between malignant and benign cells lie in their radius, perimeter, and area.  

* It is generally expected, given that these features often represent the size of the cells, and malignant cells tend to be larger than benign ones.

* Important to note that compactness and concavity can also be distinguishing factors at times.

In [None]:
columns = [
    'radius_mean', 'texture_mean', 'perimeter_mean', 
    'area_mean', 'smoothness_mean', 'compactness_mean', 
    'concavity_mean', 'concave points_mean', 'symmetry_mean', 
    'fractal_dimension_mean', 'radius_se', 'texture_se', 
    'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 
    'concavity_se', 'concave points_se', 'symmetry_se', 
    'fractal_dimension_se', 'radius_worst', 'texture_worst', 
    'perimeter_worst', 'area_worst', 'smoothness_worst', 
    'compactness_worst', 'concavity_worst', 'concave points_worst', 
    'symmetry_worst', 'fractal_dimension_worst'
]

for column in columns:
    plt.figure(figsize=(8, 6))
    sns.histplot(data=data, x=column, hue='diagnosis', kde=True, element="step", stat="density", common_norm=False)
    plt.title(f'Distribution of {column} by Diagnosis')
    plt.xlabel(f'{column}')
    plt.ylabel('Density')
    plt.grid(axis='y', linestyle='--', alpha=0.7, linewidth=0.7)
    plt.show()


### Section4.3: Additional Recommendations based on research to optimise the descriptive interpretation


* To enhance the analytical depth, it would be prudent to undertake targeted statistical examinations, including T-tests, to ascertain statistically significant disparities between malignant and benign groups across individual features, thereby substantiating the observations made.

#### Code would follow these overview of steps:

1. importing the following library - from scipy.stats import ttest_ind

2. Grouping data by diagnosis -  The objective of segmenting the data according to the diagnosis category is to facilitate the isolated access and analysis of each group independently during the T-test and the computation of mean/standard deviation.

3. Conducting a T-test for the means of two independent samples (malignant and benign groups) - Printing out the mean and standard deviation for each group to provide deeper insight


<a id='section5'></a>
## Section 5: Preparing the Data

In [None]:
# Create a copy of the original dataset
prepared_data = data.copy()

# Drop the 'id' column as it is not a feature
prepared_data.drop('id', axis=1, inplace=True)

# Temporarily convert the 'diagnosis' column to numerical values for correlation analysis
diagnosis_mapping = {'M': 1, 'B': 0}
prepared_data['diagnosis_temp'] = prepared_data['diagnosis'].replace(diagnosis_mapping)

# Create a correlation matrix using the temporary numerical 'diagnosis' column
plt.figure(figsize=[15, 13])
corr_matrix = prepared_data.drop('diagnosis', axis=1).corr()
sns.heatmap(corr_matrix, cmap='coolwarm', annot=True, fmt=".2f", linewidths=1, linecolor='black', cbar=True, cbar_kws={"shrink": .8})
plt.title('Multicollinearity Heatmap', fontsize=16)
plt.show()

# Drop the temporary numerical 'diagnosis' column, keeping the original 'diagnosis' column with 'M' and 'B' labels
prepared_data.drop('diagnosis_temp', axis=1, inplace=True)

# Checking the balance of the classifications in the 'diagnosis' column
diagnosis_count = prepared_data['diagnosis'].value_counts()
print(f"Count of each diagnosis category:\nMalignant (M): {diagnosis_count['M']}\nBenign (B): {diagnosis_count['B']}")

# Visual representation of the balance in the 'diagnosis' column
plt.figure(figsize=(6, 4))
sns.barplot(x=diagnosis_count.index, y=diagnosis_count.values, palette='viridis')
plt.title('Diagnosis Categories Count')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.xticks(ticks=[0, 1], labels=['Benign (B)', 'Malignant (M)'])  # Setting custom labels for the x-axis
plt.grid(axis='y', linestyle='--', alpha=0.7, linewidth=0.7)
plt.show()


## Interpretation

* A copy of the original dataset data is created to avoid making changes directly to the original dataset, allowing for data manipulation without the loss of original data.

* In this procedure, a provisional column named 'diagnosis_temp' is established, wherein 'M' is substituted with 1 and 'B' with 0, aiding in the facilitation of the correlation analysis. 

* Subsequently, a correlation matrix is constructed to decipher the intricate relationships between diverse features and the newly numerical 'diagnosis' column, a process visualized utilizing a heatmap to pinpoint the features most correlated with the diagnosis. Following the analytical process, the 'diagnosis_temp' column is discarded to restore the original 'diagnosis' column with its 'M' and 'B' delineations.



<a id='section5.1'></a>
### Section 5.1: Imbalanced Dataset

* Check Class Distribution - Confirming that stratification worked as expected

* The 'plot.pie' function from the pandas library is utilized to generate pie charts, describing the respective proportions of each class within the training and testing datasets. 
This visualization distinctly exhibits the percentages, leveraging the autopct parameter to articulate the proportions explicitly.


In [None]:
# Separating the independent and dependent variables
features = prepared_data.drop(columns=['diagnosis'])
target = prepared_data['diagnosis']

# Splitting the data into training and testing sets
# We are using a stratified split to ensure that the train and test sets have the same proportion of class labels
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, stratify=target, random_state=42)

# Display the number of records and features in the training and testing sets
print(f"Training set: {X_train.shape[0]} records, {X_train.shape[1]} features")
print(f"Testing set: {X_test.shape[0]} records, {X_test.shape[1]} features")

# Getting the counts for each class in both sets
train_counts = y_train.value_counts(normalize=True)
test_counts = y_test.value_counts(normalize=True)

# Displaying the class distribution
print("\n")
print("Class distribution in the training set:")
for label, value in train_counts.items():
    print(f"{label}: {value * 100:.2f}%")

print("\nClass distribution in the testing set:")
for label, value in test_counts.items():
    print(f"{label}: {value * 100:.2f}%")

# Plotting the distributions
fig, ax = plt.subplots(1, 2, figsize=(12, 6))

# Defining labels with percentages
train_labels = [f'{label}: {value * 100:.1f}%' for label, value in zip(train_counts.index, train_counts.values)]
test_labels = [f'{label}: {value * 100:.1f}%' for label, value in zip(test_counts.index, test_counts.values)]

# Creating pie charts
ax[0].pie(train_counts, labels=train_labels, autopct='', shadow=True)
ax[0].set_title('Training Set')

ax[1].pie(test_counts, labels=test_labels, autopct='', shadow=True)
ax[1].set_title('Testing Set')

plt.show()

<a id='section6'></a>
## Section 6: Feature Selection and Analysis

* Analyzing the correlation between different features and the target variable to assist in feature selection.

* Removing features with low variance.

* Using univariate feature selection to find the best features based on univariate statistical tests.

* Visualizing the feature importances.


In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, stratify=target, random_state=42)

# Remove features with low variance
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_high_variance = sel.fit_transform(X_train)

# Univariate feature selection
X_best_features = SelectKBest(score_func=chi2, k=10)
X_best_features.fit_transform(X_train, y_train)

# Get score of each feature
feature_scores = X_best_features.scores_

# Get column names
columns = X_train.columns

# Create a dictionary and a dataframe with scores and features
feature_dict = dict(zip(columns, feature_scores))
feature_df = pd.DataFrame(feature_dict.items(), columns=['Feature', 'Score'])

# Sort the dataframe based on score
feature_df = feature_df.sort_values(by='Score', ascending=False)

# Plot the scores
plt.figure(figsize=(12,8))
sns.barplot(x='Score', y='Feature', data=feature_df)
plt.title('Feature Importance')
plt.show()

# Display the feature score dataframe
print(feature_df)

<a id='section6.1'></a>
### Section 6.1 - Interpretation of Feature Selection and Analysis

Variance Thresholding - The selection of a threshold value set at 0.8*(1-0.8) in variance thresholding aims to eliminate features characterized by a similarity in behavior in over 80% of the samples, thereby deemed to provide restricted information for classification purposes.

Univariate Feature Selection -  Implementing a parameter of k=10 facilitates the selection of the top 10 attributes grounded on the outcomes of the chi-squared statistical analysis. The chi-squared test, which assesses the independence of two events, is leveraged in this dataset to pinpoint the 10 features exhibiting the most substantial associations with the target variable.

<a id='section7'></a>
## Section 7: Implementing Decision Tree Algorithm

In [None]:
# Step 1: Initialize the DecisionTreeClassifier with default parameters
dt_classifier = DecisionTreeClassifier(random_state=42)

# Step 2: Fit the model to the training data
dt_classifier.fit(X_train, y_train)

# Step 3: Make predictions on the test data
y_pred = dt_classifier.predict(X_test)

# Step 4: Evaluate the model's performance using accuracy score, confusion matrix, and classification report
print('Accuracy Score:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))


# Step 5: Visualize the decision tree
plt.figure(figsize=(15, 10))
plot_tree(dt_classifier, filled=True, feature_names=X_train.columns, class_names=['B', 'M'])
plt.show()


<a id='section7.1'></a>
### Section 7.1 - Interpretation of Decision Tree Algorithm

* Initialize the DecisionTreeClassifier - Setting the random_state to 42 ensures that we get the same results each time we run the script, facilitating reproducibility

* Fit the Model to the Training Data - The classifier is trained utilizing the training dataset, thereby learning the underlying patterns present within the data. This knowledge equips the classifier with the ability to accurately forecast outcomes on previously unseen data.

*  Predictions on the Test Data - The model uses the patterns it learned during training to predict the labels of the test data.

* Decision Tree - This visualization facilitates comprehension of the decision rules employed by the classifier to formulate predictions. Within this representation, nodes depict the specific decision rules while the leaves indicate the respective outcomes, categorized as 'B' for benign and 'M' for malignant conditions.


## <a id='section8'></a>
## Section 8: Optimization and Hyper-parameter Tuning

* Hyper-parameter Tuning: The objective is to refine the decision tree model through the identification and integration of the optimal hyperparameters, thereby augmenting the model's predictive efficacy.

* Avoid Overfitting - The objective of circumventing overfitting is to avert a scenario where the model excessively adapts to the training dataset, thereby jeopardizing its ability to appropriately generalize and make predictions on unfamiliar data.

In [None]:
# Define the parameter grid to search
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 5, 10],
}

# Initialize a DecisionTreeClassifier
dtree = DecisionTreeClassifier(random_state=42)

# Initialize a GridSearchCV object which will find the best hyper-parameters
grid_search = GridSearchCV(dtree, param_grid, cv=5, scoring='accuracy')

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters found
print('Best parameters found: ', grid_search.best_params_)

# Print the best score found
print('Best cross-validation score: {:.2f}'.format(grid_search.best_score_))

# Test the model on the test data
test_score = grid_search.score(X_test, y_test)
print('Test set score: {:.2f}'.format(test_score))

### Explanation of hyperparameter ranges:

* Criterion -  Exploring both options to ascertain which criterion facilitates more effective tree splits, thereby enhancing the model's predictive accuracy.

* Max_depth - The max_depth parameter restricts the depth of the tree. Setting it to None allows the tree to grow until it contains less than min_samples_split samples. The other values represent a series of increasing limits on the tree depth, helping us control the complexity of the model and potentially prevent overfitting

* Min samples split - This parameter dictates that any node in the decision tree must contain at least 2 samples to be eligible for splitting into further nodes. A minimum value of 2 has been chosen, meaning the tree will continue splitting nodes as long as there are more than one sample in the node, thereby allowing the tree to learn finer details from the training data.

* Min samples leaf - Each leaf/terminal node must contain a minimum of 5 samples from the training data. Setting this value helps in preventing the tree from creating leaves for outliers or noise in the data, thereby helping to avoid overfitting.

Including a varied range of values for these hyperparameters ensures that the grid search can explore a broad space of possible models, which is essential to finding a well-tuned model