<a href="https://colab.research.google.com/github/txusser/Master_IA_Sanidad/blob/main/Modulo_2/2_3_4_Proyecto_Arboles_de_decision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Trees
Application in a classification problem using data related to morphologies and other clinically relevant features in breast cancer diagnosis.


## Project Description

In [None]:
# Load Required Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer

from rich import print
from rich.console import Console
console = Console()

sns.set(style="whitegrid") # Configure Seaborn figure style

# This dataset is part of Scikit-learn's example datasets
# Reference: https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset
data = load_breast_cancer()
# The DESCR method provides a docstring with information about the dataset
print(data.DESCR)


## Exploration and Preprocessing

In [None]:
# Create a Pandas DataFrame from the dataset dictionary
df = pd.DataFrame(data.data, columns=data.feature_names)

# Check for missing values per variable
missing_values = df.isnull().sum()
print("Number of missing values per variable:\n", missing_values)

# Visualize the distribution of missing values
plt.figure(figsize=(12, 8))  # Adjust figure size for better readability
sns.heatmap(df.isnull(), cbar=True, cmap='coolwarm', yticklabels=False)
plt.title("Distribution of Missing Values in the Dataset")
plt.xlabel("Variables")
plt.ylabel("Entries")
plt.xticks(rotation=90)  # Rotate x-axis labels to prevent overlap
plt.grid(False)  # Optional: set to True if grid lines are preferred
plt.tight_layout()  # Automatically adjust subplot parameters
plt.show()


In [None]:
# Function to introduce missing values in a DataFrame
def add_missing_values(df, missing_percentage=0.05):
    # Calculate the total number of cells in the DataFrame
    total_cells = np.product(df.shape)
    # Calculate the total number of cells that need to be NaN
    total_missing = int(total_cells * missing_percentage)

    # Get the indices of rows and columns for the cells to be turned into NaN
    row_indices = np.random.choice(df.shape[0], total_missing)
    col_indices = np.random.choice(df.shape[1], total_missing)

    # Assign NaN to the selected cells
    for row, col in zip(row_indices, col_indices):
        df.iat[row, col] = np.nan

# Create a copy of the DataFrame and add missing values
df_missing = df.copy()
add_missing_values(df_missing, missing_percentage=0.01)  # Add 1% of missing values

# Visualize the missing values introduced
plt.figure(figsize=(12, 8))  # Adjust figure size for better readability
sns.heatmap(df_missing.isnull(), cbar=True, cmap='coolwarm', yticklabels=False)
plt.title("Distribution of Missing Values in the Dataset")
plt.xlabel("Variables")
plt.ylabel("Entries")
plt.xticks(rotation=90)  # Rotate x-axis labels to prevent overlap
plt.grid(False)  # Optional: set to True if grid lines are preferred
plt.tight_layout()  # Automatically adjust subplot parameters
plt.show()


In [None]:
# The goal of the project is to achieve high accuracy in classifying
# examples in the test set into the two categories indicating whether
# the tumor is malignant or benign.
print("Target Classes:", data.target_names)


In [None]:
# Split the dataset into features and target variable
X, y = data.data, data.target

# Visualize the class distribution in the target variable
fig, ax = plt.subplots(figsize=(8, 5))
sns.countplot(x=y, ax=ax, palette="viridis")  # Using the 'viridis' color palette
ax.set_title("Class Distribution in Target Variable")
ax.set_xlabel("Classes")
ax.set_ylabel("Count")
ax.set_xticklabels(data.target_names)

# Add data labels to the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='bottom', color='black')


## Class Balance

In [None]:
labels = data.target

# Count the number of instances per class
class_counts = np.bincount(labels)
class_counts, class_counts / len(labels) * 100
class_labels = data.target_names
class_percentages = class_counts / len(labels) * 100

print("Class Counts:", class_counts)

# Create a bar chart
plt.figure(figsize=(8, 6))
plt.bar(class_labels, class_percentages, color=['red', 'green'])

# Add title and axis labels
plt.title('Class Balance in Breast Cancer Dataset')
plt.xlabel('Class')
plt.ylabel('Percentage (%)')
plt.ylim(0, 100)  # Set limits for y-axis for better clarity

# Display percentages on the bars
for i, percentage in enumerate(class_percentages):
    plt.text(i, percentage + 1, f'{percentage:.2f}%', ha='center')

plt.show()



**Conclusions**: The dataset exhibits a relative imbalance, with a higher proportion of samples classified as cancer absence (class 1) compared to cancer presence (class 0). However, the difference is not extreme. With approximately 37% positive cases and 63% negative cases, standard machine learning techniques might handle this imbalance without specific adjustments.

For algorithms sensitive to class imbalance or in scenarios where misclassifying one class has significantly more severe consequences, strategies to balance the distribution could be considered. These strategies include oversampling the minority class, undersampling the majority class, or applying synthetic data generation methods such as SMOTE.


In [None]:
# Increase the font size
sns.set_context('talk', font_scale=0.7)

# Create a figure to visualize the correlation matrix with an adjusted size
plt.figure(figsize=(20, 20))  # Adjust the figure size

# Generate a heatmap of the correlation matrix
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

# Set titles and axis labels
plt.title('Correlation Matrix')
plt.xlabel('Variables')
plt.ylabel('Variables')

# Rotate x-axis labels and improve subplot layout
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()  # Automatically adjust subplot parameters


In [None]:
# In the previous graph, we observed that several variables exhibit high correlation. Let's explore how we can remove them.

# Define a function to identify columns with correlation below a certain threshold (thres)
def find_lc_cols(df, thres):
    """
    Finds variables with correlation greater than the threshold (thres).
    """
    corr = df.corr()
    columns = np.full((corr.shape[0],), True, dtype=bool)
    for i in range(corr.shape[0]):
        for j in range(i + 1, corr.shape[0]):
            if abs(corr.iloc[i, j]) >= thres:
                if columns[j]: 
                    columns[j] = False

    return columns

# Run the function and retrieve columns/variables with low correlation
lc_cols = find_lc_cols(df, thres=0.90)
print("Variables with low (<90%) correlation:", df.columns[lc_cols].tolist())

In [None]:
# Build a DataFrame that includes only the selected columns
s_cols = df.columns[lc_cols]
df_s = df[s_cols]

# Log the results
console.log(f"Selected variables: {len(df_s.columns)}", style="bold blue")
console.log(f"From a total of: {len(df.columns)}", style="bold blue")
console.log(f"\nFinal dataset: \n{df_s}", style="yellow")

In [None]:
# Split the dataset into training and testing sets using Scikit-learn
from sklearn.model_selection import train_test_split

X = df_s[df_s.columns]  # Features
y = data.target  # Target labels

# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the datasets
print(" - X_train:", X_train.shape)
print(" - X_test:", X_test.shape)
print(" - y_train:", y_train.shape)
print(" - y_test:", y_test.shape)

# Visualize the distribution of the training and testing sets
shapes = {
    'X_train': X_train.shape[0],
    'y_train': y_train.shape[0],
    'X_test': X_test.shape[0],
    'y_test': y_test.shape[0]
}

plt.figure(figsize=(10, 6))
plt.bar(shapes.keys(), shapes.values(), color=['blue', 'orange', 'green', 'red'])
plt.xlabel('Data Sets')
plt.ylabel('Number of Instances')
plt.title('Distribution of Training and Testing Sets')
plt.show()


## Decision Tree

In [None]:
# It's time to configure our classifier and fit the model to the data
from sklearn.tree import DecisionTreeClassifier
clf_dt = DecisionTreeClassifier(criterion='gini', max_depth=2)
clf_dt.fit(X_train, y_train)

In [None]:
# Fitting the model, let's evaluate its performance
# Calculate the predictions of the trained model on the test set
y_test_pred = clf_dt.predict(X_test)

# Using the true labels in the test set, calculate the model's accuracy
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_test_pred) * 100
print("The model's accuracy is: {:0.2f}".format(acc))

In [None]:
# Let's now look at the confusion matrix for a better assessment of performance.
# Remember that the test set includes 114 subjects.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Create the confusion matrix
cm = confusion_matrix(y_test, y_test_pred)

# Create an instance of ConfusionMatrixDisplay to visualize the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)

# Set the title and display the confusion matrix
disp.plot(cmap='Blues', values_format='d')
plt.title("Confusion Matrix")
plt.show()


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, precision_score, recall_score, f1_score

# Assuming `y_test` is your set of true labels and `y_test_pred` are the model predictions

# Create the confusion matrix
cm = confusion_matrix(y_test, y_test_pred)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]  # Normalize the confusion matrix

# Calculate additional metrics
precision = precision_score(y_test, y_test_pred, average='weighted')
recall = recall_score(y_test, y_test_pred, average='weighted')
f1 = f1_score(y_test, y_test_pred, average='weighted')

# Create an instance of ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm_normalized, display_labels=data.target_names)

# Configure and display the normalized confusion matrix
plt.figure(figsize=(8, 8))
disp.plot(cmap='Blues', values_format='.2f')
plt.title('Normalized Confusion Matrix')

# Annotate additional metrics
plt.figtext(0.5, -0.1, f'Precision: {precision:.2f}', ha='center', va='center')
plt.figtext(0.5, -0.15, f'Recall/Sensitivity: {recall:.2f}', ha='center', va='center')
plt.figtext(0.5, -0.2, f'F1-Score: {f1:.2f}', ha='center', va='center')

plt.tight_layout()  # Automatically adjust subplot parameters
plt.show()


## Results
The results obtained highlight the robustness of our model, demonstrating a classification accuracy exceeding 91%. Furthermore, it successfully identifies 67 malignant cases and 37 benign cases accurately. Although it stands out for its performance, 6 false positives and 4 false negatives have been recorded, providing valuable insights into areas that could be refined to further enhance its effectiveness.


## Apply Logistic Regression Model
Let's see how well a logistic regression model performs.

https://www.kaggle.com/neisha/heart-disease-prediction-using-logistic-regression


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create and fit the Logistic Regression model
clf_lr = LogisticRegression()
clf_lr.fit(X_train, y_train)

# Predict on the test set
y_test_pred = clf_lr.predict(X_test)

# Calculate and display the model's accuracy
accuracy = accuracy_score(y_test, y_test_pred) * 100
print("The model accuracy (Logistic Regression) is: {:.2f}%".format(accuracy))


## ROC Curve
The Receiver Operating Characteristic (ROC) curve is a graphical tool used to evaluate classification models. It provides a visual representation of how well a model can distinguish between two distinct classes by varying the decision threshold.

The curve is created by plotting the True Positive Rate (Sensitivity) on the y-axis and the False Positive Rate (1 - Specificity) on the x-axis. Here's an explanation of the key terms:

- **Sensitivity (True Positive Rate - TPR):** Measures the proportion of positive instances correctly classified as positive by the model. It is calculated as TP / (TP + FN), where TP is the number of true positives (correctly classified positive instances), and FN is the number of false negatives (positive instances incorrectly classified as negative).

- **Specificity (True Negative Rate - TNR):** Measures the proportion of negative instances correctly classified as negative by the model. It is calculated as TN / (TN + FP), where TN is the number of true negatives (correctly classified negative instances), and FP is the number of false positives (negative instances incorrectly classified as positive).

On the ROC curve, each point represents a different decision threshold, which affects the true positive rate and the false positive rate. An ideal model would have a ROC curve that reaches the upper-left corner (sensitivity of 1, specificity of 1), indicating no false positives and all true positives correctly classified.

The Area Under the ROC Curve (AUC) is another important metric. It measures the overall discriminative ability of the model across all possible decision thresholds. An AUC close to 1 indicates a model with good discriminative power, while an AUC near 0.5 suggests performance similar to random guessing.


In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Calculate the probability scores assigned by the model on the test set
y_score = clf_lr.predict_proba(X_test)[:,1]

# Compute the False Positive Rate (FPR) and True Positive Rate (TPR)
fpr, tpr, thresholds = roc_curve(y_test, y_score)

# Calculate the Area Under the ROC Curve (AUC)
roc_auc = auc(fpr, tpr)

# Create a figure to visualize the ROC curve
plt.figure(figsize=(8, 6))

# Plot the ROC curve
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC Curve (AUC = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')

# Configure axis limits, labels, and add a legend
plt.xlim([-0.01, 1.0])
plt.ylim([0.0, 1.05])
plt.title('ROC Curve (Logistic Regression)')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.legend(loc='lower right')
plt.grid(True)


## Other Useful Models for Classification Problems

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Initialize Models
models = {
    "Logistic Regression": LogisticRegression(),
    "SVM": SVC(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "KNN": KNeighborsClassifier()
}

# Train and Evaluate Each Model
for name, model in models.items():
    model.fit(X_train, y_train)  # Train the model
    y_pred = model.predict(X_test)  # Predict on the test set
    console.rule(f"[bold]Model {name}[/bold]")
    print(classification_report(y_test, y_pred))  # Display performance metrics


## Comments on the Results

Analyzing the performance metrics of the Random Forest-based model:
* Class 0: Malignant tumor
* Class 1: Benign tumor

### Precision
- **Class 0**: 0.95. This means that 95% of the time the model predicts class 0, it is correct.
- **Class 1**: 0.96. 96% of the time the model predicts class 1, it is correct.
- High precision in both classes suggests the model is effective at minimizing false positives.

### Recall (Sensitivity)
- **Class 0**: 0.93. Of all actual instances of class 0, the model correctly identifies them 93% of the time.
- **Class 1**: 0.97. The model correctly identifies 97% of all actual instances of class 1.
- High recall indicates the model is good at identifying positive classes.

### F1-Score
- **Class 0**: 0.94. It is a balance between precision and recall for class 0.
- **Class 1**: 0.97. It is a balance between precision and recall for class 1.
- F1-Score is especially useful in situations where a balance between precision and recall is important. A high value in both classes indicates a good balance.

### Support
- **Class 0**: 43. There are 43 instances of class 0 in the dataset.
- **Class 1**: 71. There are 71 instances of class 1.
- Support indicates the number of actual occurrences of each class in the test dataset.

### Overall Accuracy
- 0.96. The model correctly classified 96% of all instances.

### Macro Avg and Weighted Avg
- **Macro Avg**: Computes the arithmetic mean of the metrics for each class, treating all classes equally. For this model, the macro averages for precision, recall, and f1-score are 0.96, 0.95, and 0.95 respectively.
- **Weighted Avg**: Computes the weighted average of the metrics, considering the support (number of instances) for each class. The weighted averages for precision, recall, and f1-score are 0.96.

### General Interpretation
The model demonstrates high performance across all metrics for both classes, indicating effectiveness in correctly identifying classes (high recall) and minimizing false positives (high precision). The high f1-score in both classes suggests a good balance between precision and recall. The overall accuracy is also high, meaning the model is very efficient in correctly classifying instances overall. However, it is always essential to consider the application context of the model, as in some cases, the cost of false positives or false negatives could be critical and require a more nuanced approach than suggested by a high overall accuracy.
