# Breast Cancer Dataset Analysis

This notebook focuses on analyzing the **Breast Cancer Wisconsin (Diagnostic)** dataset. The tasks include:

1. Installing required libraries.
2. Fetching the dataset from the UCI Machine Learning Repository.
3. Exploring the dataset to understand its structure and properties.
4. Preparing the dataset by splitting it into training and testing sets using various proportions.
5. Visualizing class distributions to ensure stratification in the splits.

We will use the following tools and libraries:
- `ucimlrepo` to fetch the dataset.
- `pandas` for data manipulation.
- `matplotlib` and `seaborn` for data visualization.
- `scikit-learn` for splitting the dataset into training and testing subsets.

## Step 0: Install Necessary Libraries

Before running the notebook, make sure you have all the required Python libraries installed:
- **ucimlrepo**: Fetches datasets from the UCI Machine Learning Repository.
- **pandas**: Handles data manipulation and analysis.
- **matplotlib** and **seaborn**: For creating visualizations.
- **scikit-learn**: Provides tools for splitting datasets and building machine learning models.

Run the following code to install them if they are not already installed.


In [None]:
# Install required libraries (run this cell only if needed)
%pip install ucimlrepo pandas matplotlib seaborn scikit-learn graphviz 




## Step 1: Import Libraries

We begin by importing the necessary libraries for data loading, manipulation, and visualization.

In [5]:
# Import libraries
from ucimlrepo import fetch_ucirepo  # Fetch dataset from UCI repository
import pandas as pd  # Data manipulation
import matplotlib.pyplot as plt  # Visualization
import seaborn as sns  # Advanced visualization
from sklearn.model_selection import train_test_split  # Train-test splitting

# Set plotting style for consistency
sns.set_style('whitegrid')


## Step 2: Fetch the Dataset

The **Breast Cancer Wisconsin (Diagnostic)** dataset is fetched from the UCI Machine Learning Repository using the `ucimlrepo` library. 


In [10]:
# Fetch the dataset using its unique ID
breast_cancer_data = fetch_ucirepo(id=17)

### Dataset metadata

In [12]:
# View metadata about the dataset
breast_cancer_data.metadata

{'uci_id': 17,
 'name': 'Breast Cancer Wisconsin (Diagnostic)',
 'repository_url': 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic',
 'data_url': 'https://archive.ics.uci.edu/static/public/17/data.csv',
 'abstract': 'Diagnostic Wisconsin Breast Cancer Database.',
 'area': 'Health and Medicine',
 'tasks': ['Classification'],
 'characteristics': ['Multivariate'],
 'num_instances': 569,
 'num_features': 30,
 'feature_types': ['Real'],
 'demographics': [],
 'target_col': ['Diagnosis'],
 'index_col': ['ID'],
 'has_missing_values': 'no',
 'missing_values_symbol': None,
 'year_of_dataset_creation': 1993,
 'last_updated': 'Fri Nov 03 2023',
 'dataset_doi': '10.24432/C5DW2B',
 'creators': ['William Wolberg',
  'Olvi Mangasarian',
  'Nick Street',
  'W. Street'],
 'intro_paper': {'ID': 230,
  'type': 'NATIVE',
  'title': 'Nuclear feature extraction for breast tumor diagnosis',
  'authors': 'W. Street, W. Wolberg, O. Mangasarian',
  'venue': 'Electronic imaging',
  'yea

### Variable Information

The variable information describes:
- The features (input variables) in the datase`.
- The target variable (`M` for malignant, `B` for benign).

In [15]:
# View feature and variable information
breast_cancer_data.variables


Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,ID,ID,Categorical,,,,no
1,Diagnosis,Target,Categorical,,,,no
2,radius1,Feature,Continuous,,,,no
3,texture1,Feature,Continuous,,,,no
4,perimeter1,Feature,Continuous,,,,no
5,area1,Feature,Continuous,,,,no
6,smoothness1,Feature,Continuous,,,,no
7,compactness1,Feature,Continuous,,,,no
8,concavity1,Feature,Continuous,,,,no
9,concave_points1,Feature,Continuous,,,,no


### Extract Features and Labels

After fetching and examining the dataset, we will extract:
- **Features**: The numerical columns describing tumor properties.
- **Labels**: The target column, which indicates whether the tumor is malignant (`M`) or benign (`B`).


In [18]:
# Extract features and labels
features = breast_cancer_data.data.features  # Features (30 columns)
labels = breast_cancer_data.data.targets     # Labels ('M' for malignant, 'B' for benign)


Preview first 5 rows of `features`

In [21]:
features.head()

Unnamed: 0,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,fractal_dimension1,...,radius3,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Preview first 5 rows of `labels`

In [24]:
labels.head()

Unnamed: 0,Diagnosis
0,M
1,M
2,M
3,M
4,M


## Step 3: Prepare the Dataset

We will split the dataset into training and testing subsets with the following proportions:
- **40% Training / 60% Testing**
- **60% Training / 40% Testing**
- **80% Training / 20% Testing**
- **90% Training / 10% Testi### Tasks:
1. Use the `train_test_split` function from `sklearn` to split the data.
2. Apply **stratified splitting** to ensure that the class distribution (malignant/benign) remains consistent across the splits.
3. Store the splits in a dictionary for easy access.e accurate.


In [27]:
# Import train_test_split to split the dataset into training and testing subsets
from sklearn.model_selection import train_test_split

# Define train-test proportions
train_sizes = [0.4, 0.6, 0.8, 0.9]

# Create a dictionary to store the splits
splits = {}


## Step 4: Perform Stratified Splitting

Using the defined train-test proportions, we will:
1. Split the dataset into `feature_train`, `feature_test`, `label_train`, and `label_test` subsets.
2. Ensure stratification so that the class balance is preserved in both training and testing sets.
3. Store the splits in a dictionary, where each key corresponds to the train-test proportion.


In [30]:
# Perform stratified splitting for each proportion
for train_size in train_sizes:
    feature_train, feature_test, label_train, label_test = train_test_split(
        features, labels, train_size=train_size, stratify=labels, random_state=42
    )
    # Store the splits in the dictionary
    splits[f'{int(train_size * 100)}/{int(round((1 - train_size) * 100))}'] = {
        'feature_train': feature_train,
        'feature_test': feature_test,
        'label_train': label_train,
        'label_test': label_test
    }

## Step 5: Confirm Splits

We will now verify that the splits were created correctly by:
1. Checking the structure of the dictionary where splits are stored.
2. Displaying a few rows from all the splits to confirm that the data is organized as expected.

In [33]:
# Display the keys of the splits dictionary to confirm the proportions
splits.keys()

dict_keys(['40/60', '60/40', '80/20', '90/10'])

First 5 rows of the training features and labels for 80/20 split

In [36]:
# Access and display a sample from the 80/20 split
feature_train = splits['80/20']['feature_train']
feature_test = splits['80/20']['feature_test']
label_train = splits['80/20']['label_train']
label_test = splits['80/20']['label_test']

# Show the shape of the training features, test features, training labels and test labels for 80/20 split
feature_train.shape, feature_test.shape, label_train.shape, label_test.shape

((455, 30), (114, 30), (455, 1), (114, 1))

## Step 6: Train the Decision Tree Classifier

We will now train a **Decision Tree Classifier** on the **80/20 training data** (`feature_train` and `label_train`). 

### Parameters:
- **criterion='entropy'**: The splitting criterion is information gain (entropy).
- **random_state=42**: Ensures reproducibility of the training process.

This classifier will learn to differentiate between malignant and benign tumors based on the training data.


In [39]:
from sklearn.tree import DecisionTreeClassifier

# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(criterion='entropy', random_state=42)

# Train the classifier on the training data
clf.fit(feature_train, label_train)


## Step 7: Export the Decision Tree to Graphviz Format
We export the trained decision tree to Graphviz's DOT format, which allows us to visualize the tree structure..rovement.


In [None]:
from sklearn.tree import export_graphviz

# Export the decision tree to DOT format
dot_data = export_graphviz(
    clf,
    out_file=None,  # Do not save to file
    feature_names=feature_train.columns,  # Feature names
    class_names=clf.classes_,  # Class names
    filled=True,  # Fill nodes with colors
    rounded=True,  # Use rounded edges
    special_characters=True,  # Allow special characters
    max_depth=2  # Visualize tree up to depth 2, as shown in the PDF example
)



### Step 8: Visualize the Decision Tree Using Graphviz

The DOT data is converted to an image using Graphviz. This visualization provides a graphical representation of the decision tree:
- Splits based on feature thresholds.
- The reduction in entropy at each split.
- Predicted class and sample counts at leaf nodes.
able.

In [49]:
from graphviz import Source

# Render and display the decision tree
graph = Source(dot_data)
graph.render("decision_tree_graphviz", format="png", cleanup=True)  # Save as PNG (optional)
graph


ModuleNotFoundError: No module named 'graphviz'

In [99]:
%pip install pydotplus


Collecting pydotplusNote: you may need to restart the kernel to use updated packages.

  Downloading pydotplus-2.0.2.tar.gz (278 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: pydotplus
  Building wheel for pydotplus (setup.py): started
  Building wheel for pydotplus (setup.py): finished with status 'done'
  Created wheel for pydotplus: filename=pydotplus-2.0.2-py3-none-any.whl size=24575 sha256=2f23debe2f69d45b0f7d9f07be34a769192a970ea216f2a97306381e819c3448
  Stored in directory: c:\users\dao ba thanh\appdata\local\pip\cache\wheels\77\54\7c\c8077b6151c819495492300386cf9b151a954259d1a658c63b
Successfully built pydotplus
Installing collected packages: pydotplus
Successfully installed pydotplus-2.0.2


In [47]:
dot- version

NameError: name 'dot' is not defined

In [51]:
pip install graphviz

Collecting graphviz
  Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Downloading graphviz-0.20.3-py3-none-any.whl (47 kB)
Installing collected packages: graphviz
Successfully installed graphviz-0.20.3
Note: you may need to restart the kernel to use updated packages.
