# Breast Cancer Dataset Analysis

This notebook focuses on analyzing the **Breast Cancer Wisconsin (Diagnostic)** dataset. The tasks include:

1. Installing required libraries.
2. Fetching the dataset from the UCI Machine Learning Repository.
3. Exploring the dataset to understand its structure and properties.
4. Preparing the dataset by splitting it into training and testing sets using various proportions.
5. Visualizing class distributions to ensure stratification in the splits.

We will use the following tools and libraries:
- `ucimlrepo` to fetch the dataset.
- `pandas` for data manipulation.
- `matplotlib` and `seaborn` for data visualization.
- `scikit-learn` for splitting the dataset into training and testing subsets.

## Step 0: Install Necessary Libraries

Before running the notebook, make sure you have all the required Python libraries installed:
- **ucimlrepo**: Fetches datasets from the UCI Machine Learning Repository.
- **pandas**: Handles data manipulation and analysis.
- **matplotlib** and **seaborn**: For creating visualizations.
- **scikit-learn**: Provides tools for splitting datasets and building machine learning models.

Run the following code to install them if they are not already installed.


In [4]:
# Install required libraries (run this cell only if needed)
!pip install ucimlrepo pandas matplotlib seaborn scikit-learn

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


## Step 1: Import Libraries

We begin by importing the necessary libraries for data loading, manipulation, and visualization.

In [15]:
# Import libraries
from ucimlrepo import fetch_ucirepo  # Fetch dataset from UCI repository
import pandas as pd  # Data manipulation
import matplotlib.pyplot as plt  # Visualization
import seaborn as sns  # Advanced visualization
from sklearn.model_selection import train_test_split  # Train-test splitting

# Set plotting style for consistency
sns.set_style('whitegrid')


## Step 2: Fetch the Dataset

The **Breast Cancer Wisconsin (Diagnostic)** dataset is fetched from the UCI Machine Learning Repository using the `ucimlrepo` library. 


In [21]:
# Fetch the dataset using its unique ID
breast_cancer_data = fetch_ucirepo(id=17)

### Dataset metadata

In [23]:
# View metadata about the dataset
breast_cancer_data.metadata

{'uci_id': 17,
 'name': 'Breast Cancer Wisconsin (Diagnostic)',
 'repository_url': 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic',
 'data_url': 'https://archive.ics.uci.edu/static/public/17/data.csv',
 'abstract': 'Diagnostic Wisconsin Breast Cancer Database.',
 'area': 'Health and Medicine',
 'tasks': ['Classification'],
 'characteristics': ['Multivariate'],
 'num_instances': 569,
 'num_features': 30,
 'feature_types': ['Real'],
 'demographics': [],
 'target_col': ['Diagnosis'],
 'index_col': ['ID'],
 'has_missing_values': 'no',
 'missing_values_symbol': None,
 'year_of_dataset_creation': 1993,
 'last_updated': 'Fri Nov 03 2023',
 'dataset_doi': '10.24432/C5DW2B',
 'creators': ['William Wolberg',
  'Olvi Mangasarian',
  'Nick Street',
  'W. Street'],
 'intro_paper': {'ID': 230,
  'type': 'NATIVE',
  'title': 'Nuclear feature extraction for breast tumor diagnosis',
  'authors': 'W. Street, W. Wolberg, O. Mangasarian',
  'venue': 'Electronic imaging',
  'yea

### Variable Information

The variable information describes:
- The features (input variables) in the datase`.
- The target variable (`M` for malignant, `B` for benign).

In [26]:
# View feature and variable information
breast_cancer_data.variables


Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,ID,ID,Categorical,,,,no
1,Diagnosis,Target,Categorical,,,,no
2,radius1,Feature,Continuous,,,,no
3,texture1,Feature,Continuous,,,,no
4,perimeter1,Feature,Continuous,,,,no
5,area1,Feature,Continuous,,,,no
6,smoothness1,Feature,Continuous,,,,no
7,compactness1,Feature,Continuous,,,,no
8,concavity1,Feature,Continuous,,,,no
9,concave_points1,Feature,Continuous,,,,no


### Extract Features and Labels

After fetching and examining the dataset, we will extract:
- **Features**: The numerical columns describing tumor properties.
- **Labels**: The target column, which indicates whether the tumor is malignant (`M`) or benign (`B`).


In [30]:
# Extract features and labels
features = breast_cancer_data.data.features  # Features (30 columns)
labels = breast_cancer_data.data.targets     # Labels ('M' for malignant, 'B' for benign)


Preview first 5 rows of `features`

In [36]:
features.head()

Unnamed: 0,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,fractal_dimension1,...,radius3,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Preview first 5 rows of `labels`

In [41]:
labels.head()

Unnamed: 0,Diagnosis
0,M
1,M
2,M
3,M
4,M


## Step 3: Prepare the Dataset

We will split the dataset into training and testing subsets with the following proportions:
- **40% Training / 60% Testing**
- **60% Training / 40% Testing**
- **80% Training / 20% Testing**
- **90% Training / 10% Testi### Tasks:
1. Use the `train_test_split` function from `sklearn` to split the data.
2. Apply **stratified splitting** to ensure that the class distribution (malignant/benign) remains consistent across the splits.
3. Store the splits in a dictionary for easy access.e accurate.


In [60]:
# Import train_test_split to split the dataset into training and testing subsets
from sklearn.model_selection import train_test_split

# Define train-test proportions
train_sizes = [0.4, 0.6, 0.8, 0.9]

# Create a dictionary to store the splits
splits = {}


## Step 4: Perform Stratified Splitting

Using the defined train-test proportions, we will:
1. Split the dataset into `feature_train`, `feature_test`, `label_train`, and `label_test` subsets.
2. Ensure stratification so that the class balance is preserved in both training and testing sets.
3. Store the splits in a dictionary, where each key corresponds to the train-test proportion.


In [63]:
# Perform stratified splitting for each proportion
for train_size in train_sizes:
    feature_train, feature_test, label_train, label_test = train_test_split(
        features, labels, train_size=train_size, stratify=labels, random_state=42
    )
    # Store the splits in the dictionary
    splits[f'{int(train_size * 100)}/{int(round((1 - train_size) * 100))}'] = {
        'feature_train': feature_train,
        'feature_test': feature_test,
        'label_train': label_train,
        'label_test': label_test
    }

## Step 5: Confirm Splits

We will now verify that the splits were created correctly by:
1. Checking the structure of the dictionary where splits are stored.
2. Displaying a few rows from all the splits to confirm that the data is organized as expected.

In [66]:
# Display the keys of the splits dictionary to confirm the proportions
splits.keys()

dict_keys(['40/60', '60/40', '80/20', '90/10'])

First 5 rows of the training features and labels for 80/20 split

In [69]:
# Access and display a sample from the 80/20 split
feature_train_80 = splits['80/20']['feature_train']
label_train_80 = splits['80/20']['label_train']

# Show the first 5 rows of the training features and labels for 80/20 split
feature_train_80.head(), label_train_80.head()

(     radius1  texture1  perimeter1   area1  smoothness1  compactness1  \
 10     16.02     23.24      102.70   797.8      0.08206       0.06669   
 170    12.32     12.39       78.85   464.1      0.10280       0.06981   
 407    12.85     21.37       82.63   514.5      0.07551       0.08316   
 430    14.90     22.53      102.10   685.0      0.09947       0.22250   
 27     18.61     20.25      122.10  1094.0      0.09440       0.10660   
 
      concavity1  concave_points1  symmetry1  fractal_dimension1  ...  radius3  \
 10      0.03299          0.03323     0.1528             0.05697  ...    19.19   
 170     0.03987          0.03700     0.1959             0.05955  ...    13.50   
 407     0.06126          0.01867     0.1580             0.06114  ...    14.40   
 430     0.27330          0.09711     0.2041             0.06898  ...    16.35   
 27      0.14900          0.07731     0.1697             0.05699  ...    21.31   
 
      texture3  perimeter3   area3  smoothness3  compactness