# Week 3 Assignment: Breast Cancer Wisconsin Dataset Analysis

## I 320D: Data Science for Biomedical Informatics | Spring 2026

### üìã Assignment Version E

---

## üéØ This Week's Mantra

> **"Every Column Tells a Story"**

In this assignment, you'll apply the 10-Point Data Inspection to a real-world medical imaging dataset focused on breast cancer diagnosis. By the end, you'll understand not just *what* the data contains, but *why* each variable matters for clinical decision-making.

---

## Learning Objectives

By completing this assignment, you will be able to:

1. ‚úÖ Apply the systematic 10-Point Inspection to a new healthcare dataset
2. ‚úÖ Identify and classify feature types (continuous, discrete, categorical, ordinal)
3. ‚úÖ Detect and document data quality issues (missing values, unexpected values)
4. ‚úÖ Research and document clinical meaning for healthcare variables
5. ‚úÖ Create meaningful data groupings based on clinical standards
6. ‚úÖ Formulate answerable research questions about cancer diagnosis factors

---

## About the Dataset

**Dataset:** Wisconsin Diagnostic Breast Cancer (WDBC)  
**Source:** UCI Machine Learning Repository / Kaggle  
**File:** `data.csv`  
**Target Variable:** `diagnosis` (M = Malignant, B = Benign)

### Clinical Context

Breast cancer is the most common cancer among women worldwide, affecting about 2.3 million women annually according to the World Health Organization (WHO). This dataset contains features computed from digitized images of fine needle aspirate (FNA) of breast masses. The features describe characteristics of the cell nuclei present in the image.

Understanding these variables is crucial for:

- Computer-aided diagnosis (CAD) systems
- Early detection of malignant tumors
- Reducing unnecessary biopsies
- Supporting clinical decision-making in oncology

### Feature Categories

The dataset contains **30 numeric features** organized into three measurement types for each of **10 characteristics**:

| Suffix | Meaning | Description |
|--------|---------|-------------|
| `_mean` | Mean | Average value across all nuclei in the image |
| `_se` | Standard Error | Variation in measurements |
| `_worst` | Worst/Largest | Mean of the three largest values |

The **10 cell nuclei characteristics** measured are:
- **radius** - mean distance from center to points on the perimeter
- **texture** - standard deviation of gray-scale values
- **perimeter** - boundary length of the nucleus
- **area** - area of the nucleus
- **smoothness** - local variation in radius lengths
- **compactness** - (perimeter¬≤ / area) - 1.0
- **concavity** - severity of concave portions of the contour
- **concave points** - number of concave portions of the contour
- **symmetry** - symmetry of the nucleus
- **fractal dimension** - "coastline approximation" - 1

---

In [12]:
## Getting Started

# First, load the dataset and import your libraries:

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Load the dataset
df = pd.read_csv('data.csv')

# Display first few rows to confirm it loaded
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


## Part 1: The 10-Point Data Inspection (40 points)

Complete each inspection step and document your findings.

### Step 1: Shape (4 points)

**Your Code:**

In [27]:
print(df.shape)

(569, 33)


**Your Findings:**
- How many rows (observations/patients)?: 569 rows
- How many columns (features)?: 33 columns
- What does each row represent in clinical terms?: Each row represents one breast cancer paitent who recieved a Fine Needle Aspiration (FNA) biopsy. The row shows the biopsied cell and its characteristics that can be used to either diagnose it as malignant or benign.  

### Step 2: Column Names (4 points)

**Your Code:**

In [28]:
column_names = df.columns.tolist()
print("Column Names:")
print(column_names)

Column Names:
['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']


**Your Findings:**
- List all column names: id, diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractional_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, fractional_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_wrost, concavity_worst, concave points_worst, symmetry_worst, fractional_dimension_worst. There were also 32 unnamed columns. 

- Do you notice any pattern in the column naming convention?: Yes, there was a clear pattern in the column naming convention. There were three groups based on the statistical measures of: mean, se (standard error), and worst values. The same 10 measures (radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractional dimension) were then evaluated based on these statistical measures. 

- Which columns might need further research to understand?: I feel that the columns that might need further research are the fractional dimension columns, the concavity/concave points, and the texture columns. I'm not sure what fractional dimension referrs to in regards to tumor cells, and am unsure what the difference a concavity/convex point makes in a tumor cell. I'd also like to better understand how the texture of a cell is quantified in regards to tumor cells. 

### Step 3: Data Types (4 points)

**Your Code:**

In [29]:
data_types = df.dtypes
print("\n Data Types:")
print(data_types)


 Data Types:
id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concav

**Your Findings:**
- Which columns are numeric (int64 or float64)?:
The id column was the only numeric column while radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractional_dimension_mean) and all the _se columns and the _worst columns totaled to 31 float columns. There were also notably 32 unnamed columns in the dataset. 

- Which columns are categorical (object/string)?:
The diagnosis column is the categorical object in the dataset, with M representing malignant and B representing benign tumor cells.   

- Are there any data types that seem incorrect?:
Some potential issues that might seem incorrect are the 32 unnamed columns being identified as a float64. This seems to possibly be an error column.

### Step 4: First Look (4 points)

**Your Code:**

In [36]:
# Load the dataset
df = pd.read_csv('data.csv')

# first look at the data
print("Dataset Overview:")
print(f"\nTotal Records: {len(df):,}")
print(f"Total Features: {len(df.columns)}")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# See the first few rows
print("\n First 5 Records:")
print(df.head())

# Columns
print("\n Column Names and Data Types:")
print("-"*40)
for col in df.columns:
    print(f"  ‚Ä¢ {col}: {df[col].dtype}")

# Quick statistical summary
print("\n Statistical Summary:")
print(df.describe().round(2))

Dataset Overview:

Total Records: 569
Total Features: 33
Memory Usage: 0.17 MB

 First 5 Records:
         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302         M        17.99         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280      

**Your Findings:**


- What do the actual values look like?: Based on the first 5 records, the ID column has large integers that range from 6 to 9 digits. In the diagnosis column, the single letter codes -'M' appears in all five of the shown rows. The measuremenst columns all have very different scales: from small decimals(smoothness_mean, compactness_mean, concavity_mean, fractal_dimension), medium values(radius_mean, texture_mean), large values(perimeter_mean, area_mean), and very large values(area_worst). 

- Do you notice anything unusual or unexpected?: Yes, there were several unusual/unexpected values. There was unnamed:32 column that was completely empty. Additionally, the scales were all vastly different from each other, indicating that a system of standardizing these values will be needed later. Moreover, the ID values also vary widely in magnitude, with some being 5-6 digits long while others are 8-9. 

- What are the possible values for the `diagnosis` column?: 'M' for malignant and 'B' for benign

### Step 5: Last Look (4 points)

**Your Code:**

In [38]:
#Seeing last few rows of data
print("\nLast 5 Records:")
print(df.tail())

#Checking for missing values
print("\nMissing values in last 10 rows:")
print(df.tail(10).isnull().sum())


Last 5 Records:
         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
564  926424         M        21.56         22.39          142.00     1479.0   
565  926682         M        20.13         28.25          131.20     1261.0   
566  926954         M        16.60         28.08          108.30      858.1   
567  927241         M        20.60         29.33          140.10     1265.0   
568   92751         B         7.76         24.54           47.92      181.0   

     smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
564          0.11100           0.11590         0.24390              0.13890   
565          0.09780           0.10340         0.14400              0.09791   
566          0.08455           0.10230         0.09251              0.05302   
567          0.11780           0.27700         0.35140              0.15200   
568          0.05263           0.04362         0.00000              0.00000   

     ...  texture_worst  perimete

**Your Findings:**

- Does the data end cleanly?: Yes, the data ends clearly as all 5 of the last rows are complete with the expected columns, there is no evidence of truncation or partial rows, no error messages that show up when trying to access the last rows, they are all actual feature columns, and the last row confirms that there are exactly 569 rows as expected in the dataset.

- Are the last rows consistent with the first rows?: Yes, the last rows are consistent with the first rows as the same data types are seen in both, the value ranges are similar between the two, the diagnosis values are all valid, no data corruption, and the unnamed:32 column is consistently empty in both.

### Step 6: Memory Usage (4 points)

**Your Code:**

In [19]:
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

Memory Usage: 0.17 MB


**Your Findings:**
- How much memory does the dataset use? Approximately 174.08 KB
- Is this a "small" or "large" dataset by data science standards?: This would be considered a very "small" dataset by data science standards. 


### Step 7: Missing Values (4 points)

**Your Code:**

In [20]:
print("Missing Values Check:")
print("-"*40)
missing = df.isnull().sum()
if missing.sum() == 0:
    print("No missing values found! This data is clean.")
else:
    print(missing[missing > 0])

Missing Values Check:
----------------------------------------
Unnamed: 32    569
dtype: int64


**Your Findings:**
- Which columns have missing values (according to `.isnull()`)?: Only one column has missing values according to the .isnull(), seen as the unnamed:32

- What percentage of each column is missing? 31 columns have 0% missing data while the one unnamed:32 column has 100% missing data.

- ‚ö†Ô∏è **IMPORTANT:** Do you notice any columns that appear to be entirely empty or have suspicious patterns?: Yes, the unnamed:32 column is 100% empty and shows a suspicious pattern in the way it containes no useful information and is taking up space in the dataset.


### Step 8: Duplicates (4 points)

**Your Code:**

In [35]:
duplicates = df.duplicated().sum()
print(f"\n Duplicate Rows: {duplicates:,}")
print(f" Unique Records: {len(df) - duplicates:,}")


 Duplicate Rows: 0
 Unique Records: 569


**Your Findings:**
- Are there any duplicate rows?: There were no duplicate rows found in the dataset.
- Are all patient IDs unique? Yes, all the patient IDs were found to be unique.

### Step 9: Basic Statistics (4 points)

**Your Code:**

In [39]:
print("\n Statistical Summary:")
df.describe().round(2)


 Statistical Summary:


Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,30371830.0,14.13,19.29,91.97,654.89,0.1,0.1,0.09,0.05,0.18,...,25.68,107.26,880.58,0.13,0.25,0.27,0.11,0.29,0.08,
std,125020600.0,3.52,4.3,24.3,351.91,0.01,0.05,0.08,0.04,0.03,...,6.15,33.6,569.36,0.02,0.16,0.21,0.07,0.06,0.02,
min,8670.0,6.98,9.71,43.79,143.5,0.05,0.02,0.0,0.0,0.11,...,12.02,50.41,185.2,0.07,0.03,0.0,0.0,0.16,0.06,
25%,869218.0,11.7,16.17,75.17,420.3,0.09,0.06,0.03,0.02,0.16,...,21.08,84.11,515.3,0.12,0.15,0.11,0.06,0.25,0.07,
50%,906024.0,13.37,18.84,86.24,551.1,0.1,0.09,0.06,0.03,0.18,...,25.41,97.66,686.5,0.13,0.21,0.23,0.1,0.28,0.08,
75%,8813129.0,15.78,21.8,104.1,782.7,0.11,0.13,0.13,0.07,0.2,...,29.72,125.4,1084.0,0.15,0.34,0.38,0.16,0.32,0.09,
max,911320500.0,28.11,39.28,188.5,2501.0,0.16,0.35,0.43,0.2,0.3,...,49.54,251.2,4254.0,0.22,1.06,1.25,0.29,0.66,0.21,


**Your Findings:**

- What is the radius_mean range in the dataset?: 6.98 to 28.11
- What is the range of area_mean values?: 143.50 to 2501.00
- What is the range of concavity_mean values?: 0.00 to 0.43
- Do any min/max values seem impossible or clinically unlikely?: No, all the values appear clinically reasonable.

### Step 10: Unique Counts (4 points)

**Your Code:**

In [43]:
#print(f" Unique Records: {len(df) - duplicates:,}")


print("\n Unique Value Counts by Column:")

# Get unique counts for all columns
for col in df.columns:
    unique_count = df[col].nunique()
    print(f"  ‚Ä¢ {col:30s}: {unique_count:4d} unique values")

# Separate categorical vs continuous
print("\n Column Classification:")

print("\nLikely CATEGORICAL (few unique values):")
categorical_threshold = 10
for col in df.columns:
    unique_count = df[col].nunique()
    if unique_count <= categorical_threshold:
        print(f"  ‚Ä¢ {col}: {unique_count} unique values")

print("\nLikely CONTINUOUS (many unique values):")
for col in df.columns:
    unique_count = df[col].nunique()
    if unique_count > categorical_threshold:
        print(f"  ‚Ä¢ {col}: {unique_count} unique values")

# Check if IDs are unique
print("\n ID Column Check:")
print(f"Total rows: {len(df)}")
print(f"Unique IDs: {df['id'].nunique()}")
print(f"Do IDs match rows? {len(df) == df['id'].nunique()}")


 Unique Value Counts by Column:
  ‚Ä¢ id                            :  569 unique values
  ‚Ä¢ diagnosis                     :    2 unique values
  ‚Ä¢ radius_mean                   :  456 unique values
  ‚Ä¢ texture_mean                  :  479 unique values
  ‚Ä¢ perimeter_mean                :  522 unique values
  ‚Ä¢ area_mean                     :  539 unique values
  ‚Ä¢ smoothness_mean               :  474 unique values
  ‚Ä¢ compactness_mean              :  537 unique values
  ‚Ä¢ concavity_mean                :  537 unique values
  ‚Ä¢ concave points_mean           :  542 unique values
  ‚Ä¢ symmetry_mean                 :  432 unique values
  ‚Ä¢ fractal_dimension_mean        :  499 unique values
  ‚Ä¢ radius_se                     :  540 unique values
  ‚Ä¢ texture_se                    :  519 unique values
  ‚Ä¢ perimeter_se                  :  533 unique values
  ‚Ä¢ area_se                       :  528 unique values
  ‚Ä¢ smoothness_se                 :  547 unique value

**Your Findings:**

- Which columns have very few unique values (likely categorical)?: Only two columns are categorical: the diagnosis(M/B) and unnamed.
- Which columns have many unique values (likely continuous)?: 31 columns are continuous. All 10 of the "_mean values", "the _se", "_worst" columns, and the ID column have around 500-569 unique values each, indicating continuous numeric data.
- Does the number of unique IDs match the number of rows?: Yes, 569 unique IDs match the 569 total rows, meaning that every patient/sample has a unique identifier and there are no duplicate patient records.

## Part 2: Data Dictionary (20 points)

Complete the following data dictionary for the **key columns**. For each column, you must:
1. **Research** the clinical meaning
2. **Identify** the feature type (Continuous, Discrete, Categorical-Nominal, Categorical-Ordinal, Binary, Identifier)
3. **Document** the valid values/range you observe
4. **Note** any issues or questions

| Column | Description | Feature Type | Valid Values/Range | Notes/Issues |
|--------|-------------|--------------|-------------------|--------------|
| `id` |Unique patient/sample identifier|Identifier| 8670-911320502 integers|Variable length, inconsistent formatting|

| `diagnosis` |Tumor diagnosis classifier|Binary categorical|M(malignant), B(benign)|Target variable|

| `radius_mean` |Mean distance from center to perimeter of cell nucleus|Continuous|6.98-28.11|Larger values typically associated with malignancy; important diagnostic feature|

| `texture_mean` |Mean standard deviation of gray-scale values in nucleus|Continuous|9.71-39.28|Measures surface roughness/irregularity with higher texture meaning more irregular (often malignant)|

| `perimeter_mean` |Mean perimeter of cell nucleus|Continuous|43.79-188.50|Highly correlated with radius, with malignant tumors tending to have larger perimeters|

| `area_mean` | Mean area of cell nucleus|Continuous|143.50 to 2501.00|Highly correlated with radius and perimeter with malignant cells being typically larger|

| `smoothness_mean` |Mean local variation in radius lengths|Continuous|0.05 to 0.16|Lower vlaues indicate smoother borders while higher values indicate irregular borders|

| `compactness_mean` |Mean compactness|Continuous|0.02 to 0.35|Measures the compactness/circular shape of the nucleus with zero being a perfect circle and higher numbers indicating more irregularities|

| `concavity_mean` |Mean severity of concave portions of the nucleus contour|Continuous|0.00 to 0.43|Measures indentations in cell boundary; zero can still be smooth benign tumors but higher vlaues indicate irregular malignant cells|

| `concave points_mean` |Mean number of concave portions of the contour|Continuous|0.00 to 0.20|Counts the indentation points; benign tumors may have zero but malignant tumors typically have multiple concave points|

| `symmetry_mean` |Mean symmetry of the nucleus|Continuous|0.11 to 0.30|Measures the mirror symmetry of cell with lower values indicating more symmetry; cancer cells tend to be asymmetrical due to the abnormal cell division that occurs|

| `fractal_dimension_mean` |Mean "coastline approximation"|Continuous|0.05 to 0.10|Measures the complexity of the nucleus boundary using fractional geometry with higher values indicate more complex irregular borders|

### Clinical Research Questions for Version E

Answer these questions based on your research (you may need to use Google):

**1. What is nuclear pleomorphism and why is it important in cancer grading systems like the Nottingham grading system? How does cell symmetry relate to pleomorphism?**

Your answer: 
Nuclear pleomorphism refers to the variation in size, shape, and appearance of cell nuclei. In normal tissue, cells have uniform, regular nuclei, but cancer cells often show marked variation and irregularity. This is important for cancer grading systems like the Nottingham grading system as it evalues three components: tubule formation, nuclear pleomorphism, and miotic count. Nuclear pleomorphism is scored on a scale of 1-3, with 1 indicating a small and uniform nuclei with regular outlines and 3 indicating marked variation with large and irregular nuclei. Higher scores correlate to more aggresive tumors.
Cell symmetry on the other hand is inversely related to pleomorphism as pleomorphic nuclei are asymmetric, but more symmetric values suggest regular benign cells. But symmetry can be a quantitative measure of one aspect of nuclear pleomorphism.

---

**2. What is the biological basis for why cancer cells often show asymmetry? How do abnormal cell division and chromosomal instability contribute to irregular cell shapes?**

Your answer:
The biological basis for cancer cell asymmetry include abnormal cell division, chromosomal instability, loss of nuclear envelope integrity, altered nuclear organization, dysregulated growth singals. These molecular and cellular defects combine to porduce the asymmetric, irregular nuclear shapes that are charactertistic of malignancy and can be quantified by summetry measurements. 

---

**3. Explain what the symmetry measurement in this dataset captures mathematically. How is symmetry calculated from the cell nucleus contour?**

Your answer: The symmetry measurement quantifies the difference between a cell nucleus and its mirror image. It captures how much the nucleus deviates from a perfect mirror symmetry image. This is done though tracing the cell nucleus boundary and creating a closed contour, givng a set of x,y coordinates outlining the nucleus. Then the major axis is found by identifying the longest line that can be drawn through the nucleus, allowing for the axis of symmetry to be made to compare the mirror reflection when the contour is reflected across the axis. The original contour is then compared to its mirror image and the symmetry is calculated with the length difference between corresponding perpendicular lines from the major axis to the contour on each side or by standard deviation of the length differences. The measurement is then put on a scale where lower values indicate more symmetry while higher calues indicate more asymmetry.

---

**4. What is the Breast Imaging Reporting and Data System (BI-RADS)? How do shape and symmetry features factor into the BI-RADS classification of breast masses?**

Your answer: BI-RADS is a standardized classification system developed by the American College of Radiology for reporting breast imaging findings from mammogrpahy, ultrasound, and MRI. Shape and symmetry factor into BI-RADS thorugh mass shape descriptors (oval/round versus irregular), mass margin (circumscribed, microlobulated, obscured, indistinct/ill-defined, spiculated), and symmetry considerations (bilateral symmetry, asymmetry, mass asymmetry).

---

## Part 3: Data Validation (15 points)

### 3.1 Diagnosis Distribution Validation (5 points)

Write code to check:
- How many patients have malignant (M) tumors?
- How many patients have benign (B) tumors?
- What is the percentage of each?

**Your Code:**

In [48]:
print("DIAGNOSIS DISTRIBUTION VALIDATION")

# Count of each diagnosis
print("\nDiagnosis Counts:")
diagnosis_counts = df['diagnosis'].value_counts()
print(diagnosis_counts)

# Get specific counts
malignant_count = df[df['diagnosis'] == 'M'].shape[0]
benign_count = df[df['diagnosis'] == 'B'].shape[0]
total_count = len(df)

print(f"\nMalignant (M) tumors: {malignant_count}")
print(f"Benign (B) tumors: {benign_count}")
print(f"Total samples: {total_count}")

# Calculate percentages
print("\nDiagnosis Percentages:")
diagnosis_percentages = df['diagnosis'].value_counts(normalize=True) * 100
print(diagnosis_percentages.round(2))

malignant_pct = (malignant_count / total_count) * 100
benign_pct = (benign_count / total_count) * 100

print(f"\nMalignant (M): {malignant_pct:.2f}%")
print(f"Benign (B): {benign_pct:.2f}%")


DIAGNOSIS DISTRIBUTION VALIDATION

Diagnosis Counts:
diagnosis
B    357
M    212
Name: count, dtype: int64

Malignant (M) tumors: 212
Benign (B) tumors: 357
Total samples: 569

Diagnosis Percentages:
diagnosis
B    62.74
M    37.26
Name: proportion, dtype: float64

Malignant (M): 37.26%
Benign (B): 62.74%


**Your Findings:**

- Is this dataset balanced or imbalanced between the two classes?: This dataset is moderately imbalanced but still relatively usable. The class distribution shows approximately 37% malignant and 63% benign cases, resulting in a 1.68:1 ratio of benign to malginant cases. While not perfectly balanced, the level of imbalance seen in this analysis is reasonable and considered acceptable. 

- In the real world, what percentage of breast biopsies are malignant vs benign?: In real-world cases, the perceptage of breast biopsies that are malignant varies depending on the screening context and patient population. Generallly it ranges from 2%-30% malignant (according to the American Cancer Society), so the dataset's distribution is pretty close to the real-world clinical proportions. Overall, this dataset is clinically representative and suitable for classification modeling.

---

### 3.2 Empty Column Validation (5 points)

Write code to examine all columns for any that might be completely empty or contain only null values.

**Your Code:**

In [55]:
print("EMPTY COLUMN VALIDATION")

# Check for completely empty columns
print("\nChecking for Empty Columns:")

# Method 1: Count null values per column
null_counts = df.isnull().sum()
total_rows = len(df)

# Find completely empty columns
empty_columns = null_counts[null_counts == total_rows]

if len(empty_columns) > 0:
    print(f"Found {len(empty_columns)} completely empty column(s):")
    for col in empty_columns.index:
        print(f"  ‚Ä¢ {col}: {empty_columns[col]} null values ({(empty_columns[col]/total_rows)*100:.1f}%)")
else:
    print("No completely empty columns found")

print("\nSUMMARY:")
completely_empty = sum(null_counts == total_rows)
partially_empty = sum((null_counts > 0) & (null_counts < total_rows))
completely_full = sum(null_counts == 0)

print(f"Completely empty columns: {completely_empty}")
print(f"Partially empty columns: {partially_empty}")
print(f"Completely full columns: {completely_full}")
print(f"Total columns: {len(df.columns)}")

EMPTY COLUMN VALIDATION

Checking for Empty Columns:
Found 1 completely empty column(s):
  ‚Ä¢ Unnamed: 32: 569 null values (100.0%)

SUMMARY:
Completely empty columns: 1
Partially empty columns: 0
Completely full columns: 32
Total columns: 33


**Your Findings:**

- Did you find any columns that are entirely empty?: Yes, column unnamed: 32 contained 569 null values out of 569 total rows, meaning it is completely empty with no actual data. All the other 32 columns were completely full with zero missing values.

- What should you do with such columns before analysis?: Completely empty columns should be immediately removed from the dataset before any analysis begins. 

- Why might an empty column exist in a dataset?: An empty column may exist in a dataset due to a CSV formatting issue, data collection errors, or export/conversion errors. In the case of this dataset, the most probable cause of this empty column is a CSV formatting issue. However, this can be easily fixed through data cleaning.

---

### 3.3 Feature Range Validation (5 points)

Write code to check if the "worst" measurements are always greater than or equal to the "mean" measurements for the same characteristic.

**Your Code:**

In [56]:
print("FEATURE RANGE VALIDATION")

features = ['radius', 'texture', 'perimeter', 'area', 'smoothness',
            'compactness', 'concavity', 'concave points', 'symmetry', 
            'fractal_dimension']

for feature in features:
    violations = (df[f'{feature}_worst'] < df[f'{feature}_mean']).sum()
    status = "PASS" if violations == 0 else f"FAIL ({violations} violations)"
    print(f"{feature}: {status}")

FEATURE RANGE VALIDATION
radius: PASS
texture: PASS
perimeter: PASS
area: PASS
smoothness: PASS
compactness: PASS
concavity: PASS
concave points: PASS
symmetry: PASS
fractal_dimension: PASS


**Your Findings:**

- Does `radius_worst` always >= `radius_mean`?: Yes, radius_worst is always greater than or equal to radius_mean across all 569 records with zero violations, confirmed by the PASS result

- Does this relationship hold for other features?: Yes, all 10 features pass with zero violations. Every single worst measurement is always greater than or equal to its corresponding mean measurement across all 569 records.

- What would it mean if this relationship was violated?: If this relationship was violated it would indicate a serious data quality problem. Since "worst" is mathematically defined as the mean of the three largest values, it is impossible for it to be smaller than the overall mean. A violation would  suggest a data entry error, data corruption during storage or transfer, or a calculation error during feature extraction. Any violations would damage trust in the dataset and could cause machine learning models to learn incorrect patterns, potentially leading to dangerous misclassification of malignant tumors as benign.

---

## Part 4: Create Cell Symmetry Groups (10 points)

Create a new column called `symmetry_category` that categorizes tumors into clinically-meaningful groups based on `symmetry_mean` (a measure of how symmetrical the cell nuclei are).

### Version E: Symmetry-Based Clinical Categories

Use these categories based on observed symmetry values (where higher values indicate more asymmetry):

| Symmetry Category | Symmetry Range | Clinical Rationale |
|-------------------|----------------|-------------------|
| Highly Symmetric | < 0.14 | Very regular cell shape, typical of benign cells |
| Symmetric | 0.14 - 0.17 | Normal symmetry range, most cells fall here |
| Mildly Asymmetric | 0.17 - 0.20 | Some irregularity, warrants closer examination |
| Asymmetric | 0.20 - 0.25 | Notable asymmetry, associated with abnormal growth |
| Highly Asymmetric | ‚â• 0.25 | Significant asymmetry, strong indicator of malignancy |

In [71]:
# Create symmetry categories
def categorize_symmetry(value):
    if value < 0.14:
        return 'Highly Symmetric'
    elif value < 0.17:
        return 'Symmetric'
    elif value < 0.20:
        return 'Mildly Asymmetric'
    elif value < 0.25:
        return 'Asymmetric'
    else:
        return 'Highly Asymmetric'

print("Data:")
df['symmetry_category'] = df['symmetry_mean'].apply(categorize_symmetry)
print(df.shape)


# Count each category
print("\nCounts per category:")
print(df['symmetry_category'].value_counts())


# Malignancy rate per category
print("\nMalignancy rate per category:")
for category in ['Highly Symmetric', 'Symmetric', 'Mildly Asymmetric', 
                  'Asymmetric', 'Highly Asymmetric']:
    group = df[df['symmetry_category'] == category]
    malignant = (group['diagnosis'] == 'M').sum()
    total = len(group)
    pct = (malignant / total) * 100
    print(f"{category}: {pct:.1f}% malignant")

Data:
(569, 34)

Counts per category:
symmetry_category
Mildly Asymmetric    247
Symmetric            180
Asymmetric           103
Highly Symmetric      26
Highly Asymmetric     13
Name: count, dtype: int64

Malignancy rate per category:
Highly Symmetric: 3.8% malignant
Symmetric: 21.7% malignant
Mildly Asymmetric: 41.3% malignant
Asymmetric: 60.2% malignant
Highly Asymmetric: 61.5% malignant


### 
Analysis Questions:

**1. How many tumors are in each symmetry category?**

Your answer: Highly Symmetric has 26 tumors (4.6%), Symmetric has 180 tumors (31.6%), Mildly Asymmetric has 247 tumors (43.4%), Asymmetric has 103 tumors (18.1%), and Highly Asymmetric has only 13 tumors (2.3%). The majority of tumors (43.4%) fall in the Mildly Asymmetric category, suggesting that mild irregularity is the most common finding in this clinical dataset. Very few tumors fall at the extremes, with only 4.6% being Highly Symmetric and just 2.3% being Highly Asymmetric.

---

**2. What is the malignancy rate (percentage) for each symmetry category?**

Your answer: The malignancy rates show a dramatic and consistent upward trend across categories. Highly Symmetric has only 3.8% malignancy, Symmetric has 21.7% malignancy, Mildly Asymmetric has 41.3% malignancy, Asymmetric has 60.2% malignancy, and Highly Asymmetric has 61.5% malignancy. The jump from Symmetric (21.7%) to Mildly Asymmetric (41.3%) is particularly notable, nearly doubling the malignancy rate.

---

**3. At what level of asymmetry does malignancy rate noticeably increase? Does cell symmetry appear to be a useful diagnostic feature?**

Your answer: The malignancy rate noticeably increases starting at the Mildly Asymmetric category (symmetry_mean >= 0.17), where it nearly doubles from 21.7% to 41.3%. It continues rising sharply through the Asymmetric category (60.2%) before plateauing slightly at Highly Asymmetric (61.5%). The near identical malignancy rates between Asymmetric (60.2%) and Highly Asymmetric (61.5%) suggest that once a tumor crosses the 0.20 symmetry threshold, additional asymmetry does not significantly increase malignancy risk further. Cell symmetry does appear to be a useful diagnostic feature given the dramatic sixteenfold difference in malignancy rates between Highly Symmetric (3.8%) and Asymmetric (60.2%) categories. However, it cannot be used as a standalone diagnostic tool since 41.3% of Mildly Asymmetric tumors are still benign, and even in the highest asymmetry categories, a meaningful proportion of tumors remain benign.

---

**4. Why might highly asymmetric cells be more likely to be malignant? (Think about how cancer cells divide and grow differently from normal cells.)**

Your answer: Highly asymmetric cells are more likely to be malignant because cancer cells divide and grow in fundamentally different ways compared to normal cells. Normal cells undergo tightly regulated mitosis, producing daughter cells with consistent, symmetric nuclear shapes. Cancer cells have lost these regulatory mechanisms due to genetic mutations, leading to chaotic and uncontrolled division that produces irregular, asymmetric nuclei. Chromosomal instability causes uneven distribution of genetic material during division, resulting in nuclei with varying shapes and sizes. Cancer cells also frequently undergo multipolar mitosis, where three or more spindle poles form instead of the normal two, creating highly unequal cell division and contributing to nuclear asymmetry. Mutations in genes controlling nuclear architecture physically distort the nucleus shape further. The more asymmetric a cell nucleus appears, the more likely it has undergone these chaotic processes that are characteristic of malignancy, making symmetry_mean a biologically meaningful and clinically relevant diagnostic feature.

---

## Part 5: Research Questions (15 points)

### 5.1 Write Three Answerable Questions (9 points)

Write three questions that THIS dataset can answer. Remember: the data can show relationships and patterns, but cannot prove causation.

**Your questions must explore these specific areas:**

1. **A question about symmetry and compactness together:**
Do malignant tumors show higher values in both symmetry_mean and compactness_mean compared to benign tumors, and is the combination of high asymmetry and high compactness a stronger indicator of malignancy than either measurement alone?

---

2. **A question comparing symmetry_mean vs symmetry_worst:**
How much does cell nucleus symmetry worsen from the average measurement (symmetry_mean) to the worst measurement (symmetry_worst) in malignant tumors compared to benign tumors, and does the difference between these two measurements provide additional diagnostic information beyond either measurement alone?

---

3. **A question about the relationship between symmetry and area:**
Is there a relationship between tumor size (area_mean) and cell nucleus asymmetry (symmetry_mean), and do larger malignant tumors tend to show greater asymmetry than smaller malignant tumors?

---

### 5.2 Identify One Question the Data CANNOT Answer (3 points)

Write one question about **tumor location or breast density** that this dataset cannot answer, and explain why.

**Question:**
Does tumor location within the breast (such as upper outer quadrant, lower inner quadrant, or retroareolar region) influence the degree of cell nucleus asymmetry, and are tumors found in areas of higher breast density more likely to show greater asymmetry and malignancy?


**Why it cannot be answered with this data:**
This question cannot be answered with the Wisconsin Breast Cancer Dataset because the dataset contains no information about tumor location, breast quadrant, or breast density classifications. All 30 features are purely computational measurements derived from microscopy images of cell nuclei obtained through Fine Needle Aspiration biopsy, meaning the data only describes what cells look like at a microscopic level with no anatomical, positional, or imaging information included. Answering this question would require linking patient records to radiology reports and mammography findings, none of which are present in this dataset.

---

### 5.3 Grouping Analysis (3 points)

Answer this question using a groupby analysis:

**"What is the average symmetry_mean for each diagnosis category (M vs B)?"**

In [61]:
print("AVERAGE SYMMETRY BY DIAGNOSIS:")
symmetry_by_diagnosis = df.groupby('diagnosis')['symmetry_mean'].mean()
for diagnosis, avg in symmetry_by_diagnosis.items():
    label = "Malignant" if diagnosis == "M" else "Benign"
    print(f"{label} ({diagnosis}): {avg:.4f}")

print(f"\nDifference: {symmetry_by_diagnosis['M'] - symmetry_by_diagnosis['B']:.4f}")

AVERAGE SYMMETRY BY DIAGNOSIS:
Benign (B): 0.1742
Malignant (M): 0.1929

Difference: 0.0187


**Your Interpretation:**

How does symmetry differ between malignant and benign tumors? What does this suggest about the shape characteristics of cancer cells?:

Malignant tumors have a noticeably higher average symmetry_mean (0.1928) compared to benign tumors (0.1742), a difference of 0.0186. While this difference may appear small numerically, it is clinically meaningful because it confirms a consistent pattern across all 569 patients where malignant tumors are systematically more asymmetric than benign ones. This suggests that cancer cells tend to have more irregular, uneven nuclear shapes compared to the rounder, more uniform nuclei typical of benign cells. This shape irregularity in malignant cells reflects the underlying biological reality of cancer, where uncontrolled cell division and chromosomal instability produce nuclei that lack the orderly symmetry seen in normal healthy tissue.


---

## Part 6: Target Variable Analysis (Bonus - 5 points)

The `diagnosis` column is our **target variable** - what we're trying to predict. Analyze its relationship with key features.

In [62]:
#Distribution of diagnosis
print("Diagnosis Distribution:")
diagnosis_counts = df['diagnosis'].value_counts()
for diagnosis, count in diagnosis_counts.items():
    pct = (count / len(df)) * 100
    print(f"{diagnosis}: {count} ({pct:.2f}%)")

# Key features grouped by diagnosis
print("\nKey Feature Averages by Diagnosis:")
key_features = ['radius_mean', 'area_mean', 'concavity_mean', 
                'symmetry_mean', 'compactness_mean']

print(df.groupby('diagnosis')[key_features].mean().round(4))

Diagnosis Distribution:
B: 357 (62.74%)
M: 212 (37.26%)

Key Feature Averages by Diagnosis:
           radius_mean  area_mean  concavity_mean  symmetry_mean  \
diagnosis                                                          
B              12.1465   462.7902          0.0461         0.1742   
M              17.4628   978.3764          0.1608         0.1929   

           compactness_mean  
diagnosis                    
B                    0.0801  
M                    0.1452  


### Bonus Questions:

**1. What percentage of patients in this dataset have malignant tumors?**

Your answer: 37.26% of patients (212 out of 569) have malignant tumors, with the remaining 62.74% (357 patients) being benign, representing a moderate class imbalance of roughly 1.68:1.

---

**2. Which feature shows the largest difference between malignant and benign tumors?**

Your answer: Concavity_mean shows the largest percentage difference at 248.6% higher in malignant cases (0.1607 vs 0.0461), making it the strongest discriminator between malignant and benign tumors, followed by area_mean which shows the largest absolute difference at 515.59 units higher in malignant cases.

---

**3. Why does class imbalance matter for machine learning classification? (You may need to research this)**

Your answer: Class imbalance matters because algorithms naturally favor the majority class, meaning a model could achieve 62.74% accuracy by simply predicting every tumor as benign without learning anything meaningful. This is especially dangerous in medical diagnosis where missing a malignant tumor is far more costly than a false positive, so metrics like sensitivity, F1-score, and AUC should be used instead of simple accuracy, and techniques like SMOTE or class weighting should be applied during model training.

---

**4. If you were building a diagnostic model, which 3 features would you prioritize based on your analysis? Justify your choices.**

Your answer: The three features I would prioritize are concavity_mean (248.6% higher in malignant tumors, strongest discriminative power), area_mean (111.4% higher in malignant tumors, clinically validated size indicator), and symmetry_mean (consistent relationship with malignancy shown throughout this entire analysis). Together these three features cover three distinct aspects of nuclear morphology - contour irregularity, size, and shape symmetry - providing complementary information that should yield stronger predictive performance than any single feature alone.

---

## Submission Checklist

Before submitting, verify you have completed:

- [ ] **Part 1:** All 10 inspection steps with code AND written findings
- [ ] **Part 2:** Complete data dictionary with 12 key columns filled in
- [ ] **Part 2:** Answered all 4 clinical research questions
- [ ] **Part 3:** All 3 validation checks with code and answers
- [ ] **Part 4:** Created `symmetry_category` column using **Symmetry-Based Clinical Categories**
- [ ] **Part 4:** Calculated malignancy rate by symmetry category with interpretation
- [ ] **Part 5:** Three research questions (symmetry+compactness, symmetry mean vs worst, symmetry+area)
- [ ] **Part 5:** One unanswerable question about tumor location/breast density
- [ ] **Part 5:** symmetry_mean by diagnosis groupby analysis
- [ ] **Bonus (Optional):** Target variable analysis

---

## Grading Rubric

| Component | Points | Requirements for Full Credit |
|-----------|--------|------------------------------|
| Part 1: 10-Point Inspection | 40 | All 10 steps complete with working code AND thoughtful written analysis |
| Part 2: Data Dictionary | 20 | All 12 columns documented with correct feature types and clinical research |
| Part 3: Data Validation | 15 | All validation checks complete with working code and insightful answers |
| Part 4: Symmetry Groups | 10 | Working code that creates correct groups AND meaningful interpretation |
| Part 5: Research Questions | 15 | Three good questions in specified areas, one clear limitation, groupby analysis complete |
| **Bonus:** Target Analysis | +5 | Thoughtful analysis with real-world connection |
| **Total** | 100 (+5 bonus) | |

---

## Hints (Read Before You Get Stuck!)

### ‚ö†Ô∏è Common Pitfalls:

1. **One column appears to be entirely empty** (all NaN values)
   - Check the last column carefully
   - This often happens with CSV exports that have trailing commas
   - You should drop this column before analysis

2. **The diagnosis column uses single letters** - "M" and "B"
   - Don't forget what these stand for when interpreting results
   - You may need to convert to 0/1 for some calculations

3. **Symmetry values are relatively small** - typically between 0.1 and 0.3
   - Pay attention to the decimal places when creating categories
   - Make sure your bin edges are precise

4. **Continuous features** - most features in this dataset are continuous
   - Think carefully about appropriate grouping strategies

### üí° Pro Tips:

- Use `value_counts()` liberally to understand categorical columns
- Use `value_counts(dropna=False)` to see if there are any null values
- When using `pd.cut()` with custom bins, include `float('-inf')` or `float('inf')` to catch all values
- The `describe()` method works best with numeric columns
- For comparing groups, `groupby().mean()` is your friend

---

## Useful Resources

- **UCI ML Repository - Original Dataset:** https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
- **Kaggle Dataset Page:** https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data
- **American Cancer Society - Breast Cancer:** https://www.cancer.org/cancer/breast-cancer.html
- **Nottingham Grading System:** https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3714112/
- **Pandas Documentation:** https://pandas.pydata.org/docs/

---

*Remember: "Every Column Tells a Story" - your job is to figure out what that story is!*

---

**Due Date:** [See Canvas]

**Submission:** Upload your completed Jupyter notebook (.ipynb) to Canvas