# Week 3 Assignment: Breast Cancer Wisconsin Dataset Analysis

## I 320D: Data Science for Biomedical Informatics | Spring 2026

### üìã Assignment Version E

---

## üéØ This Week's Mantra

> **"Every Column Tells a Story"**

In this assignment, you'll apply the 10-Point Data Inspection to a real-world medical imaging dataset focused on breast cancer diagnosis. By the end, you'll understand not just *what* the data contains, but *why* each variable matters for clinical decision-making.

---

## Learning Objectives

By completing this assignment, you will be able to:

1. ‚úÖ Apply the systematic 10-Point Inspection to a new healthcare dataset
2. ‚úÖ Identify and classify feature types (continuous, discrete, categorical, ordinal)
3. ‚úÖ Detect and document data quality issues (missing values, unexpected values)
4. ‚úÖ Research and document clinical meaning for healthcare variables
5. ‚úÖ Create meaningful data groupings based on clinical standards
6. ‚úÖ Formulate answerable research questions about cancer diagnosis factors

---

## About the Dataset

**Dataset:** Wisconsin Diagnostic Breast Cancer (WDBC)  
**Source:** UCI Machine Learning Repository / Kaggle  
**File:** `data.csv`  
**Target Variable:** `diagnosis` (M = Malignant, B = Benign)

### Clinical Context

Breast cancer is the most common cancer among women worldwide, affecting about 2.3 million women annually according to the World Health Organization (WHO). This dataset contains features computed from digitized images of fine needle aspirate (FNA) of breast masses. The features describe characteristics of the cell nuclei present in the image.

Understanding these variables is crucial for:

- Computer-aided diagnosis (CAD) systems
- Early detection of malignant tumors
- Reducing unnecessary biopsies
- Supporting clinical decision-making in oncology

### Feature Categories

The dataset contains **30 numeric features** organized into three measurement types for each of **10 characteristics**:

| Suffix | Meaning | Description |
|--------|---------|-------------|
| `_mean` | Mean | Average value across all nuclei in the image |
| `_se` | Standard Error | Variation in measurements |
| `_worst` | Worst/Largest | Mean of the three largest values |

The **10 cell nuclei characteristics** measured are:
- **radius** - mean distance from center to points on the perimeter
- **texture** - standard deviation of gray-scale values
- **perimeter** - boundary length of the nucleus
- **area** - area of the nucleus
- **smoothness** - local variation in radius lengths
- **compactness** - (perimeter¬≤ / area) - 1.0
- **concavity** - severity of concave portions of the contour
- **concave points** - number of concave portions of the contour
- **symmetry** - symmetry of the nucleus
- **fractal dimension** - "coastline approximation" - 1

---

In [12]:
## Getting Started

# First, load the dataset and import your libraries:

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Load the dataset
df = pd.read_csv('data.csv')

# Display first few rows to confirm it loaded
df.head()


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


## Part 1: The 10-Point Data Inspection (40 points)

Complete each inspection step and document your findings.

### Step 1: Shape (4 points)

**Your Code:**

In [27]:
print(df.shape)

(569, 33)


**Your Findings:**
- How many rows (observations/patients)?: 569 rows
- How many columns (features)?: 33 columns
- What does each row represent in clinical terms?: Each row represents one breast cancer paitent who recieved a Fine Needle Aspiration (FNA) biopsy. The row shows the biopsied cell and its characteristics that can be used to either diagnose it as malignant or benign.  

### Step 2: Column Names (4 points)

**Your Code:**

In [28]:
column_names = df.columns.tolist()
print("Column Names:")
print(column_names)

Column Names:
['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32']


**Your Findings:**
- List all column names: id, diagnosis, radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractional_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave points_se, symmetry_se, fractional_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_wrost, concavity_worst, concave points_worst, symmetry_worst, fractional_dimension_worst. There were also 32 unnamed columns. 

- Do you notice any pattern in the column naming convention?: Yes, there was a clear pattern in the column naming convention. There were three groups based on the statistical measures of: mean, se (standard error), and worst values. The same 10 measures (radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractional dimension) were then evaluated based on these statistical measures. 

- Which columns might need further research to understand?: I feel that the columns that might need further research are the fractional dimension columns, the concavity/concave points, and the texture columns. I'm not sure what fractional dimension referrs to in regards to tumor cells, and am unsure what the difference a concavity/convex point makes in a tumor cell. I'd also like to better understand how the texture of a cell is quantified in regards to tumor cells. 

### Step 3: Data Types (4 points)

**Your Code:**

In [29]:
data_types = df.dtypes
print("\n Data Types:")
print(data_types)


 Data Types:
id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concav

**Your Findings:**
- Which columns are numeric (int64 or float64)?:
The id column was the only numeric column while radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave points_mean, symmetry_mean, fractional_dimension_mean) and all the _se columns and the _worst columns totaled to 31 float columns. There were also notably 32 unnamed columns in the dataset. 

- Which columns are categorical (object/string)?:
The diagnosis column is the categorical object in the dataset, with M representing malignant and B representing benign tumor cells.   

- Are there any data types that seem incorrect?:
Some potential issues that might seem incorrect are the 32 unnamed columns being identified as a float64. This seems to possibly be an error column.

### Step 4: First Look (4 points)

**Your Code:**

In [34]:
# Load the dataset
df = pd.read_csv('data.csv')

# first look at the data
print("Dataset Overview:")
print(f"\nTotal Records: {len(df):,}")
print(f"Total Features: {len(df.columns)}")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# See the first few rows
print("\n First 5 Records:")
print(df.head())

# Columns
print("\n Column Names and Data Types:")
print("-"*40)
for col in df.columns:
    print(f"  ‚Ä¢ {col}: {df[col].dtype}")

# Quick statistical summary
print("\n Statistical Summary:")
print(df.describe().round(2))

Dataset Overview:

Total Records: 569
Total Features: 33
Memory Usage: 0.17 MB

 First 5 Records:
         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302         M        17.99         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280      

**Your Findings:**
- What do the actual values look like?: 

- Do you notice anything unusual or unexpected?

- What are the possible values for the `diagnosis` column?

---

### Step 5: Last Look (4 points)

**Your Code:**

In [18]:
#Seeing last few rows of data
print("\nLast 5 Records:"
print(df.tail())

#Checking for missing values
print("\nMissing values in last 10 rows:")
print(df.tail(10).isnull().sum())

**Your Findings:**
- Does the data end cleanly?

- Are the last rows consistent with the first rows?

---

### Step 6: Memory Usage (4 points)

**Your Code:**

In [19]:
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

Memory Usage: 0.17 MB


**Your Findings:**
- How much memory does the dataset use? Approximately 174.08 KB
- Is this a "small" or "large" dataset by data science standards?: This would be considered a very "small" dataset by data science standards. 


### Step 7: Missing Values (4 points)

**Your Code:**

In [20]:
print("Missing Values Check:")
print("-"*40)
missing = df.isnull().sum()
if missing.sum() == 0:
    print("No missing values found! This data is clean.")
else:
    print(missing[missing > 0])

Missing Values Check:
----------------------------------------
Unnamed: 32    569
dtype: int64


**Your Findings:**
- Which columns have missing values (according to `.isnull()`)?: Only one column has missing values according to the .isnull(), seen as the unnamed:32

- What percentage of each column is missing? 31 columns have 0% missing data while the one unnamed:32 column has 100% missing data.

- ‚ö†Ô∏è **IMPORTANT:** Do you notice any columns that appear to be entirely empty or have suspicious patterns?: Yes, the unnamed:32 column is 100% empty and shows a suspicious pattern in the way it containes no useful information and is taking up space in the dataset.


### Step 8: Duplicates (4 points)

**Your Code:**

In [21]:
duplicates = df.duplicated().sum()
print(f"\nüîÑ Duplicate Rows: {duplicates:,}")
print(f"üìä Unique Records: {len(df) - duplicates:,}")


üîÑ Duplicate Rows: 0
üìä Unique Records: 569


**Your Findings:**
- Are there any duplicate rows?: There were no duplicate rows found in the dataset.
- Are all patient IDs unique? Yes, all the patient IDs were found to be unique.

### Step 9: Basic Statistics (4 points)

**Your Code:**

In [22]:
print("\nüìà Statistical Summary:")
df.describe().round(2)


üìà Statistical Summary:


Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,30371830.0,14.13,19.29,91.97,654.89,0.1,0.1,0.09,0.05,0.18,...,25.68,107.26,880.58,0.13,0.25,0.27,0.11,0.29,0.08,
std,125020600.0,3.52,4.3,24.3,351.91,0.01,0.05,0.08,0.04,0.03,...,6.15,33.6,569.36,0.02,0.16,0.21,0.07,0.06,0.02,
min,8670.0,6.98,9.71,43.79,143.5,0.05,0.02,0.0,0.0,0.11,...,12.02,50.41,185.2,0.07,0.03,0.0,0.0,0.16,0.06,
25%,869218.0,11.7,16.17,75.17,420.3,0.09,0.06,0.03,0.02,0.16,...,21.08,84.11,515.3,0.12,0.15,0.11,0.06,0.25,0.07,
50%,906024.0,13.37,18.84,86.24,551.1,0.1,0.09,0.06,0.03,0.18,...,25.41,97.66,686.5,0.13,0.21,0.23,0.1,0.28,0.08,
75%,8813129.0,15.78,21.8,104.1,782.7,0.11,0.13,0.13,0.07,0.2,...,29.72,125.4,1084.0,0.15,0.34,0.38,0.16,0.32,0.09,
max,911320500.0,28.11,39.28,188.5,2501.0,0.16,0.35,0.43,0.2,0.3,...,49.54,251.2,4254.0,0.22,1.06,1.25,0.29,0.66,0.21,


**Your Findings:**
- What is the radius_mean range in the dataset?: 3.52 to 569.00
- What is the range of area_mean values?: 143.50 to 2501.00
- What is the range of concavity_mean values?: 0.00 to 569.00
- Do any min/max values seem impossible or clinically unlikely?:

---

### Step 10: Unique Counts (4 points)

**Your Code:**

In [23]:
#```python

#```

**Your Findings:**
- Which columns have very few unique values (likely categorical)?

- Which columns have many unique values (likely continuous)?

- Does the number of unique IDs match the number of rows? _______________

---

## Part 2: Data Dictionary (20 points)

Complete the following data dictionary for the **key columns**. For each column, you must:
1. **Research** the clinical meaning
2. **Identify** the feature type (Continuous, Discrete, Categorical-Nominal, Categorical-Ordinal, Binary, Identifier)
3. **Document** the valid values/range you observe
4. **Note** any issues or questions

| Column | Description | Feature Type | Valid Values/Range | Notes/Issues |
|--------|-------------|--------------|-------------------|--------------|
| `id` | | | | |
| `diagnosis` | | | | |
| `radius_mean` | | | | |
| `texture_mean` | | | | |
| `perimeter_mean` | | | | |
| `area_mean` | | | | |
| `smoothness_mean` | | | | |
| `compactness_mean` | | | | |
| `concavity_mean` | | | | |
| `concave points_mean` | | | | |
| `symmetry_mean` | | | | |
| `fractal_dimension_mean` | | | | |

### Clinical Research Questions for Version E

Answer these questions based on your research (you may need to use Google):

**1. What is nuclear pleomorphism and why is it important in cancer grading systems like the Nottingham grading system? How does cell symmetry relate to pleomorphism?**

Your answer:

---

**2. What is the biological basis for why cancer cells often show asymmetry? How do abnormal cell division and chromosomal instability contribute to irregular cell shapes?**

Your answer:

---

**3. Explain what the symmetry measurement in this dataset captures mathematically. How is symmetry calculated from the cell nucleus contour?**

Your answer:

---

**4. What is the Breast Imaging Reporting and Data System (BI-RADS)? How do shape and symmetry features factor into the BI-RADS classification of breast masses?**

Your answer:

---

## Part 3: Data Validation (15 points)

### 3.1 Diagnosis Distribution Validation (5 points)

Write code to check:
- How many patients have malignant (M) tumors?
- How many patients have benign (B) tumors?
- What is the percentage of each?

**Your Code:**

#```python

#```

**Your Findings:**

- Is this dataset balanced or imbalanced between the two classes?

- In the real world, what percentage of breast biopsies are malignant vs benign?

---

### 3.2 Empty Column Validation (5 points)

Write code to examine all columns for any that might be completely empty or contain only null values.

**Your Code:**

In [24]:
#```python

#```

**Your Findings:**

- Did you find any columns that are entirely empty?

- What should you do with such columns before analysis?

- Why might an empty column exist in a dataset?

---

### 3.3 Feature Range Validation (5 points)

Write code to check if the "worst" measurements are always greater than or equal to the "mean" measurements for the same characteristic.

**Your Code:**

In [25]:
#```python

#```

**Your Findings:**

- Does `radius_worst` always >= `radius_mean`?

- Does this relationship hold for other features?

- What would it mean if this relationship was violated?

---

## Part 4: Create Cell Symmetry Groups (10 points)

Create a new column called `symmetry_category` that categorizes tumors into clinically-meaningful groups based on `symmetry_mean` (a measure of how symmetrical the cell nuclei are).

### Version E: Symmetry-Based Clinical Categories

Use these categories based on observed symmetry values (where higher values indicate more asymmetry):

| Symmetry Category | Symmetry Range | Clinical Rationale |
|-------------------|----------------|-------------------|
| Highly Symmetric | < 0.14 | Very regular cell shape, typical of benign cells |
| Symmetric | 0.14 - 0.17 | Normal symmetry range, most cells fall here |
| Mildly Asymmetric | 0.17 - 0.20 | Some irregularity, warrants closer examination |
| Asymmetric | 0.20 - 0.25 | Notable asymmetry, associated with abnormal growth |
| Highly Asymmetric | ‚â• 0.25 | Significant asymmetry, strong indicator of malignancy |

In [None]:
### Your Code:

```python
# Create the symmetry_category column
# You can use a custom function with .apply() OR pd.cut()
# Remember: if using pd.cut(), use include_lowest=True!

```

### Verify your groupings worked:

```python
# Show counts per symmetry category

```

### Calculate malignancy rate by symmetry category:

```python
# Calculate the percentage of malignant diagnoses in each symmetry category

```

### Analysis Questions:

**1. How many tumors are in each symmetry category?**

Your answer:

---

**2. What is the malignancy rate (percentage) for each symmetry category?**

Your answer:

---

**3. At what level of asymmetry does malignancy rate noticeably increase? Does cell symmetry appear to be a useful diagnostic feature?**

Your answer:

---

**4. Why might highly asymmetric cells be more likely to be malignant? (Think about how cancer cells divide and grow differently from normal cells.)**

Your answer:

---

## Part 5: Research Questions (15 points)

### 5.1 Write Three Answerable Questions (9 points)

Write three questions that THIS dataset can answer. Remember: the data can show relationships and patterns, but cannot prove causation.

**Your questions must explore these specific areas:**

1. **A question about symmetry and compactness together:**


---

2. **A question comparing symmetry_mean vs symmetry_worst:**


---

3. **A question about the relationship between symmetry and area:**


---

### 5.2 Identify One Question the Data CANNOT Answer (3 points)

Write one question about **tumor location or breast density** that this dataset cannot answer, and explain why.

**Question:**


**Why it cannot be answered with this data:**


---

### 5.3 Grouping Analysis (3 points)

Answer this question using a groupby analysis:

**"What is the average symmetry_mean for each diagnosis category (M vs B)?"**

In [None]:
**Your Code:**
```python

```

**Your Interpretation:**

How does symmetry differ between malignant and benign tumors? What does this suggest about the shape characteristics of cancer cells?


---

## Part 6: Target Variable Analysis (Bonus - 5 points)

The `diagnosis` column is our **target variable** - what we're trying to predict. Analyze its relationship with key features.

In [None]:
**Your Code:**
```python
# Show the distribution of diagnosis
# Calculate summary statistics for at least 3 key features, grouped by diagnosis

```

### Bonus Questions:

**1. What percentage of patients in this dataset have malignant tumors?**

Your answer:

---

**2. Which feature shows the largest difference between malignant and benign tumors?**

Your answer:

---

**3. Why does class imbalance matter for machine learning classification? (You may need to research this)**

Your answer:

---

**4. If you were building a diagnostic model, which 3 features would you prioritize based on your analysis? Justify your choices.**

Your answer:

---

## Submission Checklist

Before submitting, verify you have completed:

- [ ] **Part 1:** All 10 inspection steps with code AND written findings
- [ ] **Part 2:** Complete data dictionary with 12 key columns filled in
- [ ] **Part 2:** Answered all 4 clinical research questions
- [ ] **Part 3:** All 3 validation checks with code and answers
- [ ] **Part 4:** Created `symmetry_category` column using **Symmetry-Based Clinical Categories**
- [ ] **Part 4:** Calculated malignancy rate by symmetry category with interpretation
- [ ] **Part 5:** Three research questions (symmetry+compactness, symmetry mean vs worst, symmetry+area)
- [ ] **Part 5:** One unanswerable question about tumor location/breast density
- [ ] **Part 5:** symmetry_mean by diagnosis groupby analysis
- [ ] **Bonus (Optional):** Target variable analysis

---

## Grading Rubric

| Component | Points | Requirements for Full Credit |
|-----------|--------|------------------------------|
| Part 1: 10-Point Inspection | 40 | All 10 steps complete with working code AND thoughtful written analysis |
| Part 2: Data Dictionary | 20 | All 12 columns documented with correct feature types and clinical research |
| Part 3: Data Validation | 15 | All validation checks complete with working code and insightful answers |
| Part 4: Symmetry Groups | 10 | Working code that creates correct groups AND meaningful interpretation |
| Part 5: Research Questions | 15 | Three good questions in specified areas, one clear limitation, groupby analysis complete |
| **Bonus:** Target Analysis | +5 | Thoughtful analysis with real-world connection |
| **Total** | 100 (+5 bonus) | |

---

## Hints (Read Before You Get Stuck!)

### ‚ö†Ô∏è Common Pitfalls:

1. **One column appears to be entirely empty** (all NaN values)
   - Check the last column carefully
   - This often happens with CSV exports that have trailing commas
   - You should drop this column before analysis

2. **The diagnosis column uses single letters** - "M" and "B"
   - Don't forget what these stand for when interpreting results
   - You may need to convert to 0/1 for some calculations

3. **Symmetry values are relatively small** - typically between 0.1 and 0.3
   - Pay attention to the decimal places when creating categories
   - Make sure your bin edges are precise

4. **Continuous features** - most features in this dataset are continuous
   - Think carefully about appropriate grouping strategies

### üí° Pro Tips:

- Use `value_counts()` liberally to understand categorical columns
- Use `value_counts(dropna=False)` to see if there are any null values
- When using `pd.cut()` with custom bins, include `float('-inf')` or `float('inf')` to catch all values
- The `describe()` method works best with numeric columns
- For comparing groups, `groupby().mean()` is your friend

---

## Useful Resources

- **UCI ML Repository - Original Dataset:** https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
- **Kaggle Dataset Page:** https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data
- **American Cancer Society - Breast Cancer:** https://www.cancer.org/cancer/breast-cancer.html
- **Nottingham Grading System:** https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3714112/
- **Pandas Documentation:** https://pandas.pydata.org/docs/

---

*Remember: "Every Column Tells a Story" - your job is to figure out what that story is!*

---

**Due Date:** [See Canvas]

**Submission:** Upload your completed Jupyter notebook (.ipynb) to Canvas