In this exercise, you'll be given labels for 60 mammograms that contain a suspicious mass. Anytime this occurs in a clinical setting, the patient is sent for a mass biopsy to determine if the mass is benign or cancerous. The radiologist can still make a judgment about whether to mass appears malignant or not based on how it appears in the image.

Sometimes in algorithmic development settings, we are only able to obtain radiologist reports and we are not able to obtain biopsy reports for all studies. Since the true gold standard label is the biopsy result, it helps to get several opinions from different radiologists on the image appearance to make a more robust ground truth assessment in the absence of biopsy data.

Here, you are provided with labels from three different radiologists who have the following levels of clinical experience:

Rad1 = 5 years  
Rad2 = 10 years  
Rad3 = 15 years  
In this exercise, create three 'ground truths', you can label benign as 1 and malignant as 0:  

Ground Truth Method 1: Using biopsy labels (true gold standard)  
Ground Truth Method 2: Using a voting system between the three radiologists  
Ground Truth Method 3: Using a weighted voting system with experience levels between the three radiologists  
Assess how 2 & 3 compare to 1: if the ground truths from 2 & 3 agree with 1.  

In [4]:
# Import relevant libraries
import numpy as np
import pandas as pd

Read in your label data:

In [5]:
# Read the CSV file of Mammogram biopsy findings into a DataFrame
labels = pd.read_csv('labels.csv')
labels.head(10)

Unnamed: 0,rad1,rad2,rad3,biopsy
0,benign,benign,benign,benign
1,benign,benign,benign,benign
2,benign,benign,benign,benign
3,benign,benign,benign,benign
4,benign,benign,cancer,benign
5,cancer,cancer,cancer,cancer
6,benign,benign,benign,benign
7,benign,benign,benign,benign
8,cancer,cancer,benign,cancer
9,benign,benign,cancer,benign


## Create your first ground truth as derived from biopsy labels: 

In [6]:
## Binarization or encoding of the labels 
## I'm going to replace everything in my 'labels' dataframe with 0's and 1's for easier processing later:
labels2 = labels.replace('benign',1).replace('cancer',0)
labels2.head(10)

  labels2 = labels.replace('benign',1).replace('cancer',0)


Unnamed: 0,rad1,rad2,rad3,biopsy
0,1,1,1,1
1,1,1,1,1
2,1,1,1,1
3,1,1,1,1
4,1,1,0,1
5,0,0,0,0
6,1,1,1,1
7,1,1,1,1
8,0,0,1,0
9,1,1,0,1


In [7]:
gt1 = labels2['biopsy']
gt1.head()

0    1
1    1
2    1
3    1
4    1
Name: biopsy, dtype: int64

## Create your second truth by voting system from the three radiologists:

In [12]:
# Creating a ground truth using a simple voting system from three radiologists
# This block uses pandas to sum the binary votes (benign=1, cancer=0) from three radiologists for each case.
# The sum across 'rad1', 'rad2', and 'rad3' columns gives the total number of benign votes per case.
# If the sum is greater than 1 (i.e., at least two radiologists voted benign), the case is labeled as benign (1), otherwise malignant (0).

gt2 = labels2[['rad1','rad2','rad3']].sum(axis=1)
gt2 = (gt2 > 1).astype(int)
gt2.head()

0    1
1    1
2    1
3    1
4    1
dtype: int64

## Create your third ground truth by weighting the three radiologists:

In [14]:
# Using pandas to create a weighted voting system for radiologist labels

# This block uses the pandas library to manipulate the DataFrame of radiologist votes.
# We copy the binarized labels DataFrame (labels2) to avoid modifying the original data.
# Then, we apply weights to each radiologist's vote based on their years of experience:
# - rad1 (5 years): weight 0.33
# - rad2 (10 years): weight 0.67
# - rad3 (15 years): weight 1 (no change)
# The multiplication is vectorized using pandas, so each value in the specified columns is scaled accordingly.
# The resulting DataFrame (weighted_labels) contains the weighted votes for each radiologist.

weighted_labels = labels2.copy()
weighted_labels['rad2'] = weighted_labels['rad2'] * 0.67
weighted_labels['rad1'] = weighted_labels['rad1'] * 0.33
weighted_labels.head()

Unnamed: 0,rad1,rad2,rad3,biopsy
0,0.33,0.67,1,1
1,0.33,0.67,1,1
2,0.33,0.67,1,1
3,0.33,0.67,1,1
4,0.33,0.67,0,1


In [16]:
# Creating a weighted voting ground truth using pandas
# 
# This block uses the pandas library to aggregate radiologist votes with different weights based on experience.
# The weighted votes for each radiologist (already applied in 'weighted_labels') are summed across columns 'rad1', 'rad2', and 'rad3'.
# The sum represents the total weighted "benign" score for each case.
# If the sum is greater than 1, the case is labeled as benign (1), otherwise malignant (0).
# The .replace() function is used to convert boolean values to integer labels for consistency.


gt3 = weighted_labels[['rad1','rad2','rad3']].sum(axis=1)
gt3 = (gt3 > 1).astype(int)
gt3.head()

0    1
1    1
2    1
3    1
4    0
dtype: int64

## Compare the three ground truths:

Here, just explore the three sets of labels you created and see how often they agree

In [19]:
# ## Comparing Biopsy Ground Truth to Voting System

# This block uses the pandas library to compare two sets of ground truth labels:
# - `gt1`: The gold standard biopsy labels.
# - `gt2`: The labels generated by a simple majority voting system among three radiologists.

# The comparison is performed using the equality operator (`==`), which returns a boolean Series (`biopsy_to_votes`) indicating where the two ground truths agree (True) or disagree (False) for each case.
# The code then filters and displays only the cases where there is disagreement (`biopsy_to_votes == False`).

biopsy_to_votes = gt1 == gt2
biopsy_to_votes[biopsy_to_votes==False]



12    False
14    False
22    False
29    False
30    False
34    False
37    False
52    False
57    False
dtype: bool

In [20]:
# ## Comparing Biopsy Ground Truth to Weighted Voting System
#
# This block uses the pandas library to compare two sets of ground truth labels:
# - `gt1`: The gold standard biopsy labels.
# - `gt3`: The labels generated by a weighted voting system among three radiologists.
#
# The comparison is performed using the equality operator (`==`), which returns a boolean Series (`biopsy_to_weights`) indicating where the two ground truths agree (True) or disagree (False) for each case.
# The code then filters and displays only the cases where there is disagreement (`biopsy_to_weights == False`).


biopsy_to_weights = gt1 == gt3
biopsy_to_weights[biopsy_to_weights==False]

4     False
9     False
12    False
14    False
17    False
20    False
22    False
29    False
30    False
34    False
37    False
52    False
56    False
57    False
58    False
dtype: bool

Interestingly, in the example above the weighting example performs worse against biopsy labels than simple voting. This may be an artefact of the weightings that we chose, and is not always sub-optimal to simple voting. 

This script aims to evaluate different methods for establishing ground truth labels in mammogram studies where biopsy results (the gold standard) are available, but radiologist opinions are also collected. The workflow involves reading mammogram label data, encoding categorical labels into binary format (benign=1, cancer=0), and generating three sets of ground truth: (1) directly from biopsy results, (2) via a simple majority vote among three radiologists, and (3) using a weighted voting system based on each radiologist's years of experience. The script then compares the agreement between the radiologist-based ground truths and the biopsy-based ground truth, highlighting cases of disagreement. The output provides insight into how well radiologist consensus and experience-weighted votes align with biopsy-confirmed diagnoses, informing the reliability of alternative labeling strategies when biopsy data is limited.