## Data pruning to solve imbalance problem and speed up training

## Strategy

Over 18k samples are "No Finding". Second most common class is only 1800 samples. We will only going to keep BIRADS greater than 2. And for BIRADS less than 2, we will add samples from each combination of BIRADS and densities as evenly as possible. And we are not going to exceed total of 2500 samples of "No Finding" class while doing that.

In [2]:
import pandas as pd

In [11]:
df = pd.read_csv('..\\metadata\\stratified_local.csv')

In [90]:
df.finding_categories.value_counts().reset_index(name='count')

Unnamed: 0,finding_categories,count
0,['No Finding'],18232
1,['Mass'],1123
2,['Suspicious Calcification'],402
3,['Focal Asymmetry'],232
4,['Architectural Distortion'],95
5,['Asymmetry'],90
6,"['Suspicious Calcification', 'Mass']",82
7,['Suspicious Lymph Node'],57
8,['Skin Thickening'],38
9,"['Suspicious Calcification', 'Focal Asymmetry']",31


In [14]:
no_finding_df = df[df["finding_categories"] == "['No Finding']"].copy()

In [16]:
no_finding_df.head()

Unnamed: 0,study_id,series_id,image_id,laterality,view_position,height,width,breast_birads,breast_density,finding_categories,finding_birads,xmin,ymin,xmax,ymax,split,fold
2254,b8d273e8601f348d3664778dae0e7e0b,b36517b9cbbcfd286a7ae04f643af97a,d8125545210c08e1b1793a5af6458ee2,L,CC,3518,2800,BI-RADS 2,DENSITY C,['No Finding'],,,,,,training,training
2255,b8d273e8601f348d3664778dae0e7e0b,b36517b9cbbcfd286a7ae04f643af97a,290c658f4e75a3f83ec78a847414297c,L,MLO,3518,2800,BI-RADS 2,DENSITY C,['No Finding'],,,,,,training,training
2256,b8d273e8601f348d3664778dae0e7e0b,b36517b9cbbcfd286a7ae04f643af97a,cd0fc7bc53ac632a11643ac4cc91002a,R,CC,3518,2800,BI-RADS 2,DENSITY C,['No Finding'],,,,,,training,training
2257,b8d273e8601f348d3664778dae0e7e0b,b36517b9cbbcfd286a7ae04f643af97a,71638b1e853799f227492bfb08a01491,R,MLO,3518,2800,BI-RADS 2,DENSITY C,['No Finding'],,,,,,training,training
2258,8269f5971eaca3e5d3772d1796e6bd7a,d931832a0815df082c085b6e09d20aac,dd9ce3288c0773e006a294188aadba8e,L,CC,3518,2800,BI-RADS 1,DENSITY C,['No Finding'],,,,,,training,training


In [18]:
combinations = no_finding_df.groupby(['breast_birads', 'breast_density']).size().reset_index(name='count')
print("Unique combinations and their counts:")
print(combinations)


Unique combinations and their counts:
   breast_birads breast_density  count
0      BI-RADS 1      DENSITY A     80
1      BI-RADS 1      DENSITY B   1326
2      BI-RADS 1      DENSITY C  10088
3      BI-RADS 1      DENSITY D   1912
4      BI-RADS 2      DENSITY A      8
5      BI-RADS 2      DENSITY B    400
6      BI-RADS 2      DENSITY C   3617
7      BI-RADS 2      DENSITY D    639
8      BI-RADS 3      DENSITY A      1
9      BI-RADS 3      DENSITY B      2
10     BI-RADS 3      DENSITY C    109
11     BI-RADS 3      DENSITY D     14
12     BI-RADS 4      DENSITY B      3
13     BI-RADS 4      DENSITY C     24
14     BI-RADS 4      DENSITY D      7
15     BI-RADS 5      DENSITY B      1
16     BI-RADS 5      DENSITY C      1


In [19]:
no_finding_birads_le2 = no_finding_df[no_finding_df['breast_birads'].str.replace('BI-RADS', '').astype(int) <= 2]

In [21]:
no_finding_birads_gt2 = no_finding_df[no_finding_df['breast_birads'].str.replace('BI-RADS', '').astype(int) > 2]

In [22]:
no_finding_birads_gt2

Unnamed: 0,study_id,series_id,image_id,laterality,view_position,height,width,breast_birads,breast_density,finding_categories,finding_birads,xmin,ymin,xmax,ymax,split,fold
2332,bdf65210726eb71e919b1af1a1c87c61,2efe91988161bfbb2927f0b32371e6c2,35293500d91fb681fd8ff9171d8161c1,R,CC,3518,2800,BI-RADS 3,DENSITY C,['No Finding'],,,,,,training,training
2425,72f7758250782ee835586f7122acd9fd,36dc18c8c1a5b7203d6e3e26f32f3547,305835a194605ef353a1700def9a2429,L,CC,3518,2800,BI-RADS 3,DENSITY C,['No Finding'],,,,,,training,training
2450,cc529421ea90b754791cb0a17b5d5950,1db485d076206d7b2888e16270330a69,4faa502262cb0142a0e6406c4438be3a,L,MLO,3518,2800,BI-RADS 3,DENSITY C,['No Finding'],,,,,,training,training
2547,ed57df0b0549ec005fedec9dd9a23a99,3e99cc5facba25c43179a4b7f9432d15,1b44ec25f2859a76c7272b240e25e336,L,CC,3518,2800,BI-RADS 3,DENSITY C,['No Finding'],,,,,,training,training
2684,711543a7b6910a0e41ef126205eca9a9,a3047e690794511f75a84715f9f46a23,fb1d5c7b4a6a346e353654207d2950cd,R,MLO,3518,2800,BI-RADS 3,DENSITY C,['No Finding'],,,,,,training,training
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20067,7bdfdd72d14a004a88918758eec0ebf8,460ac23eb66bbeb63081f2db80ca47e3,f0a27d3d6d8a31aa72efa9a54db6b23d,R,MLO,3580,2766,BI-RADS 4,DENSITY D,['No Finding'],,,,,,training,training
20070,968f72a200354163e45343aeeead090d,efbe8a15d8929b30daa84f3ba17d3fb1,a0d8764fd9aab586c329e0458f120ee5,R,CC,3580,2702,BI-RADS 3,DENSITY C,['No Finding'],,,,,,test,training
20117,aba909f14c21dc68e13e516c8a8cbb7e,5be0ca890689abd40dbd923687b5fb3a,aad09d3b49967b0e47f36138aa513b27,L,CC,3580,2750,BI-RADS 4,DENSITY C,['No Finding'],,,,,,training,test
20384,7c98228fc11204260460934ba8c6e12b,5d6e9f5b8bad28ab4cea2f8775649d24,a2b207fbb38e7522dc318d2ca7a56d19,L,CC,3580,2645,BI-RADS 4,DENSITY C,['No Finding'],,,,,,training,training


In [28]:
main_dataset = df[df['finding_categories'] != "['No Finding']"].copy()

In [32]:
main_dataset.describe()

Unnamed: 0,height,width,xmin,ymin,xmax,ymax
count,2254.0,2254.0,2254.0,2254.0,2254.0,2254.0
mean,3426.414374,2681.761757,1160.120275,1439.259983,1426.837964,1740.29963
std,254.211505,277.416325,960.853053,444.646243,966.959392,437.777765
min,2812.0,2012.0,-26.844999,-5.22405,55.005161,152.147003
25%,3518.0,2800.0,244.267254,1165.827484,513.675766,1474.47998
50%,3518.0,2800.0,702.410004,1448.295044,989.629486,1765.705017
75%,3518.0,2800.0,2158.917542,1743.442505,2458.27002,2040.83501
max,3580.0,2812.0,2743.610107,2613.790039,2830.149902,2978.73999


In [33]:
main_dataset = pd.concat([main_dataset, no_finding_birads_gt2], ignore_index=True) # add no finding with birads > 2

In [35]:
main_dataset.head()

Unnamed: 0,study_id,series_id,image_id,laterality,view_position,height,width,breast_birads,breast_density,finding_categories,finding_birads,xmin,ymin,xmax,ymax,split,fold
0,48575a27b7c992427041a82fa750d3fa,26de4993fa6b8ae50a91c8baf49b92b0,4e3a578fe535ea4f5258d3f7f4419db8,R,CC,3518,2800,BI-RADS 4,DENSITY C,['Mass'],BI-RADS 4,2355.139893,1731.640015,2482.97998,1852.75,training,training
1,48575a27b7c992427041a82fa750d3fa,26de4993fa6b8ae50a91c8baf49b92b0,dac39351b0f3a8c670b7f8dc88029364,R,MLO,3518,2800,BI-RADS 4,DENSITY C,['Mass'],BI-RADS 4,2386.679932,1240.609985,2501.800049,1354.040039,training,training
2,75e8e48933289d70b407379a564f8594,853b70e7e6f39133497909d9ca4c756d,c83f780904f25eacb44e9030f32c66e1,R,CC,3518,2800,BI-RADS 3,DENSITY C,['Global Asymmetry'],BI-RADS 3,2279.179932,1166.51001,2704.439941,2184.26001,training,training
3,75e8e48933289d70b407379a564f8594,853b70e7e6f39133497909d9ca4c756d,893528bc38a0362928a89364f1b692fd,R,MLO,3518,2800,BI-RADS 3,DENSITY C,['Global Asymmetry'],BI-RADS 3,1954.27002,1443.640015,2589.76001,2193.810059,training,training
4,c3487424fee1bdd4515b72dc3fd69813,77619c914263eae44e9099f1ce07192c,318264c881bf12f2c1efe5f93920cc37,R,CC,3518,2800,BI-RADS 4,DENSITY C,['Architectural Distortion'],BI-RADS 4,2172.300049,1967.410034,2388.699951,2147.159912,training,training


In [36]:
combinations = no_finding_birads_le2.groupby(['breast_birads', 'breast_density']).size().reset_index(name='count')
num_combinations = combinations.shape[0]

In [39]:
print(combinations)

  breast_birads breast_density  count
0     BI-RADS 1      DENSITY A     80
1     BI-RADS 1      DENSITY B   1326
2     BI-RADS 1      DENSITY C  10088
3     BI-RADS 1      DENSITY D   1912
4     BI-RADS 2      DENSITY A      8
5     BI-RADS 2      DENSITY B    400
6     BI-RADS 2      DENSITY C   3617
7     BI-RADS 2      DENSITY D    639


In [41]:
print(main_dataset.value_counts('finding_categories').reset_index(name='count'))

                                   finding_categories  count
0                                            ['Mass']   1123
1                        ['Suspicious Calcification']    402
2                                 ['Focal Asymmetry']    232
3                                      ['No Finding']    162
4                        ['Architectural Distortion']     95
5                                       ['Asymmetry']     90
6                ['Suspicious Calcification', 'Mass']     82
7                           ['Suspicious Lymph Node']     57
8                                 ['Skin Thickening']     38
9     ['Suspicious Calcification', 'Focal Asymmetry']     31
10                               ['Global Asymmetry']     24
11  ['Suspicious Calcification', 'Architectural Di...     13
12                              ['Nipple Retraction']     12
13                                ['Skin Retraction']      7
14           ['Skin Thickening', 'Nipple Retraction']      6
15             ['Skin Th

In [42]:
grouped = no_finding_birads_le2.groupby(['breast_birads', 'breast_density'])

total_desired = 2500 # Desired no finding samples
num_groups = len(grouped)
base_sample_per_group = total_desired // num_groups  # 312

### Adding samples from each combination of BIRADS and densities as evenly as possible. If 2500 samples is not reached we add extra from the most common combinations.

In [44]:
random_state = 42
base_samples = pd.DataFrame()
for name, group in grouped:
    if len(group) <= base_sample_per_group:
        sampled = group
    else:
        sampled = group.sample(n=base_sample_per_group, random_state=random_state)
    base_samples = pd.concat([base_samples, sampled], ignore_index=True)


current_total = len(base_samples)
remaining = total_desired - current_total
print(remaining)

if remaining > 0:

    remaining_available = grouped.apply(
        lambda x: (
            len(x) - base_sample_per_group if len(x) > base_sample_per_group else 0
        )
    )
    total_remaining_available = remaining_available.sum()

    if total_remaining_available > 0:

        fractions = remaining_available / total_remaining_available
        additional_samples = pd.DataFrame()
        for name, group in grouped:
            if len(group) > base_sample_per_group:

                additional = int(round(fractions[name] * remaining))
                if additional > 0:
                    extra = group.sample(n=additional, random_state=random_state)
                    additional_samples = pd.concat(
                        [additional_samples, extra], ignore_index=True
                    )
        base_samples = pd.concat([base_samples, additional_samples], ignore_index=True)

        if len(base_samples) > total_desired:
            base_samples = base_samples.sample(
                n=total_desired, random_state=random_state
            )
elif remaining < 0:
    base_samples = base_samples.sample(n=total_desired, random_state=random_state)

540


  remaining_available = grouped.apply(


In [46]:
base_samples.groupby(['breast_birads', 'breast_density']).size().reset_index(name='count') # final sampling distribution

Unnamed: 0,breast_birads,breast_density,count
0,BI-RADS 1,DENSITY A,80
1,BI-RADS 1,DENSITY B,346
2,BI-RADS 1,DENSITY C,640
3,BI-RADS 1,DENSITY D,365
4,BI-RADS 2,DENSITY A,8
5,BI-RADS 2,DENSITY B,315
6,BI-RADS 2,DENSITY C,423
7,BI-RADS 2,DENSITY D,323


In [47]:
final_dataset = pd.concat([main_dataset, base_samples], ignore_index=True)

In [54]:
print(f"Total samples in final dataset: {final_dataset.shape[0]}")
print("Distribution of findings:")
print(final_dataset["finding_categories"].value_counts().reset_index(name="count"))
print("Distribution of BIRADS and Density in 'No finding' cases:")
print(
    final_dataset[final_dataset["finding_categories"] == "['No Finding']"]
    .groupby(["breast_birads","breast_density"])
    .size()
    .reset_index(name="count")
)

Total samples in final dataset: 4916
Distribution of findings:
                                   finding_categories  count
0                                      ['No Finding']   2662
1                                            ['Mass']   1123
2                        ['Suspicious Calcification']    402
3                                 ['Focal Asymmetry']    232
4                        ['Architectural Distortion']     95
5                                       ['Asymmetry']     90
6                ['Suspicious Calcification', 'Mass']     82
7                           ['Suspicious Lymph Node']     57
8                                 ['Skin Thickening']     38
9     ['Suspicious Calcification', 'Focal Asymmetry']     31
10                               ['Global Asymmetry']     24
11  ['Suspicious Calcification', 'Architectural Di...     13
12                              ['Nipple Retraction']     12
13                                ['Skin Retraction']      7
14           ['Skin Th

### We overshoot the 2500 samples because of fractions. But it is logical to keep them because of test fold imbalance.

In [55]:
final_dataset.to_csv('..\\metadata\\stratified_local_balanced.csv', index=False)

In [58]:
df.value_counts('fold').reset_index(name='count')

Unnamed: 0,fold,count
0,training,16404
1,test,4082


In [59]:
final_dataset.value_counts('fold').reset_index(name='count')

Unnamed: 0,fold,count
0,training,3923
1,test,993


In [64]:
final_dataset.groupby(["fold", "finding_categories"]).size().reset_index(
    name="count"
).sort_values(by="count", ascending=False)

Unnamed: 0,fold,finding_categories,count
32,training,['No Finding'],2098
27,training,['Mass'],917
7,test,['No Finding'],564
49,training,['Suspicious Calcification'],321
5,test,['Mass'],206
25,training,['Focal Asymmetry'],185
18,test,['Suspicious Calcification'],81
22,training,['Architectural Distortion'],77
24,training,['Asymmetry'],70
47,training,"['Suspicious Calcification', 'Mass']",67


### Some findings are not included in the test/train fold. This is because of extremely rare findings. We are going to duplicate them with new folds to make sure they are included in the test/train fold. This causes data leakage but it is necessary to make sure they are included in the test/train fold.

In [80]:
grouped = final_dataset.groupby(['fold', 'finding_categories']).size().reset_index(name='count')

train_findings = grouped[grouped['fold'] == 'training']['finding_categories'].unique()
test_findings = grouped[grouped['fold'] == 'test']['finding_categories'].unique()


missing_in_train = set(test_findings) - set(train_findings)
missing_in_test = set(train_findings) - set(test_findings)

In [81]:
missing_in_test, missing_in_train

({"['Architectural Distortion', 'Asymmetry']",
  "['Architectural Distortion', 'Mass']",
  "['Asymmetry', 'Mass']",
  "['Nipple Retraction', 'Asymmetry']",
  "['Nipple Retraction', 'Mass']",
  "['Nipple Retraction', 'Skin Thickening', 'Mass']",
  "['Skin Retraction', 'Architectural Distortion', 'Suspicious Calcification']",
  "['Skin Retraction', 'Nipple Retraction', 'Mass']",
  "['Skin Retraction', 'Nipple Retraction']",
  "['Skin Thickening', 'Asymmetry']",
  "['Skin Thickening', 'Focal Asymmetry']",
  "['Skin Thickening', 'Nipple Retraction']",
  "['Suspicious Calcification', 'Architectural Distortion', 'Focal Asymmetry']",
  "['Suspicious Calcification', 'Architectural Distortion', 'Mass']",
  "['Suspicious Calcification', 'Asymmetry']"},
 {"['Focal Asymmetry', 'Mass']",
  "['Skin Thickening', 'Global Asymmetry', 'Nipple Retraction']",
  "['Skin Thickening', 'Mass']",
  "['Suspicious Calcification', 'Architectural Distortion', 'Nipple Retraction', 'Skin Retraction']"})

In [82]:
final_df = final_dataset.copy()

In [84]:
for finding in missing_in_train:
    rows_to_add = final_df[
        (final_df["finding_categories"] == finding) & (final_df["fold"] == "test")
    ].copy()
    rows_to_add["fold"] = "training"
    final_df = pd.concat([final_df, rows_to_add], ignore_index=True)

for finding in missing_in_test:
    rows_to_add = final_df[
        (final_df["finding_categories"] == finding) & (final_df["fold"] == "training")
    ].copy()
    rows_to_add["fold"] = "test"
    final_df = pd.concat([final_df, rows_to_add], ignore_index=True)

In [91]:

# Check the distribution of findings in each fold
print("\nDistribution of findings in training fold:")
print(final_df[final_df['fold'] == 'training']['finding_categories'].value_counts())

print("\nDistribution of findings in test fold:")
print(final_df[final_df['fold'] == 'test']['finding_categories'].value_counts())


Distribution of findings in training fold:
finding_categories
['No Finding']                                                                                      2098
['Mass']                                                                                             917
['Suspicious Calcification']                                                                         321
['Focal Asymmetry']                                                                                  185
['Architectural Distortion']                                                                          77
['Asymmetry']                                                                                         70
['Suspicious Calcification', 'Mass']                                                                  67
['Suspicious Lymph Node']                                                                             46
['Skin Thickening']                                                                              

In [87]:
final_df.to_csv('..\\metadata\\stratified_local_balanced_v2.csv', index=False)