## Gathering the Specifications of the Images

As indicated in the [BreakHis info page](https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathological-database-breakhis/), the name of each image file contains some information as follows,
- the corresponding patient identification number, 
- the type of tumor $\rightarrow$ benign (B) or malignant (M),
- the subtype of tumor $\rightarrow$ the dataset contains 4 histological distinct types of benign breast tumors: adenosis (A), fibroadenoma (F), phyllodes tumor (PT), and tubular adenona (TA);  and 4 malignant tumors (breast cancer): carcinoma (DC), lobular carcinoma (LC), mucinous carcinoma (MC) and papillary carcinoma (PC),
- the magnification factor of that image, and,
- the sequence number of the image for that particular patient.

In this notebook, using the folder of images prepared before, I extract of specification of each image and then store that information in a data frame. 

In [1]:
import os
import re
import cv2
import pandas as pd

In [2]:
dst_dir = './Images/'

In [3]:
file_names = [f for f in os.listdir(dst_dir) if f.endswith('.png')]

In [4]:
df = pd.DataFrame(columns=['Image_Id', 'Patient_Id', 'Tumor_Type', 'Tumor_Subtype', 'Magnification', 'Image_Number'])

In [5]:
pids = []
iids = []
types = []
subtypes = []
mags = []
num = []

for file in file_names:
    data = re.findall(r'(\w+)_(\w)_(\w+)-(\d+)-(\w+)-(\d+)-(\d+)', file)
    types.append(data[0][1])
    subtypes.append(data[0][2])
    pids.append(data[0][4])
    mags.append(int(data[0][5]))
    num.append(int(data[0][6]))
    iids.append(file.split('.')[0])

df['Image_Id'] = iids
df['Patient_Id'] = pids
df['Tumor_Type'] = types
df['Tumor_Subtype'] = subtypes
df['Magnification'] = mags
df['Image_Number'] = num

In [6]:
df.head()

Unnamed: 0,Image_Id,Patient_Id,Tumor_Type,Tumor_Subtype,Magnification,Image_Number
0,SOB_M_DC-14-13412-40-020,13412,M,DC,40,20
1,SOB_M_DC-14-17915-40-009,17915,M,DC,40,9
2,SOB_M_DC-14-20636-40-014,20636,M,DC,40,14
3,SOB_B_TA-14-15275-40-010,15275,B,TA,40,10
4,SOB_M_DC-14-17915-40-021,17915,M,DC,40,21


In [7]:
df['Patient_Id'].nunique()

81

In [8]:
df['Patient_Id'].unique()

array(['13412', '17915', '20636', '15275', '11520', '9461', '12312',
       '12773', '16184', '3411F', '14134', '25197', '9133', '13200',
       '17614', '15570C', '20629', '17901', '19440', '11951', '14134E',
       '15696', '14946', '15687B', '22549G', '22549CD', '15570',
       '22549AB', '21998AB', '6241', '18842', '12204', '5694', '4372',
       '16456', '18842D', '10147', '29315EF', '23222AB', '11031', '5695',
       '21998EF', '19979C', '16196', '16336', '16875', '5287', '2985',
       '15792', '23060AB', '12465', '21998CD', '23060CD', '14926',
       '13993', '13413', '29960CD', '16184CD', '19854C', '21978AB',
       '14015', '15704', '9146', '190EF', '16188', '16716', '15572',
       '16448', '2523', '22704', '2773', '18650', '2980', '10926',
       '19979', '3909', '29960AB', '8168', '16601', '13418DE', '4364'],
      dtype=object)

### Dropping Images with non-matching size

According to the information given by the curators of the [BreakHis](https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathological-database-breakhis/), all images are supposed to be 460 $\times$ 700. However, I realized that for one of the patients, the images width is different than 460. Since these few non-matching images will create problems in later analysis of the dataset, here, I drop those images from the data frame.

In [None]:
for index, row in df.iterrows():
    file_name = dst_folder + row['Image_Id'] + '.png'
    img = cv2.imread(file_name)
    if (img.shape[0] != 460):
        df.drop(index=index, inplace=True)

In [9]:
df.to_csv('./specs.csv', index=False)