## Gathering Image Files

In the original configuration of [BreakHis](https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathological-database-breakhis/) dataset, the image files are spread across different folders categorized based on tumor subtype for each individual patient. In this notebook, the images with magnification 40X for all patients are gathered from all different folders and put together in a single folder called `Images`.

In [1]:
import os
import shutil

In [2]:
mag = 40
mag_folder = '/{}X/'.format(str(mag))

In [3]:
root_dir = './BreaKHis_v1/histology_slides/breast'

src_folders = {'DC': '/malignant/SOB/ductal_carcinoma/',
               'LC': '/malignant/SOB/lobular_carcinoma/',
               'MC': '/malignant/SOB/mucinous_carcinoma/',
               'PC': '/malignant/SOB/papillary_carcinoma/',
               'A': '/benign/SOB/adenosis/',
               'F': '/benign/SOB/fibroadenoma/',
               'PT': '/benign/SOB/phyllodes_tumor/',
               'TA': '/benign/SOB/tubular_adenoma/'}

In [4]:
dst_dir = './Images/'
if not os.path.exists(dst_dir):
    os.mkdir(dst_dir)

In [5]:
counter = 0

for key in src_folders.keys():
    path = root_dir + src_folders[key]
    image_folders = [f.path + mag_folder for f in os.scandir(path) if f.is_dir()]
    for folder in image_folders:
        image_files = [f.path for f in os.scandir(folder) if f.name.endswith('.png')]
        for file in image_files:
            counter = counter + 1
            shutil.copy(file, dst_dir)

In [6]:
print('Number of copied files is: {}'.format(counter))

Number of copied files is: 1995


**Note:** The total number of copied files agrees with the number of 40X images given in the research [paper](http://www.inf.ufpr.br/lesoliveira/download/TBME-00608-2015-R2-preprint.pdf) of the curators of the dataset.