# **Deep Learning Project - Dataset Cleaning**

---

## Table of Contents
1. [Introduction](#Introduction)
2. [Imports](#Imports)  
3. [Data Preparation](#data-preparation)
   - [Missing Values](#missing-values)
4. [Export Csv](#export-csv-cleaned)
   
----

# Introduction

Breast cancer is one of the most common types of cancer worldwide, making early and accurate diagnosis crucial for effective treatment. In October, Breast Cancer Awareness Month highlighted the importance of early detection and research in combating this disease. This project leverages the BreakHis dataset, containing histopathological images of breast tissue, to explore two tasks: binary classification to determine whether a tumor is benign or malignant, and multi-class classification to identify specific subtypes of tumors.

# Imports

In [1]:
from functions import *

# Data Preparation

In [None]:
# Path to the directory
image_dir = "YOUR PATH HERE"

# Path to the CSV
data_path = 'image_data.csv'

In [3]:
# Set pandas to display full content to be able to see the full image path
pd.set_option('display.max_colwidth', None)

In [4]:
df = pd.read_csv(data_path)
df.head(5)

Unnamed: 0,path_to_image,Benign or Malignant,Cancer Type,Magnification
0,BreaKHis_v1/histology_slides/breast/benign/SOB/adenosis/SOB_B_A_14-22549AB/100X/SOB_B_A-14-22549AB-100-011.png,Benign,Adenosis,100X
1,BreaKHis_v1/histology_slides/breast/benign/SOB/adenosis/SOB_B_A_14-22549AB/100X/SOB_B_A-14-22549AB-100-005.png,Benign,Adenosis,100X
2,BreaKHis_v1/histology_slides/breast/benign/SOB/adenosis/SOB_B_A_14-22549AB/100X/SOB_B_A-14-22549AB-100-004.png,Benign,Adenosis,100X
3,BreaKHis_v1/histology_slides/breast/benign/SOB/adenosis/SOB_B_A_14-22549AB/100X/SOB_B_A-14-22549AB-100-010.png,Benign,Adenosis,100X
4,BreaKHis_v1/histology_slides/breast/benign/SOB/adenosis/SOB_B_A_14-22549AB/100X/SOB_B_A-14-22549AB-100-006.png,Benign,Adenosis,100X


In [5]:
df.shape

(7909, 4)

In [6]:
df[df.duplicated()]

Unnamed: 0,path_to_image,Benign or Malignant,Cancer Type,Magnification


## Missing Values

In [7]:
# Displays DataFrame info, memory usage, and checks for missing values
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7909 entries, 0 to 7908
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   path_to_image        7909 non-null   object
 1   Benign or Malignant  7906 non-null   object
 2   Cancer Type          7905 non-null   object
 3   Magnification        7905 non-null   object
dtypes: object(4)
memory usage: 2.8 MB


 As shown, the dataset contains some missing values. Specifically, there are a few missing entries in the 'Benign or Malignant', 'Cancer Type', and 'Magnification' columns. Let's take a closer look.

In [8]:
# Loop through each column in the DataFrame to display unique values
for column in df.columns:
    # Print the name of the column and the unique values within it
    print(f"Unique values on '{column}':")
    print(df[column].unique())
    print("\n")

Unique values on 'path_to_image':
['BreaKHis_v1/histology_slides/breast/benign/SOB/adenosis/SOB_B_A_14-22549AB/100X/SOB_B_A-14-22549AB-100-011.png'
 'BreaKHis_v1/histology_slides/breast/benign/SOB/adenosis/SOB_B_A_14-22549AB/100X/SOB_B_A-14-22549AB-100-005.png'
 'BreaKHis_v1/histology_slides/breast/benign/SOB/adenosis/SOB_B_A_14-22549AB/100X/SOB_B_A-14-22549AB-100-004.png'
 ...
 'BreaKHis_v1/histology_slides/breast/malignant/SOB/lobular_carcinoma/SOB_M_LC_14-12204/200X/SOB_M_LC-14-12204-200-006.png'
 'BreaKHis_v1/histology_slides/breast/malignant/SOB/lobular_carcinoma/SOB_M_LC_14-12204/200X/SOB_M_LC-14-12204-200-039.png'
 'BreaKHis_v1/histology_slides/breast/malignant/SOB/lobular_carcinoma/SOB_M_LC_14-12204/200X/SOB_M_LC-14-12204-200-038.png']


Unique values on 'Benign or Malignant':
['Benign' 'Malignant' nan]


Unique values on 'Cancer Type':
['Adenosis' 'Tubular Adenoma' 'Fibroadenoma' 'Phyllodes Tumor'
 'Mucinous Carcinoma' nan 'Papillary Carcinoma' 'Ductal Carcinoma'
 'Lobular Car

In [9]:
# Filter rows with any missing values
df_missing_values = df[df.isnull().any(axis=1)]

# Display the rows with missing values
df_missing_values

Unnamed: 0,path_to_image,Benign or Malignant,Cancer Type,Magnification
2871,BreaKHis_v1/histology_slides/breast/malignant/SOB/mucinous_carcinoma/SOB_M_MC_14-18842/100X/SOB_M_MC-14-18842-100-014.png,,,
3093,BreaKHis_v1/histology_slides/breast/malignant/SOB/mucinous_carcinoma/SOB_M_MC_14-13413/200X/SOB_M_MC-14-13413-200-010.png,Malignant,,
3228,BreaKHis_v1/histology_slides/breast/malignant/SOB/mucinous_carcinoma/SOB_M_MC_14-10147/400X/SOB_M_MC-14-10147-400-013.png,,,
4536,BreaKHis_v1/histology_slides/breast/malignant/SOB/ductal_carcinoma/SOB_M_DC_14-14926/40X/SOB_M_DC-14-14926-40-012.png,,,


We have a few missing values, but the columns with missing data are linked to the image paths so we can use the information from the corresponding images to fill in the missing values.

In [10]:
# Extract information from image paths and fill missing values
df[['Benign or Malignant', 'Cancer Type', 'Magnification']] = df['path_to_image'].apply(
    lambda x: pd.Series(extract_info(x))) 


In [11]:
# Just to check the index numbers of the missing values we had before
df_missing_values

Unnamed: 0,path_to_image,Benign or Malignant,Cancer Type,Magnification
2871,BreaKHis_v1/histology_slides/breast/malignant/SOB/mucinous_carcinoma/SOB_M_MC_14-18842/100X/SOB_M_MC-14-18842-100-014.png,,,
3093,BreaKHis_v1/histology_slides/breast/malignant/SOB/mucinous_carcinoma/SOB_M_MC_14-13413/200X/SOB_M_MC-14-13413-200-010.png,Malignant,,
3228,BreaKHis_v1/histology_slides/breast/malignant/SOB/mucinous_carcinoma/SOB_M_MC_14-10147/400X/SOB_M_MC-14-10147-400-013.png,,,
4536,BreaKHis_v1/histology_slides/breast/malignant/SOB/ductal_carcinoma/SOB_M_DC_14-14926/40X/SOB_M_DC-14-14926-40-012.png,,,


In [12]:
# Check if the function worked correctly
df.loc[[2871, 3093, 3228, 4536]]

Unnamed: 0,path_to_image,Benign or Malignant,Cancer Type,Magnification
2871,BreaKHis_v1/histology_slides/breast/malignant/SOB/mucinous_carcinoma/SOB_M_MC_14-18842/100X/SOB_M_MC-14-18842-100-014.png,Malignant,Mucinous Carcinoma,100X
3093,BreaKHis_v1/histology_slides/breast/malignant/SOB/mucinous_carcinoma/SOB_M_MC_14-13413/200X/SOB_M_MC-14-13413-200-010.png,Malignant,Mucinous Carcinoma,200X
3228,BreaKHis_v1/histology_slides/breast/malignant/SOB/mucinous_carcinoma/SOB_M_MC_14-10147/400X/SOB_M_MC-14-10147-400-013.png,Malignant,Mucinous Carcinoma,400X
4536,BreaKHis_v1/histology_slides/breast/malignant/SOB/ductal_carcinoma/SOB_M_DC_14-14926/40X/SOB_M_DC-14-14926-40-012.png,Malignant,Ductal Carcinoma,40X


# Export Csv Cleaned

In [13]:
'''

# If the images are not in the same directory as the CSV file, you can add the base path to the image paths.
base_path = "YOUR PATH HERE"
df['path_to_image'] = base_path + df['path_to_image']


'''

'\n\n# If the images are not in the same directory as the CSV file, you can add the base path to the image paths.\nbase_path = "YOUR PATH HERE"\ndf[\'path_to_image\'] = base_path + df[\'path_to_image\']\n\n\n'

In [14]:
# Save the cleaned DataFrame to a CSV file
df.to_csv("df_clean.csv", index=False)