# Preprocessing

## Import

In [1]:
import os
import pandas as pd

import cv2 as cv
from sklearn.model_selection import train_test_split

%run ../scripts/save_utils.py

## Data split

First of all, we need to properly split the data.  
  
I have combined train and test data into one folder resp. to proper class. Now we need to combine images and their labels.  
  
I decided to make a dataframe containing path to the image and its label as columns:

In [2]:
images_path = '../data/raw/merged_data'

In [3]:
def get_folder_names(directory_path):
    entries = os.listdir(directory_path)
    folders = [entry for entry in entries if os.path.isdir(os.path.join(directory_path, entry))]
    return folders

def images_to_dataframe(directory_path):
    folders = get_folder_names(directory_path)
    file_paths = []
    labels = []

    for folder in folders:
        folder_path = os.path.join(directory_path, folder)
        files = os.listdir(folder_path)
        for file in files:
            file_path = os.path.relpath(os.path.join(folder_path, file), directory_path)
            file_path = file_path.replace('\\', '/')  # Replace backslashes with forward slashes
            file_paths.append(file_path)
            labels.append(folder)

    df = pd.DataFrame({
        'image_path': file_paths,
        'label': labels
    })

    df = df.sample(frac=1).reset_index(drop=True)

    return df

In [4]:
df = images_to_dataframe(images_path)

In [5]:
print(df)

                     image_path       label
0     meningioma/Tr-me_0905.jpg  meningioma
1      pituitary/Te-pi_0208.jpg   pituitary
2         glioma/Tr-gl_1317.jpg      glioma
3        notumor/Te-no_0095.jpg     notumor
4         glioma/Tr-gl_0917.jpg      glioma
...                         ...         ...
7018  meningioma/Te-me_0045.jpg  meningioma
7019     notumor/Te-no_0280.jpg     notumor
7020  meningioma/Tr-me_0820.jpg  meningioma
7021     notumor/Tr-no_0719.jpg     notumor
7022   pituitary/Tr-pi_0735.jpg   pituitary

[7023 rows x 2 columns]


Now we have a dataset containing all paths to images and their resp. labels.  
Let's proceed to splitting:

In [6]:
x_tmp, x_test, y_tmp, y_test = train_test_split(df['image_path'], df['label'], test_size=0.2,
                                                shuffle=True, random_state=73, stratify=df['label'])

x_train, x_val, y_train, y_val = train_test_split(x_tmp, y_tmp, test_size=0.25,
                                                  shuffle=True, random_state=73, stratify=y_tmp)

Since indices got all mixed up after splitting, we will reset them:

In [7]:
data = [x_train, y_train, x_val, y_val, x_test, y_test]
for entry in data:
    entry.reset_index(drop=True, inplace=True)

Now we have 3 separate sets : **train**, with **80%** of initial data, and **validation** and **test** sets **each** containing **20%** of the initial data.  
  
Let's now check how many samples each set has:

In [8]:
print('Number of samples in:')
print('  train set:      ', x_train.shape[0])
print('  validation set: ', x_val.shape[0])
print('  test set:       ', x_test.shape[0])

Number of samples in:
  train set:       4213
  validation set:  1405
  test set:        1405


And now we save it to use later when it comes to building actual models:

In [9]:
save_data(x_train, 'train_data', '../save_files/variables/x_train.pkl')
save_data(y_train, 'train_data', '../save_files/variables/y_train.pkl')
save_data(x_val, 'validation_data', '../save_files/variables/x_val.pkl')
save_data(y_val, 'validation_data', '../save_files/variables/y_val.pkl')
save_data(x_test, 'test_data', '../save_files/variables/x_test.pkl')
save_data(y_test, 'test_data', '../save_files/variables/y_test.pkl')

object successfuly saved to: ../save_files/variables/x_train.pkl
object successfuly saved to: ../save_files/variables/y_train.pkl
object successfuly saved to: ../save_files/variables/x_val.pkl
object successfuly saved to: ../save_files/variables/y_val.pkl
object successfuly saved to: ../save_files/variables/x_test.pkl
object successfuly saved to: ../save_files/variables/y_test.pkl
