# Pre-Processing
## Business Problem
Leukemia is a type of cancer of the blood that often affects young people. In the past, pathologists would diagnose patients by eye after examining blood smear images under the microscope. But, this is time consuming and tedious. Advances in image recognition technology have come a long ways since their inception. Therefore, automated solutions using computers would be of great benefit to the medical community to aid in cancer diagnoses.

The goal of this project is to address the following question: How can the doctor’s at the Munich University Hospital automate the diagnosis of patients with leukemia using images from blood smears?

**Plan:**
1. Load annotations file.
2. Create features and labels.
3. Perform train-test split.
    1. Without stratify.
    2. With stratify.

In [1]:
import os
import sys
sys.path.append('..')

import cv2 as cv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split

from src import constants as con
from src.visualization import visualize as viz

# %load_ext autoreload
# %autoreload 2
%reload_ext autoreload
%matplotlib inline

## Load Annotations File
The annotations files contains the list of images, which will be used for the features, and list of morphologies, which are the labels.

In [2]:
df_annotate = pd.read_csv(os.path.join(con.REFERENCES_DIR, 'annotations.dat'), sep=' ', 
                          names=['Image Dir', 'Morphology', 'First Re-Annotation', 'Second Re-Annotation'])

In [3]:
df_annotate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18365 entries, 0 to 18364
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Image Dir             18365 non-null  object
 1   Morphology            18365 non-null  object
 2   First Re-Annotation   1905 non-null   object
 3   Second Re-Annotation  1905 non-null   object
dtypes: object(4)
memory usage: 574.0+ KB


## Features and Labels
Select the features and labels from the annotations file.

In [4]:
X = df_annotate['Image Dir'].values
y = df_annotate['Morphology'].values

Print out a few items in the features array. These are image file names.

In [5]:
X[:5]

array(['BAS/BAS_0001.tiff', 'BAS/BAS_0002.tiff', 'BAS/BAS_0003.tiff',
       'BAS/BAS_0004.tiff', 'BAS/BAS_0005.tiff'], dtype=object)

Print out of few of the label values. These are leukocite morphology types.

In [6]:
y[:5]

array(['BAS', 'BAS', 'BAS', 'BAS', 'BAS'], dtype=object)