# Merged notebook containing code and learnings from cxr project
This notebook is meant as a starting point for the master thesis neural net. Ideas, learnings and code from previous notebooks is gathered and described, to summarize the current status of the project.

## General components that can be kept
*1. Preamble*<br>
Imort of packages<br>
Configuration variables like file paths, boolean switches, numeric settings, etc.

*2. Getting the data*<br>
Funciton definitions<br>
*CHANGES NECESSARY* Read meta data + image files and convert it to a dataframe<br>
*CHANGES NECESSARY* Unify labling of datasets<br>
Train / Test / Val split<br>
Shuffling data

*3. Data preprocessing*<br>
Image augmentation<br>
Data generator class<br>

*4. Model training*<br>
**TODO: Think about a concept for comparing the nets**<br>
Define neural net architecture<br>
Model settings like learning rate reduction, early stopping, model save setting<br>
Training the model

*5. Model evaluation*<br>
Check performance parameters like:
- accuracy
- loss
- recall
- f1

Evaluate generalizability on validation data<br>
Show confusion matrix

# 1. Preamble

In [84]:
import pandas as pd
import math
import tensorflow as tf
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import cv2
import os
from pathlib import Path
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.applications.vgg16 import VGG16
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import activations
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, Callback, ModelCheckpoint, ReduceLROnPlateau
from tensorflow.keras.layers import Flatten, Dense, Dropout, BatchNormalization, AveragePooling2D, Conv2D, MaxPool2D, Activation, GlobalAveragePooling2D, Lambda
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split

In [85]:
# Config
IMG_SIZE = 256
BATCH_SIZE=32
CHANNELS=2
USE_MASKS=True

MODELSAVE = "models/cnn_gray_030121"

if(USE_MASKS):
    MODELSAVE += "_w_masks.h5"
else:
    MODELSAVE += "_wo_masks.h5"

pd.set_option('display.max_colwidth', None)

base_path = '/mnt/f/DatasetsChestXRay/'

datasets = {
    'padchest': 'BIMCV-PadChest/',
    'cxr14': 'ChestX-ray14_NationalInstituesofHealthClinicalCenter/',
    'chexpert': 'CheXpert/',
    'kermany': 'KermanyChildStudy/',
    'mimic': 'MIMIC-CXR/',
    'openi': 'Open-i_IndianaUniversityNetworkforPatientCare/'
}

In [86]:
class dataset:
    def __init__(self, name, folder):
        self.name = name
        self.folder = folder
        self.path = base_path + self.folder
        self.df = self.create_dataset()
        self.train = None
        self.test = None

    def create_dataset(self):
        # initialize dataframe
        data = {}
        i = 0
        # iterate over folders and files
        for label_folder in ['normal/', 'pneumonia/']:
            for filename in os.listdir(self.path + label_folder):
                img_path = self.path + label_folder + str(filename)
                label_pneumonia = 0
                if label_folder == 'pneumonia/':
                    label_pneumonia = 1
                data[i] = {'img': img_path, 'dataset': self.name, 'label_pneumonia': label_pneumonia, 'label_viral': None, 'label_covid': None}
                i = i + 1

        print("Successfully created " + self.name + " dataset")
        return pd.DataFrame.from_dict(data, "index")

    def train_test_split(self, test_size=0.2, random_state=200):
        self.train, self.test = train_test_split(self.df, test_size=test_size, random_state=random_state)

# 2. Load Data

In [90]:
# dictionary of all datasets
dataset_dict = {}

for dataset_name, folder_name in datasets.items():
    tmp_dataset = dataset(dataset_name, folder_name)
    tmp_dataset.train_test_split()
    dataset_dict[dataset_name] = tmp_dataset



Successfully created padchest dataset
Successfully created cxr14 dataset
Successfully created chexpert dataset
Successfully created kermany dataset
Successfully created mimic dataset
Successfully created openi dataset


In [88]:
dataset_dict['padchest'].df.head()

Unnamed: 0,img,dataset,label_pneumonia,label_viral,label_covid
0,/mnt/f/DatasetsChestXRay/BIMCV-PadChest/normal/216840111366964012558082906712009307140128467_00-103-132.png,padchest,0,,
1,/mnt/f/DatasetsChestXRay/BIMCV-PadChest/normal/216840111366964012283393834152009029131935409_00-012-142.png,padchest,0,,
2,/mnt/f/DatasetsChestXRay/BIMCV-PadChest/normal/216840111366964012558082906712009338102001322_00-084-144.png,padchest,0,,
3,/mnt/f/DatasetsChestXRay/BIMCV-PadChest/normal/331526407267774289321200683504949782995_l8wza5.png,padchest,0,,
4,/mnt/f/DatasetsChestXRay/BIMCV-PadChest/normal/216840111366964012373310883942009163103156006_00-100-066.png,padchest,0,,


In [91]:
dataset_dict['padchest'].train.head()

Unnamed: 0,img,dataset,label_pneumonia,label_viral,label_covid
1936,/mnt/f/DatasetsChestXRay/BIMCV-PadChest/normal/216840111366964013076187734852011280100757461_00-195-078.png,padchest,0,,
3952,/mnt/f/DatasetsChestXRay/BIMCV-PadChest/normal/216840111366964013307756408102012073090901423_01-091-187.png,padchest,0,,
4730,/mnt/f/DatasetsChestXRay/BIMCV-PadChest/normal/205831107437374337656296325098340678881_egcigv.png,padchest,0,,
749,/mnt/f/DatasetsChestXRay/BIMCV-PadChest/normal/216840111366964012487858717522009231124942752_00-003-065.png,padchest,0,,
7084,/mnt/f/DatasetsChestXRay/BIMCV-PadChest/pneumonia/216840111366964013451228379692012298103355408_01-116-066.png,padchest,1,,


# 3. Data Preprocessing

# 4. Model Training

# 5. Model Evaluation