## Strategy for Pneumonia Classification

Record change in model performance across the following steps:

1. Build a base ML classifier using raw image data
2. Build a ML model using histogram data
3. Build a DL model using raw data
4. Augment training data and build DL 
5. Feature generation and selection for better pneumonia predictiveness
6. Use image cropping/segmentation to further improve pneumonia predictiveness


## Skeleton Code

The code below provides a skeleton for the model building & training component of your project. You can add/remove/build on code however you see fit, this is meant as a starting point.

In [1]:
!which python

/Users/eshaankirpal/miniforge3/envs/pneumonia-det/bin/python


In [17]:
!pip install mlflow
#!pip install tensorflow_decision_forests
# TF-DF requires Tensorflow < 2.15 or tf_keras
#!pip install tf_keras
#!pip install wurlitzer

Collecting mlflow
  Downloading mlflow-2.11.2-py3-none-any.whl.metadata (15 kB)
Collecting click<9,>=7.0 (from mlflow)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting cloudpickle<4 (from mlflow)
  Downloading cloudpickle-3.0.0-py3-none-any.whl.metadata (7.0 kB)
Collecting gitpython<4,>=3.1.9 (from mlflow)
  Downloading GitPython-3.1.42-py3-none-any.whl.metadata (12 kB)
Collecting packaging<24 (from mlflow)
  Downloading packaging-23.2-py3-none-any.whl.metadata (3.2 kB)
Collecting sqlparse<1,>=0.4.0 (from mlflow)
  Downloading sqlparse-0.4.4-py3-none-any.whl.metadata (4.0 kB)
Collecting alembic!=1.10.0,<2 (from mlflow)
  Downloading alembic-1.13.1-py3-none-any.whl.metadata (7.4 kB)
Collecting docker<8,>=4.0.0 (from mlflow)
  Downloading docker-7.0.0-py3-none-any.whl.metadata (3.5 kB)
Collecting Flask<4 (from mlflow)
  Downloading flask-3.0.2-py3-none-any.whl.metadata (3.6 kB)
Collecting querystring-parser<2 (from mlflow)
  Downloading querystring_parser-1.2.4-py2

In [15]:
!conda list | grep "tensor"

tensorboard               2.16.2                   pypi_0    pypi
tensorboard-data-server   0.7.2                    pypi_0    pypi
tensorflow                2.16.1                   pypi_0    pypi
tensorflow-io-gcs-filesystem 0.36.0                   pypi_0    pypi


In [2]:
import numpy as np 
import pandas as pd 
import os
from glob import glob
import pathlib
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

from skimage.io import imread, imshow
from PIL import Image
from sklearn import model_selection, metrics, ensemble, linear_model

# Keep using Keras 2
os.environ['TF_USE_LEGACY_KERAS'] = '1'
import tensorflow as tf
import keras

import mlflow

%matplotlib inline
print(f"Keras version: {keras.__version__}")
print(f"Tensorflow version: {tf.__version__}")

Keras version: 3.1.1
Tensorflow version: 2.16.1


In [3]:
import tensorflow_decision_forests as tfdf

ModuleNotFoundError: No module named 'tensorflow_decision_forests'

In [6]:
MLFLOW_TRACKING_URI="http://127.0.0.1:5000"

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

In [7]:
EXPERIMENT_NAME="pneumonia-classification"
mlflow.set_experiment(EXPERIMENT_NAME)

# with mlflow.start_run():
#     mlflow.log_metric("foo", 1)
#     mlflow.log_metric("bar", 2)

MlflowException: API request to endpoint /api/2.0/mlflow/experiments/get-by-name failed with error code 403 != 200. Response body: ''

## Do some early processing of your metadata for easier model training:

In [4]:
FILEPATH='../dataset/'
RAW_FILEPATH='raw'
EDA_FILEPATH='eda'

In [5]:
## Below is some helper code to read all of your full image filepaths into a dataframe for easier manipulation
FILENAME='data-entry-with-diseases.csv'

all_xray_df = pd.read_csv(os.path.join(FILEPATH, EDA_FILEPATH,FILENAME))
all_image_paths = {os.path.basename(x): x for x in 
                   glob(os.path.join('../dataset',RAW_FILEPATH,'images*', '*', '*.png'))}
print('Scans found:', len(all_image_paths), ', Total Headers', all_xray_df.shape[0])
all_xray_df['path'] = all_xray_df['ImageIndex'].map(all_image_paths.get)
all_xray_df.sample(5)

Scans found: 112120 , Total Headers 112120


Unnamed: 0.1,Unnamed: 0,ImageIndex,FindingLabels,Follow-up#,PatientID,PatientAge,PatientGender,ViewPosition,OriginalImageWidth,OriginalImageHeight,...,Pneumothorax,Cardiomegaly,Pleural_Thickening,Consolidation,Effusion,Pneumonia,Nodule,Infiltration,Atelectasis,path
91570,91570,00022837_001.png,No Finding,1,22837,64,M,PA,2992,2991,...,0,0,0,0,0,0,0,0,0,../dataset/raw/images_010/images/00022837_001.png
15108,15108,00003972_000.png,No Finding,0,3972,31,F,PA,2048,2500,...,0,0,0,0,0,0,0,0,0,../dataset/raw/images_003/images/00003972_000.png
89003,89003,00022078_000.png,No Finding,0,22078,28,F,PA,2992,2991,...,0,0,0,0,0,0,0,0,0,../dataset/raw/images_010/images/00022078_000.png
20100,20100,00005372_011.png,Atelectasis|Effusion,11,5372,47,M,AP,2500,2048,...,0,0,0,0,1,0,0,0,1,../dataset/raw/images_003/images/00005372_011.png
106228,106228,00028603_000.png,No Finding,0,28603,54,M,PA,3056,2544,...,0,0,0,0,0,0,0,0,0,../dataset/raw/images_012/images/00028603_000.png


In [6]:
all_xray_df.columns

Index(['Unnamed: 0', 'ImageIndex', 'FindingLabels', 'Follow-up#', 'PatientID',
       'PatientAge', 'PatientGender', 'ViewPosition', 'OriginalImageWidth',
       'OriginalImageHeight', 'OriginalImagePixelSpacing_x',
       'OriginalImagePixelSpacing_y', 'disease_count', 'diseased', 'Mass',
       'Edema', 'Emphysema', 'Hernia', 'Fibrosis', 'Pneumothorax',
       'Cardiomegaly', 'Pleural_Thickening', 'Consolidation', 'Effusion',
       'Pneumonia', 'Nodule', 'Infiltration', 'Atelectasis', 'path'],
      dtype='object')

In [7]:
all_xray_df.diseased.value_counts()

diseased
0    60361
1    51759
Name: count, dtype: int64

In [8]:
all_image_paths['00010007_182.png']

'../dataset/raw/images_005/images/00010007_182.png'

In [9]:
len(all_image_paths)

112120

In [10]:
## Here you may want to create some extra columns in your table with binary indicators of certain diseases 
## rather than working directly with the 'Finding Labels' column

# Todo()
all_xray_df['Pneumonia_plus_class']= all_xray_df.apply(lambda x: 1 if (x.Pneumonia==1) or (x.Infiltration==1) else 0,axis=1)
all_xray_df['Pneumonia_plus_class'].value_counts()

Pneumonia_plus_class
0    91400
1    20720
Name: count, dtype: int64

In [11]:
## Here we can create a new column called 'pneumonia_class' that will allow us to look at 
## images with or without pneumonia for binary classification

# Todo


## Create your training, testing, and validation data:

In [12]:
AUTOTUNE = -1
img_height = 256
img_width = 256

In [13]:
df=all_xray_df.copy()
df=df.query("diseased==0 or Pneumonia==1")
print(df.shape)
df.head()

(61792, 30)


Unnamed: 0.1,Unnamed: 0,ImageIndex,FindingLabels,Follow-up#,PatientID,PatientAge,PatientGender,ViewPosition,OriginalImageWidth,OriginalImageHeight,...,Cardiomegaly,Pleural_Thickening,Consolidation,Effusion,Pneumonia,Nodule,Infiltration,Atelectasis,path,Pneumonia_plus_class
3,3,00000002_000.png,No Finding,0,2,81,M,PA,2500,2048,...,0,0,0,0,0,0,0,0,../dataset/raw/images_001/images/00000002_000.png,0
13,13,00000005_000.png,No Finding,0,5,69,F,PA,2048,2500,...,0,0,0,0,0,0,0,0,../dataset/raw/images_001/images/00000005_000.png,0
14,14,00000005_001.png,No Finding,1,5,69,F,AP,2500,2048,...,0,0,0,0,0,0,0,0,../dataset/raw/images_001/images/00000005_001.png,0
15,15,00000005_002.png,No Finding,2,5,69,F,AP,2500,2048,...,0,0,0,0,0,0,0,0,../dataset/raw/images_001/images/00000005_002.png,0
16,16,00000005_003.png,No Finding,3,5,69,F,PA,2992,2991,...,0,0,0,0,0,0,0,0,../dataset/raw/images_001/images/00000005_003.png,0


In [14]:
df=df.drop('Unnamed: 0',axis=1)
df.head()

Unnamed: 0,ImageIndex,FindingLabels,Follow-up#,PatientID,PatientAge,PatientGender,ViewPosition,OriginalImageWidth,OriginalImageHeight,OriginalImagePixelSpacing_x,...,Cardiomegaly,Pleural_Thickening,Consolidation,Effusion,Pneumonia,Nodule,Infiltration,Atelectasis,path,Pneumonia_plus_class
3,00000002_000.png,No Finding,0,2,81,M,PA,2500,2048,0.171,...,0,0,0,0,0,0,0,0,../dataset/raw/images_001/images/00000002_000.png,0
13,00000005_000.png,No Finding,0,5,69,F,PA,2048,2500,0.168,...,0,0,0,0,0,0,0,0,../dataset/raw/images_001/images/00000005_000.png,0
14,00000005_001.png,No Finding,1,5,69,F,AP,2500,2048,0.168,...,0,0,0,0,0,0,0,0,../dataset/raw/images_001/images/00000005_001.png,0
15,00000005_002.png,No Finding,2,5,69,F,AP,2500,2048,0.168,...,0,0,0,0,0,0,0,0,../dataset/raw/images_001/images/00000005_002.png,0
16,00000005_003.png,No Finding,3,5,69,F,PA,2992,2991,0.143,...,0,0,0,0,0,0,0,0,../dataset/raw/images_001/images/00000005_003.png,0


In [15]:
df.Pneumonia.value_counts(normalize=True)

Pneumonia
0    0.976842
1    0.023158
Name: proportion, dtype: float64

In [16]:
df= df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0,ImageIndex,FindingLabels,Follow-up#,PatientID,PatientAge,PatientGender,ViewPosition,OriginalImageWidth,OriginalImageHeight,OriginalImagePixelSpacing_x,...,Cardiomegaly,Pleural_Thickening,Consolidation,Effusion,Pneumonia,Nodule,Infiltration,Atelectasis,path,Pneumonia_plus_class
0,00028454_002.png,No Finding,2,28454,56,F,AP,3056,2544,0.139,...,0,0,0,0,0,0,0,0,../dataset/raw/images_012/images/00028454_002.png,0
1,00026248_001.png,No Finding,1,26248,34,M,PA,2920,2835,0.143,...,0,0,0,0,0,0,0,0,../dataset/raw/images_011/images/00026248_001.png,0
2,00008476_000.png,No Finding,0,8476,56,F,PA,2048,2500,0.168,...,0,0,0,0,0,0,0,0,../dataset/raw/images_004/images/00008476_000.png,0
3,00004064_001.png,No Finding,1,4064,53,F,PA,2986,2769,0.143,...,0,0,0,0,0,0,0,0,../dataset/raw/images_003/images/00004064_001.png,0
4,00008027_002.png,No Finding,2,8027,33,M,PA,2500,2048,0.168,...,0,0,0,0,0,0,0,0,../dataset/raw/images_004/images/00008027_002.png,0


In [17]:
y=df.Pneumonia.values

In [18]:
# create test set
df['test_set']=0

In [19]:
# create folds
df['kfold']=-1

In [20]:
# Separate test data
sss=model_selection.StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for i, (train_index, test_index) in enumerate(sss.split(X=df, y=y)):
    df.loc[test_index,'test_set']=1
df.test_set.value_counts()

test_set
0    49433
1    12359
Name: count, dtype: int64

In [21]:
print(df[df.test_set==1].Pneumonia.value_counts(normalize=True))
print(df[df.test_set==0].Pneumonia.value_counts(normalize=True))

Pneumonia
0    0.976859
1    0.023141
Name: proportion, dtype: float64
Pneumonia
0    0.976837
1    0.023163
Name: proportion, dtype: float64


In [22]:
df_trunc=df[df.test_set==0].reset_index(drop=True)
df_test=df[df.test_set==1].reset_index(drop=True)
df_trunc.shape,df_test.shape

((49433, 31), (12359, 31))

In [23]:
print(df_trunc.Pneumonia.value_counts(normalize=True))
print(df_test.Pneumonia.value_counts(normalize=True))

Pneumonia
0    0.976837
1    0.023163
Name: proportion, dtype: float64
Pneumonia
0    0.976859
1    0.023141
Name: proportion, dtype: float64


In [24]:
y=df_trunc.Pneumonia.values

In [25]:
# Create folds for training and validation data
kf=model_selection.StratifiedKFold(n_splits=5)
for f, (t_,v_) in enumerate(kf.split(X=df_trunc,y=y)):
    df_trunc.loc[v_,'kfold']=f
df_trunc.kfold.value_counts()

kfold
0    9887
1    9887
2    9887
3    9886
4    9886
Name: count, dtype: int64

In [26]:
for fold in range(5):
    #print(f"{fold}")
    print(f"Fold: {fold} \n{df_trunc[df_trunc.kfold==fold].Pneumonia.value_counts()}")

Fold: 0 
Pneumonia
0    9658
1     229
Name: count, dtype: int64
Fold: 1 
Pneumonia
0    9658
1     229
Name: count, dtype: int64
Fold: 2 
Pneumonia
0    9658
1     229
Name: count, dtype: int64
Fold: 3 
Pneumonia
0    9657
1     229
Name: count, dtype: int64
Fold: 4 
Pneumonia
0    9657
1     229
Name: count, dtype: int64


In [None]:
fold=0
df_train = df_trunc[df_trunc.kfold!=fold].reset_index(drop=True)
df_val = df_trunc[df_trunc.kfold==fold].reset_index(drop=True)
df_train.shape,df_val.shape

In [None]:
df_train.head()

## Load data

In [54]:
help(np.ravel)

Help on function ravel in module numpy:

ravel(a, order='C')
    Return a contiguous flattened array.
    
    A 1-D array, containing the elements of the input, is returned.  A copy is
    made only if needed.
    
    As of NumPy 1.10, the returned array will have the same type as the input
    array. (for example, a masked array will be returned for a masked array
    input)
    
    Parameters
    ----------
    a : array_like
        Input array.  The elements in `a` are read in the order specified by
        `order`, and packed as a 1-D array.
    order : {'C','F', 'A', 'K'}, optional
    
        The elements of `a` are read using this index order. 'C' means
        to index the elements in row-major, C-style order,
        with the last axis index changing fastest, back to the first
        axis index changing slowest.  'F' means to index the elements
        in column-major, Fortran-style order, with the
        first index changing fastest, and the last index changing
       

In [27]:
def norm(x):
    # Normalize the input
    return (x - np.mean(x))/np.std(x)

def create_dataset(training_df, image_dir=None, img_height=256, img_width=256, desc='processing images'):
    """
    This function takes the training dataframe 
    and outputs training array and labels
    """
    images, targets, histograms = [],[],[]
    img_idx=[]
    img_sizes=set()
    exclusion_count = 0
    reqd_size= img_width*img_height
    
    for index,row in tqdm(training_df.iterrows(),total=len(training_df), desc=desc):
        image_path= row['path']
        image = Image.open(image_path).convert("L")
        histogram=np.array(image.histogram())
        image.thumbnail((img_width,img_height))
        image=np.array(image).ravel()
        
        #img_sizes.add(image.shape[0])
        if image.size != reqd_size: #(img_width,img_height):
            exclusion_count+=1
            img_idx.append(row['ImageIndex'])
            continue

        
        image=norm(image)
        images.append(image)
        histograms.append(histogram)
        targets.append(int(row["Pneumonia"]))
        
    print(f"Images excluded: {exclusion_count}")
    #print(img_sizes)
    images= np.array(images)
    histograms = np.array(histograms)
    targets=np.array(targets)
    
    return images, histograms, targets, img_idx


In [28]:
df_train = df_trunc.sample(300).reset_index(drop=True)

xtrain, xtrain_hist,ytrain,img_idx=create_dataset(df_train)
print(xtrain.shape,ytrain.shape,img_idx)

processing images: 100%|██████████████████████| 300/300 [00:03<00:00, 87.25it/s]

Images excluded: 0
(300, 65536) (300,) []





In [73]:
for index in img_idx:
    image_path=df_train[df_train['ImageIndex']==index].reset_index(drop=True).loc[0,'path']
    print(image_path)
    im =Image.open(image_path)
    print(im.size)
    im.thumbnail((img_width,img_height))
    print(im.size)
    print(np.asarray(im).size)
    print(im.reshape(-1).size)

In [70]:
262144/(256*256)

4.0

In [None]:
train_data={}
val_data={}
img_indexes=[]
tree_models={}


for fold in tqdm(range(2),desc="creating data and training RF models"):
    df_train = df_trunc[df_trunc.kfold!=fold].reset_index(drop=True)
    df_val = df_trunc[df_trunc.kfold==fold].reset_index(drop=True)
    
    xtrain, xtrain_hist, ytrain, train_img_idx=create_dataset(df_train,desc=f'processing Train images for Fold {fold}')
    xval, xval_hist, yval, val_img_idx=create_dataset(df_val, desc=f'processing Val images for Fold {fold}')

    print(xtrain.shape, xtrain_hist.shape, ytrain.shape)
    print(xval.shape, xval_hist.shape, yval.shape)

    #train_data[f'Fold_{fold}']= (x_train,x_train_hist,y_train)
    #val_data[f'Fold_{fold}']= (xval, xval_hist, yval)
    img_indexes+=train_img_idx
    img_indexes+=val_img_idx

    rf=ensemble.RandomForestClassifier(n_estimators=100, max_depth=None, n_jobs=-1)
    rf.fit(xtrain,ytrain)
    
    tree_models[f"Fold_{fold}"]=rf

    preds=rf.predict_proba(xval)[:,1]
    
    print(f"Val Fold: {fold}")
    print(f"AUC= {metrics.roc_auc_score(yval,preds)}\n")    

    rfh=ensemble.RandomForestClassifier(n_estimators=100, max_depth=None, n_jobs=-1)
    rfh.fit(xtrain_hist,ytrain)

    tree_models[f"HistFold_{fold}"]=rfh

    preds=rfh.predict_proba(xval_hist)[:,1]
    
    print("Trained on histogram data:")
    print(f"Fold: {fold}")
    print(f"AUC= {metrics.roc_auc_score(yval,preds)}\n")

creating data and training RF models:   0%|               | 0/2 [00:00<?, ?it/s]
processing Train images for Fold 0:   0%|             | 0/39546 [00:00<?, ?it/s][A
processing Train images for Fold 0:   0%|     | 7/39546 [00:00<09:38, 68.31it/s][A
processing Train images for Fold 0:   0%|    | 16/39546 [00:00<08:13, 80.02it/s][A
processing Train images for Fold 0:   0%|    | 25/39546 [00:00<08:10, 80.50it/s][A
processing Train images for Fold 0:   0%|    | 34/39546 [00:00<08:01, 81.99it/s][A
processing Train images for Fold 0:   0%|    | 43/39546 [00:00<07:57, 82.78it/s][A
processing Train images for Fold 0:   0%|    | 52/39546 [00:00<07:51, 83.68it/s][A
processing Train images for Fold 0:   0%|    | 61/39546 [00:00<07:50, 83.88it/s][A
processing Train images for Fold 0:   0%|    | 70/39546 [00:00<07:45, 84.78it/s][A
processing Train images for Fold 0:   0%|    | 79/39546 [00:00<07:48, 84.19it/s][A
processing Train images for Fold 0:   0%|    | 88/39546 [00:01<07:52, 83.43it/s

In [None]:
#Train Random Forest Models on pixel data
tree_models={}

for fold in tqdm(range(5),desc='training tree models on pixel data'):
    rf=ensemble.RandomForestClassifier(n_estimators=100, max_depth=None, n_jobs=-1)
    rf.fit(xtrain,ytrain)
    
    tree_models[f"Fold_{fold}"]=rf

    preds=rf.predict_proba(xval)[:,1]
    
    print(f"Val Fold: {fold}")
    print(f"AUC= {metrics.roc_auc_score(yval,preds)}\n")    

In [None]:
#Train Random Forest Models on Histogram data
for fold in tqdm(range(5),desc='training tree models on histogram data'):
    rfh=ensemble.RandomForestClassifier(n_estimators=100, max_depth=None, n_jobs=-1)
    rfh.fit(xtrain_hist,ytrain)

    tree_models[f"HistFold_{fold}"]=rfh

    preds=rfh.predict_proba(xval_hist)[:,1]

    print(f"Fold: {fold}")
    print(f"AUC= {metrics.roc_auc_score(yval,preds)}\n")

In [None]:

for fold in tqdm(range(1),desc="training models"):
    df_train = df_trunc[df_trunc.kfold==2].reset_index(drop=True)
    df_val = df_trunc[df_trunc.kfold==fold].reset_index(drop=True)
    
    xtrain, xtrain_hist, ytrain=create_dataset(df_train)
    xval, xval_hist, yval=create_dataset(df_val)

    #Train Logistic Regression
    #lr=linear_model.LogisticRegression()
    #lr.fit(xtrain,ytrain)
    
    #lr_models[f"ValFold_{fold}"]=lr
    
    preds=lr.predict_proba(xval)[:,1]
    
    print(f"Fold: {fold}")
    print(f"AUC= {metrics.roc_auc_score(yval,preds)}\n")

In [None]:
dataset = tf.data.Dataset.list_files(os.path.join(FILEPATH,RAW_FILEPATH,"images*/*/*.png"),shuffle=False)
type(dataset)

In [None]:
len(dataset)

In [None]:
for element in dataset.take(5):
    print(element.numpy())
    x=os.path.basename(element.numpy()).decode("utf-8")
    print(all_xray_df[all_xray_df["ImageIndex"]==x]['Pneumonia'].iloc[0])

In [None]:
all_xray_df[all_xray_df["ImageIndex"]==x]['Pneumonia'].iloc[0]

In [None]:
def get_label(file_path):
    img_name=os.path.basename(element.numpy())
    label=all_xray_df[all_xray_df["ImageIndex"]==x]['Pneumonia'].iloc[0]
    return label

def decode_img(img):
    # Convert the compressed string to a 3D uint8 tensor
    img = tf.io.decode_jpeg(img, channels=3)
    # Resize the image to the desired size
    return tf.image.resize(img, [img_height, img_width])

def process_path(file_path):
    label=get_label(file_path)
    img = tf.io.read_file(file_path)
    img = decode_img(img)
    return img, label

In [None]:
# Set `num_parallel_calls` so multiple images are loaded/processed in parallel.
all_ds = dataset.map(process_path, num_parallel_calls=AUTOTUNE)
type(all_ds)

In [None]:
#help(all_ds)

In [None]:
for image, label in all_ds.take(1):
    print("Image shape: ", image.numpy().shape)
    print("Label: ", label.numpy())

In [None]:
plt.figure(figsize=(20, 15))
i=0
for images, labels in all_ds.take(9):
    i=i+1
    ax = plt.subplot(3, 3, i)
    plt.imshow(images.numpy().astype("uint8"))
    plt.title(labels.numpy())
    plt.axis("off")

In [None]:
train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

val_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

In [None]:
def create_splits(vargs):
    
    ## Either build your own or use a built-in library to split your original dataframe into two sets 
    ## that can be used for training and testing your model
    ## It's important to consider here how balanced or imbalanced you want each of those sets to be
    ## for the presence of pneumonia
    
    # Todo
    
    return train_data, val_data

# Now we can begin our model-building & training

#### First suggestion: perform some image augmentation on your data

In [None]:
batch_size = 32
img_height = 224
img_width = 224

In [None]:
AUTOTUNE = tf.data.AUTOTUNE
AUTOTUNE

### Base Models
#### Logistic Regression

In [None]:
model=linear_model.LogisticRegression()
model.fit(x_train,df_train.target.values)


In [None]:
valid_preds=model.predict_proba(x_valid)[:,1]

In [None]:
auc = metrics.roc_auc_score(real,predictions)

#### Random Forest

In [None]:
model=ensemble.RandomForestClassifier(n_jobs=-1)
model.fit(x_train,df_train.target.values)

In [None]:
valid_preds=model.predict_proba(x_valid)[:,1]

In [None]:
auc = metrics.roc_auc_score(real,predictions)

In [None]:
def my_image_augmentation(vargs):
    
    ## recommendation here to implement a package like Keras' ImageDataGenerator
    ## with some of the built-in augmentations 
    
    ## keep an eye out for types of augmentation that are or are not appropriate for medical imaging data
    ## Also keep in mind what sort of augmentation is or is not appropriate for testing vs validation data
    
    ## STAND-OUT SUGGESTION: implement some of your own custom augmentation that's *not*
    ## built into something like a Keras package
    
    # Todo
    
    return my_idg

data_augmentation = keras.Sequential(
  [
    layers.RandomFlip("horizontal",
                      input_shape=(img_height,
                                  img_width,
                                  3)),
    #layers.RandomRotation(0.1),
    layers.RandomZoom(0.1),
  ]
)

def make_train_gen(vargs):
    
    ## Create the actual generators using the output of my_image_augmentation for your training data
    ## Suggestion here to use the flow_from_dataframe library, e.g.:
    
#     train_gen = my_train_idg.flow_from_dataframe(dataframe=train_df, 
#                                          directory=None, 
#                                          x_col = ,
#                                          y_col = ,
#                                          class_mode = 'binary',
#                                          target_size = , 
#                                          batch_size = 
#                                          )
     # Todo

    return train_gen


def make_val_gen(vargs):
    
#     val_gen = my_val_idg.flow_from_dataframe(dataframe = val_data, 
#                                              directory=None, 
#                                              x_col = ,
#                                              y_col = ',
#                                              class_mode = 'binary',
#                                              target_size = , 
#                                              batch_size = ) 
    
    # Todo
    return val_gen

In [None]:
## May want to pull a single large batch of random validation data for testing after each epoch:
valX, valY = val_gen.next()

In [None]:
## May want to look at some examples of our augmented training data. 
## This is helpful for understanding the extent to which data is being manipulated prior to training, 
## and can be compared with how the raw data look prior to augmentation

t_x, t_y = next(train_gen)
fig, m_axs = plt.subplots(4, 4, figsize = (16, 16))
for (c_x, c_y, c_ax) in zip(t_x, t_y, m_axs.flatten()):
    c_ax.imshow(c_x[:,:,0], cmap = 'bone')
    if c_y == 1: 
        c_ax.set_title('Pneumonia')
    else:
        c_ax.set_title('No Pneumonia')
    c_ax.axis('off')

## Build your model: 

Recommendation here to use a pre-trained network downloaded from Keras for fine-tuning

In [None]:
def load_pretrained_model(vargs):
    
    # model = VGG16(include_top=True, weights='imagenet')
    # transfer_layer = model.get_layer(lay_of_interest)
    # vgg_model = Model(inputs = model.input, outputs = transfer_layer.output)
    
    # Todo
    
    return vgg_model


In [None]:
def build_my_model(vargs):
    
    # my_model = Sequential()
    # ....add your pre-trained model, and then whatever additional layers you think you might
    # want for fine-tuning (Flatteen, Dense, Dropout, etc.)
    
    # if you want to compile your model within this function, consider which layers of your pre-trained model, 
    # you want to freeze before you compile 
    
    # also make sure you set your optimizer, loss function, and metrics to monitor
    
    # Todo
    
    return my_model



## STAND-OUT Suggestion: choose another output layer besides just the last classification layer of your modele
## to output class activation maps to aid in clinical interpretation of your model's results

In [None]:
## Below is some helper code that will allow you to add checkpoints to your model,
## This will save the 'best' version of your model by comparing it to previous epochs of training

## Note that you need to choose which metric to monitor for your model's 'best' performance if using this code. 
## The 'patience' parameter is set to 10, meaning that your model will train for ten epochs without seeing
## improvement before quitting

# Todo

# weight_path="{}_my_model.best.hdf5".format('xray_class')

# checkpoint = ModelCheckpoint(weight_path, 
#                              monitor= CHOOSE_METRIC_TO_MONITOR_FOR_PERFORMANCE, 
#                              verbose=1, 
#                              save_best_only=True, 
#                              mode= CHOOSE_MIN_OR_MAX_FOR_YOUR_METRIC, 
#                              save_weights_only = True)

# early = EarlyStopping(monitor= SAME_AS_METRIC_CHOSEN_ABOVE, 
#                       mode= CHOOSE_MIN_OR_MAX_FOR_YOUR_METRIC, 
#                       patience=10)

# callbacks_list = [checkpoint, early]

### Start training! 

In [None]:
## train your model

# Todo

# history = my_model.fit_generator(train_gen, 
#                           validation_data = (valX, valY), 
#                           epochs = , 
#                           callbacks = callbacks_list)

##### After training for some time, look at the performance of your model by plotting some performance statistics:

Note, these figures will come in handy for your FDA documentation later in the project

In [None]:
## After training, make some predictions to assess your model's overall performance
## Note that detecting pneumonia is hard even for trained expert radiologists, 
## so there is no need to make the model perfect.
my_model.load_weights(weight_path)
pred_Y = new_model.predict(valX, batch_size = 32, verbose = True)

In [None]:
def plot_auc(t_y, p_y):
    
    ## Hint: can use scikit-learn's built in functions here like roc_curve
    
    # Todo
    
    return

## what other performance statistics do you want to include here besides AUC? 


# def ... 
# Todo

# def ...
# Todo
    
#Also consider plotting the history of your model training:

def plot_history(history):
    
    # Todo
    return

In [None]:
## plot figures

# Todo

Once you feel you are done training, you'll need to decide the proper classification threshold that optimizes your model's performance for a given metric (e.g. accuracy, F1, precision, etc.  You decide) 

In [None]:
## Find the threshold that optimize your model's performance,
## and use that threshold to make binary classification. Make sure you take all your metrics into consideration.

# Todo

In [None]:
## Let's look at some examples of predicted v. true with our best model: 

# Todo

# fig, m_axs = plt.subplots(10, 10, figsize = (16, 16))
# i = 0
# for (c_x, c_y, c_ax) in zip(valX[0:100], testY[0:100], m_axs.flatten()):
#     c_ax.imshow(c_x[:,:,0], cmap = 'bone')
#     if c_y == 1: 
#         if pred_Y[i] > YOUR_THRESHOLD:
#             c_ax.set_title('1, 1')
#         else:
#             c_ax.set_title('1, 0')
#     else:
#         if pred_Y[i] > YOUR_THRESHOLD: 
#             c_ax.set_title('0, 1')
#         else:
#             c_ax.set_title('0, 0')
#     c_ax.axis('off')
#     i=i+1

In [None]:
## Just save model architecture to a .json:

model_json = my_model.to_json()
with open("my_model.json", "w") as json_file:
    json_file.write(model_json)