<a href="https://colab.research.google.com/github/anhle/AI-Healthcare/blob/master/AI_2D/Ex/Ex_9_split_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

You're given a dataframe with image labels for 5,000 chest x-rays. Your goal is to prepare a training and a validation dataset for an algorithm that predicts the presence of a Pneumothorax (collapsed lung). Remember, we want our model to see an equal amount of positive and negative cases when it's training, but when we evaluate its performance, we should be looking at a class balance or imbalance that is more reflective of the real world. In this exercise,

You will notice that Pneumothorax isn't a very common finding in this dataset, so you'll want to maximize your data so that you can use all of the true Pneumothorax cases in training. Given the large class imbalances, however, you may end up throwing away images that don't contain Pneumothorax.

Here's an assumption you can make when creating your validation set: Despite the large imbalance of Pneumothorax in this dataset, in the actual clinical setting where you want to deploy your algorithm, the prevalence of Pneumothorax will be about 20%. This is because patients are only being x-rayed based on their clinical symptoms that make Pneumothorax highly likely.

In [0]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from random import sample

from itertools import chain
from random import sample 
import scipy

import sklearn.model_selection as skl

## First read in the dataframe. You'll notice it's similar to the dataframe that you ended the final solution with in Lesson 2, Exercise 4, only with more data:

In [0]:
d = pd.read_csv('https://raw.githubusercontent.com/anhle/AI-Healthcare/master/AI_2D/Ex/data/findings_data_5000.csv')
d.head()

## Just like in Lesson 2, Exercise 4, we want to see how different diseases are distributed with our disease of interest, as well as how age and gender are distributed:

In [0]:
all_labels = np.unique(list(chain(*d['Finding Labels'].map(lambda x: x.split('|')).tolist())))
all_labels = [x for x in all_labels if len(x)>0]
all_labels

In [0]:
d[all_labels].sum()

In [0]:
ax = d[all_labels].sum().plot(kind='bar')
ax.set(ylabel = 'Number of Images with Label')

In [0]:
##Since there are many combinations of potential findings, I'm going to look at the 30 most common co-occurrences:
plt.figure(figsize=(16,6))
d[d.Pneumothorax==1]['Finding Labels'].value_counts()[0:30].plot(kind='bar')

In [0]:
##Since there are many combinations of potential findings, I'm going to look at the 30 most common co-occurrences:
plt.figure(figsize=(6,6))
d[d.Pneumothorax ==1]['Patient Gender'].value_counts().plot(kind='bar')

In [0]:
plt.figure(figsize=(10,6))
plt.hist(d[d.Pneumothorax==1]['Patient Age'])

## Now, knowing what we know from above, let's create the appropriate training and validation sets for a model that we want to train to classify the presence of a Pneumothorax

In [0]:
train_df, valid_df = skl.train_test_split(d, 
                                   test_size = 0.2, 
                                   stratify = d['Pneumothorax'])

In [0]:
train_df['Pneumothorax'].sum()/len(train_df)

In [0]:
valid_df['Pneumothorax'].sum()/len(valid_df)

Great, our train_test_split made sure that we had the same proportions of Pneumothorax in both sets!

But.... we know that we want our model to be trained on a set that has _equal_ proportions of pneumothorax and no pneumothorax, so we're going to have to throw away some data:

In [0]:
p_inds = train_df[train_df.Pneumothorax==1].index.tolist()
np_inds = train_df[train_df.Pneumothorax==0].index.tolist()

np_sample = sample(np_inds,len(p_inds))
train_df = train_df.loc[p_inds + np_sample]

In [0]:
train_df['Pneumothorax'].sum()/len(train_df)

Ta-da! We randomly chose a set of non-Pneumothorax images using the sample() function that was the same length as the number of true Pneumothorax cases we had, and then we threw out the rest of the non-Pneumothorax cases. Now our training dataset is balanced 50-50.

Finally, we want to make the balance in our validation set more like 20-80 since our exercise told us that the prevalence of Pneumothorax in this clinical situation is about 20%:

In [0]:
p_inds = valid_df[valid_df.Pneumothorax==1].index.tolist()
np_inds = valid_df[valid_df.Pneumothorax==0].index.tolist()

# The following code pulls a random sample of non-pneumonia data that's 4 times as big as the pneumonia sample.
np_sample = sample(np_inds,4*len(p_inds))
valid_df = valid_df.loc[p_inds + np_sample]

In [0]:
valid_df['Pneumothorax'].sum()/len(valid_df)