<a href="https://colab.research.google.com/github/anhle/AI-Healthcare/blob/master/AI_2D/Ex/Ex_7_explore_population.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The other important aspect of EDA is exploring your population. In this exercise, you'll be given a dataframe that describes a large dataset. Your goal is to perform EDA on the population in the dataset such that you can answer the following questions:

1. How are the different diseases distributed in my dataset in terms of frequency and co-occurrence with one another? (For the sake of time, just choose one of the diseases and assess its co-occurrence frequencies with all other diseases.)
2. How is age distributed across my dataset? Is it distributed differently for different diseases?
3. How is sex distributed across my dataset? Is it distributed differently for different diseases?
4. For findings that have a Mass_size (i.e. not just a binary classification of disease presence) is there a relationship between size and age, sex, or presence of other diseases?

In [0]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from random import sample

from itertools import chain
from random import sample 
import scipy

In [0]:
d = pd.read_csv('https://raw.githubusercontent.com/anhle/AI-Healthcare/master/AI_2D/Ex/data/findings_data.csv')

To understand distributions of variables as they relate to diseases, let's try splitting up the 'Finding Labels' column into one additional column per disease (e.g. one for 'Cardiomegaly', one for 'Emphysema', etc.) and put a binary flag in that column to indicate the presence of the disease

In [0]:
## Here I'm just going to split up my "Finding Labels" column so that I have one column in my dataframe
# per disease, with a binary flag. This makes EDA a lot easier! 

all_labels = np.unique(list(chain(*d['Finding Labels'].map(lambda x: x.split('|')).tolist())))
all_labels = [x for x in all_labels if len(x)>0]
print('All Labels ({}): {}'.format(len(all_labels), all_labels))
for c_label in all_labels:
    if len(c_label)>1: # leave out empty labels
        d[c_label] = d['Finding Labels'].map(lambda finding: 1.0 if c_label in finding else 0)
d.sample(3)

In [0]:
len(all_labels)

I see here that there are 14 unique types of labels found in my dataset

In [0]:
d[all_labels].sum()/len(d)

In [0]:
ax = d[all_labels].sum().plot(kind='bar')
ax.set(ylabel = 'Number of Images with Label')

Above, I see the relative frequencies of each disease in my dataset. It looks like 'No Finding' is the most common occurrence. 'No Finding' can never appear with any other label by definition, so we know that in 57.5% of this dataset, there is no finding in the image. Beyond that, it appears that 'Infiltration' is the most common disease-related label, and it is followed by 'Effusion' and 'Atelectasis.'

Since 'Infiltration' is the most common, I'm going to now look at how frequently it appears with all of the other diseases: 

In [0]:
d[d.Infiltration == 1]['Finding Labels'].value_counts()[:10].plot(kind='bar')

In [0]:
##Since there are many combinations of potential findings, I'm going to look at the 30 most common co-occurrences:
plt.figure(figsize=(16,6))
d[d.Infiltration==1]['Finding Labels'].value_counts()[0:30].plot(kind='bar')

It looks like Infiltration actually occurs alone for the most part, and that its most-common comorbidities are Atelectasis and Effusion. 

Let's see if the same is true for another label, we'll try Effusion:

In [0]:
##Since there are many combinations of potential findings, I'm going to look at the 30 most common co-occurrences:
plt.figure(figsize=(16,6))
d[d.Effusion==1]['Finding Labels'].value_counts()[0:30].plot(kind='bar')

Same thing! Now let's move on to looking at age & gender: 

In [0]:
plt.figure(figsize=(10,6))
plt.hist(d['Patient Age'])

In [0]:
plt.figure(figsize=(10,6))
plt.hist(d[d.Infiltration==1]['Patient Age'])

In [0]:
plt.figure(figsize=(10,6))
plt.hist(d[d.Effusion==1]['Patient Age'])

Looks like the distribution of age across the whole population is slightly different than it is specifically for Infiltration and Effusion. Infiltration appears to be more skewed towards younger individuals, and Effusion spans the age range but has a large peak around 55. 

In [0]:
plt.figure(figsize=(6,6))
d['Patient Gender'].value_counts().plot(kind='bar')

In [0]:
plt.figure(figsize=(6,6))
d[d.Infiltration ==1]['Patient Gender'].value_counts().plot(kind='bar')

In [0]:
plt.figure(figsize=(6,6))
d[d.Effusion ==1]['Patient Gender'].value_counts().plot(kind='bar')

Gender distribution seems to be pretty equal in the whole population as well as with Infiltration, with a slight preference towards females in the Effusion distribution. 

#### Finally, let's look at if and how age & gender relate to mass size in individuals who have a mass as a finding:

In [0]:
plt.scatter(d['Patient Age'],d['Mass_Size'])

In [0]:
mass_sizes = d['Mass_Size'].values
mass_inds = np.where(~np.isnan(mass_sizes))
ages = d.iloc[mass_inds]['Patient Age']
mass_sizes=mass_sizes[mass_inds]
scipy.stats.pearsonr(mass_sizes,ages)

The above tells us that age and mass size are significantly correlated, with a Pearson's coerrelation coefficient of 0.727

In [0]:
np.mean(d[d['Patient Gender']== 'M']['Mass_Size'])

In [0]:
np.mean(d[d['Patient Gender']== 'F']['Mass_Size'])

In [0]:
scipy.stats.ttest_ind(d[d['Patient Gender']== 'F']['Mass_Size'],d[d['Patient Gender']== 'M']['Mass_Size'],nan_policy='omit')

The above tells us that there is no statistically significant difference between mass size with gender. 