> **This notebook is based on udacity Pneumonia Detection from Chest X-Rays project which can be accessed**  [here](https://www.udacity.com/course/ai-for-healthcare-nanodegree--nd320)

> **Most of my work is inspired from this github repo** [here](https://github.com/aymanaboghonim/udacity-healthcare-ai/tree/master/02-medical-imaging-2d)

# <span style="color:blue"> **EDA**</span>


# Importing Essential Libraries


In [None]:
import numpy as np
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from glob import glob
from scipy.stats import norm
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

##Import any other packages you may need here
import matplotlib.image as image

EDA is open-ended, and it is up to you to decide how to look at different ways to slice and dice your data. A good starting point is to look at the requirements for the FDA documentation in the final part of this project to guide (some) of the analyses you do. 

This EDA should also help to inform you of how pneumonia looks in the wild. E.g. what other types of diseases it's commonly found with, how often it is found, what ages it affects, etc. 

Note that this NIH dataset was not specifically acquired for pneumonia. So, while this is a representation of 'pneumonia in the wild,' the prevalence of pneumonia may be different if you were to take only chest x-rays that were acquired in an ER setting with suspicion of pneumonia. 

Perform the following EDA:
1. The patient demographic data such as gender, age, patient position,etc. (as it is available)
2. The x-ray views taken (i.e. view position)
3. The number of cases including: 
    1. number of pneumonia cases,
    2. number of non-pneumonia cases
4. The distribution of other diseases that are comorbid with pneumonia
5. Number of disease per patient 
6. Pixel-level assessments of the imaging data for healthy & disease states of interest (e.g. histograms of intensity values) and compare distributions across diseases.

Note: use full NIH data to perform the first a few EDA items and use `sample_labels.csv` for the pixel-level assassements. 

Also, **describe your findings and how will you set up the model training based on the findings.**

In [None]:
!ls -la ../input/data

In [None]:
## Below is some helper code to read data for you.
## Load NIH data
all_xray_df = pd.read_csv('../input/data/Data_Entry_2017.csv')
all_xray_df.head()

In [None]:

all_xray_df.shape


In [None]:
#Print a concise summary of our DataFrame 
all_xray_df.info()

### It turns out that 
1. All columns except the last one have no missed valuse.
    So, missing data handling is not required!
2. The last column is unnamed and contain missed data ony.
    So, we will drop it out!
3. We have some cloumns of object data type.
    So. we need to examin them in further details before encoding     them into numerical format.

# 1. Analysis of Demographic Data
## 1.1. Age

In [None]:
#Summary statistics 
all_xray_df['Patient Age'].describe()

### It turns out that 
1. Max age is 414 ! 
    so, there are outliers in our data which need to be handeled!
2. Min age is 1.
    So, we have no outlier values here.
3. Mean and Median are close to each other.
    So, Age' ditsribution is slightly skewed!
    We will confirm that visually using *Histogram*.
    

In [None]:
Age = all_xray_df['Patient Age']
# Fit a normal distribution to the data:
mu, std = norm.fit(Age)

# Plot the histogram.
plt.hist(Age, bins=50, density=True, alpha=0.6, color='g')

# Plot the PDF.
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

**Finding Outliers in Age Column**
> Subjectively, I will consider age above 100 is outlier and need to be dropped out if they are just fewe records.

In [None]:

over100 = all_xray_df[all_xray_df['Patient Age'] > 100]
print(len(over100))
over100

### droping Age outliers 

In [None]:
# drop age above 100
all_xray_df = all_xray_df[all_xray_df['Patient Age'] < 100]
# reset indices 
all_xray_df.sample(frac=1).reset_index(drop=True)

In [None]:
# Checking disribution after removing outliers
Age = all_xray_df['Patient Age']
# Fit a normal distribution to the data:
mu, std = norm.fit(Age)

# Plot the histogram.
plt.hist(Age, bins=50, density=True, alpha=0.6, color='g')

# Plot the PDF.
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

## 1.2. Gender

In [None]:
# Checking Gender distribution visually
all_xray_df['Patient Gender'].value_counts().plot(kind="bar", title="Gender Distribution", color = ["purple","orange"]);

In [None]:
all_xray_df['Patient Gender'].dtypes

### It turns out that :
1. There is no messy data in Gender (only male & female).
    so, no further cleaning is required.
2. There is imbalance in gender.
    so, we should calculate the ratio numerically. 
3. Data of object type. 
    So, we may need to encode them into numerical format.
    

In [None]:
#Calculate the ratio
males = all_xray_df[all_xray_df['Patient Gender'] == 'M']
females = all_xray_df[all_xray_df['Patient Gender'] == 'F']
print(f'Patien Gender distribution\nMale: {len(males)} ({100.0*len(males)/len(all_xray_df):.2f}%), Female: {len(females)} ({100.0*len(females)/len(all_xray_df):.2f}%)')

## 1.3. View position

In [None]:
# Descriptive statistics 
all_xray_df['View Position'].describe()

In [None]:
# Checking View Position distribution visually
all_xray_df['View Position'].value_counts().plot(kind="bar", title="View Position Distribution", color = ["purple","orange"]);

In [None]:
all_xray_df['Patient Gender'].dtypes

### It turns out that :
1. There is no messy data in View Position (only PA & AP).
    so, no further cleaning is required.
2. There is imbalance in View Position.
    so, we should calculate the ratio numerically. 
3. Data of object type. 
    So, we may need to encode them into numerical format.
    

In [None]:
PA_View = all_xray_df[all_xray_df['View Position'] == 'PA']
AP_View = all_xray_df[all_xray_df['View Position'] == 'AP']
print(f'View position value distribution\nPA: {len(PA_View)} ({100.0*len(PA_View)/len(all_xray_df):.2f}%), AP: {len(AP_View)} ({100.0*len(AP_View)/len(all_xray_df):.2f}%)')

## 1.4. Follow-up number

In [None]:
all_xray_df['Follow-up #'].describe()

In [None]:
all_xray_df['Follow-up #'].hist(bins=100);

### It turns out that :
1. There is no messy data in Follow up numbers (only integer values).
    so, no further cleaning is required.
2. The distibution folllows an exponential pattern which is generally decreasing.  
3. 75 % of patients have follow up numbers bellow 11 and only 25 % of them is between 11 to 183.

## 1.5. Finding Labels

In [None]:
# Checking statistics 
all_xray_df['Finding Labels'].describe()

In [None]:
# Visualize the distribution of the first (Largest) twenty labels
all_xray_df['Finding Labels'].value_counts()[:20].plot(kind= "bar");


In [None]:
# Display the count for every label
all_xray_df['Finding Labels'].value_counts()

### There are 15 classes (14 diseases, and one for "No findings"). Images can be classified as "No findings" or one or more disease classes. Reference [here](https://www.kaggle.com/nih-chest-xrays/data)

### It turns out that :


### 1. We have 835 combinations (Co-occurance) of our 14 primary classes.

### 2. More than half of images are normal (No Findings )
### 3. There is significant imabalance between labels which will need to be handeled well during training to get reliable results.
### 4. Some labels are represented only by one image !

In [None]:
# Visuaize the distribution of the largest positive Findings ( Excluding the No Findings cases)

Finding = all_xray_df[all_xray_df['Finding Labels'] != 'No Finding']
Finding['Finding Labels'].value_counts()[:20].plot(kind= 'bar');


## 1.6. Patient ID

In [None]:
# Checking if ID is unique or not ?
all_xray_df['Patient ID'].nunique()

In [None]:
# Display the frequency of the patied ID 
all_xray_df['Patient ID'].value_counts()[:50]

In [None]:
all_xray_df['Patient ID'].value_counts().describe()

In [None]:

all_xray_df['Patient ID'].nunique()/len(all_xray_df['Patient ID'])

### It turns out that :
### 1. Patient ID is not unique and this is expected because each patient may have more than one image. 
### 2. The percentage of unique values is about 27.5 % 
### 3. 75 % of patients have less than 3 images. 
### 4. 25 % of patients have more than 3 images and there is one patient has 184 images !


## 1.7. Image width & height, pixel spacing

In [None]:
all_xray_df['OriginalImage[Width'].describe()

In [None]:
all_xray_df['OriginalImage[Width'].value_counts().describe()

In [None]:
# The largest 6 values
all_xray_df['OriginalImage[Width'].value_counts()[:6]

In [None]:
# Calculate the percentage of the largest six categories 
100 *all_xray_df['OriginalImage[Width'].value_counts()[:6].sum()/len(all_xray_df['OriginalImage[Width'])

### It turns out that :
### 1. We have 904 unique values for the Image width.
### 2. The most common value is `2500` which appeared in one third of patients.
### 3. 82 % of patient images are represented by only 6 values! 

In [None]:
all_xray_df['Height]'].describe()


In [None]:
all_xray_df['Height]'].value_counts().describe()


In [None]:
# The largest 6 values
all_xray_df['Height]'].value_counts()[:10]

In [None]:
# Calculate the percentage of the largest six categories 
100 *all_xray_df['Height]'].value_counts()[:6].sum()/len(all_xray_df['Height]'])

### It turns out that :
### 1. We have 1137 unique values for the Image Height.
### 2. The most common value is `2048` which appeared in one third of patients.
### 3. 88 % of patient images are represented by only 6 values! 

**There is strong similarity between Width and Hieght statistics which is expected**
> ** So lets visualize Bivariate Analysis of them using scatter plot

In [None]:
all_xray_df.plot(x='OriginalImage[Width', y='Height]', style='o');


**There is a strong positive linear relationship**

In [None]:
# Calculate how many image of size 2500 * 204 which are the most common values of Width and Height
size2500x2048 = all_xray_df[(all_xray_df['OriginalImage[Width'] == 2500) & (all_xray_df['Height]'] == 2048)]
len (size2500x2048) / all_xray_df.shape[0]

### As expected, one third of images has the 2500 * 2048 size !

# 2. Analysis of Pneumonia cases


## 2.1. Calculation of Pnumonia Cases
> **Because we have co-occurences of findings, we need to split them out to get accurate statistics about Pneumonia cases.** 

In [None]:
# Getting unique Findings
findings = set()
for f in all_xray_df['Finding Labels'].unique():
    findings.update(f.split('|'))
print(f'Total number of single diagnoses: {len(findings)}')
findings

In [None]:
#Mapping single finding 
# 1 represents findings and 0 No findings
for finding in findings:
    all_xray_df[finding] = all_xray_df['Finding Labels'].map(lambda x: 1.0 if finding in x else 0)

all_xray_df

In [None]:
# Calcuate number of pneumonia cases which are either separate findings or co-occurred with other findings
pneumonia = all_xray_df[all_xray_df['Pneumonia'] == 1]
pneumonia

In [None]:
# Calculate all findings cases 
all_findings = all_xray_df[all_xray_df["No Finding"] == 0]
print(f'All findings: {len(all_findings)}')

In [None]:
print(f'Pneumonia images: {len(pneumonia)} ({100.0*len(pneumonia)/len(all_xray_df) :.2f}% of all)')
print(f'Pneumonia images: {len(pneumonia)} ({100.0*len(pneumonia)/len(all_findings) :.2f}% of findings)')

In [None]:
no_pneumonia = all_xray_df[all_xray_df["Pneumonia"] == 0]
print(f'No pneumonia: {len(no_pneumonia)}')

In [None]:
no_pneumonia_findings = all_xray_df[ (all_xray_df["Pneumonia"] == 0) & (all_xray_df["No Finding"] == 0) ]
print(f'No pneumonia among findings: {len(no_pneumonia_findings)}')

### It turns out that :
### 1.Pneumonia cases is only 1.28% of all cases.
### 2.Pneumonia images is 2.76% of all images with findings.
### So, we have sever imbalance among classes which will need extra cautions.

## 2.2. Bivariate analysis of pneumonia cases with other features

> ### **With Age**

In [None]:
Age = pneumonia['Patient Age']
# Fit a normal distribution to the data:
mu, std = norm.fit(Age)

# Plot the histogram.
plt.hist(Age, bins=50, density=True)

# Plot the PDF.
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, std)
plt.title(title)

plt.show()

In [None]:
# Summary statistics of Age accross Pneumonia cases
pneumonia['Patient Age'].describe()


In [None]:
# Summary statistics of Age accross the whole data
all_xray_df["Patient Age"].describe()

### It turns out that :
### 1.Pneumonia disease is normally destributed over age
### 2.Age distribution among Pneumonia cases is similar to Age distribution in the whole dataset.
> **so, there is no evidence that getting infected with Pneumonia depends on your age!**
 
> **This claims need to be verified with further analysis.**

> ### **With Gender**

In [None]:
pneumonia['Patient Gender'].value_counts().plot(kind="bar", title="Gender Distribution", color = ["purple","orange"]);

In [None]:
#Calculate the ratio
males = pneumonia[pneumonia['Patient Gender'] == 'M']
females = pneumonia[pneumonia['Patient Gender'] == 'F']
print(f'Patient Gender distribution\nMale: {len(males)} ({100.0*len(males)/len(pneumonia):.2f}%), Female: {len(females)} ({100.0*len(females)/len(pneumonia):.2f}%)')

### It turns out that Gender distribution accross Pneumonia is very similar to the Gender distribution accross the whole data.
### So, Getting infected with pneumonia may not depends on your Gender!.

> ### **With View Position**

In [None]:
pneumonia['View Position'].value_counts().plot(kind="bar", title="View Position Distribution", color = ["purple","orange"]);

> ### **With Patient ID**

In [None]:
# Checking if ID is unique or not ?
pneumonia['Patient ID'].nunique()

In [None]:
round (100* pneumonia['Patient ID'].nunique()/len(pneumonia['Patient ID']), 2)