# Lung deseases data analysis

You'll learn how to analyze sample lung desease data

This demo is a jupyter notebook, i.e. intended to be run step by step.

Author: Eric Einspänner
<br>
Contributor: Nastaran Takmilhomayouni

First version: 6th of July 2023


Copyright 2023 Clinic of Neuroradiology, Magdeburg, Germany

License: Apache-2.0

*This notebook is inspired by: https://www.kaggle.com/code/sbernadac/lung-deseases-data-analysis/notebook*

*Dataset: http://openaccess.thecvf.com/content_cvpr_2017/papers/Wang_ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.pdf*

# Table Of Contents
0. [Initial Set-Up for Google Colab](#initial-set-up-for-google-colab)
1. [Initial Set-Up (offline)](#initial-set-up-offline)
2. [Data Analysis](#Data-Analysis)
    - [Data cleaning](#Data-cleaning)
    - [Display number of each deseases by patient gender ](#Display-number-of-each-deseases-by-patient-gender)
2. [Age distribution](#Age-distribution)

## Initial Set-Up for Google Colab
<u> Execute these code blocks just in Google Colab! </u>

In [None]:
!git clone https://github.com/University-Clinic-of-Neuroradiology/python-bootcamp.git

In [None]:
import os
import sys
from google.colab import output
output.enable_custom_widget_manager()

sys.path.insert(0,'/content/python-bootcamp/notebooks/DataManagement')
os.chdir(sys.path[0])

In [None]:
%pip install -q ipympl numpy matplotlib pandas seaborn

In [None]:
import numpy as np                          # linear algebra
import pandas as pd                         # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt             # visualization
import seaborn as sns                       # visualization
import matplotlib.gridspec as gridspec      # grid layout to place subplots within a figure.
import matplotlib.ticker as ticker          # configuring plot tick locating and formatting
sns.set_style('whitegrid')                  # color of the background and whether a grid is enabled

## Initial Set-Up (offline)

In [None]:
# Make sure figures appears inline and animations works
# Edit this to ""%matplotlib notebook" when using the "classic" jupyter notebook interface
%matplotlib widget

In [None]:
import numpy as np                          # linear algebra
import pandas as pd                         # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt             # visualization
import seaborn as sns                       # visualization
import matplotlib.gridspec as gridspec      # grid layout to place subplots within a figure.
import matplotlib.ticker as ticker          # configuring plot tick locating and formatting
sns.set_style('whitegrid')                  # color of the background and whether a grid is enabled

## --- Start notebook ---

## Data Analysis
To read a csv file you can use read_csv function from python pandas module

In [None]:
# import data set description
df = pd.read_csv('./Data/Data_Entry_2017.csv') # read the csv file from the given location
df.head() # return the first n rows. n=int, default 5

In [None]:
df.describe() # generate descriptive statistics of quantitative variables

### Data cleaning
you can select whichever columns you want to work from your dataframe

In [None]:
# drop unused columns
# keep the columns of the dataframe you want
df = df[['Index','Finding Labels','Follow-up #','Patient ID','Patient Age','Patient Gender']]
df.head() # return the first n rows. n=int, default 5

In [None]:
# create new columns for each decease
# List of labels you are looking for
pathology_list = ['Cardiomegaly','Emphysema','Effusion','Hernia','Nodule','Pneumothorax','Atelectasis','Pleural_Thickening','Mass','Edema','Consolidation','Infiltration','Fibrosis','Pneumonia']

# print all the labels in 'Finding Labels' column of the df
print(df['Finding Labels'])


you can add column(s) to your dataframe which content can be filled through applying a function on already existing columns of your dataframe

In [None]:
# apply a function on the labels in 'Finding Labels' column of the df
# which adds a column for each pathology and 1 if the pathology is in 'Finding Labels' and 0 if not
for pathology in pathology_list:
    df[pathology] = df['Finding Labels'].apply(lambda x: 1 if pathology in x else 0) #Function to apply to each column or row
    
df.head()

In [None]:
print(df.select_dtypes(include=['category']))

### Display number of each deseases by patient gender

In [None]:
df.head()

In [None]:
# use melt to transform or reshape data
# investigate only the 'Patient Gender' column of the dataset. How?
# values in the columns in the 'pathology_list' will be printed for each row of'Patient Gender' column
data1 = pd.melt(df, id_vars=['Patient Gender'], value_vars = list(pathology_list))
data1

In [None]:
# change the column name 'variable' to 'Category' and 'value' to 'Count'
data1 = pd.melt(df,                             #dataframe
             id_vars = ['Patient Gender'],        #columns to keep
             value_vars = list(pathology_list), #variables with values of those columns
             var_name = 'Category',             #change 'variable' name to 'Category'
             value_name = 'Count')              #change 'value'  name to 'Count' 
data1

In [None]:
# Let's keep only those rows that have Count>0
data1 = data1.loc[data1.Count > 0]
data1

In [None]:
# Let's take a look at the dataframe one more time
# In the 'Finding Labels' column we have 'No Finding' entries
df

In [None]:
# Let's create a 'Nothing' column with value 1 if 'No Finding' was in that row of the 'Finding Labels' column  and 0 if not
df['Nothing'] = df['Finding Labels'].apply(lambda x: 1 if 'No Finding' in x else 0)
df

In [None]:
# use melt to transform or reshape data
# investigate only the 'Patient Gender' column of the dataset. How?
# values in the 'Nothing' column will be printed for each row of 'Patient Gender' column
# change the column name 'variable' to 'Category' and 'value' to 'Count'
data2 = pd.melt(df,
             id_vars=['Patient Gender'],
             value_vars = list(['Nothing']),
             var_name = 'Category',
             value_name = 'Count')
# Let's keep only those rows that have Count>0
data2 = data2.loc[data2.Count>0]
data2

Pathology and Non-pathology counts in female and male cases 

In [None]:
# Let's plot
plt.figure(figsize=(15,10)) # generate a new figure
gs = gridspec.GridSpec(8,1) # create a new grid layout with 8 rows and 1 column
ax1 = plt.subplot(gs[:7, :]) # ax1 to plot the first 7 rows
ax2 = plt.subplot(gs[7, :])  # ax2 to plot the last row
#-------------------------------------------
# Plot 'Cardiomegaly','Emphysema','Effusion','Hernia','Nodule',
#      'Pneumothorax','Atelectasis','Pleural_Thickening','Mass',
#      'Edema','Consolidation','Infiltration','Fibrosis','Pneumonia' counts for male and female

g = sns.countplot(y='Category', hue='Patient Gender', data=data1, ax=ax1, order = data1['Category'].value_counts().index)
ax1.set(ylabel="",xlabel="")
ax1.figure.set_size_inches(9, 9)
ax1.legend(fontsize=13)
ax1.set_title('X Ray partition (total number = 121120)', fontsize=15)
ax1.set_xlim([0,12000])
plt.tight_layout()

# Plot 'Nothing' counts for male and female

g = sns.countplot(y='Category', hue='Patient Gender', data=data2, ax=ax2)
ax2.set(ylabel="", xlabel="Number of decease")
ax2.legend('')
plt.subplots_adjust(hspace=.5)

Patient age counts in female and male cases

In [None]:
# Let's draw a categorical plot for number of each "Patient Age" value over each "Patient Age" value at ordinal positions
# Do it for each "Patient Gender"
g = sns.catplot(x="Patient Age", col="Patient Gender", data=df, kind="count", aspect=0.8, palette="GnBu_d")
g.set_xticklabels(np.arange(0,121))
g.set_xticklabels(step=10)
g.fig.suptitle('Age distribution by sex', fontsize=15)
g.fig.subplots_adjust(top=.8)
g.figure.set_size_inches(9, 9)

# Age distribution 

In [None]:
# Let's take a look at the dataframe 
df

In [None]:
# Let's add 'Age Type' column
df['Age Type'] = df['Patient Age'].apply(lambda x: x[-1:]) # keep only the last value Y, M or D
df['Age Type'].unique()  # only keep unique values => Y, M and D

# we mainly have ages expressed in Years, but also a few expressed in Months or in Days
print('age expressed in years', df[df['Age Type']=='Y']['Patient ID'].count())
print('age expressed in months', df[df['Age Type']=='M']['Patient ID'].count())  
print('age expressed in days', df[df['Age Type']=='D']['Patient ID'].count())
df

In [None]:
#we are going to remove character after patients age, and transform D and M in years
df['Age'] = df['Patient Age'].apply(lambda x: x[:-1]).astype(int) # keep all before last digits, e.g. 058 and then remove 0 ->58
df

In [None]:
# convert those age values in 'Age' column that are in months, 'M' or days, 'D' into years, 'Y'

df.loc[df['Age Type']=='M',['Age']] = df[df['Age Type']=='M']['Age'].apply(lambda x: round(x/12.)).astype(int)

df.loc[df['Age Type']=='D',['Age']] = df[df['Age Type']=='D']['Age'].apply(lambda x: round(x/365.)).astype(int)
print(df[df['Age Type']=='D']['Age'])
df[df['Age Type']=='M']['Age']

df['Age'].sort_values(ascending=False).head(20) # sort the age values in an ascending order

df.loc[df['Patient ID']==5567, ['Patient Age','Finding Labels','Follow-up #']].sort_values('Follow-up #', ascending=True)
df.loc[df['Patient ID']==5567, ['Patient Age','Finding Labels','Follow-up #']].sort_values('Patient Age', ascending=False)
df.head()

In [None]:
# Let's draw a categorical plot for number of each "Age" value over each "Age" value at ordinal positions
# Do it for each "Patient Gender"
g = sns.catplot(x="Age", col="Patient Gender",data=df, kind="count", aspect=0.8, palette="GnBu_d")
g.set_xticklabels(np.arange(0,108))
g.set_xticklabels(step=10)
g.fig.suptitle('Age distribution by sex', fontsize=14)
g.fig.subplots_adjust(top=.8)

# Distribution looks more realistic now -> Exercise for students: Fix the age column!?

## Display pathologies distribution by age&sex

In [None]:
f, axarr = plt.subplots(7, 2, sharex=True, figsize=(10, 20)) # create layout with 7 rows and 2 columns

i = 0
j = 0
x = np.arange(0, 100, 10)
for pathology in pathology_list : # Count the occurence of each pathology for each age value for each "Patient Gender"
    g = sns.countplot(x='Age', hue="Patient Gender", data=df[df['Finding Labels']==pathology], ax=axarr[i, j])
    axarr[i, j].set_title(pathology)   
    g.set_xlim(0, 90)
    g.set_xticks(x) # use each 'Age' value as ticks
    g.set_xticklabels(x)
    j = (j + 1) % 2
    if j == 0:
        i = (i + 1) % 7
f.subplots_adjust(hspace=0.4) # height of the padding between subplots for a better view
f.subplots_adjust(wspace=0.1) # width of the padding between subplots for a better view

## Display patient number by Follow-up in details

In [None]:
f, (ax1,ax2) = plt.subplots( 2, figsize=(15, 10))

# get those patient data with less that 15 times 'Follow-up #'
data = df[df['Follow-up #'] < 15]

# plot number of patients having each number of 'Follow-up #' in an ordinal distribution
g = sns.countplot(x='Follow-up #', data=data, palette="GnBu_d", ax=ax1)

ax1.set_title('Follow-up distribution')
data = df[df['Follow-up #']>14] # get those patient data with more  that 15 times 'Follow-up #'

# plot number of patients having each number of 'Follow-up #' in an ordinal distribution
g = sns.countplot(x='Follow-up #', data=data, palette="GnBu_d", ax=ax2)

x = np.arange(15,100,10)
g.set_ylim(15,450)
g.set_xlim(15,100)
g.set_xticks(x)
g.set_xticklabels(x)
f.subplots_adjust(top=1)

## Try to find links between pathologies

In [None]:
# Let's look at the dataframe again
df

In [None]:
#First display Top 10 multiple deseases
data = df.groupby('Finding Labels').count().sort_values('Patient ID', ascending=False).head(23)
data = data[['|' in index for index in data.index.values]]
data

In [None]:
# Group dataframe by 'Finding Labels' column and count all the values for each
# sort them based on 'Patient ID' column values in an ascending order
data = df.groupby('Finding Labels').count().sort_values('Patient ID', ascending=False)
data.head()

In [None]:
# Some row labels of data contains more than one label separated by a '|'
print(data.index) #The index (row labels) of the DataFrame.

In [None]:
df1 = data[['|' in index for index in data.index]].copy()  # those rows containing information of many labels

df2 = data[['|' not in index for index in data.index]]     # those rows containing information of at least one label
df2 = df2[['No Finding' not in index for index in df2.index]] 

df2['Finding Labels'] = df2.index.values  # Simple Pathology dataframe 
df1['Finding Labels'] = df1.index.values  # Multiple Pathology dataframe

In [None]:
f, ax = plt.subplots(sharex=True, figsize=(15, 10))

sns.set_color_codes("pastel")
g = sns.countplot(y='Category',data=data1, ax=ax, order = data1['Category'].value_counts().index,color='b',label="Multiple Pathologies")

sns.set_color_codes("muted")
g = sns.barplot(x='Patient ID', y='Finding Labels', data=df2, ax=ax, color="b", label="Simple Pathology")

ax.legend(ncol=2, loc="center right", frameon=True, fontsize=20)
ax.set(ylabel="", xlabel="Number of decease")
ax.set_title("Comparaison between simple or multiple decease", fontsize=20)

sns.despine(left=True)

In [None]:
#we just keep groups of pathologies which appear more than 30 times
df3 = df1.loc[df1['Patient ID'] > 30, ['Patient ID','Finding Labels']]

for pathology in pathology_list:
    df3[pathology] = df3.apply(lambda x: x['Patient ID'] if pathology in x['Finding Labels'] else 0, axis=1)

df3.head(20)

In [None]:
# 'Hernia' has not enough values to figure here
df4 = df3[df3['Hernia'] > 0]  # df4.size == 0

# remove 'Hernia' from list
pat_list = [elem for elem in pathology_list if 'Hernia' not in elem]

f, axarr = plt.subplots(13, sharex=True, figsize=(10, 140))
i = 0
for pathology in pat_list :
    df4 = df3[df3[pathology] > 0]
    if df4.size > 0:  #'Hernia' has not enough values to figure here
        axarr[i].pie(df4[pathology], labels = df4['Finding Labels'], autopct='%1.1f%%')
        axarr[i].set_title('main desease : ' + pathology, fontsize=14)   