# Case Study: Data Preprocessing (Keras & PyTorch)

In the first assignment of this course, you will work with chest x-ray images taken from the public [ChestX-ray8 dataset](https://arxiv.org/abs/1705.02315). In this notebook, you'll get a chance to explore this dataset and familiarize yourself with some of the techniques you'll use in the first graded assignment.

<img src="images/xray-image.png" alt="U-net Image" width="300" align="middle"/>

The first step before jumping into writing code for any machine learning project is to explore your data. A standard Python package for analyzing and manipulating data is [pandas](https://pandas.pydata.org/docs/#).

With the next two code cells, you'll import `pandas` and a package called `numpy` for numerical manipulation, then use `pandas` to read a csv file into a dataframe and print out the first few rows of data.

In [1]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os
import seaborn as sns

# %matplotlib inline
sns.set()

### Exploration

In [2]:
train_df = pd.read_csv('data/train-small.csv')

# show first 5 rows
print(f'There are {train_df.shape[0]} rows & {train_df.shape[1]} columns')
train_df.head(3)

There are 1000 rows & 16 columns


Unnamed: 0,Image,Atelectasis,Cardiomegaly,Consolidation,Edema,Effusion,Emphysema,Fibrosis,Hernia,Infiltration,Mass,Nodule,PatientId,Pleural_Thickening,Pneumonia,Pneumothorax
0,00008270_015.png,0,0,0,0,0,0,0,0,0,0,0,8270,0,0,0
1,00029855_001.png,1,0,0,0,1,0,0,0,1,0,0,29855,0,0,0
2,00001297_000.png,0,0,0,0,0,0,0,0,0,0,0,1297,1,0,0


In order to understand what to do with data, we need to investigate the columns thoroughly. Let's first find out which diseases are present and their occurrences.

We see that there are several types of occurrences of fluids in lungs, therefore, we need to how many rows contain singular and multiple diseases.

In [7]:
# check unique diseases
columns = train_df.columns.tolist()

# drop irrelevant values
columns.remove('Image')
columns.remove('PatientId')

# show unique diseases
print(columns)

['Atelectasis', 'Cardiomegaly', 'Consolidation', 'Edema', 'Effusion', 'Emphysema', 'Fibrosis', 'Hernia', 'Infiltration', 'Mass', 'Nodule', 'Pleural_Thickening', 'Pneumonia', 'Pneumothorax']


In [11]:
# Print out the number of positive labels for each class
for column in columns:
    print(f"The class {column} has {train_df[column].sum()} samples")

The class Atelectasis has 106 samples
The class Cardiomegaly has 20 samples
The class Consolidation has 33 samples
The class Edema has 16 samples
The class Effusion has 128 samples
The class Emphysema has 13 samples
The class Fibrosis has 14 samples
The class Hernia has 2 samples
The class Infiltration has 175 samples
The class Mass has 45 samples
The class Nodule has 54 samples
The class Pleural_Thickening has 21 samples
The class Pneumonia has 10 samples
The class Pneumothorax has 38 samples


In [10]:
# check the number of missing values
print(train_df.isna().sum())

# check the dataset
train_df.info()

Image                 0
Atelectasis           0
Cardiomegaly          0
Consolidation         0
Edema                 0
Effusion              0
Emphysema             0
Fibrosis              0
Hernia                0
Infiltration          0
Mass                  0
Nodule                0
PatientId             0
Pleural_Thickening    0
Pneumonia             0
Pneumothorax          0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Image               1000 non-null   object
 1   Atelectasis         1000 non-null   int64 
 2   Cardiomegaly        1000 non-null   int64 
 3   Consolidation       1000 non-null   int64 
 4   Edema               1000 non-null   int64 
 5   Effusion            1000 non-null   int64 
 6   Emphysema           1000 non-null   int64 
 7   Fibrosis            1000 non-null   int64 
 8   Hernia      

In [15]:
def get_name(series):
    name = series[series == 1]
    print(name, end='\n\n\n')

# count instances
normal_ids, normal_count = [], 0
multi_ids, multi_count = [], 0

# iterate the dataframe
for idx, content in train_df.iterrows():
    get_name(content)

    if idx == 5:
        break

Series([], Name: 0, dtype: object)


Atelectasis     1
Effusion        1
Infiltration    1
Name: 1, dtype: object


Pleural_Thickening    1
Name: 2, dtype: object


Series([], Name: 3, dtype: object)


Infiltration    1
Name: 4, dtype: object


Atelectasis     1
Cardiomegaly    1
Effusion        1
Infiltration    1
Name: 5, dtype: object


