# **Prelim Skills Exam** <br>

**Name**: Mj Spencer Almodiel <br>
**Instructor**: Engr. Roman Richard <br>
**Date Performed**: 3/12/24

1. Build and train a CNN model from a scratch.  Apply different regularization techniques and data preprocessing to reduce overfitting.

2.  Plot the training and validation loss and accuracy. The target accuracy is 85 and above.

3.  Use the assigned pre-trained model and fine-tuned it. 

4. Build and train a CNN model using the modified pretrained model.

5. Plot the training and validation loss and accuracy. The target accuracy is 95 and above.

6. Use the classification report, confusion matrix and ROC over AUC metric to evaluate also the performance of the scratch and using pre-trained mode

### Import the data and apply data preprocessing techniques

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

import glob # to be used on search files
%matplotlib inline

In [2]:
xray_df = pd.read_csv(r"archive\Data_Entry_2017.csv")
xray_df

Unnamed: 0,Image Index,Finding Labels,Follow-up #,Patient ID,Patient Age,Patient Gender,View Position,OriginalImage[Width,Height],OriginalImagePixelSpacing[x,y],Unnamed: 11
0,00000001_000.png,Cardiomegaly,0,1,58,M,PA,2682,2749,0.143,0.143,
1,00000001_001.png,Cardiomegaly|Emphysema,1,1,58,M,PA,2894,2729,0.143,0.143,
2,00000001_002.png,Cardiomegaly|Effusion,2,1,58,M,PA,2500,2048,0.168,0.168,
3,00000002_000.png,No Finding,0,2,81,M,PA,2500,2048,0.171,0.171,
4,00000003_000.png,Hernia,0,3,81,F,PA,2582,2991,0.143,0.143,
...,...,...,...,...,...,...,...,...,...,...,...,...
112115,00030801_001.png,Mass|Pneumonia,1,30801,39,M,PA,2048,2500,0.168,0.168,
112116,00030802_000.png,No Finding,0,30802,29,M,PA,2048,2500,0.168,0.168,
112117,00030803_000.png,No Finding,0,30803,42,F,PA,2048,2500,0.168,0.168,
112118,00030804_000.png,No Finding,0,30804,30,F,PA,2048,2500,0.168,0.168,


**Remove unnecessary columns**

In [11]:
xray_df = xray_df[['Image Index', 'Finding Labels', 'Follow-up #', 'Patient Age', 'Patient Gender']]

In [13]:
xray_df

Unnamed: 0,Image Index,Finding Labels,Follow-up #,Patient Age,Patient Gender
0,00000001_000.png,Cardiomegaly,0,58,M
1,00000001_001.png,Cardiomegaly|Emphysema,1,58,M
2,00000001_002.png,Cardiomegaly|Effusion,2,58,M
3,00000002_000.png,No Finding,0,81,M
4,00000003_000.png,Hernia,0,81,F
...,...,...,...,...,...
112115,00030801_001.png,Mass|Pneumonia,1,39,M
112116,00030802_000.png,No Finding,0,29,M
112117,00030803_000.png,No Finding,0,42,F
112118,00030804_000.png,No Finding,0,30,F


- We can see that in Patient Age column, the max value in age is 414. Let's clean and only get age under 100.

In [17]:
xray_df = xray_df[xray_df['Patient Age'] < 100]
xray_df.describe()

Unnamed: 0,Follow-up #,Patient Age
count,112104.0,112104.0
mean,8.574172,46.872574
std,15.406734,16.598152
min,0.0,1.0
25%,0.0,35.0
50%,3.0,49.0
75%,10.0,59.0
max,183.0,95.0


In [19]:
xray_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 112104 entries, 0 to 112119
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Image Index     112104 non-null  object
 1   Finding Labels  112104 non-null  object
 2   Follow-up #     112104 non-null  int64 
 3   Patient Age     112104 non-null  int64 
 4   Patient Gender  112104 non-null  object
dtypes: int64(2), object(3)
memory usage: 5.1+ MB


In [20]:
image_paths = {os.path.basename(x): x for x in glob.glob(os.path.join('archive/images*/*/*.png'))} 
len(image_paths)

112120

In [21]:
xray_df['path'] = xray_df['Image Index'].map(image_paths.get)
xray_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Image Index,Finding Labels,Follow-up #,Patient Age,Patient Gender,path
0,00000001_000.png,Cardiomegaly,0,58,M,archive\images_001\images\00000001_000.png
1,00000001_001.png,Cardiomegaly|Emphysema,1,58,M,archive\images_001\images\00000001_001.png
2,00000001_002.png,Cardiomegaly|Effusion,2,58,M,archive\images_001\images\00000001_002.png
3,00000002_000.png,No Finding,0,81,M,archive\images_001\images\00000002_000.png
4,00000003_000.png,Hernia,0,81,F,archive\images_001\images\00000003_000.png


- First thing I noticed within the data is that there are some rows that have multiple labels. I should convert this to binary values to maintain its multiple label characteristic. 

In [22]:
# Change the "No Findings" label to ''
xray_df['Finding Labels'] = xray_df['Finding Labels'].map(lambda x : x.replace('No Finding', ''))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [23]:
from itertools import chain
labels = np.unique(list(chain(*xray_df['Finding Labels'].map(lambda x: x.split('|')).tolist())))    # splits the findings by |

# lets remove the '' value
labels = [x for x in labels if len(x) > 1]
labels

['Atelectasis',
 'Cardiomegaly',
 'Consolidation',
 'Edema',
 'Effusion',
 'Emphysema',
 'Fibrosis',
 'Hernia',
 'Infiltration',
 'Mass',
 'Nodule',
 'Pleural_Thickening',
 'Pneumonia',
 'Pneumothorax']

In [24]:
xray_df['Finding Labels'].value_counts()[:20]

                                     60353
Infiltration                          9546
Atelectasis                           4214
Effusion                              3955
Nodule                                2705
Pneumothorax                          2193
Mass                                  2139
Effusion|Infiltration                 1603
Atelectasis|Infiltration              1350
Consolidation                         1310
Atelectasis|Effusion                  1165
Pleural_Thickening                    1126
Cardiomegaly                          1093
Emphysema                              892
Infiltration|Nodule                    829
Atelectasis|Effusion|Infiltration      737
Fibrosis                               727
Edema                                  627
Cardiomegaly|Effusion                  484
Consolidation|Infiltration             441
Name: Finding Labels, dtype: int64

- As you can see, since have over 120k+ of data. There are some rows that have multiple findings.

- Let's try only use the labels that are greater than 1000 instances. Since there are some labels that have 3 findings but not enough compared to the size of whole dataset. We can change this later if the model's performance is good.

In [25]:
minCount = 1000

# Count the occurrences of each label
label_counts = {}
for label in labels:
    label_counts[label] = (xray_df['Finding Labels'].str.contains(label)).sum()

# Filter the labels that occur more than 1000 times
filtered_labels = [label for label, count in label_counts.items() if count > 1000]
filtered_labels

['Atelectasis',
 'Cardiomegaly',
 'Consolidation',
 'Edema',
 'Effusion',
 'Emphysema',
 'Fibrosis',
 'Infiltration',
 'Mass',
 'Nodule',
 'Pleural_Thickening',
 'Pneumonia',
 'Pneumothorax']

In [26]:
for label, count in label_counts.items():
    if count > 1000:
        print(label, count)

Atelectasis 11558
Cardiomegaly 2776
Consolidation 4667
Edema 2302
Effusion 13316
Emphysema 2516
Fibrosis 1686
Infiltration 19891
Mass 5779
Nodule 6331
Pleural_Thickening 3384
Pneumonia 1430
Pneumothorax 5301


### Create training data

- Let's transform the labels to 0 and 1. So that it can be trainable. But since this problem is a multilabel classification, there are instances that 1 or more findings can be concluded. 

In [27]:
label_encoding = {label: idx for idx, label in enumerate(filtered_labels)}

def transform_label(label_string):
    labels = label_string.split('|')
    binary_array = np.zeros(len(filtered_labels), dtype=np.int)
    for label in labels:
        if label in label_encoding:
            idx = label_encoding[label]
            binary_array[idx] = 1
    return binary_array

In [28]:
xray_df['Encoded Labels'] = xray_df['Finding Labels'].apply(transform_label)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [29]:
xray_df['Finding Labels'].head(5)

0              Cardiomegaly
1    Cardiomegaly|Emphysema
2     Cardiomegaly|Effusion
3                          
4                    Hernia
Name: Finding Labels, dtype: object

In [30]:
xray_df['Encoded Labels']

0         [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1         [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
2         [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
3         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
4         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
                           ...                   
112115    [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0]
112116    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
112117    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
112118    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
112119    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Name: Encoded Labels, Length: 112104, dtype: object

- After transforming the labels

### Build a CNN model from scratch.