2023 © Volintine Ander \\
This notebook is for the **AI 101: Your First Step into Machine Learning with Tensorflow** workshop at Universiti Teknologi PETRONAS.

# Classify images with CNN

In this activity, you will train a neural network to label chest x-rays into two categories: \\

* Normal (healthy)
* Pneumonia

The dataset was sourced from https://huggingface.co/datasets/trpakov/chest-xray-classification by Paul Mooney.

## Getting started

Install the package ```huggingface_hub``` using pip. \\
Import the libraries required:

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
import numpy as np
import PIL
import matplotlib.pyplot as plt

## Download the dataset

We will need the ```pathlib``` library to handle directories and ```hf_hub_download``` to download the dataset from Hugging Face🤗

In [None]:
import pathlib
from huggingface_hub import hf_hub_download
#hf_hub_download outputs a str
path_to_data = hf_hub_download(repo_id="trpakov/chest-xray-classification", local_dir = "/root/.keras/datasets/", filename="data/train.zip", repo_type="dataset")

Print out ```path_to_data```

In [None]:
#Insert code here

##Extract the zip file
You need to import the ```zipfile``` library. Once you have imported it, run the code below.

In [None]:
with zipfile.ZipFile(path_to_data, 'r') as training_archive:
  training_archive.extractall("/root/.keras/datasets/data/")

Find the number of images in the dataset. You will need to import the libraries ```fnmatch``` and ```os```. Once imported, the code below.

In [None]:


dir_path = r'/root/.keras/datasets/data/PNEUMONIA'
count_pneumonia = len(fnmatch.filter(os.listdir(dir_path), '*.jpg'))
print('Pneumonia images:', count_pneumonia)

dir_path = r'/root/.keras/datasets/data/NORMAL'
count_normal = len(fnmatch.filter(os.listdir(dir_path), '*.jpg'))
print('Normal images:', count_normal)


Write code that prints out the total number of images.

###Open an image from the dataset

Modify the code below to find the 1st image under ```PNEUMONIA```.

In [None]:
path_to_dataset = '/root/.keras/datasets/data/'
path_to_stuff = pathlib.Path('/root/.keras/datasets/data/').with_suffix('')
pneumonia = list(path_to_stuff.glob('PNEUMONIA/*')) #An array of paths to each x-ray image in the PNEUMONIA folder
PIL.Image.open(str(pneumonia[49]))

### Create a dataset

Define the batch size, xray height, and xray width.
The images will be rescaled based on the parameters. \\

The batch size is the number of new images passed through the network in one time to update the model parameters. \\

In one epoch the model is updated a few times. In each of those updates, the quantity of images used is equal to the batch size.

In [None]:
batch_size = 30
xray_height = 150
xray_width = 150

It's good practice to use a validation split when developing your model. Use 80% of the images for training and 20% for validation. \\
Train the model using a 70:30 training-validation split, and then run it again using an 80:20 split.

In [None]:
train_ds = tf.keras.utils.image_dataset_from_directory(
  path_to_dataset,
  validation_split = 0.3,
  subset = "training",
  seed = 100, #Determines initial values of weights
  image_size = (xray_height, xray_width),
  batch_size = batch_size)

In [None]:
valid_ds = tf.keras.utils.image_dataset_from_directory(
  path_to_dataset,
  validation_split = 0.3,
  subset = "validation",
  seed = 100, #Determines initial values of weights
  image_size = (xray_height, xray_width),
  batch_size = batch_size)

You can find the class names in the `class_names` attribute on these datasets. These correspond to the directory names in alphabetical order.

In [None]:
categories = train_ds.class_names
print(categories)

## View the dataset

Show some of the images in the dataset with labels using ```matplotlib```.

In [None]:
plt.figure(figsize = (10, 10))
for images, labels in train_ds.take(1):
  for i in range(16):
    ax = plt.subplot(4, 4, i + 1)
    plt.imshow(images[i].numpy().astype("uint8"))
    plt.title(categories[labels[i]])
    plt.axis("off")

## Optimize dataset for training

The following code optimizes training by loading some of the images into memory as opposed to only loading from disk as needed. This is called caching.

It also loads from the dataset for a future training step while the current training step executes. This is called prefetching.

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE) #Caching
valid_ds = valid_ds.cache().prefetch(buffer_size = AUTOTUNE) #Prefetching

## Normalize RGB channel values

A color image contains three integers to describe the color of each pixel. They are the Red, Green, and Blue channels. Each channel integer can span from 0 to 255. Normalization rescales the integers to span from 0 to 1.

Add the code below as the first line under definition of ```model```.

```
layers.Rescaling(1./255, input_shape = (xray_height, xray_width, 3))
```

## Specify layers

The model will use three convolution layers followed by a max pooling layer for each. The max pooling layer is responsible for making the model focus on larger values.

In [None]:
num_classes = len(categories)

model = Sequential([
  
  layers.Conv2D(16, 3, padding = 'same', activation = 'relu'), #3 input channels, 16 output channels. 3x3 convolution kernel.
  layers.MaxPooling2D(), #Downsample 150x150 -> 75x75
  layers.Conv2D(32, 3, padding = 'same', activation = 'relu'), #16 input channels, 32 output channels. 3x3 convolution kernel.
  layers.MaxPooling2D(), #Downsample, 75x75 -> 37x37
  layers.Conv2D(64, 3, padding = 'same', activation = 'relu'), #32 input channels, 64 output channels. 3x3 convolution kernel.
  layers.MaxPooling2D(), #Downsample, 37x37 -> 18x18
  layers.Flatten(), #1 dimensional tensor, 18*18*64 = 20736 channels/parameters
  layers.Dense(128, activation = 'relu'), #1 dimensional tensor, 128 output channels. 128*(20736+1) = 2654336 parameters
  layers.Dense(num_classes) #1 dimensional tensor, 2 output channel (one for each category, NORMAL and PNEUMONIA). 
])

### Compile the model

Use the `tf.keras.optimizers.Adam` optimizer and the `tf.keras.losses.SparseCategoricalCrossentropy` loss function.

In [None]:
model.compile(optimizer = 'adam',
              loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics = ['accuracy'])

### Model summary

Use the `Model.summary` method to view all the layers, sizes, and associated parameters.

In [None]:
model.summary()

### Train the model

Run the code below to train the model:

In [None]:
epochs = 5
history = model.fit(
  train_ds,
  validation_data = valid_ds,
  epochs = epochs
)

## Analyze model performance

Create two plots:

1. Training accuracy vs. epoch
2. Validation accuracy vs. epoch

Did the model perform well? Explain.

In [None]:
accuracy = history.history['accuracy']
valid_accuracy = history.history['val_accuracy']

loss = history.history['loss']
valid_loss = history.history['val_loss']

no_of_epochs = range(epochs)

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(no_of_epochs, accuracy, label='Training Accuracy')
plt.plot(no_of_epochs, valid_accuracy, label='Validation Accuracy')
plt.legend(loc='lower center')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(no_of_epochs, loss, label='Training Loss')
plt.plot(no_of_epochs, valid_loss, label='Validation Loss')
plt.legend(loc='upper center')
plt.title('Training and Validation Loss')
plt.show()

## Predict on new data

Find an x-ray that was neither in the training nor validation dataset. \\
See if the model can correctly distinguish between healthy and pneumonia lungs.

In [None]:
path_to_predict = "Replace with the path to your uploaded image"
predict = tf.keras.utils.load_img(
    path_to_predict, target_size=(xray_height, xray_width)
)

convert_to_array = tf.keras.utils.img_to_array(predict)
convert_to_array = tf.expand_dims(convert_to_array, 0)

prediction = model.predict(convert_to_array)
chance = tf.nn.softmax(prediction[0])

print("{:.3f}% likely to be {}".format(100 * np.max(chance), categories[np.argmax(chance)].lower()))