# Changelog and Notes

### Changelog
**Modifying file processing and creation (22/02/2024):**
* Made separate files for training_clean, training_pneu, testing_clean and testing_pneu in the BuildData process. This should make it easier to split data evenly when training the model (i.e. we can choose an even split between "clean" and "pneu" samples).

**Fixing data splitting and the running network (24/02/2024):**
* Split files evenly. Created dataset "testing_both" and "training_both" as required, as well as their children datasets. Fixed issues with unshuffled data.
* Fixed "train_X" being bigger than "train_y" by reading images with cv2.IMREAD_GRAYSCALE to reduce dimensionality.
* Ran neural network locally. Tweaked learning rate and other parameters to prevent neuron death (this was an issue).
* *Current results (saved in "pneumonia_model2")*:  92.7% on test_X.

**Changed path names for generalization (29/02/2024)**
* Modified implementation of "Building Datasets" (slightly) to include generalized path names.
* Included assert statements (very briefly) and some error-handling to avoid misuse.
* Made project ready for initial public release.

### For next time:
* Optimize the network by modifying parameters: learning rate, image compression size (IMG_SIZE), number of layers and neurons.
* Try implementing momentum and/or decaying learning rate. Maybe use SGD instead of Adam for optimization. How do things change?
* Run CNN on data that is UNMODIFIED (namely, go back to the Chest-X dataset and find the other datasets; see generalizations).

Note: try implementing all of this remotely; use cloud computing software to run these larger batches and epochs.

### Research Ideas:
1. Compare learning rate decay methods in optimizing results (eg. "None" vs. "ReduceLROnPlateau"). Which one is best here?
2. Does the IMG_SIZE affect results much? To what extent can this be reduced and results be preserved?
3. Visualizing intermediate layers as images (after all, these are convolutional layers, so they should be visualizable)
4. Creating a GAN to artificially create images of lungs. How realistic can these images be? (I'll have to make another network to differentiate between pictures of lungs and outliers, so this is another kind of deep learning application). Then run these through the existing network to see what it says about pneumonia/clean. 

# Implementation:

### Building Datasets:
**BEFORE RUNNING:**
Ensure the dataset found at the following link is installed, and paths are configured as specified in the next cell.



In [9]:
# Setting relevant constants.

# An absolute path to the current directory that "public_pneumonia.ipynb" is currently in: 
#    ~~ Modify before executing anything else in "public_pneumonia.ipynb"! ~~
ROOT_DIR = None

# An absolute path to the directory "Chest X-Rays.v3-augmented.folder":
#    ~~ Modify before executing anything else in "public_pneumonia.ipynb"! ~~
DATA_DIR = None

In [52]:
import torch
import os
import cv2
import numpy as np
from tqdm import tqdm
import pickle
import matplotlib.pyplot as plt
import sys


# Constants to modify based on data being built:
REBUILD_TRAIN_DATA = False
REBUILD_TEST_DATA = True


class BuildData():
    # Chosen image size for compression:
    IMG_SIZE = 100
    # Initializing datasets:
    build_clean = []
    build_pneu = []
    datasets = [build_clean, build_pneu]
    # Counting instances:
    cleancount = 0
    pneucount = 0
    
    def __init__(self, data_type):
        # Universal paths to each of the unprocessed datasets:
        self.data_type = data_type
        try:
            self.LUNGS_CLEAN = DATA_DIR + f"/{data_type}/NORMAL"
            self.LUNGS_PNEU = DATA_DIR + f"/{data_type}/PNEUMONIA"
        except TypeError:
            print("ERROR: Please update DATA_DIR in the above cell (TypeError at BuildData: __init__()).")
        # Enumerating values for each path in a dictionary:
        self.LABELS = {self.LUNGS_CLEAN: 0, self.LUNGS_PNEU: 1}
        

    # Reading, processing and collecting data into build_data:
    #    type:    "train"/"test"
    def make_data(self):
        # Iterating between CLEAN and PNEU directories, and saving respectively:
        for label in self.LABELS:
            print(label)
            for f in tqdm(os.listdir(label)):
                try:
                    path = os.path.join(label, f)  # creating full path to image
                    # Loading/reading the image:
                    img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
                    img = cv2.resize(img, (self.IMG_SIZE, self.IMG_SIZE))  # image resize
                    
                    # Appending to appropriate dataset:
                    if label == self.LUNGS_CLEAN:   # label is clean
                        self.build_clean.append([np.array(img), np.eye(2)[self.LABELS[label]]])
                        self.cleancount += 1
                    elif label == self.LUNGS_PNEU:                               # label is pneu
                        self.build_pneu.append([np.array(img), np.eye(2)[self.LABELS[label]]])
                        self.pneucount += 1
    
                except Exception as e:
                    pass
                
        # Shuffling our data randomly:
        np.random.shuffle(self.build_clean)
        np.random.shuffle(self.build_pneu)

        # Saving to the appropriate files:
        filename_clean = ROOT_DIR + f"processed/{self.data_type}ing_clean.pkl"
        filename_pneu = ROOT_DIR + f"processed/{self.data_type}ing_pneu.pkl"
        
        with open(filename_clean, "wb") as f:
            pickle.dump(self.build_clean, f)
            print(f"Saved CLEAN scans '{self.data_type}' in {filename_clean}.")
        with open(filename_pneu, "wb") as f:
            pickle.dump(self.build_pneu, f)
            print(f"Saved PNEU scans '{self.data_type}' in {filename_pneu}.")

        print(f"Clean count: {self.cleancount}")
        print(f"Pneu count: {self.pneucount}")


if REBUILD_TRAIN_DATA:
    try:
        pneuvsclean = BuildData("train")
        pneuvsclean.make_data()
    except TypeError:
        print("ERROR: Please update ROOT_DIR in the above cell (TypeError at BuildData: make_data()).")

if REBUILD_TEST_DATA:
    try: 
        pneuvsclean = BuildData("test")
        pneuvsclean.make_data()
    except:
        print("ERROR: Please update ROOT_DIR in the above cell (TypeError at BuildData: make_data()).")



ERROR: Please update DATA_DIR in the above cell (TypeError at BuildData: __init__()).
ERROR: Please update ROOT_DIR in the above cell (TypeError at BuildData: make_data()).


In [6]:
assert ROOT_DIR != None, "ERROR: ROOT_DIR is NoneType."
assert DATA_DIR != None, "ERROR: DATA_DIR is NoneType."

# Loading all data from files:

training_clean = np.load(f"{ROOT_DIR}/processed/training_clean.pkl", allow_pickle=True)
training_pneu = np.load(f"{ROOT_DIR}/processed/training_pneu.pkl", allow_pickle=True)
testing_clean = np.load(f"{ROOT_DIR}/processed/testing_clean.pkl", allow_pickle=True)
testing_pneu = np.load(f"{ROOT_DIR}/processed/testing_pneu.pkl", allow_pickle=True)

# Displaying first element graphically using pyplot, for intuition:
plt.imshow(training_clean[3][0])
plt.show()

# [1, 0]:  Clean
# [0, 1]:  Pneu
print(training_clean[1][1])


# Displaying first element graphically using pyplot, for intuition:
plt.imshow(training_pneu[3][0])
plt.show()

# [1, 0]:  Clean
# [0, 1]:  Pneu
print(training_pneu[1][1])

AssertionError: ERROR: ROOT_DIR is NoneType.

### Splitting Datasets:

In [7]:
assert ROOT_DIR != None, "ERROR: ROOT_DIR is NoneType."
assert DATA_DIR != None, "ERROR: DATA_DIR is NoneType."

################################################################################
# The loading datasets code is run again:
# Loading all data from files:
training_clean = np.load(f"{ROOT_DIR}/processed/training_clean.pkl", allow_pickle=True)
training_pneu = np.load(f"{ROOT_DIR}/processed/training_pneu.pkl", allow_pickle=True)
testing_clean = np.load(f"{ROOT_DIR}/processed/testing_clean.pkl", allow_pickle=True)
testing_pneu = np.load(f"{ROOT_DIR}/processed/testing_pneu.pkl", allow_pickle=True)

################################################################################


# How many samples of each (CLEAN and PNEU) is desired?
TRAIN_SPLIT_INPUT = 500
# Capping train_split by its limiting factors:
TRAIN_SPLIT = min(TRAIN_SPLIT_INPUT, min(len(training_clean), len(training_pneu)))

# Shuffling existing datasets to ensure randomness:
np.random.shuffle(training_clean)
np.random.shuffle(training_pneu)

np.random.shuffle(testing_clean)
np.random.shuffle(testing_pneu)

# Concatenating lists and shuffling result:
training_both = training_clean[:TRAIN_SPLIT] + training_pneu[:TRAIN_SPLIT]
testing_both = testing_clean + testing_pneu

np.random.shuffle(training_both)
np.random.shuffle(testing_both)

# Conversion to a Tensor:
#    (scaled to avoid overflow)
train_X = torch.Tensor(np.array([np.array(i[0]) for i in training_both])).view(-1, 100, 100) / 255
train_y = torch.Tensor(np.array([np.array(i[1]) for i in training_both]))

test_X = torch.Tensor(np.array([np.array(i[0]) for i in testing_both])).view(-1, 100, 100) / 255
test_y = torch.Tensor(np.array([np.array(i[1]) for i in testing_both]))

# Checking the lengths are equal:
print(len(train_X))
print(len(train_y))

print(len(test_X))
print(len(test_y))

AssertionError: ERROR: ROOT_DIR is NoneType.

### Creating Neural Network:

In [343]:
import torch.nn as nn
import torch.nn.functional as F

CONV_LAYERS = 3
LINEAR_LAYERS = 2

FLATTEN_SHAPE = 12800

# Defining our Neural Network class:
#    inherits from nn.Module.
class NeuralNet(nn.Module):
    # Initialization of a network:
    def __init__(self):
        super().__init__()
        # Convolutional layers:
        #    syntax:  nn.Conv2d(inputs, outputs, kernel-size)  for a 2D convnet.
        self.conv1 = nn.Conv2d(1, 32, 5)
        self.conv2 = nn.Conv2d(32, 64, 5)
        self.conv3 = nn.Conv2d(64, 128, 3)
        # Defining our flattening function for "x":
        self.flat = nn.Flatten()

        self.fc1 = nn.Linear(FLATTEN_SHAPE, 256)
        self.fc2 = nn.Linear(256, 2)

    # Forward propagation:
    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv3(x)), (2, 2))
        # Flattening to pass through linear layers:
        x = x.view(-1, FLATTEN_SHAPE)
        
        x = F.relu(self.fc1(x))
        x = F.softmax(self.fc2(x), dim=1)
        return x

net = NeuralNet()
x = torch.randn(100, 100).view(-1, 1, 100, 100)
net.forward(x)

tensor([[0.4645, 0.5355]], grad_fn=<SoftmaxBackward0>)

#### Standard Adam Optimization:

In [345]:
import torch.optim as optim

# Using the Adam optimizer:
optimizer = optim.Adam(net.parameters(), lr=0.0001)

# Defining loss function MSE:
loss_function = nn.MSELoss()


# Running the neural network:
# Choosing batch size:
BATCH_SIZE = 64
BATCHES = len(train_X) // BATCH_SIZE
EPOCHS = 3

for epoch in range(EPOCHS):
    for i in tqdm(range(0, len(train_X), BATCHES)):
        batch_X = train_X[i : i + BATCHES].view(-1, 1, 100, 100)
        batch_y = train_y[i : i + BATCHES]

        # Zeroing gradient:
        optimizer.zero_grad()

        # Calculating forward propagation:
        outputs = net(batch_X)
        loss = loss_function(outputs, batch_y)

        # Backpropagation:
        loss.backward()
        optimizer.step()
        
    print(loss)



100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:04<00:00, 13.63it/s]


tensor(0.0233, grad_fn=<MseLossBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:04<00:00, 14.40it/s]


tensor(0.0084, grad_fn=<MseLossBackward0>)


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:04<00:00, 14.82it/s]

tensor(0.0045, grad_fn=<MseLossBackward0>)





### Testing:

In [346]:
import pandas as pd

# Testing accuracy:
correct = 0
total = 0
TEST_RANGE = 400

with torch.no_grad():
    for i in tqdm(range(TEST_RANGE)):
        real_class = torch.argmax(test_y[i])
        net_out = net(test_X[i].view(-1, 1, 100, 100))[0]
        predicted_class = torch.argmax(net_out)
        if predicted_class == real_class:
            correct += 1
            # print(f"Correct! Predicted: {predicted_class}. Actual: {real_class}")
        # else:
            # print(f"WRONG! Predicted: {predicted_class}. Actual: {real_class}")
        total += 1
    print(f"Accuracy: {round(correct/total, 3)}")
    print(f"Total: {total}")

# plt.imshow(test_X[0])
# plt.show()

# [1, 0]:  Clean
# [0, 1]:  Pneu
# print(test_y[0])

# Save code for models, if wanted.
# torch.save(net.state_dict(), f"{ROOT_DIR}saved_models/current_model")

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:01<00:00, 356.84it/s]

Accuracy: 0.9
Total: 400





#### Accuracies:

* 88.00:  BATCH_SIZE=32, EPOCHS=5

* 90.25:  BATCH_SIZE=32, EPOCHS=5
* 92.70:  BATCH_SIZE=64, EPOCHS=3  (rerun)
* 92.50:  BATCH_SIZE=32, EPOCHS=5  (rerun)

