![logo](https://github.com/donatellacea/DL_tutorials/blob/main/notebooks/figures/1128-191-max.png?raw=true)

# Classification with Deep Learning 
---

With the term **Machine Learning** (ML) we define a set of algorithms and methods that provide a machine with the ability to learn automatically and improve from experience without being explicitly programmed.
When we have labeled data, we can use the label to guide the learning process, and this is called **Supervised learning**. If data are not labeled, it means that we don't have a guide or a supervision, and this is called **Unsupervised learning**.
Within Supervised learning we can have two different kind if problems:
 - **Regression problem**: the task of predicting a contineous quantity,  
 - **Classification problem**: the task of predicting a label or a class (discrete values).

This tutorial will show you how to perform classification with deep Neural Network (NN) on images. We will work with two public datasets, and we will see a binary classification and a multi-class classification problem. 

In order to start working on the notebook, click on the following button, this will open this page in the Colab environment and you will be able to execute the code on your own.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/donatellacea/DL_tutorials/blob/main/notebooks/DL_Classification_tutorial.ipynb)


## Setup the environment

If you already did this step for the Tensorflow Playground tutorial, you can skip the setup section and start with the Import and Install section. Otherwise, complete the next step before starting the tutorial.

Now that you are visualizing the notebook in Colab, run the next cell, in order to create a folder in your Google Drive. Alle the files for this tutorial, will be uploaded in this folder. After the first execution you might receive some warning and notification, please follow this instruciotns:
1. Warning: This notebook was not authored by Google. Click on Run anyway.
2. Permit this notebook to access your Google Drive files? Click on Yes, and select your account.
3. Google Drive for desktop wants to access your Google Account. Click on 'Allow'.

At this point a folder has been created and you can navigate it trhought the lefthand panel in Colab, you might also have received an email that informs you about the acess on your google Drive. 

In [None]:
# Create a folder in your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Execute the next cells to clone the repository from GitHub, so the important files and notebooks for this tutorial will be downloaded in your working folder on the Drive that you created in the previous step.

In [None]:
%cd drive/MyDrive

In [None]:
!git clone https://github.com/donatellacea/DL_tutorials

In [None]:
%cd DL_tutorials

### Import and intall

In [None]:
!pip install alive_progress

In [None]:
# Run this cell to import the main packages we will use
import pandas as pd
import numpy as np
import os
import shutil
import glob
import sklearn
import random
random.seed(1)
import matplotlib.pyplot as plt 
import PIL
import plotly.graph_objects as go
from skimage import io 
from alive_progress import alive_bar

import torch
torch.manual_seed(0)
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms

## Binary Classification
---

In this problem, we will use the Lung CT scans dataset in order to predict whether the patient has Codiv-19 or not. Since the output can be positive or negative, this is a classic example of **binary classification**. 

### Dataset 
The dataset, available on Kaggle (https://www.kaggle.com/datasets/luisblanche/covidct), will be downloaded in your google drive folder that we will create in the first step of the tutorial.

It counts in a total of 746 images divided as follows:
- 397 No Covid
- 349 Covid

The images, i.e. CT scans, are obtained through Computed Tomography, a medical imaging technique used in radiology (x-ray) to obtain detailed internal images of the body noninvasively for diagnostic purposes. Only with proper training is it possible to interpret the scans, so without a radiology/medical background, it is tough to understand the presence of Covid-19 from the scan. But we will see that a well-trained NN can help the technicians and doctors diagnose this kind of disease.

Run the next cell to download the data, you should see a folder that contains two sub folder one for each class, Covid and No-Covid.

In [None]:
# Define path
main_path = '/content/drive/MyDrive/DL_tutorials/notebooks/'

In [None]:
!curl -L https://www.dropbox.com/s/ynxtbh7t0mts30k/Dataset_CT_lungs.zip?dl=1 > /content/drive/MyDrive/DL_tutorials/notebooks/Dataset_CT_lungs.zip

In [None]:
shutil.unpack_archive(main_path + 'Dataset_CT_lungs.zip', main_path)
shutil.rmtree(main_path + '__MACOSX')

In [None]:
# Create the path to each folder
data_path = main_path + '/Dataset_CT_lungs/'
pos_files = glob.glob(os.path.join(data_path, "CT_COVID",'*.*'))
neg_files = glob.glob(os.path.join(data_path, 'CT_NonCOVID','*.*'))
images = pos_files + neg_files
num_total = len(images)

In [None]:
# Plot 9 random CT scans from the dataset to see how do they look like
# random.seed(7)
plt.subplots(3, 3, figsize=(8, 8)) 
num_fig = 9
ax_name = ['No Covid'] * num_fig
for i, number in enumerate(random.sample(range(num_total), num_fig)):
    im = PIL.Image.open(images[number])
    arr = np.array(im)
    plt.subplot(3, 3, i + 1)
    if 'CT_COVID' in images[number]:        
        ax_name[i] = 'Covid'
    plt.xlabel(ax_name[i], fontsize=15)
    plt.imshow(arr, cmap="gray", vmin=0, vmax=255)
plt.tight_layout()
plt.show()

## Multi-class Classification
---

In this problem, we will use the MedNIST dataset in order to predict whether the image belongs to one of the six possible class. Since the output can be positive or negative, this is a classic example of **multi-class classification**. 

### Dataset 
The MedNIST dataset was gathered from several sets from TCIA, the RSNA Bone Age Challenge, and the NIH Chest X-ray dataset.

The dataset is kindly made available by Dr. Bradley J. Erickson M.D., Ph.D. (Department of Radiology, Mayo Clinic) under the Creative Commons CC BY-SA 4.0 license. If you use the MedNIST dataset, please acknowledge the source, e.g.

https://github.com/Project-MONAI/MONAI/blob/master/examples/notebooks/mednist_tutorial.ipynb.

The following commands download and unzip the dataset (~60MB).

In [None]:
!curl -L https://www.dropbox.com/s/1c5em1n5suasf1c/MedNIST.zip?dl=1 > /content/drive/MyDrive/DL_tutorials/notebooks/MedNIST.zip

In [None]:
shutil.unpack_archive(main_path + 'MedNIST.zip', main_path)
shutil.rmtree(main_path + '__MACOSX')

#### Define the structure of the convolutional Neural Network (CNN)

In [None]:
class CNN(nn.Module):
    def __init__(self, in_channels, num_classes):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=in_channels, out_channels=8, kernel_size=(3,3)) #out_channels=32
        self.pool1 = nn.MaxPool2d(kernel_size=(2,2), stride=(2,2))
        self.conv2 = nn.Conv2d(in_channels=8, out_channels=16, kernel_size=(3,3)) # in_channels=32
        self.pool2 = nn.MaxPool2d(kernel_size=(2,2), stride=(2,2))
        self.flatten = nn.Flatten()
        self.dropout = nn.Dropout(0.2)
        self.lin1 = nn.Linear(3136, 64)
        self.lin2 = nn.Linear(64, num_classes)
        
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = F.relu(self.conv2(x))
        x = self.pool2(x)
        x = self.flatten(x)
        x = self.dropout(x) # test with no dropout
        x = F.relu(self.lin1(x))
        x = self.lin2(x)
        
        return x

#### Set up the parameters

In [None]:
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyperparameters
in_channels = 1
num_classes = 6
lr = 0.0001 #0.0001 good
batch_size = 64
num_epochs = 10

#### Load the data

In [None]:
class MedicalMNIST(Dataset):
    def __init__(self, df, root_dir, transform=None):
        self.annotations = df
        self.root_dir = root_dir
        self.transform = transform
    def __len__(self):
        return len(self.annotations)
    def __getitem__(self, index):
        img_path = os.path.join(self.root_dir, self.annotations.iloc[index, 0])
        image = io.imread(img_path)
        y_label = torch.tensor(int(self.annotations.iloc[index, 1]))
        
        if self.transform:
            image = self.transform(image)
        
        return (image, y_label, self.annotations.iloc[index, 0])

#### Create the dataset and train-test split

In [None]:
mp = {}
i = 0
for category in os.listdir(main_path + "MedNIST"):
    if category != 'README.md' and not category.startswith("."):
        mp[category] = i
        i += 1
print(mp)

In [None]:
def create_data(base_path, name_folder, percentage_to_treat=None):
    # this function create a dataframe with the path of files and gt values 
    # for the classification algorithm with pytorch
    # base_path = path for MedMNIST and dataset folders
    # name_folder = name of the folder of the new dataset
    # percentage_to_treat = list with percentage of files to take from each folder

    data_path = base_path + 'MedNIST/'
    new_path = base_path + name_folder ### Remove together with relative input
    list_of_dirs = []
    for name in os.listdir(data_path):
        if name != 'README.md' and not name.startswith("."):
            list_of_dirs.append(name)
    number_of_dirs = len(list_of_dirs)
    if percentage_to_treat is None:
        percentage_to_treat = [1.] * number_of_dirs

    df_new = pd.DataFrame() # columns=['filename', 'groundtruth'])

    for i, name in enumerate(list_of_dirs):
        current_dir = data_path + name
        number_of_files = len(os.listdir(current_dir))
        number_of_files_treat = int(percentage_to_treat[i] * number_of_files)

        list_copied_train_files = []
        with alive_bar(number_of_files_treat, title=name, force_tty=True, bar='classic', spinner='dots_waves') as bar:
            for j, number in enumerate(random.sample(range(number_of_files), number_of_files_treat)):
                file = os.listdir(current_dir)[number]
                list_copied_train_files.append([name + '/' + file, i, name])
                bar()
                
        df_new = pd.concat([df_new, pd.DataFrame(list_copied_train_files)]) #columns=['filename', 'groundtruth', 'class name'])])
   
    return df_new

In [None]:
folder_name = 'prova'
df = create_data(main_path, folder_name, percentage_to_treat=[0.01, 0.01, 0.01, 0.05, 0.05, 0.05])
df

In [None]:
num_total = len(df)
num_total

In [None]:
df[2].value_counts()

#### Have a look at the data

In [None]:
plt.subplots(4, 4, figsize=(8, 8))
random.seed(1) # change this number to create other random images
for i, k in enumerate(random.sample(range(len(df)), 16)):
    im = PIL.Image.open(main_path + "MedNIST/" + df[0].iloc[k])
    # im = PIL.Image.open(df[0].iloc[k])

    arr = np.array(im)
    plt.subplot(4, 4, i + 1)
    plt.xlabel(df[2].iloc[k])
    plt.imshow(arr, cmap="gray", vmin=0, vmax=255)
plt.tight_layout()
plt.show()