# Practical machine learning and deep learning. Lab 2
# Tools and Processes for Machine Learning and Data Analysis

## [Competition](https://www.kaggle.com/t/936f7286a02b4161b70e9037553ef01d)

### Goal

In Lab 2, you have two tasks: 1) recap the training flow of a neural network and 2) use a traking tool to control this flow. 

### Submission
Your are asked to implement a neural network to classify if there is a person with blond hairs on a photo and generate `submission.csv` for the test set.

## Frameworks we're using in this lab

#### PyTorch
   PyTorch is an open-source machine learning library primarily developed by Meta's (компания признана экстремистской организацией на территории Российской Федерации) AI Research lab . It is widely used for deep learning tasks.

#### Tensorboard
   TensorBoard is a visualization tool provided by TensorFlow for monitoring and visualizing the training process and model performance during machine learning experiments.


#### ClearML
   ClearML is an open-source machine learning platform designed to automate and streamline the end-to-end machine learning workflow, including data management, model training, and deployment.

In [2]:
!pip install tensorboard



In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
!pip install torch 



In [5]:
!pip install matplotlib



In [6]:
!pip install torchvision



## 1. Data set

Data set, a part of [CelebA](https://www.kaggle.com/datasets/jessicali9530/celeba-dataset/data), contain about 11,000 cropped images of faces. Your task is to detect people with blond hairs or do binary classification (blondy\not blondy) for each image.

There are several small challenges you're going to face with the dataset. First, the images are much bigger than in regular learning sets—178 by 218 pixels in RGB scale. Second, the CelebA dataset is quite well- labelled, and each image has lots of metadata about face features. Usually, they are used to train generative or face recognition models. We intentionally left all of these features, which are redundant for solving this task.

Please do not train a classifier on other data than given (including other parts of CelebA).


## 1.1 Data preprocessing

In [7]:
# necessary imports
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from tqdm.notebook import tqdm

from torch.utils.tensorboard import SummaryWriter

import torchvision.transforms as transforms
from torchvision.io import read_image

In [8]:
from torchvision.transforms import v2

def load_img(fname):
    """
    Load an image from file, do transformation (including possible augmentation) and return it as torch.tensor

    :param fname: path to jpg image
    """
    img = read_image(fname)
    x = img / 255.
    
    # Write your code here
    transform = transforms.Compose(
    [
        transforms.Resize((224, 224)),                # Resize image to 224x224 (or the size of your model input)
        transforms.RandomHorizontalFlip(p=0.5),       # Random horizontal flip with 50% probability
        transforms.ColorJitter(brightness=0.2,        # Randomly change brightness, contrast, and saturation
                               contrast=0.2, 
                               saturation=0.2),
        transforms.RandomRotation(degrees=15),        # Randomly rotate the image by up to 15 degrees
        transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                             std=[0.229, 0.224, 0.225]),
        v2.ToDtype(torch.float32, scale=True),  # Convert to float and scale
        # Keep color augmentations minimal as we're detecting hair color:
        ]
    )
    
    return transform(x)

In [9]:
img_path = "C:/Users/pv010/OneDrive/Рабочий стол/Homework/PMDL/Assignment 1/DeployedModel/code/datasets/archive"
# img_path = "C:/Users/pv010/OneDrive/Рабочий стол/Homework/PMDL/Drafts/hw2/archive"
# Image attributes
train_features = pd.read_csv(f"{img_path}/train.csv")

# Load and transform images 
images = torch.stack([load_img(f"{img_path}/img_align_celeba/train/{item['image_id']}") for _, item in  train_features.iterrows()])

# Write your code here
# Select label(s) from train_features
labels = train_features.get('Blond_Hair')
# Leave values that only 1 or 0 and convert to float just for simplicity
labels.replace(-1, 0, inplace=True)
labels = torch.from_numpy(labels.to_numpy()).float()


## 1.2 Visualization

In [10]:
import matplotlib.pyplot as plt


def plot_images(images, captions=[], rows=2, columns=5, title="", **kwargs):
    """
    Plots images with captions

    :param images: list of images to plot
    :param captions: captions of images:
    :param rows: number of rows in figure
    :param columns: number of columns:
    :param title: super title of figure
    """
    fig = plt.figure(figsize=(6, 3))
    for i, img in enumerate(images):
        fig.add_subplot(rows, columns, i + 1)
        plt.imshow(img, **kwargs)
        if i < len(captions):
            plt.title(captions[i])
        plt.axis("off")
    fig.suptitle(title)
    plt.show()

## 1.3 Data loaders creation

In [11]:
from torch.utils.data import TensorDataset, DataLoader

processed_dataset = TensorDataset(images, labels)

# Write your code here
# Set proportion and split dataset into train and validation parts
proportion = 0.9

train_dataset, val_dataset = torch.utils.data.random_split(
    processed_dataset,
   [(int(len(images) * proportion)), len(images) - int(len(images) * proportion)],
)


In [12]:
# Create Dataloaders for training and validation 
# Dataloader is iterable object over dataset
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)


## 2. Training


## 2.1 Defining a model

We will implement CNN

> if you want higher score implement any suitable model you know and like

Check Pytorch [documentation](https://pytorch.org/docs/stable/nn.html): layers, loss functions, etc

In [13]:
import torch
import torch.nn as nn

class CNNClassificationModel(nn.Module):
    """
    MLP (multi-layer perceptron) based classification model for MNIST
    """

    def __init__(self, num_classes=2):
        super(CNNClassificationModel, self).__init__()

        # Add fully connected layers to nn.Sequential to create MLP
        # First layer should take 28x28 vector
        # last layer should return vector of size num_classes
        # do not forget to add activation function between layers

        self.block1 = nn.Sequential(
            nn.BatchNorm2d(3),
            nn.Conv2d(3, 4, kernel_size=(3, 3), stride=2),
            nn.ReLU(),
            nn.BatchNorm2d(4),
        )

        self.block2 = nn.Sequential(
            nn.Conv2d(4, 8, kernel_size=(3, 3)),
            nn.ReLU(),
            nn.BatchNorm2d(8),
        )

        self.out = nn.Sequential(
            nn.Linear(95048, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 16),
            nn.ReLU(),
            nn.Dropout(0.4),
            nn.Linear(16, num_classes),
        )

    def forward(self, x):
        x = self.block1(x)
        x = self.block2(x)
        x = x.view(x.size(0), -1)

        x = self.out(x)

        return x



## 2.2 Defining training & validation loops

Here is the sample function for training procedure. 
We save the checkpoints with best accuracy score. For the inference you need to load it to the model.

> You can add early stopping if you want for better results

In [33]:
def train(
    model,
    optimizer,
    loss_fn,
    train_loader,
    val_loader,
    writer=None,
    epochs=1,
    device="cpu",
    ckpt_path="../models/best.pt",
):
    # best score for checkpointing
    best = 0.0
    
    # iterating over epochs
    for epoch in range(epochs):
        # training loop description
        train_loop = tqdm(
            enumerate(train_loader, 0), total=len(train_loader), desc=f"Epoch {epoch}"
        )
        model.train()
        train_loss = 0.0
        # iterate over dataset 
        for data in train_loop:
            # Write your code here
            # Move data to a device, do forward pass and loss calculation, do backward pass and run optimizer
            id, (inputs, labels) = data
            inputs = inputs.to(device)
            labels = labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            labels = labels.type(torch.int64)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
            train_loop.set_postfix({"loss": loss.item()})
        # write loss to tensorboard
        if writer:
            writer.add_scalar("Loss/train", train_loss / len(train_loader), epoch)
        
        # Validation
        correct = 0
        total = 0
        with torch.no_grad():
            model.eval()  # evaluation mode
            val_loop = tqdm(enumerate(val_loader, 0), total=len(val_loader), desc="Val")
            for data in val_loop:
                id, (inputs, labels) = data
                # Write your code here
                # Get predictions and compare them with labels
                inputs = inputs.to(device)
                labels = labels.to(device)
                labels = labels.type(torch.int64)
                outputs = model(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                for i, j in zip(predicted,labels):
                    if i == j: correct += 1
                    
                val_loop.set_postfix({"acc": correct / total})

            if correct / total > best:
                torch.save(model.state_dict(), ckpt_path)
                best = correct / total


## 2.3 Combining everything together

In [34]:
import torch.optim as optim
from model import CNNClassificationModel
# Write your code here
# Pick optimizer from torch.optim and loss function loss_fn from torch.nn that suits best the model
# SummaryWriter is used by tensorboard and could be set None
model = CNNClassificationModel()

train(
    model=model,
    optimizer=optim.Adam(model.parameters(), lr=0.001),
    loss_fn=nn.CrossEntropyLoss(),
    train_loader=train_loader,
    val_loader=val_loader,
    device='cpu',
    writer=SummaryWriter(),
    epochs=7
)


Epoch 0:   0%|          | 0/166 [00:00<?, ?it/s]

Val:   0%|          | 0/19 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/166 [00:00<?, ?it/s]

KeyboardInterrupt: 

## 2.4 Inference
Here you need to perform inference of trained model on test data. 

Load the best checkpoint from training to the model and run inference

In [40]:
# load best checkpoint to model
from pathlib import Path
model = CNNClassificationModel()
Path("../models/best.pt").rename("C:/Users/pv010/OneDrive/Рабочий стол/Homework/PMDL/Assignment 1/DeployedModel/models/best.pt")
ckpt = torch.load("C:/Users/pv010/OneDrive/Рабочий стол/Homework/PMDL/Assignment 1/DeployedModel/models/best.pt")
model.load_state_dict(ckpt)

<All keys matched successfully>

In [None]:
import torch
from tqdm.notebook import tqdm

def predict(model, test_loader, device):
    """
    Run model inference on test data
    """
    predictions = []
    with torch.no_grad():
        model.eval()  # evaluation mode
        test_loop = tqdm(enumerate(test_loader, 0), total=len(test_loader), desc="Test")

        for inputs in test_loop:
            # Write your code here
            # Similar to validation part in training cell
            id, pred = inputs
            pred = pred.to(device)
            _, predicted = torch.max(model(pred).data, 1)

            # Extend overall predictions by prediction for a batch
            predictions.extend([i.item() for i in predicted])
        return predictions


In [20]:
# process test data and run inference on it
test_features = pd.read_csv(f"{img_path}/test.csv")
images = torch.stack([load_img(f"{img_path}/img_align_celeba/test/{item['image_id']}") for _, item in  test_features.iterrows()])

test_loader = DataLoader(images, batch_size=batch_size, shuffle=False)
predictions = predict(model, test_loader, device='cpu')

# generate the submission file
submission_df = pd.DataFrame(columns=['ID', 'Blond_Hair'])
submission_df['ID'] = test_features.index
submission_df['Blond_Hair'] = predictions
submission_df.to_csv('submission.csv', index=False)
submission_df.head()

Test:   0%|          | 0/16 [00:00<?, ?it/s]

Unnamed: 0,ID,Blond_Hair
0,0,1
1,1,0
2,2,0
3,3,0
4,4,0
