# Fundamental Machine Learning Questions

## Easy
### Score - 3 points for each correct and confident answer
1. What is the difference between supervised, semi-supervised, weakly supervised, and unsupervised learning?

2. Can you explain the concept of overfitting? How can you avoid it? Answer the question from a traditional ML perspective.

3. What is generalization? How do you evaluate a model's ability to generalize?


## Medium
### Score - 5 points for each correct and confident answer

1. Explain the Bias and Variance Trade-off. What is the need to trade-off? What is the importance of each (bias and variance)?

2. Is the Logistic Regression model a regression model? If yes, what is the core working principle behind it?

3. What are the metrics generally used for a classification problem? Can you build a confusion matrix and how many metrics can one derive from it?


## Hard
### Score - 8 points for each correct and confident answer

1. Describe briefly a project you worked on and please share the GitHub repo (code). What was your underlying motivation to start off with this project?

2. Discuss various preprocessing methods used in ML such as data imputation, dimensionality reduction, data normalization, and feature selection. What are various techniques or ways to solve each of the problems? Is there a global solution to tackle all these problems at once? If no, what could be the reason?

3. Pick a machine learning deterministic algorithm of your choice from the below list and provide an in-detail explanation of the underlying mathematical formulation and use the sklearn library (https://scikit-learn.org/stable/) to implement it on the data of your choice from the UCI repo (https://archive.ics.uci.edu/).

Algorithm List: [Linear Regression, Logistic Regression, Decision Tree Regressor/Classifier, Support Vector Regressor/Classifier]


# Easy questions -> answers

### 1. What is the difference between supervised, semi-supervised, weakly supervised, and unsupervised learning?
#### Supervised learning

This is when we train a model using data that already has correct answers (labels). The model learns by matching input to the known output, like learning from solved examples.

####  Semi-supervised learning

Here, we have a small amount of labeled data and a large amount of unlabeled data. The model learns from the labeled part and also uses patterns from the unlabeled data to improve its performance.

#### Weakly supervised learning

In this case, we have labels, but they are not fully reliable or detailed. The labels might be noisy, automatically generated, or roughly correct, and the model learns even though the supervision is imperfect.

#### Unsupervised learning

This is when we train the model without any labels at all. The model tries to find structure on its own, like grouping similar data points or discovering hidden patterns.

### 2. Can you explain the concept of overfitting? How can you avoid it? Answer the question from a traditional ML perspective.
Overfitting happens when a machine learning model learns the training data too well, including its noise and random patterns instead of learning the true general trend. So it performs very well on training data but gives poor results on new or unseen data.

We can reduce overfitting by keeping the model simpler and more general. Common ways include using more balanced training data, reducing model complexity (like limiting tree depth or using fewer features), and applying regularization (like L1/L2). We can also use proper validation methods such as train-test split or cross-validation and early stopping during training to prevent the model from fitting too much.

### 3. What is generalization? How do you evaluate a model's ability to generalize?
Generalization means how well a model can perform on new, unseen data, not just the data it was trained on. A model that generalizes well learns the real underlying pattern, so it gives good predictions even on fresh inputs.

We evaluate a model's ability by testing the model on data it has never seen before, like a test set or a validation set. Common methods include train-test split, cross-validation, and checking performance using metrics like accuracy, precision, recall, F1-score, RMSE, etc. If a model performs well on both training and unseen data with a small gap, it usually means it generalizes well.

# Medium Questions -> answers

### 1. Explain the Bias and Variance Trade-off. What is the need to trade-off? What is the importance of each (bias and variance)?

The bias and variance trade-off explains why a model cannot be perfect in every way. If a model is too simple, it makes strong assumptions and may miss important patterns which leads to high bias. If a model is too complex, it may learn even the noise in the training data which leads to high variance. So we need a balance where the model learns real patterns but does not become too sensitive to the training data.

We need the trade-off because improving one side often worsens the other. When we reduce bias by making the model more complex, the model can start overfitting and variance increases. When we reduce variance by making the model simpler, the model may underfit and bias increases. The goal is to find a model complexity that gives the best performance on unseen data.

Importance of bias is that it shows how much error comes from wrong assumptions in the model. A high-bias model is usually too simple and performs poorly even on training data because it cannot capture the true relationship. Reducing bias helps the model learn better patterns and improves accuracy on training data.

Importance of variance is that it shows how much the model’s output changes when the training data changes. A high-variance model depends too much on the training data and performs well on training but poorly on unseen data. Reducing variance helps the model become more stable and reliable in real-world predictions.

### 2. Is the Logistic Regression model a regression model? If yes, what is the core working principle behind it?
No, Logistic Regression is not a regression model in practice. Even though the name has regression, it is mainly used for classification, especially binary classification.
Core working principle
It takes a linear combination of input features like a normal linear model and passes it through the sigmoid function to convert the output into a probability between 0 and 1.
So it predicts P(y = 1 | x), and then uses a threshold (usually 0.5) to decide the final class:
probability >= 0.5 -> class 1
probability < 0.5 -> class 0

### 3. What are the metrics generally used for a classification problem? Can you build a confusion matrix and how many metrics can one derive from it?
The most common metrics are Accuracy, Precision, Recall, F1-score, and ROC-AUC.
For imbalanced datasets, Precision, Recall, and F1-score are usually more important than accuracy, people also use Log Loss and for multi-class problems, macro/micro averaged Precision, Recall, and F1.

Yes, we can build a confusion matrix. It is a 2×2 table that compares actual vs predicted classes:
sample syntax is-
```python
from sklearn.metrics import confusion_matrix
# y_true = actual labels
# y_pred = predicted labels
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
```
Predicted       Positive	         Predicted Negative
Actual Positive	TP (True Positive)	FN (False Negative)
Actual Negative	FP (False Positive)	TN (True Negative)

We can derive many useful metrics from a confusion matrix. The most common ones are:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall (Sensitivity / TPR) = TP / (TP + FN)
F1-score = 2 × (Precision × Recall) / (Precision + Recall)
False Positive Rate & False Negative Rate

# Hard Questions -> answers

### 1. Describe briefly a project you worked on and please share the GitHub repo (code). What was your underlying motivation to start off with this project?
One project I’m proud of is RoadEye, a road safety focused, road infrastructure defect auditing system that I built for the National Road Safety Hackathon at IIT Madras, where it was selected and presented as a Top 10 finalist project among 3.7k participants nationwide.

The project focuses on building a computer vision based road infrastructure defect auditing system that can analyze road images/footage and detect critical road safety issues and surface defects. Technically, it’s designed as a modular ML pipeline where we handle data preprocessing, run object detection inference using a deep learning model YOLO-style workflow, and generate structured outputs by extracting metadata that we be plugged into a reporting layer as an automated road audit. The overall aim was to convert raw road visuals into actionable, measurable defect insights that can support faster road maintenance and safety monitoring.

[github repo](https://github.com/aryy8/road-safety)

### 2. Discuss various preprocessing methods used in ML such as data imputation, dimensionality reduction, data normalization, and feature selection. What are various techniques or ways to solve each of the problems? Is there a global solution to tackle all these problems at once? If no, what could be the reason?

In ML, preprocessing is basically the set of steps we apply to raw data to make it clean, usable, and more model-friendly. Some of the most common preprocessing methods are data imputation, dimensionality reduction, normalization, and feature selection.
1) Data Imputation (handling missing values)
When our dataset has missing values, we either remove them or fill them in using a sensible strategy. Common ways include:
Dropping rows (only if missing values are very few)
Dropping columns (if a column has too many missing values)
Filling numeric values using mean or median
Filling categorical values using mode
Forward fill or backward fill (mostly in time-series)
KNN imputation (using nearest data points)
Regression-based imputation (predict missing values from other features)

2) Dimensionality Reduction
When we have too many features, it can lead to overfitting, slower training, and redundant information. To reduce dimensions, we use:
PCA (most common for reducing numeric features)
SVD (very useful in sparse data like text vectors)
LDA (supervised dimensionality reduction)
Removing highly correlated features
Autoencoders (deep learning based compression approach)

3) Data Normalization / Scaling
Scaling is important when features are in different ranges, because many models like KNN, SVM, and logistic regression depend on distances and gradients. Common scaling methods are:
Min-Max scaling (0 to 1 range)
Standardization (mean 0, standard deviation 1)
Robust scaling (good when outliers exist)
Log transform (reduces skewness)

4) Feature Selection
Feature selection helps us keep only the most important features, reduce noise, and improve generalization. Some common techniques are:
-Filter methods (fast and simple)
Correlation threshold
Chi-square test
ANOVA / F-test
-Wrapper methods (more accurate but slower)
Forward selection
Backward elimination
Recursive Feature Elimination (RFE)
-Embedded methods (built into model training)
L1 regularization (Lasso)
Elastic Net (L1 + L2)
Tree-based feature importance like Random Forest or XGBoost


No, there is no single global solution that works for all preprocessing problems together.
Reason being that preprocessing depends heavily on the dataset and the model. For example, missing value handling depends on why values are missing, scaling depends on the model type, dimensionality reduction may reduce interpretability, and feature selection depends on feature-target relationships. So we usually choose preprocessing methods based on the problem statement, the data distribution, and the ML algorithm we plan to use.




### 3. Pick a machine learning deterministic algorithm of your choice from the below list and provide an in-detail explanation of the underlying mathematical formulation.
#### 1) What Logistic Regression is
Logistic Regression is a deterministic supervised machine learning algorithm used for classification (mostly binary classification).  
Instead of predicting a continuous value, it predicts the probability that an input belongs to a class.

#### 2) Mathematical Formulation


##### 2.1 Linear model (score function)
For an input feature vector $x = [x_1, x_2, ..., x_n]$, we compute:

$$
z = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n
$$

##### 2.2 Sigmoid function (probability output)
We convert this score into probability using the sigmoid function:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

$$
\hat{p} = P(y=1|x) = \sigma(z)
$$

##### 2.3 Final class prediction
We use a threshold (usually 0.5) to decide the class:

$$
\hat{y} =
\begin{cases}
1, & \hat{p} \ge 0.5 \\
0, & \hat{p} < 0.5
\end{cases}
$$

##### 2.4 Loss function (Binary Cross Entropy / Log Loss)
Logistic Regression learns parameters by minimizing:

$$
J(\beta) = -\frac{1}{m}\sum_{i=1}^{m}
\left[
y_i\log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)
\right]
$$

Where:
- $m$ = number of training samples  
- $y_i$ = actual label (0 or 1)  
- $\hat{p}_i$ = predicted probability  

##### 2.5 Regularization (to avoid overfitting)
In practice, we often use L2 regularization:

$$
J(\beta) + \lambda \sum_{j=1}^{n}\beta_j^2
$$

This controls model complexity by preventing weights from becoming too large.

#### 3) Implementation using `sklearn` (UCI Breast Cancer Dataset)

##### Dataset used
 **Breast Cancer Wisconsin (Diagnostic)** dataset from the UCI repository.



In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1) UCI dataset
data = load_breast_cancer()
X = data.data
y = data.target  # 0 = malignant, 1 = benign

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

# 2) Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train samples:", X_train.shape[0])
print("Test samples:", X_test.shape[0])

# 3) Feature scaling (important for Logistic Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Shape of X: (569, 30)
Shape of y: (569,)
Train samples: 455
Test samples: 114


In [3]:
from sklearn.linear_model import LogisticRegression

# 4) Train the Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)


0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


In [4]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#  5) Predict on test data
y_pred = model.predict(X_test_scaled)


# 6) Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.9824561403508771

Confusion Matrix:
 [[41  1]
 [ 1 71]]

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98        42
           1       0.99      0.99      0.99        72

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



# Deep Learning Assessment
## Solving Face Recognition using Deep Learning Models

Most of the code is given. There is one `TODO` section where, one has to code the entire architecture in pytorch from scratch. Once the model architecture is ready all the components in code will be working. Further instructions are provided below.

In [6]:
# Installing Libraries
!pip3 install matplotlib Pillow


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m


In [1]:

# Basic Imports
import os
import sys
import warnings
import numpy as  np
import pandas as pd
from scipy import linalg

# Loading and plotting data
from PIL import Image
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Features
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
from sklearn.discriminant_analysis import _class_means,_class_cov
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.manifold import TSNE
plt.ion()
%matplotlib inline


In [2]:
opt = {
    'image_size': 32,
    'is_grayscale': True,
    'val_split': 0.75
}


In [3]:
cfw_dict = {'Amitabhbachan': 0,
    'AamirKhan': 1,
    'DwayneJohnson': 2,
    'AishwaryaRai': 3,
    'BarackObama': 4,
    'NarendraModi': 5,
    'ManmohanSingh': 6,
    'VladimirPutin': 7}

imfdb_dict = {'MadhuriDixit': 0,
     'Kajol': 1,
     'SharukhKhan': 2,
     'ShilpaShetty': 3,
     'AmitabhBachan': 4,
     'KatrinaKaif': 5,
     'AkshayKumar': 6,
     'Amir': 7}

# Load Image using PIL for dataset
def load_image(path):
    im = Image.open(path).convert('L' if opt['is_grayscale'] else 'RGB')
    im = im.resize((opt['image_size'],opt['image_size']))
    im = np.array(im)
    im = im/256
    return im

# Load the full data from directory
def load_data(dir_path):
    image_list = []
    y_list = []

    if "CFW" in dir_path:
        label_dict = cfw_dict

    elif "yale" in dir_path.lower():
        label_dict = {}
        for i in range(15):
            label_dict[str(i+1)] = i
    elif "IMFDB" in dir_path:
        label_dict = imfdb_dict
    else:
        raise KeyError("Dataset not found.")


    for filename in sorted(os.listdir(dir_path)):
        if filename.endswith(".png"):
            im = load_image(os.path.join(dir_path,filename))
            y = filename.split('_')[0]
            y = label_dict[y]
            image_list.append(im)
            y_list.append(y)
        else:
            continue

    image_list = np.array(image_list)
    y_list = np.array(y_list)

    print("Dataset shape:",image_list.shape)

    return image_list,y_list

# Display N Images in a nice format
def disply_images(imgs,classes,row=1,col=2,w=64,h=64):
    fig=plt.figure(figsize=(8, 8))
    for i in range(1, col*row +1):
        img = imgs[i-1]
        fig.add_subplot(row, col, i)

        if opt['is_grayscale']:
            plt.imshow(img , cmap='gray')
        else:
            plt.imshow(img)

        plt.title("Class:{}".format(classes[i-1]))
        plt.axis('off')
    plt.show()

> 1. One has to download the IMFDB [datatset](https://iiithydresearch-my.sharepoint.com/:f:/g/personal/kamalaker_dadi_research_iiit_ac_in/IgAqPZkfn02VT6hTU5_eQgKpAT5TZrq6qHwkbIsy6ojVw5M?e=6iZiCI) and upload to their drive.
2. Once it's uploaded you have to sync your drive with colab.
3. Once it is done, you have to paste the exact directory in the `dirpath` (locate the correct path for the code to execute)

In [4]:
# Loading the dataset from your drive
# First you have to upload the IMFDB dataset in the drive and locate the exact path for code to run
dirpath = '/Users/aryan/Downloads/IIITHcode/IMFDB'
X,y = load_data(dirpath)
N,H,W = X.shape[0:3]
C = 1 if opt['is_grayscale'] else X.shape[3]

Dataset shape: (400, 32, 32)


# Example Ilustration of loading VGG-16,19 Models and Features in PyTorch
## TODO
> 1. Initially we have provided a pre-trained model here for VGG-16,19 as an example.
2. Your task is to choose `1` among 3 different architectures and write the entire code only for the architecture.
3. Later, you have to answer all the questions regardless of chosen architecture.

Note: You have to recite the statements from the original paper and justify your answer with the perspective of authors. An example question answer is provided.


In [5]:
import torch
import torch.nn as nn
from torchvision import models, transforms, datasets
from torch.utils.data import DataLoader


In [6]:
# from above functions build you dataloader for the dataset

from torch.utils.data import Dataset, DataLoader
from PIL import Image
import numpy as np
import os
import torch

class CustomDataset(Dataset):
    def __init__(self, dir_path, transform=None):
        self.dir_path = dir_path
        self.transform = transform

        if "CFW" in dir_path:
            self.label_dict = cfw_dict
        elif "yale" in dir_path.lower():
            self.label_dict = {str(i+1): i for i in range(15)}
        elif "IMFDB" in dir_path:
            self.label_dict = imfdb_dict
        else:
            raise KeyError("Dataset not found.")

        self.data = []
        self.labels = []

        for filename in sorted(os.listdir(dir_path)):
            if filename.endswith(".png"):
                path = os.path.join(dir_path, filename)
                label = filename.split('_')[0]
                label = self.label_dict[label]
                self.data.append(path)
                self.labels.append(label)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        img_path = self.data[idx]
        label = self.labels[idx]
        image = Image.open(img_path).convert('RGB')
        image = image.resize((opt['image_size'], opt['image_size']))

        if self.transform:
            image = self.transform(image)

        return image, label



In [17]:
from torchvision import transforms

transformations = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])


dataset = CustomDataset(dir_path="/Users/aryan/Downloads/IIITHcode/IMFDB", transform=transformations)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)


## Example Block
## Just for illustration, no credits will be given

In [None]:
#predefined torch model VGG-16
vgg16 = models.vgg16(pretrained=True)

#predefined torch model VGG-19
vgg19 = models.vgg19(pretrained=True)

In [None]:
from torchsummary import summary
summary(vgg16, (3,224,224))

In [None]:
summary(vgg19, (3,224,224))

## TODO Block

### Choose any `1` of the three archiectures from the below and write the code in the next subsequent cells

1. VGG-16 and VGG-19 [1]
2. ResNet-34 and Resnet-50 [2]
3. ViT-Base [3]


References

[1] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).URL: https://arxiv.org/abs/1409.1556

[2] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. URL: https://arxiv.org/abs/1512.03385

[3] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020). URL: https://arxiv.org/abs/2010.11929

## VGG-16, 19 Code

----TODO----

In [None]:
## Place your code here, archiecture should include every block detailed as per paper [1]
# class VGG16():
#   ## TODO ##

# class VGG19():
#   ## TODO ##



In [None]:
from torchsummary import summary
summary(VGG16(), (3,224,224))
print("\n########################\n")
summary(VGG19(), (3,224,224))

##ResNet-34, 50 Code
----TODO----

In [18]:
## Place your code here, archiecture should include every block detailed as per paper [2]
# class ResNet34():
#   ## TODO ##

# class ResNet50():
#   ## TODO ##

import torch
import torch.nn as nn



class BasicBlock(nn.Module):
    """
    ResNet-34 uses BasicBlock:
    (3x3 conv) -> BN -> ReLU -> (3x3 conv) -> BN -> add skip -> ReLU
    """
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # projection shortcut if shape changes
        self.downsample = None
        if stride != 1 or in_channels != out_channels:
            self.downsample = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1,
                          stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out = out + identity
        out = self.relu(out)
        return out


class Bottleneck(nn.Module):
    """
    ResNet-50 uses Bottleneck:
    (1x1) -> BN -> ReLU -> (3x3) -> BN -> ReLU -> (1x1) -> BN -> add -> ReLU
    expansion=4 
    """
    expansion = 4

    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        width = out_channels  # bottleneck inner channels

        self.conv1 = nn.Conv2d(in_channels, width, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(width)

        self.conv2 = nn.Conv2d(width, width, kernel_size=3, stride=stride,
                               padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(width)

        self.conv3 = nn.Conv2d(width, out_channels * self.expansion, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)

        self.relu = nn.ReLU(inplace=True)

        self.downsample = None
        if stride != 1 or in_channels != out_channels * self.expansion:
            self.downsample = nn.Sequential(
                nn.Conv2d(in_channels, out_channels * self.expansion,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels * self.expansion)
            )

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out = out + identity
        out = self.relu(out)
        return out



# ResNet Backbones 


class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes=1000, in_channels=3):
        super().__init__()

        self.inplanes = 64

        # Conv1: 7x7, stride 2
        self.conv1 = nn.Conv2d(in_channels, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)

        # MaxPool: 3x3 stride 2
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # conv2_x, conv3_x, conv4_x, conv5_x
        self.layer1 = self._make_layer(block, 64,  layers[0], stride=1)
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)

        # Global Average Pool + FC
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        # Kaiming init (good practice)
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def _make_layer(self, block, out_channels, blocks, stride):
        layers = []

        # first block of stage may downsample
        layers.append(block(self.inplanes, out_channels, stride=stride))
        self.inplanes = out_channels * block.expansion

        # remaining blocks keep same shape
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, out_channels, stride=1))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)     # [B,64,112,112]
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)   # [B,64,56,56]

        x = self.layer1(x)    # conv2_x
        x = self.layer2(x)    # conv3_x
        x = self.layer3(x)    # conv4_x
        x = self.layer4(x)    # conv5_x

        x = self.avgpool(x)   # [B,512*exp,1,1]
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x






class ResNet34(nn.Module):
    """
    ResNet-34 block counts (paper):
    [3, 4, 6, 3] using BasicBlock
    """
    def __init__(self, num_classes=1000, in_channels=3):
        super().__init__()
        self.model = ResNet(BasicBlock, [3, 4, 6, 3], num_classes=num_classes, in_channels=in_channels)

    def forward(self, x):
        return self.model(x)


class ResNet50(nn.Module):
    """
    ResNet-50 block counts (paper):
    [3, 4, 6, 3] using Bottleneck
    """
    def __init__(self, num_classes=1000, in_channels=3):
        super().__init__()
        self.model = ResNet(Bottleneck, [3, 4, 6, 3], num_classes=num_classes, in_channels=in_channels)

    def forward(self, x):
        return self.model(x)



In [15]:

from torchsummary import summary
summary(ResNet34(), (3,224,224))
print("\n########################\n")
summary(ResNet50(), (3,224,224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 112, 112]           9,408
       BatchNorm2d-2         [-1, 64, 112, 112]             128
              ReLU-3         [-1, 64, 112, 112]               0
         MaxPool2d-4           [-1, 64, 56, 56]               0
            Conv2d-5           [-1, 64, 56, 56]          36,864
       BatchNorm2d-6           [-1, 64, 56, 56]             128
              ReLU-7           [-1, 64, 56, 56]               0
            Conv2d-8           [-1, 64, 56, 56]          36,864
       BatchNorm2d-9           [-1, 64, 56, 56]             128
             ReLU-10           [-1, 64, 56, 56]               0
       BasicBlock-11           [-1, 64, 56, 56]               0
           Conv2d-12           [-1, 64, 56, 56]          36,864
      BatchNorm2d-13           [-1, 64, 56, 56]             128
             ReLU-14           [-1, 64,

In [21]:
from torchsummary import summary
summary(ResNet50(num_classes=8), (3,224,224))


----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 112, 112]           9,408
       BatchNorm2d-2         [-1, 64, 112, 112]             128
              ReLU-3         [-1, 64, 112, 112]               0
         MaxPool2d-4           [-1, 64, 56, 56]               0
            Conv2d-5           [-1, 64, 56, 56]           4,096
       BatchNorm2d-6           [-1, 64, 56, 56]             128
              ReLU-7           [-1, 64, 56, 56]               0
            Conv2d-8           [-1, 64, 56, 56]          36,864
       BatchNorm2d-9           [-1, 64, 56, 56]             128
             ReLU-10           [-1, 64, 56, 56]               0
           Conv2d-11          [-1, 256, 56, 56]          16,384
      BatchNorm2d-12          [-1, 256, 56, 56]             512
           Conv2d-13          [-1, 256, 56, 56]          16,384
      BatchNorm2d-14          [-1, 256,

##ViT-B Code
----TODO----

In [None]:
## Place your code here, archiecture should include every block detailed as per paper [3]
# class ViT_B():
#   ## TODO ##


In [None]:
from torchsummary import summary
summary(ViT_B, (3,224,224))

In [19]:
def train_model(model, data_loader, epochs=10):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

    model.train()
    for epoch in range(epochs):
        i=0
        for images, labels in data_loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            print(f"Batch {i+1}, Loss: {loss.item()}")
            i+=1
        print(f"Epoch {epoch+1}, Loss: {loss.item()}")


In [20]:
#---Train the model that you have coded from scratch not pre-trained model.-----#
#--- NOTE: You should not use pre-trained model here---#
# train_model(VGG16, dataloader, epochs=1)
model = ResNet34(num_classes=8)
train_model(model, dataloader, epochs=1)


Batch 1, Loss: 2.1150505542755127
Batch 2, Loss: 2.053494453430176
Batch 3, Loss: 2.2113051414489746
Batch 4, Loss: 2.2727081775665283
Batch 5, Loss: 2.0004029273986816
Batch 6, Loss: 2.1644344329833984
Batch 7, Loss: 1.8990137577056885
Batch 8, Loss: 2.2147254943847656
Batch 9, Loss: 2.6722021102905273
Batch 10, Loss: 1.9719842672348022
Batch 11, Loss: 1.8526585102081299
Batch 12, Loss: 2.222278118133545
Batch 13, Loss: 1.7165026664733887
Epoch 1, Loss: 1.7165026664733887


In [22]:
model = ResNet50(num_classes=8)
train_model(model, dataloader, epochs=1)


Batch 1, Loss: 2.1136598587036133
Batch 2, Loss: 3.380072832107544
Batch 3, Loss: 3.351116895675659
Batch 4, Loss: 4.312042236328125
Batch 5, Loss: 3.466721773147583
Batch 6, Loss: 4.465058326721191
Batch 7, Loss: 3.996025800704956
Batch 8, Loss: 5.048213005065918
Batch 9, Loss: 3.70552396774292
Batch 10, Loss: 3.444004535675049
Batch 11, Loss: 3.602144241333008
Batch 12, Loss: 4.088256359100342
Batch 13, Loss: 5.704384803771973
Epoch 1, Loss: 5.704384803771973


#Answer the following Questions (All are mandatory, unless mentioned)

## <ins>Question-0 (Example)</ins>: In the ViT paper [3], did the authors specify anywhere that their model is better than standard convolutional architectures?
## <ins>Answer-0</ins>: Yes. In the abstract, the authors claim that Vision Transformers consume less computational power to achieve better performance.

### <ins>Supporting statement in paper</ins>: "When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train."

## <ins>Question-1</ins>: What aare the major contribtuions of VGG-16,19 architecture? Justify your answer.

## <ins>Question-2</ins>: What are the main contributions of Residual Networks [2]? What problem do they solve? Are they better than VGG? If yes, how? Please answer your question briefly and justify your statements with supporting statements from the paper.

## <ins>Question-3</ins>: What are the main contributions of ViT? What are the main advantages of ViT compared to CNNs? Are there any limitations? If yes, what are they? Please answer your question briefly and justify your statements with supporting statements from the paper.

## <ins>Question-4</ins>: Now, comparing these three models, which one is better? Why is the selected model better than the others? Can you briefly state the applications of them which are related to your domain of interest? To be specific, using the model of your choice, can you specify the directions which could help to solve problems other than face recognition?

## <ins>Question-5</ins>: What are the key architectural differences among the three models? Differentiate between each of the components and explain why each component acts like building blocks for these architectures.

## <ins>Question-6 (Optional)</ins>: All the above-mentioned models are for classification purposes. Can you use these models for any other tasks (other than classification)? If so, can you specify your ideas and justify your statements with research articles you have studied?

## Answers

### Answer 1:

VGG showed that stacking many layers of small 3×3 convolution filters and increasing network depth to 16–19 weight layers improves classification accuracy and produces strong generalizable feature representations.

##### <ins>Supporting statement in paper</ins>:

“very small (3x3) convolution filters” and “pushing the depth to 16-19 weight layers”

### Answer 2:

Residual Networks (ResNet) introduce residual learning, where layers learn residual functions with respect to the input. This makes training deeper networks easier, solves the degradation problem, and enables very deep models to outperform older architectures like VGG.

##### <ins>Supporting statement in paper</ins>:

“we explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions.”

##### <ins>Supporting comparative statement</ins>:

ResNets were tested up to 152 layers, described as “8x deeper than VGG nets”, while achieving better accuracy.

### Answer 3:

Vision Transformer (ViT) treats an image as a sequence of fixed-size patches and applies a standard Transformer encoder end-to-end. The key contribution is proving that a pure transformer can match or exceed CNN performance when pre-trained at scale, often with fewer compute resources. A major limitation is its dependency on large pretraining datasets for best performance.

##### <ins>Supporting statement in paper</ins>:

“a pure transformer applied directly to sequences of image patches can perform very well” and “requiring substantially fewer computational resources to train.”

##### <ins>Supporting limitation statement</ins>:

“When pre-trained on large amounts of data ... Vision Transformer (ViT) attains excellent results” (showing the need for large-scale pretraining).

### Answer 4:

There is no single “best” model in all cases, but ResNet (ResNet-34/50) is the most practical choice for many real-world vision tasks because it trains reliably at higher depth and solves optimization difficulties using residual learning, addressing the degradation problem.
For face recognition, VGG is simpler and strong, but ResNet generally generalizes better due to stable deep training. ViT can be excellent, but it performs best with large-scale pretraining, making it less suitable for smaller datasets when trained from scratch.

##### <ins>Supporting statement in paper (ResNet)</ins>:

“with the network depth increasing, accuracy gets saturated … and then degrades rapidly” (degradation problem).

##### <ins>Supporting statement in paper (ViT)</ins>:

“When pre-trained on large amounts of data … ViT attains excellent results compared to state-of-the-art convolutional networks …”

Applications beyond face recognition (related to my domain):

* Weapon detection / violence detection: ResNet as backbone + temporal modeling (3D CNN, LSTM/GRU, Transformers)
* Object detection: ResNet backbone in Faster R-CNN / RetinaNet
* Action recognition: ResNet for frame features + temporal aggregation
* Medical imaging: classification and diagnosis support
* Industrial inspection: defect detection and anomaly classification

### Answer 5:

VGG-16/19: Uses repeated 3×3 convolution blocks + pooling, showing deeper networks with small filters improve performance.

##### <ins>Supporting statement (VGG)</ins>:

“very small (3x3) convolution filters” and “pushing the depth to 16-19 weight layers”

ResNet-34/50: Uses skip connections (identity shortcuts) so the network learns a residual mapping, enabling stable training of deep networks and avoiding degradation.

##### <ins>Supporting statement (ResNet)</ins>:

“we explicitly reformulate the layers as learning residual functions with reference to the layer inputs…”

ViT-Base: Replaces convolutions with a Transformer over image patches, enabling global context modeling through attention.

##### <ins>Supporting statement (ViT)</ins>:

“a pure transformer applied directly to sequences of image patches can perform very well…”

Component-level differences (building blocks):

* VGG building block: (Conv 3×3 + ReLU) repeated, followed by pooling
  Reason: gradually increases receptive field and keeps architecture uniform
* ResNet building block: Residual block = (conv stack + skip connection)
  Reason: stabilizes deep training by learning residual mapping
* ViT building block: Patch embedding + Multi-head self-attention + MLP
  Reason: models global relationships between patches using attention

### Answer 6:

Yes. VGG, ResNet, and ViT can be used beyond classification by using them as backbones (feature extractors) and attaching task-specific heads.

1. Object Detection (Faster R-CNN):
   VGG/ResNet features can be used for bounding box prediction.

##### <ins>Supporting statement</ins>:

“The Region Proposal Network (RPN) shares full-image convolutional features with the detection network.” (Faster R-CNN)

2. Instance Segmentation (Mask R-CNN):
   Add a mask branch along with detection.

##### <ins>Supporting statement</ins>:

“Mask R-CNN adds a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.” (Mask R-CNN)

3. Semantic Segmentation (FCN):
   Convert classifiers into pixel-wise predictors.

##### <ins>Supporting statement</ins>:

FCN adapts “classification networks… into fully convolutional networks” for dense prediction. (FCN)

4. Video Understanding (SlowFast / 3D CNN):
   Use ResNet per frame + temporal modeling for action/violence detection.

##### <ins>Supporting statement</ins>:

SlowFast uses two pathways to capture spatial semantics and motion. (SlowFast)

5. Transformer-based Detection (DETR):
   ViT-style transformers can be adapted for object detection.

##### <ins>Supporting statement</ins>:

DETR formulates detection as “direct set prediction” using transformers. (DETR)


---