# Resources & Acknowledgements

While planning this book, I often found myself wondering what my Linear Algebra and ODE professor, Lin Bon-Soon, would have said in his lectures. My goal was to emulate how other works achieve a tone of profound rigor in their acknowledgements—often through opening with a saying from a great physicist (or mathematician)—so this work could supposedly stand alongside other well-crafted ML/AI resources. Professor Lin and I both know that Evan rarely attended class, which makes this entire endeavor somewhat ironic.

Fret not, I made the effort to look through my old emails and found this gem from him:
>"If you do decide to pick something that matters to you, then focus on mastery, instead of achievement. Mastery is current, future-proof, and demonstrable; achievement is past-tense and soon irrelevant. (I refer to personal achievement—not, for instance, an achievement that saved the world.) I'm not saying achievement isn't great; mastery is simply better."

Somewhat it resonates greatly with what I am trying to achieve. Instead of just building a simple ML/AI project by using a pretrained model from YouTube and then calling it a day, I want to actually master the basics of ML/AI through a Herculean task—by building a super-duper complex ML model from scratch to be able to tell if that image you took was of a cat, a dog, or neither (a problem worth challenging!).

Anyways you innocent reader, if you have made this far, then I will reveal what are used in this project. This project is built as a learning exercise and the following explanation, analysis and theoretical foundations are based upon these incredible resources:
- **Pattern Recognition and Machine Learning (2006)** by Christopher M. Bishop. This textbook was my primary guide for understanding the core mathematical concepts of machine learning.
- **Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow** by Aurélien Géron.
- **Deep Learning Research Papers:**
  - He, K., Zhang, X., Ren, S., & Sun, J. (2015). [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385)  
    *The architectural blueprint for the ResNet model at the core of this project.*
  - LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). [Gradient-Based Learning Applied to Document Recognition](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf)  
    *The foundational paper that established the modern convolutional neural network (CNN) architecture.*

- **AWS Documentation** for ECS, SageMaker, CodeDeploy, and CodePipeline.

They are-what my generation coined-the goat of resources.

# 1.0 The Building Blocks
This notebook will attempt to build all of the required crucial units that are the foundation to the CNN (Convolutional Neural Network) layer, then, we will use the CNN layer to put together the ResNet (Residual Network) model, a very power and efficient model for computer imaging task.

We will attempt to build such a model that, given a picture, is able to predict if it's a cat, a dog, or neither.

## 1.1 What is a model?
At its simplest essence, a model is just a function. It's a box with a bunch of knobs and dials on it. You feed it an input—a number, a list of numbers, a 2D grid of numbers (like a black-and-white photo), or even a 3D block of numbers (like a color photo with Red, Green, and Blue layers). You twist the knobs, and it gives you an output.

Okay, let's not stray from the topic. A model accepts an input and it outputs something. That something could be a true/false value, a scalar, a vector... you get the idea.

Now, we want to use our model to accept an image of a cat (which is that 3D matrix) and have it output a decision. Something like this:

$$(1, 0, 0)$$

This output is a vector where the number $1$ indicates "true" and $0$ indicates "false." Each item in the vector corresponds to a **label**: $[Cat, Dog, Neither]$.

So in this case, the vector $(1, 0, 0)$ translates to: Cat: True, Dog: False, Neither: False. Our model is very confident that its **label** is probably a cat

Oh yeah, I'm going to **bold** terminology **you** may **not** understand. In a classification problem, you want to predict the names, or the commonly used term, **label** of an image.

Now a model by itself is pretty dumb, think Evan facing his first Linear Algebra exam with no sleep and studies. It must be able to "learn" somehow. Now, you may ask:

"That's pretty stupid, how could a model **learn**?"

Models have **weights**. Weights are tunable parameter within the model that influence the outcome. Think of them as the knobs on a complicated soundboard you adjust through trial and error to get the perfect sound.

These weights can be adjusted through trials and errors through a certain process, we call that "**learning**".

We will dive into the Stochastic Gradient Descent algorithm, which forces a model to learn (they must! they have no choice). The analogy is comparable to Evan and his Linear Algebra exams: Evan got a bad score on his first exam, and so, taking the partial derivative of the loss function with respect to the weight (to be explained), multiplying that with the learning rate, then adding it back to the weight, Evan will successfully perform better the next exam. Surprisingly, Evan did for the second exam.

Now we have an idea of what a model does and how it can learn. We will dive further into making of one of the oldest and elementary model, Linear Regression, then we will move on to a much more predictive model called Softmax Regression, which is suitable for classification, next, we move to stacking the model together, a Multilayer Perceptron, and finally learn the tricks to help our model learn and converge efficiently and quickly. Also take notes of how many times the word "efficiently" and "simplest" are used in our notebook.

## 1.2 Linear Regression
This is one of the most basic ML model. In the context of Machine Learning, Linear Regression is a model that predicts a continuous value based on the **features** of our data, underlined by the premise that there is a linear relationship.

a **feature** is just one characteristics of a data. Ex: In predicting the house's price, features like location, land size, age, etc are used to train/learn a model.

So, How does Linear Regression look like mathematically?

We define $\hat{y}$ to be the **prediction**, the output of the model, $(x_1, x_2, ..., x_n)$ to be the **input** with $x_i$ being our features and $(w_1, w_2, ..., w_n)$ to be our *weights*.

Thus, the Linear Regression formula that ingest a single datapoint is:  

$$\hat{y} = x_1 w_1 + x_2 w_2 + ... + x_n w_n$$
$$\hat{y} = (x_1, x_2, ..., x_n)^T(w_1, w_2, ..., w_n)$$

Let $\mathbf{x}^T$ be our dataset and $\mathbf{w}$ be our weights. We can write this compactly with in the manner of Linear Algebra (thanks Lin! don't forget to thank Lin dear readers)
$$\hat{y} = \mathbf{x}^T \mathbf{w}$$

wait! but there's a problem. If all of our inputs $(x_1, x_2, ..., x_n)$ are really small or near zero, then wouldn't our **prediction** be also be near 0 for some case? To avoid that, Bishop showed that by adding a **bias** $b$, we can have to have a "fixed offset" (Bishop, 2006, 138) yielding a complete formula.

$$\hat{y} = \mathbf{x}^T \mathbf{w} + b$$

let us now convert this into a PyTorch module for future uses.

In [14]:
import torch
from torch import nn

We will attempt to replicate `nn.Linear` from scratch. First we will define the weight, bias, and the forward method, which is called to predict our $\hat{y}$

In [15]:
class LinearRegression(nn.Module):
    def __init__(self, in_features, out_features=1, bias=True, device=None):
        super().__init__()
        # weight, add requires_grad=True to perform manual weight changes
        self.w = torch.zeros(in_features, out_features, requires_grad=True)
        # bias
        self.b = torch.zeros(1, requires_grad=True)
        self.device = device
    
    def forward(self, X):
        # x^T * w + b
        return torch.matmul(X, self.w) + self.b

With this model in the palm of our fellow reader, we are ready to do more. It's time we need to make this static cluster of symbols somehow **learn**, but before that we need to understand how it learns.

### 1.2.1 Loss
The model must learn toward finding the **weights** that will predict **label** with the least error, or in other words, with **minimal loss**. **Loss** trivially tells you how far you are from the **target** label, which is ground truth, or the actual value in reality.

But how is loss relevant to the process of learning?

The process of learning is iterative and for each iteration, our model, through SGD (explained in detail in later section), will pick a batch of inputs from the datasets $X_k = (\mathbf{x_i}, \mathbf{x_{i+1}}, ..., \mathbf{x_{i+n}})$, where $n$ is the size of the batch, to output predictions $\hat{Y_k} = (\mathbf{\hat{y}_i}, \mathbf{\hat{y}_{i + 1}}, ..., \mathbf{\hat{y}_{i + n}})$. We then calculate the loss by comparing the predictions to the **target** labels $Y_k = (\mathbf{y_i}, \mathbf{y_{i+1}}, ..., \mathbf{y_{i+n}})$, which gives us the current performance of a model in predicting to unseen values. We seek to minimize the loss on every step so that we know the model is becoming more and more accurate.

The most commonly used loss function is the **Mean Squared Error**, which we will use for Linear Regression. It is defined by
$$l(y,\hat{y}) = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y_i})^2$$
So for each iteration, **MSE** will serve as a useful bench mark. Let's create an MSE function. Then, we attach it to the `LinearRegression` module above

In [17]:
def MSE(self, y, y_hat):
    n = y.numel()
    return torch.sum((y - y_hat) ** 2) / n
LinearRegression.loss = MSE

### 1.2.2 Stochastic Gradient Descent

### 1.2.3 Putting it together

## References
1. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. [https://www.springer.com/gp/book/9780387310732](https://www.springer.com/gp/book/9780387310732)