In [3]:
#test import to make sure that the environment works
import torch
import numpy as np

layer = torch.tensor([1, 2, 3], dtype=float, requires_grad=True)

print(layer)

tensor([1., 2., 3.], dtype=torch.float64, requires_grad=True)


TEST

Links:
https://saturncloud.io/blog/how-to-use-latex-in-jupyter-notebook/

https://en.wikibooks.org/wiki/LaTeX

Outline Very Basic:

#The Main Idea behind Machine Learning

It is well known that machines are potent in processing defined algorithms with its combination of speed, memory, and accuracy. Once a human defined an algorithm, or a series of steps, for the computer to follow, it can do so faster and better than any other human.

However, machines themselves are unable to tackle the more abstract problems such as differentiating a photo of a dog from a cat. 

To humans, this task may be trivial. However, humans themselves are unable to clearly explain their thought process for separating dogs and cats in a concise way. They may suggest tips such as looking at its ears or tail, but this is another ambiguous question in itself, especially to a computer which percieves images not by its greater pattern, but by each individual pixel and its color values. For humans, their brains act like a black box: being able to intuitively process the information accurately but unsure of the exact algorithms underneath. As humans are unable to create a concise algorithm for such intuitive tasks, they cannot write code for a machine to follow in order to accomplish the same task.

So, how do humans do it? Does this mean that humans were born with an innate ability to differentiate between dogs and cats? There are no strong evidence supporting this argument, so the leading theory is that humans develop their classification abilities later on, probably by observing an uncountable amount of dogs and cats throughout their lives. This implies that the classification process can be learnt, most likely by identifying groups of hidden patterns that gives deeper insight than just the raw data itself.

**The broadest idea of machine learning is that there are intrinsic patterns in data. By matching and gathering a large amount input and output pairs, it may be possible to find the function or formula which converts an input into the desired corresponding output.**

The rest of this paper would discuss the more practical concepts in implementing simpler neural network models.

#Embedding Vectors and Representing Information (WIP):

Before making a neural network, there needs to be a quantitative way of representing the information mathamatically. This is most commonly done through vectors, matrices, and tensors. These are essencially an array or list of a certain dimension. The process of converting information from one form to a vector space is known as embedding. The general idea is to map objects in the vector space based on their properties, so that more similiar items have a smaller difference. 

Usually, each dimension or direction in the vector space would represent a certain trait or attribute. For instance, in a good embedding of English words, the difference between vectors representing man and woman should be very similar to the difference of vectors of king and queen, boy and girl, father and mother, and so on. However in practice, larger neural networks may organize their data in another unknown method in their training.

#Neurons and Linear Layers:

The idea behind a neuron is that it is the smallest possible component in a larger neural network, just like a human's neuron to their brain. While biology and chemistry powers a human neuron, a machine's neuron is defined by math.

In mathamatics, multiplication is the most common way to alter a value's size by its proportion. For instance, multiplying X by 0.5 yields X/2, something half as large in magnitude. Meanwhile, multiplying X by 2 yields 2X, something twice as large in magnitude. This is a useful way to amplify or diminish a value's magnitude without changing its inherent composition (attributes such as its prime factors, which may carry inherent information). For instance, if you multiply 15 by 2 to get 30, it still contain the prime factors 3 and 5 afterwards. Another simple way to manipulate values is addition. This operation can shift a value along the number line, or alternatively a vector along a certain axis. Although addition also affects the size of a value, it will disturb said value's composition. For instance, unlike previously with multiplication, if you add 15 by 2, the result 17 no longer contains the prime factor 3 and 5. Overall, it is best to think of multiplication as adjusting a value's size, while addition acts as an offset.

These mathmatical ideas also applies to the field of machine learning, which comes in the form of a neuron. Instead of just being a numerical value, the input to a neuron is assumed to represent embedded information in some way unknown to us. The neuron is then able to amplify and offset the input into the final output, which is akin to adjust its significance or value. Practically, the neuron accomplish this by being a function, with two inherent adjustable properties known as the weight and the bias. The neuron will take the input signal, multiply it with its weight attribute, add the product with its bias attribute, and return the final sum as its modified output signal. For example, a neuron with a large weight would amplify the input signal into a larger output signal, and vice versa. The term for these adjustable weight and bias values is parameters.

Here is a simple formula for a single neuron that incorporated the concepts from above:

$$
y = wx + b
$$

*(where w is the weight, x is the input, and b is the bias value)*

Neurons are then organized into layers, or groups of neurons in parallel. By assigning different weights to each neuron in the layers, the input signals will get amplified or diminished in its corresponding areas. Neuron layers can then be stacked sequentially, using the output of the previous layer as the input, to further add complexity and power, resulting in the final neural network.

Practically, all weights, inputs, and biases are represented as matrices or tensors, both of which are a common way to group large amount of numbers. This allows for the ease of processing large amount of calculations which neural networks need. 

Besides the core components of weights and biases, a non-linear function is also needed to help neurons with its expressiveness. Taking a look at the current model, it is a linear function. However, not all input-output pairs can be represented by a linear model, a famous example of which is the xor logic gate. 

As such, the final output of a neural layer is often passed through a non-linear function before it is actually sent to the next neural layer in the model. Common examples for non-linear functions in neural networks include sigmoid and tanh.

Here is a human-friendly example of a neural network:

In [4]:
#Inputs
X = np.array([2, 3, 5])

#Neural Layer Properties (Given)
W = np.array([1, 2, 4])
B = 0

#Non-linear function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

class linear_neuron_layer:
    def __init__(self, w, b):
        self.w = w
        self.b = b
    
    def forward(self, x):
        if len(x) == len(self.w):
            return sigmoid(sum(x*self.w) + self.b)
        else:
            return "Weight/Input Mismatch"

neuron = linear_neuron_layer(W, B)

neuron.forward(X)
#y = sigmoid(28)
#y = 0.9999999999993086

0.9999999999993086

After summarizing everything, here are the general formula of a single neural network layer.

General formula:

$$
y = f(x_1w_1 + x_2w_2 + ... + x_nw_n + b) = f((\sum_{i=1}^{n} x_iw_i) + b)
$$

Formula in matrix form:

$$
y = f(X*W + b)
$$

where X is the input matrix, W is the matrix containing the weights, b is the bias term, and f is the non-linear function.


#Forward Pass and Backpropagation:

The process described previously is the definition of a forward pass, which means putting inputs into a neural network and obtaining an output from it.

Backpropagation, on the other hand, is the process of finding out how wrong a given model is, and then using that information to correct its weights and biases so that its accuracy is improved.

Before we can correct our network, we need a way to measure how wrong our current model is from our target. Backpropagation uses an output, input and a "true value" and passes it back through the network to update its weights and biases. The "true value" is what the neural network should has created as output with the given input. Backpropagation uses the difference between the true value and the actual output from the network as a reference to adjust the weights and bias of its neurons. This difference is also known as the error of the neural network, which is a important benchmark to gauge the network's accuracy.

There are multiple ways of calculating error, which is usually specific to what the network is designed to accomplish. This paper will use the L2 Norm function as an example, which is the following:

$$
e(t, y) = 0.5(t - y)^2
$$
*(where t is the true value and y is the actual output)*


As described before, a neural network is a complicated mathamatical function at its core. As such, it is possible to obtain the derivatives of said functions, which in turn can be used to find the extremas (local maximums and minimums) of the neural network. Practically, the goal is to minimize the result of the error function as much as possible, and if the derivative is taken on the error function in respect to a parameter, the result will indicate how to adjust said parameter.



Here is an example of doing backpropagation on a single linear neuron layer using the functions we have so far, which were all taken from above:

Neuron layer before activation:

$$
s = X*W + b
$$

Neuron layer with sigmoid activation function:
$$
y = f(s) = sigmoid(s)
$$

Error of neuron layer:
$$
E = e(t, y) = 0.5(t - y)^2
$$

If we wish to find the derivative of the error function in respect to weight w_n and bias b_n, we can use the derivatives of the above functions and chain rule to obtain the following:

$$
\frac{dE}{dw_n} = \frac{dE}{dy} * \frac{dy}{ds} * \frac{ds}{dw_n}
$$
$$
\frac{dE}{db_n} = \frac{dE}{dy} * \frac{dy}{ds} * \frac{ds}{db_n}
$$

Here, we find the derivative of each function:

$$
\frac{dE}{dy} = \frac{d}{dy} 0.5(t - y)^2 = -(t - y)
$$
$$
\frac{dy}{ds} = \frac{d}{ds} sigmoid(s) = sigmoid(s) * (1 - sigmoid(s))
$$

$$
\frac{ds}{dw_n} = \frac{d}{dw_n} X * W + B = \frac{d}{dw_n} x_1w_1 + x_2w_2 + ... + x_nw_n + ... + b_1 + b_2 + ... = \frac{d}{dw_n} x_nw_n = x_n
$$
$$
\frac{ds}{db_n} = \frac{d}{db_n} X * W + B = \frac{d}{db_n} x_1w_1 + x_2w_2 + ... + b_1 + b_2 + ... + b_n + ... = \frac{d}{db_n} db_n = x_n
$$

(Note, since we are finding the derivative in respect to w_n and b_n, all the other terms without w_n or b_n as a factor can be ignored, since the entire function is one giant summation)

Here is the final result by substituting the derivative back into the overall equation:

$$
\frac{dE}{dw_n} = -(t - y) * sigmoid(s) * (1 - sigmoid(s)) * x_n
$$
$$
\frac{dE}{db_n} = -(t - y) * sigmoid(s) * (1 - sigmoid(s))
$$



With these formulas, the direction and magnitude of how each individual neuron should be changed is known, and can be adjusted so the error function would return a lower value. The collection of derivates in respect to every parameter is known as the gradient.

There is one caveat regarding backpropagation. Since the derivatives were found assuming all other variables are constant, the final model after each parameters were tweaked may not reflect a perfectly downward trend in error, as all of the weights or biases would have been shifted slightly. This problem is reduced by multiplying the gradient by a value called the learning rate, which is a small constant used to reduce the changes on the model. In other words, learning rate reduces the magnitude of change to the model in order to allow it to adjust in more precise steps.

Here is the Python implementation of backpropagation onto the linear_neuron_layer class from before. 

In [5]:
#Inputs
X = np.array([2, 3, 5])

#Neural Layer Properties
W = np.array([1, 2, 4])
B = 0

#Functions
def sigmoid(x): #Sigmoid function
    return 1 / (1 + np.exp(-x))

def sigmoidDerivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

def error(y, t): #L2 Norm
    return 0.5 * np.power((t-y), 2)

def errorDerivative(y, t):
    return -(t-y)

class linear_neuron_layer:
    def __init__(self, w, b):
        self.w = w
        self.b = b
    
    def S(self, x):
        if len(x) == len(self.w):
            return sum(x*self.w) + self.b
        else:
            return "Weight/Input Mismatch"

    def forward(self, x):
        if len(x) == len(self.w):
            return sigmoid(self.S(x))
        else:
            return "Weight/Input Mismatch"
    
    def updateWeights(self, g, u): #g is gradient, u is learning rate
        if len(self.w) == len(g):
            self.w = self.w + g*self.w * u
            self.b = self.b + g*u
    
    def backpropagate(self, x, t, u):
        y = self.forward(x)
        s = self.S(x)
        err = errorDerivative(y, t)
        g = err * sigmoidDerivative(s)

        self.updateWeights(g, x, u)

AND project below

#Project Example: AND Gate



In [240]:
#Constants
u = 0.01

#Helpers
def sigmoid(x): #Sigmoid function
    return 1 / (1 + np.exp(-x))

def sigmoidDerivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

def error(y, t): #L2 Norm
    return 0.5 * np.power((t-y), 2)

def errorDerivative(y, t):
    return -(t-y)

class NeuronLayerSingle:
    def __init__(self, w, b, f):
        self.w = w #weight list
        self.b = b #bias value
        self.f = f #non-linear function
    
    #functions
            
    def updateWeights(self, g, x, u):
        self.w = self.w + g*x*u
        self.b = self.b + g*u

    def getOutputRaw(self, x):
        if len(x) == len(self.w):
            return sum(x*self.w) + self.b
        else:
            return "Weight/Input Mismatch"

    def getOutput(self, x):
        if len(x) == len(self.w):
            return self.f(self.getOutputRaw(x))
        else:
            return "Weight/Input Mismatch"
    
    def toString(self):
        return f"weights: {self.w} | bias: {self.b}"


class Model:
    def __init__(self, inputSize):

        mag = 1 / np.sqrt(inputSize) #shallow weight initialization

        #Neuron layer, init values [-1/sqrt(x), 1/sqrt(x)]

        self.layer1 = NeuronLayerSingle(w=np.array([(np.random.rand() * mag * 2 - mag) for i in range(inputSize)]),
                                       b=np.random.rand() * mag * 2 - mag, 
                                       f=sigmoid
                                       )

        self.output = None

    def run(self, x):
        self.output = self.layer1.getOutput(x)
        return self.output
    
    def getError(self, x, t):
        y = self.run(x)
        err = error(y, t)

        return err
    
    def trainOnce(self, x, t, u):
        y = self.run(x)
        err = errorDerivative(y, t)

        x1 = x
        s1 = self.layer1.getOutputRaw(x1)
        g1 = err * sigmoidDerivative(s1)

        self.layer1.updateWeights(g1, x1, u)

    def toString(self):
        print(self.layer1.toString())

        

#For generating true value
def actualAND(x):
    a = x[0]
    b = x[1]

    if a == 1 and a == b:
        return 1
    return 0

trainedAND = Model(2)

training_data = [
    np.array([0, 0]),
    np.array([1, 0]),
    np.array([0, 1]),
    np.array([1, 1])
]

trainedAND.toString()
print("-------------AND Before Training")

avg_error = 0
for input in training_data:
    cur_error = trainedAND.getError(input, actualAND(input))
    avg_error += cur_error
    print(cur_error)
avg_error /= 4
print(f"average error: {avg_error}")


for i in range(500): #arbitrary amount of training epochs
    for training_input in training_data:
        trainedAND.trainOnce(training_input, actualAND(training_input), u)

#Note: u values were cherry picked for better results

trainedAND.toString()
print("-------------AND After Training")
avg_error = 0
for input in training_data:
    cur_error = trainedAND.getError(input, actualAND(input))
    avg_error += cur_error
    print(cur_error)
avg_error /= 4
print(f"average error: {avg_error}")

print("-------------AND Answer")

print(actualAND(np.array([0,0])))
print(actualAND(np.array([0,1])))
print(actualAND(np.array([1,0])))
print(actualAND(np.array([1,1])))

print("-------------AND Output")

print(trainedAND.run(np.array([0,0])))
print(trainedAND.run(np.array([0,1])))
print(trainedAND.run(np.array([1,0])))
print(trainedAND.run(np.array([1,1])))


weights: [-0.38720493  0.37201764] | bias: 0.5959288497049258
-------------AND Before Training
0.20783473719069062
0.1523477827507228
0.2626022918709611
0.06435507895044915
average error: 0.17178497269070592
weights: [0.10292532 0.71379843] | bias: 2.1524315246499093
-------------AND After Training
0.4013146199783163
0.409613508764064
0.4476014172304342
0.001192621449730052
average error: 0.3149305418556361
-------------AND Answer
0
0
0
1
-------------AND Output
0.8958957751639599
0.9461515916917692
0.9051116050124028
0.9511610514091458


#BELOW ARE UNFINISHED

#Common Pitfalls and Solutions

After discussing the theoretical of training a neural network, this section will discuss the more practical issues with Neural Network trainings, especially common problems that may arise during backpropagation.

One of the most important aspect in training is ensuring the quality and the quantity of training data, since it is what the model would base its behavior and patterns on. It is recommended to check for unwanted noise, clarity, erronous "true" values, and the normalization for each input and answer pair. Overall, it is good practice to prevent the "garbage in, garbage out" situation (where bad inputs naturally leads to bad results).

Underfitting is an issue which occurs when a neural network is too simple or small to effectively do its task, or that the neural network did not recieve adequate training.
Symptoms of this issue include seemingly random outputs or high skewness. The best solutions are to either increase the complexity of the model by adding more layers, or to run additional epochs on the model.

On the otherhand, overfitting is the problem where a model is trained with the same constant set of training data for too much, leading to inflexibility against new, unseen data. In a more human metaphor, overfitting is akin to reciting answers to every question, rather than learning to solve them. Although this might make them excel at the original training data, they are practically useless, as their ultimate final goal was to help identify new values, not to classify known values. One great solution is known as dropoff layer. Unlike other neural network layers, dropoff layers is a simple utility layer which randomly removes parts of its input, before passing the rest onwards to the rest of the model. This helps prevent the model from overfixating or overelying on a single data point, and ensure that it is robust enouogh to withstand more interference, and thereby remain flexible. Alternatively, simply running less epochs of training may help the model stay flexible, and less fixated on the given training data.



Neural Network training:
Problems: overfitting, underfitting
Techniques: Learning rate adjustment (optimizers), momentum(adagrad), dropout layers


#Convolutions and Image Processing

Convolution is a common way to aggregate data. 

In the context of image processing, convolution uses a kernel (a tensor of numbers) and multiply each internal value with a respective value taken from a section of the input tensor. These products are then added together to return a constant as the final result of the convolution operation. 

(Add diagram here)

The reasoning behind using convolutions in image processing is that it can summarize sections of pixels at once. Usually, each individual pixel by itself holds little significant, but by considering multiple of them together, more information can be extracted. In other words, the whole is greater than the sum of its parts. 

Even if a linear model may theoretically work with infinite time and training, convolutional layers help bring things back into the practical realm by simplfying large tensors into smaller, informationally-dense counterparts. This effectively embeds the picture into a high dimentional vector plane, before transforming it into the desired output form through more neural layers.



Project result:
MNIST Reader
ResNet18 with CIFAR-10
ViT with CIFAR-10


Generative Programs:
GAN
generator vs discriminator

DDRM diffusion
image to noise and vise versa