<a href="https://colab.research.google.com/github/aaronmichaelfrost/pytorch-cuda-learning/blob/main/Iris.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Aaron Frost - Jan 2025
# To learn about deep neural networks, I'm creating a neural network that classifies a flower given its instance properties/features.
# It is fully-connected feed forward, so all neurons in each layer influence all neurons in adjacent layers.

# The dataset: https://archive.ics.uci.edu/dataset/53/iris
# --> contains:
#       - 3 flower classes
#       - 50 instances of each class
#       - 4 continuous features per instance
#
#  The idea is that we can separate the 4D feature vectors into 3 groups.
#     "One class is linearly separable from the other 2; the latter are not linearly separable from each other."
#     --> this means that all the points in 4D space for *one* class are on one side of a line travelling through 4D space,
#         and the other two class instances are all on the other side of that line.

# Some breakdown of the fundamentals


# The Network - - - - - - - - -
    #   Neuron - container of a number called an "activation"
    #     - The first layer of a network should contain all the input features.
    #         - If your input features are pixels of an image, you'd have Width*Height Neurons in the first layer, each containing a value like alpha or RGB.
    #     - The last (output layer) has one neuron for each class you're trying to split into.
    #         - The activations in the output layer neurons can be seen as the model's prediction for the class (the likelihood of being that class given the input features)

    # It's resonable to expect a layered structure to behave intelligently because human recognition breaks down things into subcomponents.
    #   -->  Each subcomponent could be mapped to a neuron in a layer, ideally, with the activation representing the liklihood of that component being present.
    #   --> The expectation or hope is to have each layer activate on a particular set of qualities from the original input vector.
    #       The leftmost layers, we hope, are more granular / microscopic, and those neurons that activate should be fed forward to activate neurons representing a more macroscopic quality presense.
    #     ***Microscopic qualities to fed forward (left to right) through layers to Macroscopic activations
    #     ***Concrete to abstract layers***

    # The hope might be for a single neuron in a particular layer to pick up on the presence of a quality.
    #   In order to step from one layer of the network to another, there has to be some sort of transformation.
    #   This is where "weights" come into play.
    #       On each "line" connecting the neruons of one layer to the neuron of the next (remember it's fully connected), you can imagine that there is a set of numbers that if you apply them as multiplicative weights, the sum will
    #       fit a certain criteria required for activation (be positive, for example) only when a given feature is present. The "weights", are like a mask, that determines whether a single quality is present or not given the qualities from the previous layer.
    #
    # Example I made up. "Should I get pizza classifier":
    #     Input features:
    #         Hunger: 1
    #         Pizza-love: 3
    #         Burrito-love: 4

    #      Hidden layer 1 neuron:
    #         - maybe this neuron should fire if we like burritos more than pizza (arbitrary quality)
    #               neuron-activation = Hunger(1)*(weight: 0) + Pizza-love(weight:-1) + Burrito-love(weight: 1)
    #                 = 0 + -3 + 4
    #                 = 1 (it's positive so the neuron activated!)
    #             (notice I made the weight of pizza love -1, and burrito-love 1, this is so the neuron would fire if we like burritos more).
    #             These weights made this neuron fire (I guess we can call it the "burrito is better than pizza" neuron). This neuron firing should, ideally, feeding it forward, lead to activation of the output neuron that classifies the answer as no, let's get a burrito instead!
    #
    #  This 3B1B video does a good job showing how weights are like a mask that determines presence of a feature: https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=1&ab_channel=3Blue1Brown
    #
    #  When you compute the weighted sum, the output could range.. so you want to SQUISH the range into 0 to 1.
    #     There is a method called the sigmoid func (aka "logistic curve") which puts very negative inputs close to 0 and positive inputs close to 1.
    #     The sigmoid curve measures "how positive or negative" the weighted sum is, AKA whether or not to "light up" the destination neuron
    #       input 0 -> output .5
    #       input -3 -> output close to 0
    #       input 1 -> output close to 1
    #     you can apply some bias (maybe a neuron should only activate if it is super positive? sure.), by adding (or subtracting) a number from the weighted sum before passing into sigmoid
    #     ex. activation = sigmoid(weighted_sum(prev_layer) + bias)
    #       By convention, every neuron has this bias value, added to the sum before squishing with the sigmoid.

    #     Weights and biases combined are called "parameters".
    #     "Learning" is finding the valid setting for all these parameters, to solve the problem at hand.


    #   Okay, with all that out of the way, let's go over how we're going to translate this to vectors we can work with in Python.
    #     For each pair of layers:
    #     1. The neuron activations from the left-layer is a single vertical array
    #     2. The weight values for all connections to the next layer are a 2D array
    #         - each row represents the weights to a neuron in the next layer
    #
    #     This means if you multiply a row from this 2d array of weights against the column vector (the entire left-layer activations), the number you recieve represents the weighted sum for the right-layer neuron associated with that row.
    #     Matrix multiplication requires multiplying matching members and summing them together. If we multiply the weights 2D matrix against the column vector, we actually get the activations for the next row.
    #       -->  since the height of the column vector is equal to the width of the matrix, you can multiply the 2d-Array by the column vector. To multiply two vectors component by component, and sum them all, you are computing the "dot product", is what this is called.
    #         The dot product is expressed this way algebraically.
    #     The biases is just another column vector that gets added to the previous matrix-vector product.
    #       Finally, apply the sigmoid to each compoent of the resulting column vector.
    #         OR you can apply a reLU function, which returns y=x when input is positive, and y=0 when input is negative.

# How the Network Learns - - - - - - - - -
# Define a cost function using training data.
#   Training data point:
#     The desired output-layer neuron activations values (0-1), given a particular input activation vector.
#   The COST of a single training example is outputActivations.sum(o => Math.Square(o.DesiredActivation - o.ActualActivation))
#       cost is small when it is correct. cost is large when incorrect.
#   The AVERAGE cost over ALL training examples is the COST FUNCTION.. a measure of how good the network is doing..
# The COST is a function of all the weights and biases, and returns the average over all training examples

#  you can figure out what the best weights and biases are by travelling down the slope of the cost function. You should descend down the space at a step-rate relative to the slope, so you don't overstep the minimum cost.
#     the "gradient" of a function give syou the direction of steepest incline, and * -1 = steepest descent.
#  So just compute the gradient direction, step the weights in biases in that direction (downhill), repeast that over and over while the cost return value decreases.
#   the heart of the algorithm for computing the gradients is called back-propogation.
# WIP.


import torch
import torch.nn as nn             # neural network lib
import torch.nn.functional as F   # to move data forward

# Model class that inherits nerural network module (nn.module)
class Model(nn.Module):
  # neural network has input layer, hidden layers, and output layer.
  # the input layer in this dataset requires 4 features of the iris flower (vector components).
  # we will feed these features forward through the hidden layers.
  # --> all the way to the output layer --> outputs classification.

  # constructor dunder (double-underscore) -> this is just a built-in python method to create the class instance.
  def __init__(
      self,
      count_in_features: int = 4,                  # four flower features
      count_hidden_neurons_layer1 : int = 8,    # arbitrary
      count_hidden_neurons_layer2 : int = 9,    # arbitrary
      count_out_features: int = 3)              # 3 classes
    -> None:

      # we're going to create a fully-connected (fc) neural network, so every node is connected to every node in adjacent layers
      # set up connected3b1 layers:
      self.fc1 = nn.Linear(count_in_features, count_hidden_neurons_layer1)
      self.fc2 = nn.Linear(count_hidden_neurons_layer1, count_hidden_neurons_layer2)
      # output layer
      self.out = nn.Linear(count_hidden_neurons_layer2, count_out_features)

SyntaxError: expected ':' (<ipython-input-1-f3227b6d26c2>, line 61)