# Lecture 1 - Building Neural Network for Intelligence

## Introduction

### Section 1: Natural Intelligence = Brain

#### Electrical Brain
1. Group of Interconnected Wires (Neuron Connections) with different amount of Fat Insulation (Myelination)
2. Carrying electrical signals (Data)

#### Network Brain
Brain is a
1. Large 
2. Interconnected
3. Network of
4. Neurons
   1. Group of Neurons at same level are called Layer
5. Check [Brain Neural Network](http://nxxcxx.github.io/Neural-Network/)

#### Single Neuron - 3 Things
1. Input Data through Wire
2. Wire with varying Insulation Strength via Mylenation
3. Output Connection

#### Single Neuron Diagram
![Biological Neuron Model](https://www.researchgate.net/publication/341241129/figure/fig1/AS:888908187443205@1588943635819/Biological-Neuron-Model.ppm)
![Single Neuron](https://media.geeksforgeeks.org/wp-content/uploads/20230410104038/Artificial-Neural-Networks.webp)
![Layers of Neurons](https://qph.cf2.quoracdn.net/main-qimg-084ade3ed1f8a97709e374090a92e1ca.webp)

#### Brain as Layered Neural Network
![Visual Processing in Brain](https://neuwritesd.files.wordpress.com/2015/10/visual_stream_small.png)



### Section 2: Artificial Intelligence

Brute Force Error Minimizer from Data

```python

for each x_actual & y_actual in train_data_loader:
    y_predicted_LOGITS = model(x_actual)
    loss               = error_func(y_predicted_LOGITS, y_actual)

    dError_dWeights = torch.autograd.grad(outputs= loss, inputs = model.parameters() )
    for weight, gradient in zip(model.parameters(), dError_dWeights):
        weight = weight - gradient * learning_rate
        print(weight.shape, gradient.shape)
        print(weight, gradient)

```


### Neural Network in Pytorch
Neural Network has 4 Steps
1. Data
2. Model Architecture
3. Model Training
4. Model Evaluation

Model Training has 5 Steps
1. Predict from Existing Weight Values. (Network)
2. Calculate Error of Prediction wrt y_actual
3. Clear dError_dWeights
4. Calculate dError_dWeights
5. $ w = w - \nabla * lr $

$\large trained\_model = \operatorname*{argmin}_{\mathbf{w}, b}\  Loss( y_{predicted}, y_{actual})\\$

In [None]:
!pip install datasets
!pip install wandb
!pip install torchmetrics
!pip install torchinfo
!pip install torchvision
!pip install ipyplot

In [5]:
import torch , torch.nn as nn

### Data

```python
# ! KEY CODE

import torch, torch.nn as nn
import datasets as huggingface_datasets
import ipyplot

# 1. Download Dataset from Huggingface
training_dataset = huggingface_datasets.load_dataset("mnist", split="train")
hg_dataset_split = training_dataset.train_test_split(train_size=0.9, test_size=0.1)
training_dataset, validation_dataset    = hg_dataset_split['train'], hg_dataset_split['test']

# 2. Plot multiple Images. (Image format not Tensor)
ipyplot.plot_images(training_dataset['image'][0:5]);

# 3. Dataset -> Pytorch Tensors
training_dataset.set_format(type='torch', format_kwargs={"dtype": torch.float32})
training_dataset = training_dataset.with_format("torch")

# 4. Pytorch Dataset -> Data Loader
BATCH_SIZE = 4
TOTAL_BATCHES = len(training_dataset) / BATCH_SIZE
training_data_dataloader = torch.utils.data.DataLoader(training_dataset, batch_size= BATCH_SIZE)
# X_BATCH, Y_BATCH = next(iter(training_dataloader))

```

In [7]:
# ! KEY CODE

from torchvision import datasets as torchvision_datasets
from torchvision import transforms as torchvision_transforms
import ipyplot

# 1. Download Dataset from Available once in torchvision_datasets
training_dataset = torchvision_datasets.MNIST( root= '../dataset', transform= torchvision_transforms.ToTensor(), train= True, download= True )
training_dataset, validation_dataset = torch.utils.data.random_split(training_dataset, [0.9, 0.1])

# 2. Plot Image
x,y = training_dataset[0]
ipyplot.plot_images(x);

# 3. Dataset -> Pytorch Tensors
pass

# 4. Pytorch Dataset -> Data Loader
BATCH_SIZE = 4
TOTAL_BATCHES = len(training_dataset) / BATCH_SIZE

training_dataloader   = torch.utils.data.DataLoader( dataset= training_dataset  , batch_size= BATCH_SIZE, shuffle= True )
validation_dataloader = torch.utils.data.DataLoader( dataset= validation_dataset, batch_size= BATCH_SIZE, shuffle= True )
X_BATCH, Y_BATCH = next(iter(training_dataloader))

#### Model

In [79]:
layer = nn.Linear(in_features = 2, out_features = 1)
layer = nn.Linear(out_features = 1, in_features = 2)

x = torch.randn(2)
x_batch = torch.randn(10,2)

# Forward Pass
print(layer.forward(x))
print(layer.forward(x_batch))

# Dot Product = Weighted Sum
print(f'X vector is= {x}')
print(f'Weights vector is = {layer.weight}')


tensor([-0.1905], grad_fn=<AddBackward0>)
tensor([[ 0.1053],
        [ 0.0142],
        [-0.1397],
        [-0.4612],
        [ 0.4644],
        [ 0.2745],
        [-0.5693],
        [ 0.3611],
        [ 0.3861],
        [ 0.1093]], grad_fn=<AddmmBackward0>)
X vector is= tensor([-0.5936, -0.3838])
Weights vector is = Parameter containing:
tensor([[ 0.2785, -0.1303]], requires_grad=True)


In [17]:
# Neural Network for Digits Dataset
import torch, torch.nn as nn

input_layer = nn.Identity()
layer_1     = nn.Linear(out_features= 20, in_features=28*28*1)
layer_2     = nn.Linear(out_features= 10, in_features=20)

# Finding Parameters of layer. Method `named_parameters`
for name, parameter in layer_1.named_parameters():
    print(name, parameter.shape)

# Using Parameter Name Directly. weight
print("Entire Layer Parameters", layer_1.weight.shape, layer_1.bias.shape)
print(f'3rd Neurons Parameters weight = {layer_1.weight[2].shape}, bias = {layer_1.bias[2]}')

# Important Properties of Parameters
print("Will gradients be calculated?", layer_1.weight.requires_grad)
print("Parameters =>", layer_1.weight)
print("Gradient dError_dParameters =>", layer_1.weight.grad)

weight torch.Size([20, 784])
bias torch.Size([20])
Entire Layer Parameters torch.Size([20, 784]) torch.Size([20])
3rd Neurons Parameters weight = torch.Size([784]), bias = -0.020121073350310326
Will gradients be calculated? True
Parameters => Parameter containing:
tensor([[ 0.0123, -0.0293,  0.0221,  ..., -0.0339,  0.0101,  0.0162],
        [-0.0208, -0.0229,  0.0126,  ...,  0.0159,  0.0226,  0.0337],
        [-0.0082, -0.0053, -0.0338,  ..., -0.0116, -0.0308, -0.0094],
        ...,
        [ 0.0107,  0.0118,  0.0335,  ..., -0.0225,  0.0007, -0.0254],
        [ 0.0007,  0.0068, -0.0327,  ...,  0.0096,  0.0176, -0.0030],
        [-0.0164,  0.0103, -0.0133,  ...,  0.0150,  0.0282, -0.0066]],
       requires_grad=True)
Gradient dError_dParameters => None


In [20]:
from torchinfo import summary
summary(layer_1, input_size=(1*28*28,), 
        verbose=2, col_names = ["input_size", "output_size","kernel_size", "num_params","trainable", "params_percent"]);

Layer (type:depth-idx)                   Input Shape               Output Shape              Kernel Shape              Param #                   Trainable                 Param %
Linear                                   [784]                     [20]                      --                        15,700                    True                      100.00%
├─weight                                                                                     [784, 20]                 ├─15,680
├─bias                                                                                       [20]                      └─20
Total params: 15,700
Trainable params: 15,700
Non-trainable params: 0
Total mult-adds (M): 0.31
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.06
Estimated Total Size (MB): 0.07


$y = X \odot W + b$

```python
import keras_core as keras
import torch

keras.layers.Input()
keras.layers.Dense(units = 4, activation="relu")
keras.layers.Dense(units = 8)
keras.layers.Dense(units = 14)

torch.nn.Linear(4, in_features= 2)
```

In [8]:
# ! KEY CODE

from torch.nn import ReLU as ActivatePositive

# 1. Model Architecture - [20 neurons, 10 neurons]
# network = input -> Linear 20 -> Linear 10 = output
model = nn.Sequential(
    nn.Flatten(start_dim=1), # IMAGE RESHAPE from 2D(28,28) -> 1D(28*28)
    
    nn.Identity(),                                                              # LAYER 1: INPUT LAYER
    nn.Linear(out_features = 20, in_features = 28*28*1), ActivatePositive(),    # LAYER 2: 1st Hidden Layer
    nn.Linear(out_features = 10 , in_features = 20),                            # LAYER 3: Output Layer
)

# 2. Model Parameters & Their Relationship with Error
model_parameters = list(model.parameters())

# 3. Registering Model Parameters with Optimizer, as variables to be minimized
gradient_step    = 10
LEARNING_RATE    = gradient_step
OPTIMIZER        = torch.optim.SGD( params= model_parameters, lr= gradient_step )

# 4. Calculation of Error of Prediction
ERROR_FUNC = nn.functional.cross_entropy

# 5. Calculation of relationship of Error & Parameters
X_BATCH, Y_BATCH = next(iter(training_dataloader))
GRADIENTS_accumulated = torch.autograd.grad(outputs = ERROR_FUNC(model(X_BATCH), Y_BATCH), inputs = model_parameters)
# loss.backward(), computes dloss/dw for every parameter w which has requires_grad=True.
# w.grad += dloss/dw
# By default, gradients are accumulated in buffers (i.e, not overwritten) whenever .backward() is called.

# 6. Model Layers, Parameters Visualization
from torchinfo import summary
summary(model, input_size=(1,28,28), 
        verbose=2, col_names = ["input_size", "output_size","kernel_size", "num_params","trainable", "params_percent"]);

Layer (type:depth-idx)                   Input Shape               Output Shape              Kernel Shape              Param #                   Trainable                 Param %
Sequential                               [1, 28, 28]               [1, 10]                   --                        --                        True                           --
├─Flatten: 1-1                           [1, 28, 28]               [1, 784]                  --                        --                        --                             --
├─Identity: 1-2                          [1, 784]                  [1, 784]                  --                        --                        --                             --
├─Linear: 1-3                            [1, 784]                  [1, 20]                   --                        15,700                    True                       98.68%
│    └─weight                                                                                [784, 20]   

  action_fn=lambda data: sys.getsizeof(data.storage()),


**Same code as Pytorch, easier to read & understand in keras**
```python

import keras_core as keras
from keras import layers, models

import os
os.environ["KERAS_BACKEND"] = "torch"

k_model = models.Sequential([
    layers.Input(shape=(28,28,1))                   # LAYER 1: Input Layer
    layers.Flatten(),
    layers.Dense(units = 100, activation="relu"),   # LAYER 2: 1st Hidden Layer with Activation Function
    layers.Dense(units = 10 )                       # LAYER 3: Output Layer
])
k_model.summary()

k_model.compile(loss = "cross_entropy", optimizer = "adam")
```

#### Model Reducing Error by Looking at Data

```python
loss = error_func( y_predicted_logits, y_actual )
de_dw = torch.autograd.grad(outputs= loss, inputs = model.parameters() )
loss.backward()

parameters_list = list(model.parameters())
gradients_list = list(de_dw)

optimizer.step()
for parameter in model.parameters():
    parameter = parameter - step_length * parameter.gradient
```
$\Huge \frac{\partial E}{\partial W}$


In [15]:
x_actual, y_actual = next(iter(training_dataloader))


y_predicted_LOGITS = model.forward(input=x_actual)
loss               = ERROR_FUNC(y_predicted_LOGITS, y_actual)
model_parameters   = list(model.parameters())

OPTIMIZER.zero_grad()
# loss.backward()
# loss.backward(), computes dloss/dw for every parameter w which has requires_grad=True.
# w.grad += dloss/dw. By default, gradients are accumulated in buffers (i.e, not overwritten) whenever .backward() is called.
dError_dWeights = torch.autograd.grad(outputs= loss, inputs = model_parameters)

# SINGLE STEP UPDATES ALL PARAMETERS by one STEP
OPTIMIZER.step() 

for (name, weight), gradient in zip(model.named_parameters(), dError_dWeights):
    print(name)
    weight = weight - gradient * LEARNING_RATE

    print("weight & gradient shape = ", weight.shape, gradient.shape)
    print("weight & gradient values= ", weight, gradient)
    print("non zero gradients", torch.count_nonzero(gradient))

2.weight
weight & gradient shape =  torch.Size([20, 784]) torch.Size([20, 784])
weight & gradient values=  tensor([[ 0.0007,  0.0313, -0.0111,  ..., -0.0171, -0.0164, -0.0328],
        [ 0.0052,  0.0043, -0.0242,  ...,  0.0071, -0.0301,  0.0218],
        [ 0.0202,  0.0034,  0.0308,  ...,  0.0203, -0.0304,  0.0334],
        ...,
        [ 0.0182,  0.0347, -0.0090,  ...,  0.0219,  0.0026, -0.0165],
        [-0.0248,  0.0065, -0.0276,  ..., -0.0066, -0.0177, -0.0129],
        [-0.0258,  0.0085, -0.0303,  ..., -0.0304,  0.0135, -0.0334]],
       grad_fn=<SubBackward0>) tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
non zero gradients tensor(4030)
2.bias
weight & gradient shape =  torch.Size([20]) torch.Size([20])
weight & gradient values=  tensor([-0.0005,  0.0033,  0.0067, -0.0061,  0.

In [None]:
# ! KEY CODE

import torchmetrics
import wandb
wandb.init()

REPEAT = 10

def trainer_function(training_dataloader, model, error_func, optimizer, epochs):
    model.train(mode=True)

    for epoch_no in range(epochs):

        loss_total, accuracy_total = 0, 0
        for batch_no, (x_actual, y_actual) in enumerate(training_dataloader):

            y_predicted_LOGITS = model.forward(x_actual)
            y_predicted_probs  = nn.functional.softmax(y_predicted_LOGITS, dim= 1)
            loss               = error_func(y_predicted_LOGITS, y_actual)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            loss_batch      = loss.item()
            accuracy_batch  = torchmetrics.functional.accuracy(y_predicted_LOGITS, y_actual, task="multiclass", num_classes=10)
            
            loss_total      = loss_total + loss_batch 
            accuracy_total  = accuracy_total + accuracy_batch
            
            metrics_per_batch = {
                "loss": loss_batch,
                "accuracy_batch": accuracy_batch,
                "accuracy_average": accuracy_total / (batch_no + 1),
                "batch_no": batch_no
            }
            wandb.log(metrics_per_batch)
            print("END OF BATCH")
            """
            # Alternative Training Loop
            
            OPTIMIZER.zero_grad()
            # loss.backward()
            # loss.backward(), computes dloss/dw for every parameter w which has requires_grad=True.
            # w.grad += dloss/dw. By default, gradients are accumulated in buffers (i.e, not overwritten) whenever .backward() is called.
            dError_dWeights = torch.autograd.grad(outputs= loss, inputs = model_parameters)

            # SINGLE STEP UPDATES ALL PARAMETERS by one STEP
            OPTIMIZER.step()
            for (name, weight), gradient in zip(model.named_parameters(), dError_dWeights):
                print(name)
                weight = weight - gradient * LEARNING_RATE
            """
            print("END OF BATCH")
        
        accuracy_average    = accuracy_total / TOTAL_BATCHES
        metrics_per_epoch   = {
            "training_accuracy_average_per_epoch": accuracy_average
        }
        wandb.log(metrics_per_epoch)
        print(f"END OF ENTIRE EPOCH no {epoch_no}")
        evaluate_model(validation_dataset, model, error_func, epoch_no)

def evaluate_model(dataset, model, error_func, epoch_no):
    model.train(mode=False)

    loss_total, accuracy_total = 0, 0
    for x_actual, y_actual in validation_dataloader:
        y_predicted_LOGITS = model(x_actual)
        loss = error_func(y_predicted_LOGITS, y_actual)
        accuracy = torchmetrics.functional.accuracy(y_predicted_LOGITS, y_actual, task="multiclass", num_classes=10)
        
        loss_total = loss_total + loss 
        accuracy_total = accuracy_total + accuracy
    
    accuracy_average = accuracy_total / len(dataset)
    wandb.log({
        "validation_accuracy_average_per_epoch": accuracy_average
        })

trainer_function(training_dataloader, model, ERROR_FUNC, OPTIMIZER, REPEAT)

### Section 4: Brain vs Artificial Neural Network

- The brain does not learn by implementing a single, global optimization principle within a uniform and undifferentiated neural network.
- Rather, biological brains are modular, with distinct but interacting subsystems underpinning key functions such as memory, language, and cognitive control
- The primate visual system works differently. Rather than processing all input in parallel, visual attention shifts strategically among locations and objects, centering processing resources and representational coordinates on a series of regions in turn
- Continual Learning is an ability to master new tasks without forgetting how to perform prior tasks. Brain does continual Learning easily. Neural Networks can't do that.They do Catastrophic Forgetting
- Efficient Learning: ability to rapidly learn about new concepts from only a handful of examples
- Transfer Learning

## Neural Networks in more Detail

### 7 Steps to Learned Neural Network
1. Dataset in Detail
2. Neural Network Forward Pass & Dot Product & Activation
3. Error Function & Calculation for each Data
4. Error Gradient Calculation / Backward Pass
5. PARAMETER update in direction of Error Reduction. Model Training Monitoring
6. Model Report

In [None]:
# TODO: MODEL 
feature_extractor = nn.Sequential(
    nn.Conv2d( out_channels = 50, in_channels = 1, kernel_size = (3,3) , padding="same"),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=(2,2), stride = 2),
  
    nn.Conv2d(out_channels = 100, in_channels = 50, kernel_size = (3,3), padding="same"),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=(2,2), stride = 2),

)

decision_maker = nn.Sequential(
  nn.Linear(out_features = 50, in_features = 100*7*7 ),
  nn.Linear(out_features = 10, in_features = 50)
)

model = nn.Sequential(
  feature_extractor,
  decision_maker
)

## Rest

##### Types of Intelligence
1. No Intelligence      - 
1. Narrow Intelligence  - Single Task Intelligence
1. General Intelligence - Multiple Tasks Intelligence
1. Super Intelligence   - More tasks than possible by Single Human

---
##### Complexity of Intelligence
1. Standing Up & Picking Up a Pen
2. Identifying an Object
3. Understanding Words
---
##### Possible Applications via Flexibility
1. Robotics
2. Visual Factory Hand 
3. ChatGPT+

