<h1 align="center" style="color:green;font-size: 3em;" >Neural Networks</h1>
<h1 align="center" style="color:green;font-size: 2em;" >Universal Function Approximators</h1>

## An exercise on the basics of neural network:

**1. Data Preprocessing.**

**2. Perceptron (Linear Regression).**

**3. Hyperparameters (Optimizer and Loss Function).**

**4. Training.**

**5. Exploring Loss Function Landscape.**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset, Dataset
import matplotlib.pyplot as plt


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data Preprocessing

## Create our data, convert to tensors, store in a pytorch dataset, use a loader to shuffle and serve the data in batches 

In [None]:
# Define the input data
size = np.array([2104, 1600, 2400])
price = np.array([399900, 329900, 369000])

# size = np.array([2000, 2500, 3000, 3500, 4000])
# price = np.array([300000, 350000, 400000, 450000, 500000])

# size = np.array([2104, 1600, 2400, 1890])
# price = np.array([399900, 329900, 369000, 340900])

# Convert numpy arrays to PyTorch tensors
size_tensor = torch.from_numpy(size).float().unsqueeze(1)
price_tensor = torch.from_numpy(price).float().unsqueeze(1)

# Create a TensorDataset
data = TensorDataset(size_tensor, price_tensor)

# Create a DataLoader for training
train_loader = DataLoader(data, batch_size = 1, shuffle = True)

print(data[0])

## Visualize our data

In [None]:
# Create a plot of X against Y
plt.scatter(size, price)
plt.axis([1500, 2500, 200000, 500000])


# Display the plot
plt.show()

# Basics of Deep Learning: Creating our model

This is a very simple neuron, also known as a perceptron, with 1 weight and 1 bias. This is the building block of all neural networks.

## Linear Transformation

Linear transformations according to pytorch's doc is: 

## y = (n * A^T) + b

where,
* n is our input
* A^T is the transose of the weight matrix
* b is the bias

Similar to y = mx + b, the graph of our function is a line.

![Linear Transformation](https://jalammar.github.io/images/NNs_bias_2.png)

* Source: [A Visual and Interactive Guide to the Basics of Neural Networks](https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/)

In [None]:
class Perceptron (nn.Module):
    """
    Perceptron with linear activation.

    Args:
      x: Input tensor of shape (batch_size).

    Returns:
      Output tensor of shape (batch_size, 1).
    """
    def __init__(self, in_features):
        super(Perceptron, self).__init__()
        
        # Linear combination of weights and input 
        # y = torch.mm(x, w) + b
        self.linear = nn.Linear(in_features, 1, bias = True)

    def forward(self, x):
        y = self.linear(x)
        return y

In [None]:
neuron = Perceptron(1)  # Assuming 1 input feature
print(neuron)

# Define our hyperparameters

## Optimizer

An optimizer is the learning function we use to manipulate the weights and bias. Doing so will help us get outputs that are close to the expected output. Visually, this results in a line that is close to our data points. This will also result in our loss function's output approaching 0, which visually would look like a line, straight or curved, getting lower and approaching the x axis itself.

Our optimizer changes our weights and bias through a process called backpropagation. Once we have taken our input and passed it through our linear function described earlier, we find the derivative of that function with respect to the input. We can use then use the derivative as the gradient, and we go the opposite way. 

Note: Backpropagation is often discusses with bigger models that have multiple weights and biases (parameters), so terms like chain rule, partial derivatives, and graidents make the most sense in those contexts.  

## Loss Function

A loss function is another, separate function that takes in the output of our original function and compares it the expected/correct output. Since it's a function, it too has a graph. We want to get the output of this function as close to 0 as we can. To do this, we use something called an optimizer.

In [None]:
def mean_squared_error(output, price):
    """
    Find the difference of our predicted and actual values,
    squares it to avoid negative values, and finds the mean 
    of all differences. 
    
    Args:
      output: Input tensor of shape (1,).
      price: Input tensor of shape (1,).
    
    Returns:
      A scalar value representing the mean squared error.
    """

    # compute mean squared error
    mse = torch.mean((output - price) ** 2)
    
    # returns error
    return mse

In [None]:
output = torch.zeros(1)

# Define the learning rate and number of epochs
learning_rate = 0.001
num_epochs = 100

# Define loss and optimizer
criterion = mean_squared_error(output, price) # output = prediction, price = actual
optimizer = torch.optim.SGD(neuron, lr = learning_rate)

In [None]:
# Training loop
for epoch in range(num_epochs):
    epoch_loss = 0.0  # Track loss for each epoch

    for size, price in train_loader:
        optimizer.zero_grad()

        # Forward pass
        outputs = neuron(size)

        # Loss & Backward pass
        loss = criterion((outputs, price))
        loss.backward(retain_graph=True)

        optimizer.step()

        epoch_loss += loss.item()  # Accumulate loss

    # Print loss at the end of each epoch
    print(f'Epoch [{epoch + 1} / {epochs}] Loss: {epoch_loss / len(train_loader)}')

    # Print weights and biases 
    # Clone and detach for safety
    weight_copy = neuron.linear.weight.clone().detach()
    bias_copy = neuron.linear.bias.clone().detach()
    print("Weight:", weight_copy)
    print("Bias:", bias_copy)

# Print the final weights and bias
print("Final Weight:", weight_copy)
print("Final Bias:", bias_copy)

## Loss Function Landscape

Visually, identifying the gradient of the loss function and going the opposite way would look like navigating an invisible, hilly landscape where we can only see the steps we have taken, but we can also see the steepest point in our immediate vicinity using the gradient derived from the parial derivatives and we "descend".

When visualizing gradient descent in linear regression, you're essentially visualizing the descent on the loss function landscape, not directly on the curve fitting function (y = (n(A^T)) + b) itself.

## Loss Function Landscape:

Imagine a 3D space where the x and y axes represent the model's parameters (e.g., slope and intercept for linear regression), and the z-axis represents the loss function value.

This 3D space is called the loss function landscape. Points with lower loss values are like valleys, and points with higher loss values are like hills or mountains.

## Gradient Descent in this Landscape:

Gradient descent works by iteratively adjusting the model's parameters in the direction that minimizes the loss function.

We calculate the gradient of the loss function at the current parameter values. The gradient points in the direction of steepest descent on the loss function landscape.

By taking small steps in the negative direction of the gradient (opposite direction of steepest ascent), we move the model's parameters towards lower loss regions in the landscape.

## Visualizing Descent:

In visualizations, we typically project this 3D landscape onto a 2D plane, often showing the loss value as a color gradient (darker for lower loss, brighter for higher loss).

We then plot the path of the model's parameters (a series of points) as they move through the landscape during gradient descent iterations. This path represents the model's journey towards finding the minimum loss (often visualized as a valley in the landscape).

While you see the line representing the model's fit getting "closer" to the data points, you're not directly visualizing movement on that line. You're visualizing the descent on the loss function landscape, which guides the model's parameters to create a better fit.

we don't know what the landscape look like right? Since it is being "rendered" as we calculate the loss value. Are we using the derivatives, or partial derivatives via the chain rule for multiple parameters, as some kind of powerful goggles that can generate the area within our immediate proximity where the gradient is the highest point in this generation

You're exactly right! The loss function landscape is a helpful conceptualization, but it's not something we can directly "see" or fully render beforehand. Here's a more detailed explanation:

## Unknown Landscape:

The true loss function landscape for a complex model with many parameters is high-dimensional and very difficult to visualize in its entirety. We only have access to the loss value and its derivatives at specific points (parameter values) during training.

## Navigating with Derivatives:

You're right on the mark with the goggles analogy. We use the derivatives (partial derivatives via the chain rule for multiple parameters) as a powerful tool to navigate this hidden landscape.

The gradient, calculated using these derivatives, tells us the direction of steepest ascent in the loss function landscape at the current parameter location.

By taking the negative of the gradient (since we want to minimize the loss), we get the direction of steepest descent.

## Limited View:

However, the derivatives only provide local information about the immediate vicinity of the current parameter values. They don't give us a complete picture of the global landscape.

## Optimization Algorithms:

Gradient descent algorithms like the one you saw earlier use this local information from derivatives to iteratively update the parameters.

With each update, the algorithm moves to a new point (parameter values) in the landscape and calculates a new gradient to continue its descent towards a minimum.

## Challenges and Stochasticity:

The landscape can have many valleys and ridges, making it challenging to find the global minimum.

In practice, we often aim to find a good local minimum that avoids overfitting.

Some optimization algorithms like stochastic gradient descent (SGD) introduce randomness to help escape local minima and explore the landscape more broadly.