##3.1 Linear Regression

In [None]:
pip install d2l==1.0.3

In [None]:
%matplotlib inline
import math
import time
import numpy as np
import torch
from d2l import torch as d2l

###3.1.2 Vectorization for Speed

In [None]:
n = 10000
a = torch.ones(n)
b = torch.ones(n)

In [None]:
c = torch.zeros(n)
t = time.time()
for i in range(n):
  c[i] = a[i] + b[i]
f'{time.time() - t:.5f} sec'

In [None]:
t = time.time()
d = a + b
f'{time.time() - t:.5f} sec'

###3.1.3 The Normal Distribution and Squared Loss

In [None]:
def normal(x, mu, sigma):
  p = 1 / math.sqrt(2 * math.pi * sigma ** 2)
  return p * np.exp(-0.5 * (x - mu)**2 / sigma**2)

In [None]:
# Use NumPy again for visualization
x = np.arange(-7, 7, 0.01)

# Mean and standard deviation pairs
params = [(0, 1), (0, 2), (3, 1)]
d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel = 'x',
         ylabel='p(x)', figsize=(4.5, 2.5),
         legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])

##Discussion and Review 3.1

Key Concepts

- Regression Problems: Focus on predicting continuous numerical values, such as house prices, lengths of hospital stays, and stock prices.
- Training Dataset: Contains features (inputs) and labels (targets) used to teach the model.
- Features: Variables influencing predictions (e.g., area, age of a house).
- Labels/Targets: Values that need to be predicted (e.g., house prices).

###3.1.1

- Definition: Linear regression is a foundational method for regression tasks, based on the assumption of a linear relationship between the target and features.

- Model Equation:
  $y = w_1 \cdot x_1 + w_2 \cdot x_2 + b$

  - $w_1$ and $w_2$ : Weights corresponding to features.

  - $b$ : Bias term to fit data even when features are zero.

- Noise: Real-world data contains noise, often assumed to follow a Gaussian distribution.

####3.1.1.1

Model Representation
- Linear models express the target $y$ as a sum of weighted features plus bias:

  $\hat{y} = w_1 \cdot \text{(area)} + w_2 \cdot \text{(age)} + b$

- Weights: Determine each feature's contribution to predictions.

- Bias: The model’s output when all features are zero.

- Affine Transformation: Linear regression involves a linear transformation (weighted sum) plus a translation (bias).

Compact Representation
- High-Dimensional Data: Using vector and matrix notation simplifies representation:
  - Single Example Prediction:
  $\hat{y} = w^\top x + b$
  - Multiple Examples Prediction:
  $\hat{y} = Xw + b$
  - $x$: Feature vector, $w$: Weight vector, $X$: Design matrix of all examples.

Goal of Linear Regression

- Objective: Minimize prediction error by optimizing weight vector $w$ and bias $b$.
- Error Sources: Measurement inaccuracies and inherent noise in data.
- Noise Term: Essential to include in the model to account for discrepancies.

Next Steps in Linear Regression

- Model Quality: Establish a metric to evaluate prediction quality, often through loss functions.
- Optimization: Implement algorithms (e.g., gradient descent) to iteratively refine model parameters to reduce prediction errors.

####3.1.1.2

- Loss Function: Quantifies how well the model’s predictions match actual targets, with lower values indicating better performance.
  - Squared Error: Commonly used loss function for regression:
  $L(\hat{y}, y) = \frac{1}{2} (\hat{y} - y)^2$
    - Penalizes larger errors more severely due to the quadratic nature.
    - The constant $\frac{1}{2}$ simplifies the derivative calculation.

- Total Loss: For the entire dataset, the total loss is averaged or summed:

  $\text{Total Loss} = \sum_{i=1}^{n} L(\hat{y}^{(i)}, y^{(i)})$

    - Objective: Minimize this total loss by adjusting model parameters (weights and bias).

####3.1.1.3

- Analytical Solution: Linear regression offers a mathematical method to find optimal parameters.

  - By including a column of 1s for the bias, the normal equation can be derived:
  $w = (X^\top X)^{-1} X^\top y$
  - This solution is unique if the design matrix $X$ has full rank (i.e., its columns are linearly independent).

- Complex Models: Analytical solutions are rare in deep learning due to increased model complexity, necessitating iterative optimization techniques.

####3.1.1.4

- Gradient Descent: An optimization technique for updating model parameters to reduce loss iteratively.

- Full-batch Gradient Descent: Utilizes all data points to compute gradients but may be slow for large datasets.

- Stochastic Gradient Descent (SGD): Updates parameters using a single randomly selected sample at each step, which is faster but less stable.

- Minibatch SGD: A hybrid approach that uses small batches for updates, providing a balance of efficiency and stability:
  $w \leftarrow w - \eta \cdot \sum_{i=1}^{m} \nabla L(\hat{y}^{(i)}, y^{(i)})$
  - $m$: Minibatch size
  - $\eta$: Learning rate, a small positive value that controls step size.

- Hyperparameters: Learning rate and minibatch size are not learned during training but set beforehand, often tuned using methods like Bayesian optimization.

- Stopping Criterion: Training concludes after a predetermined number of iterations or upon meeting a specific condition (e.g., loss stabilization).

####3.1.1.5

- Making Predictions: The trained model can make predictions on unseen data:
  $\hat{y} = w^\top x + b$
  - This phase is referred to as inference in deep learning, although this term may lead to confusion with statistical inference.
- Generalization: The goal is not just to minimize training loss but to ensure the model performs well on unseen data, a concept known as generalization.

###3.1.3

- Squared Loss: There is a significant relationship between squared loss and the normal distribution.

  - When using squared loss in linear regression, it is optimal under the assumption that the noise (the residuals between predicted and actual values) follows a normal distribution.
- Normal Distribution:
  - Characterized by a mean $\mu$ and variance $\sigma^2$, the probability density function (PDF) is expressed as:

  $p(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$
  - Often referred to as the Gaussian distribution, it is foundational in statistics and machine learning.
- Connection between Squared Loss and Normal Distribution:

- In the context of linear regression, if the errors (or noise) are assumed to follow a normal distribution, minimizing the squared loss equates to maximizing the likelihood of the observed data under that distribution.

###3.1.3

- When fitting a linear regression model, we assume that the observed data points stem from a deterministic relationship with some added noise, which is typically modeled as following a normal distribution. This leads us to express the relationship as:

  $y = w^\top x + b + \epsilon, \quad \epsilon \sim N(0, \sigma^2)$

- This means that the target variable $y$ is modeled as a linear combination of input features $x$, plus some normally distributed noise $\epsilon$ .

Likelihood of Observing Data

- The likelihood of observing a specific target value $y$ for a given input $x$ is given by:

  $P(y \mid x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(y - (w^\top x + b))^2}{2\sigma^2}\right)$

####3.1.3.1

The overall likelihood for the entire dataset, considering $n$ independent observations, is:

  $L(w, b) = \prod_{i=1}^{n} P(y_i \mid x_i)$

To optimize this likelihood, we can take the logarithm (log-likelihood), which simplifies the calculations. The log-likelihood for all observations is:

  $\log L(w, b) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - (w^\top x_i + b))^2$

Ignoring constant terms that do not depend on $w$ and $b$, we focus on minimizing the mean squared error (MSE), which corresponds to the maximum likelihood estimation (MLE) under the assumption of Gaussian noise:

  $\min_{w, b} \sum_{i=1}^{n} (y_i - (w^\top x_i + b))^2$


###3.1.4

Despite its simplicity, linear regression can be interpreted as a basic form of a neural network:

- Input Layer: Contains neurons representing each input feature $x_1, x_2, ..., x_d.$

Output Layer: Produces a single output $\hat y$, representing the model's prediction.

In this neural network perspective:

- There is a single output neuron.
- The input layer is fully connected to the output layer.
- There are no hidden layers.

While this architecture is too simplistic for complex tasks, it lays the groundwork for more advanced neural network architectures.

####3.1.4.1

 A biological neuron includes:

- Dendrites: Receive input signals.
- Nucleus: Processes the weighted inputs.
- Axon: Transmits the output signal to other neurons or actuators.