- Assumptions of Linear Regression
    - Relationship between the independent variables  $x$  and the dependent variable  $y$  is linear, i.e., $y$  can be expressed as a weighted sum of the elements in $x$ , given some noise on the observations
    - Any noise $\epsilon$ is well-behaved (following a Gaussian distribution)

### Dataset
1. Training Set
2. Validation Set
3. Test Set

### Linear Model

Let $n$ denotes the number of examples in our dataset. We index the data examples by $i$, denoting each input as $x^{(i)} = [x_1^{(i)}, x_2^{(i)}]$ and the corresponding label as $y^{(i)}$

Prediction: $\hat{y} = w_1  x_1 + ... + w_d  x_d + b$, where $\mathbf{w}$ is weights vector, $\mathbf{x}$ is feature Vector and $b$ is bias or $y$-intercept.

Compact form:

$\hat{y} = \mathbf{w} \cdot \mathbf{x} + b$

For a collection of features $\mathbf{X}$

${\hat{\mathbf{y}}} = \mathbf{w} \cdot \mathbf{X} + b$

### Loss Function

- measure of how close is $\hat{y}^{(i)}$ to $y^{(i)}$

squared error 

$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2$

To measure the quality of a model on the entire dataset of  $n$  examples, we take the average loss:

$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w} \cdot \mathbf{x}^{(i)} + b - y^{(i)}\right)^2$

When training the model, we want to find parameters ($\mathbf{w}^*, b^*$), that minimizes $L(\mathbf{w}, b)$:

$\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\  L(\mathbf{w}, b)$

### Gradient Descent

- Taking the partial derivative of $L(\mathbf{w}, b)$ over the entire dataset, w.r.t. $w$ and $b$

### Minibatch Stochastic Gradient Descent

1. Initialize the values of the model parameters: $w$ and $b$, often randomly
2. In each iteration, we randomly sample a minibatch $\mathcal{B}$ from our dataset
3. Taking the partial derivative of $L(\mathbf{w}, b)$, w.r.t. $w$ and $b$
4. Update $w, b$:
$(\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b)$, where $\eta$ stands for learning rate, $|\mathcal{B}|$ is cardinality

$\eta$ and $|\mathcal{B}|$ are called $\textit{hyperparameters}$, are not updated in the training loop like $w$ and $b$, they're typically adjusted based on results on the validation set.

### Making Predictions

Once we find $\mathbf{w}^*, b^*$ we can make predictions. Let $\mathbf{x_{val}}$ denotes our Validation Set, then $\hat y = \mathbf{w}^* \cdot x_{val} + b^*$

### Neuron

![neuron.svg](../images/neuron.svg)

### Neural Network

![singleneuron.svg](../images/singleneuron.svg)