### **Linear Regression**



### simple Linear Regression

Given a dataset of n points: (x1,y1), (x2,y2), …, (xn,yn)

we know that , y= mx+c or b1x+b0


##### Loss Function: Mean Squared Error (MSE)

The MSE loss function is:

$$
L(b_0, b_1) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^n (y_i - (b_0 + b_1 x_i))^2
$$


***Derivatives***
To use gradient descent, we need the partial derivatives:

**Derivative w\.r.t $b_0$:**

$$
\frac{\partial L}{\partial b_0} = \frac{-2}{n} \sum_{i=1}^n (y_i - (b_0 + b_1 x_i))
$$

**Derivative w\.r.t $b_1$:**

$$
\frac{\partial L}{\partial b_1} = \frac{-2}{n} \sum_{i=1}^n x_i (y_i - (b_0 + b_1 x_i))
$$


after sub to 0 we get:

1. Slope ( $b_1$ ):
	- m = y2-y1/x2-x1 
	- $b_1$ = $\Large \frac{Cov(x,y)}{Var(x)}$ = $\large \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}$ = $\Large {\frac{\sum x_{i}y_{i}-n{\bar{x}}{\bar{y}}}{\sum x_{i}^{2}-n{\bar{x}}^{2}}}$
 2. Intercept ( $b_0$ ): 
	 - $b_0= \bar{y} - b_1 \bar{x}$
 

### Multiple Linear Regression Model:

This approach treats the data as a matrix and uses linear algebra operations to estimate the optimal values for the coefficients

 **Matrix Formulation** :Y=Xb+ε
Where:
- Y is an n×1 vector of outputs,
- X is an n×(p+1) **design matrix** (first column is all 1’s for intercept),
- b is a (p+1)×1 vector of coefficients (including b0),
- ε is the error vector.

$Y = b_0 + b_1X_1 + b_2X_2 + \varepsilon$

You'd build a design matrix like this:

$X= = \begin{bmatrix} 1 & x_{11} & x_{12} \\ 1 & x_{21} & x_{22} \\ \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} \end{bmatrix}$


$\hat{\mathbf{b}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}$


### **Logistic Regresssion**


$P(y = 1 \mid x) = \sigma(\hat y) = \frac{1}{1 + e^{- \hat y}}$


1. **Log-Likelihood**

	$\ell(\beta) = \log L(\beta) = \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]$

	Substitute: $p_i = \sigma(x_i^\top \beta)$
	
	$\ell(\beta) = \sum_{i=1}^{n} \left[ y_i \log(\sigma(x_i^\top \beta)) + (1 - y_i) \log(1 - \sigma(x_i^\top \beta)) \right]$

2. **Gradient of Log-Likelihood**
	
	To find parameters, we take the gradient of the log-likelihood and set it to zero:
	
	$\nabla_\beta \ell(\beta) = \sum_{i=1}^{n} (y_i - \sigma(x_i^\top \beta)) x_i$
	
	So: $\boxed{\nabla_\beta \ell(\beta) = X^\top (y - p)}$
	
	Where:
	
	- $X \in \mathbb{R}^{n \times d}$ is the design matrix,
	- $y \in \mathbb{R}^n$ is the vector of labels,
	- $p \in \mathbb{R}^n$ is the vector of predicted probabilities.
	    
	
	This gradient is used in optimization algorithms (like **gradient ascent** or more often **Newton-Raphson** or **BFGS**) to find the optimal β\beta.


In [3]:
### Gradient Descent

# Input: 
#     loss function J(β)
#     Gradient ∇J(β)
#     Learning rate α
#     Initial parameters β = β₀
#     Number of iterations N

# For i = 1 to N do:
#     Compute gradient: g = ∇J(β)
#     Update parameters: β = β - α * g

# Return β
