# Linear Regression

## Math

|Symbol| Meaning|
|--|--|
|$x$ | feature|
|$y$ | target|
|$x^{(i)}$ | i<sup>th</sup> training example in case of Single Variable Linear Regression|
|$\hat y$ | estimated value of target|
|$m$|No. of training examples|
|$n$|No. of features in a single training example|
|$x^{(i)}_j$ | j<sup>th</sup> feature of i<sup>th</sup> training example for Multiple Variable Linear Regression|
|$\mathbf x^{(i)}$ | i<sup>th</sup> training example for Multiple Variable Linear Regression represented as a vector|

Note on notations: Capital bold face used to denote Matrix (or 2d arrays); small bold face symbol to denote a vector (or 1d array) and when used in matrix equations, can be used to denote a single column matrix.


**Model Representation** (sometimes called a hypothesis function):

For a single featured training set:
\begin{align*} f(x^{(i)}) &= \hat {y^{(i)}} \\
f_{wb}(x^{(i)}) &= wx^{(i)} + b \tag{1} \end{align*}

**Mean Squared Error (MSE)** Cost
\begin{align*}J &= \sum_{i=1}^{m} \frac{(f(x^{(i)})-y^{(i)})^2}{2m} \tag{2} \\
J_{wb} &= \sum_{i=1}^{m} \frac{(wx^{(i)} + b-y^{(i)})^2}{2m} \tag{3} \end{align*}

To find $w$ and $b$, and thus the estimator function we must minimize cost ( $J$ ) wrt $w$ and $b$.

For an intuition for Mean Squared Error Cost function consider this - to approximate a bunch of numbers by a single number we often cite the data's average or arithmetic mean. 

**Mean is the point from which the arithmetic sum of differences from all the numbers is zero.**

Let $\bar x$ denote the mean of a set of scalar $x^{(i)}$

$$ \sum_{i=1}^{m} (x^{(i)}-\bar x) = \sum_{i=1}^{m}x^{(i)} - m\bar x = \sum_{i=1}^{m}x^{(i)}- \sum_{i=1}^{m}x^{(i)} = 0$$


If we were to find a number, lets say $\mu$, sum of squares of differences from which of all the numbers in a dataset, is to be minimized, we'd see that that number, is in fact, the mean.
To minimize $$ \sum_{i=1}^{m} (x^{(i)}-\mu)^2$$
lets take the derivative of the above term wrt $\mu$ and equate it to zero.
\begin{align*}\dfrac{d}{d \mu} \left(\sum_{i=1}^{m} (x^{(i)}-\mu)^2\right)=0 \\
\sum_{i=1}^{m} \left(2(x^{(i)}-\mu) \cdot \frac{d(x^{(i)}-\mu)}{d \mu}\right) = 0 \\
-2 \sum_{i=1}^{m} (x^{(i)}-\mu) = 0 \\
\sum_{i=1}^{m} (x^{(i)}-\mu) = 0 \\
\Rightarrow \mu = \bar x
\end{align*}


**Thus (arithmetic) mean is the point, sum of squares of differences from which of all the numbers in a dataset, is minimum.**

**Related fact - median is the point, sum of absolute differences (or distances) from which of all the numbers in a dataset is minimum.**

Fun geometric fact - because arithmetic mean is that point sum of squares of differences from which is minimum, if we're in a Eucleadian 2D plane, arithmetic mean of the x coordinates of the points gives us a number, sum of squares of differences of x coordinates of all the points from this number is minimum, and similarly the mean of y coordinates gives us a similar number. Together the arithmetic means of x and y coordinates is in fact the coordinates of the point where both the sums ie of squares of x differences and that of squares of y differences are minimum. Thus for this point the sum of these 2 sums would also be the minimum or this is the point on the plane from which sum of squares of distances (or differences) of all the points is minimum. 

Thus be it on a line or a plane and likewise in 3D or higher dimensions, arithmetic mean of coordinates of the points gives us the coordinates of the point from which sum of squared L2 Norms is minimum.

On a line or 1D, L1 and L2 norm are same and median minimizes the sum of this from all the points but in case of 2D or higher dimensions, median only minimizes the L1 norm ie for 2D or higher dimensions, median of the coordinates of the points gives us the point from which L1 norm or sum of x and y distances over all points is minimized, ie not the sum of true geometric distances or L2 norm is minimized. The point that does that ie the point that minimizes L2 norm is called geometric median.

**Thus arithmetic mean minimizes the sum of squared L2 norms, whereas geometric median minimizes the sum of L2 norms and the coordinate-wise median minimizes the sum of L1 norms.** 

Coming back to the issue of cost function, we can think that just like the mean minimizes the sum of squared differences and zeroes out the sum of differences, by trying to best fit the data on a straight line ie while finding the equation of a line for the linear regression problem we try to pick a cost function that involves the sum of squared error terms and then try to minimize this hoping to net out or cancel the differences and most closely represent the data. We divide the sum of squared errors by the number of training examples to keep the cost bounded as it is expected that more the number of examples the greater is the sum of squared errors and dividing by $m$ helps us compare the costs between differently numbered training sets.

Also note that while trying to approximate a bunch of numbers on a number line by a single number, when we chose mean it both minimizes the sum of squared errors and makes the algebraic sum of errors go to zero.
In case of finding the staight line fit to the data for linear regression (in a single variable/feature) problem, making the sum of errors to zero alone (which gives the below equation) doesn't give us a unique solution.
$$\sum_{i=1}^{m} (wx^{(i)} + b-y^{(i)})=0$$
Above equation has 2 unknowns ($w$ and $b$) and thus we need something more to uniquely find both $w$ and $b$. Thus we need MSE cost as we'll see that minimizing the MSE cost gives both the above equation and the additional info required to find $w$ and $b$.

From eq (2) note that $J$ is quadratic in both $w$ and $b$

\begin{align*}\frac {\partial J}{\partial w} &= \sum_{i=1}^{m} \frac {(wx^{(i)} + b-y^{(i)})x^{(i)}}{m}\tag{4} \\
\frac {\partial J}{\partial b} &= \sum_{i=1}^{m} \frac {(wx^{(i)} + b-y^{(i)})}{m} \tag{5} \end{align*}

Equating both the partial derivatives to zero we get:
\begin{align*} \sum_{i=1}^{m} \left((wx^{(i)} + b-y^{(i)})x^{(i)}\right) &=0 \tag{6} \\
\sum_{i=1}^{m} (wx^{(i)} + b-y^{(i)})&=0 \tag{7} \end{align*}

Eq (7) is the one we wrote earlier when we set out to make the sum of error terms zero which we see now is 1 of the results of minimizing the MSE cost function. Together the above 2 equations give us the sufficient information info to solve for 2 unkowns $w$ and $b$

For **Multiple Variable Linear Regression**, vectorially the equations are:
\begin{align*} f^{(i)} &= \mathbf w \cdot \mathbf x^{(i)} + b \tag{8}\\
J &= \sum_i \frac {(f^{(i)}-y^{(i)})^2}{2m} \tag{9}\\
\frac {\partial J}{\partial \mathbf w} &= \sum_i \frac{(f^{(i)}-y^{(i)})\mathbf x^{(i)}}{m} \tag{10}\\
\frac {\partial J}{\partial b} &= \sum_i \frac{(f^{(i)}-y^{(i)})}{m} \tag{11} \end{align*}

If the training data set is given as a matrix where every training example is on a separate row with a particular feature along that particular column, we can write following matrix equations:
\begin{align*} \mathbf f &= \mathbf X \mathbf w + b \tag{12} \\
\frac {\partial J}{\partial \mathbf w} &= \frac {\mathbf X^T(\mathbf f -\mathbf y)}{m} \tag{13} \end{align*}

In some contexts, $\mathbf w$ and $b$ are combined in a single parameter $ \boldsymbol {\theta} $ with b as the first component of $\boldsymbol{\theta} $ and the remaining components coming from $\mathbf w$. Here the feature matrix must be prefixed with a column containing all 1(s). For such a notation, the following holds:
$$\mathbf f = \mathbf X\boldsymbol{\theta}$$
Optimizing $J$:
\begin{align*} \frac {\partial J}{\partial \boldsymbol{\theta}} &=0 \\
\frac {\mathbf X^T(\mathbf X\boldsymbol{\theta} - \mathbf y)}{m} &= 0\\
\mathbf X^T \mathbf X\boldsymbol{\theta} &= \mathbf X^T\mathbf y\\
\boldsymbol{\theta} &= (\mathbf X^T \mathbf X)^{-1}\mathbf X^T\mathbf y \tag{14} \end{align*}

Above eq known as **normal equation** gives the exact solution of linear regression.

### Gradient Descent and cost plots wrt parameters

Even though we have an analytical solution for Linear Regression, we'll see that an iterative algorithm such as Gradient Descent (with proper techniques like normalization) often would help us get closer to acceptable solutions faster than normal equation. Also this approach works for other types of regression and classification where we dont have an analytical solution. So its recommended to almost never use normal equation method in any real world ML problem ourselves. (Its possible some of the libraries might use normal equation internally) - Given advice is from Andrew Ng. Need some more research to verify some of this.

Going back to the single variable Linear Regression problem, $J$ is quadratic in both $w$ and $b$ (eq (3)) and when plotted forms an upward facing bowl like shape ($w$ and $b$ being in the xy plane and $J$ along the z axis that's pointing upwards). Since $J$ is quadratic in both $w$ and $b$, keeping either of $w$ or $b$ constant and plotting $J$ vs the other produces an upwards facing parabola. Expanding the terms in eq (3) gives:

\begin{align*} J &= \sum \frac{(x^{(i)})^2w^2 + b^2 + (y^{(i)})^2 + 2x^{(i)}wb - 2y^{(i)}b - 2x^{(i)}y^{(i)}w} {2m} \\
\\
&= \frac{\displaystyle{\left(\sum({x^{(i)}})^2\right) w^2 +\ \left(2\sum x^{(i)} \right) wb + mb^2 - \left(2\sum x^{(i)}y^{(i)} \right)w - \left(2\sum y^{(i)} \right) b + \sum(y^{(i)})^2}} {2m} \end{align*}

Now since  $\displaystyle{m\sum({x^{(i)}})^2 - \left(\sum x^{(i)}\right)^2 > 0}$, for any set of real $x^{(i)}$, above expression when equated to any value would give the equation of an ellipse (in the variables $w$ and $b$) and thus we can say that the contour plots of cost are ellipses

Going back to Gradient Descent, the objective is to find a path from an initial cost towards a direction which leads to the steepest slope or descent. We keep on updating the parameters $w$ and $b$ going in the direction of steepest descent with a learning rate $\alpha$ till the successive descents stop providing an appreciable cost reduction or till we reach a minima for $J$ which is a function of $w$ and $b$. The minima we reach through a Gradient Descent is a local minima but in case of linear regression it is also a global one.

Need to keep $\alpha$ small enough to not overshoot cost in successive iterations but big enough to not slow down the convergence to the solution.

Gradient Descent Algo:
\begin{align*} w &\to w - \alpha \frac {\partial J}{\partial w} \\
b &\to b - \alpha \frac {\partial J}{\partial b} \end{align*} 

## NumPy

Most popular library for Linear Algebra ie for doing maths on vectors and matrices. Vectors represented as 1d arrays and Matrices as 2d arrays, while NumPy supports higher dimensioned arrays as well. Total number of dimensions (also called **axes** in NumPy terminology) is called the **rank** of the NumPy array. A 2d np array can be thought of as an array of (m) arrays each of which in turn contains (n) scalars. And, we say (m,n) is the **shape** (which is basically a tuple in Python) of this 2d array. Rank here is 2. Thus len of shape is rank. And m is the 1st axis and n is the last axis. When such an np array is printed, it's shown as a grid of m rows and n columns.

### Some properties/syntax with examples:

In [1]:
import numpy as np

In [11]:
a = np.array([1,2,3,4,5])
print(a)
print(f'shape of a: {a.shape}')
print(f'rank of a: {len(a.shape)}')
print(f'data type of a: {type(a)}')
print(f'data type of elements of a: {a.dtype}')

[1 2 3 4 5]
shape of a: (5,)
rank of a: 1
data type of a: <class 'numpy.ndarray'>
data type of elements of a: int32


In [24]:
b= np.array([[1.0,2,3,4],[5,6,7,8]])
print(b)
print(f'shape of b: {b.shape}')
print(f'rank of b: {len(b.shape)}')
print(f'data type of elements of b: {b.dtype}')

[[1. 2. 3. 4.]
 [5. 6. 7. 8.]]
shape of b: (2, 4)
rank of b: 2
data type of elements of b: float64


All the elements of np array must be of the same data type

In [60]:
c=np.array(3.0)
print(c)
print(f'shape of c: {c.shape}')
print(f'rank of c: {len(c.shape)}')
print(f'data type of elements of c: {c.dtype}')
print(f'type of c: {type(c)}')

3.0
shape of c: ()
rank of c: 0
data type of elements of c: float64
type of c: <class 'numpy.ndarray'>


*An np array can also have 0 dimensions or rank = 0 but can still contain a scalar value*

In [29]:
np.array(3) == c

True

*While checking equality, it checks for value not the data type*

In [14]:
np.zeros(2) # 1d array of 2 elements

array([0., 0.])

In [21]:
np.zeros((2,)) #Passing a scalar or a tuple with length 1 as a parameter produces same result

array([0., 0.])

In [39]:
d=np.zeros((3,4,2))
print(d)

[[[0. 0.]
  [0. 0.]
  [0. 0.]
  [0. 0.]]

 [[0. 0.]
  [0. 0.]
  [0. 0.]
  [0. 0.]]

 [[0. 0.]
  [0. 0.]
  [0. 0.]
  [0. 0.]]]


In [20]:
np.arange(10) # accepts only an integer not a tuple. Produces 1d array of that many elements starting from 0.

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [32]:
np.random.rand(10)

array([0.78745649, 0.22815109, 0.04346343, 0.60629973, 0.56631849,
       0.68268549, 0.53360991, 0.50686169, 0.55880697, 0.80054397])

random.rand also accepts an integer as a parameter (not a tuple) and produces array of given length with value [0,1). random.random_sample accepts a tuple too as a parameter

In [33]:
np.random.random_sample(10)

array([0.7583801 , 0.93510172, 0.88793483, 0.48953159, 0.91958823,
       0.5794756 , 0.3711545 , 0.63061935, 0.68224789, 0.69938625])

In [34]:
np.random.random_sample((5,2))

array([[0.34719564, 0.66872739],
       [0.02564214, 0.12422266],
       [0.68309453, 0.55549225],
       [0.58875464, 0.0978877 ],
       [0.91238643, 0.98325602]])

Like lists in Python we can slice an np array with an identical syntax `a[start:stop:step]`

In [35]:
a[1:4:2]

array([2, 4])

In [42]:
print(b[1,:]) # from row number 1 pick all columns
print(b[:,1]) # from every row, pick column number 1

[5. 6. 7. 8.]
[2. 6.]


Both the above produce 1d arrays from 2d array

In [43]:
print(b[:,1:2])

[[2.]
 [6.]]


Notice diff btw above 2 results. When range is given in the slice it preserves shape, when an explicit number is given, it flattens the data.

In [44]:
e=np.array([[1],[2],[3],[4],[5]])
e

array([[1],
       [2],
       [3],
       [4],
       [5]])

In [45]:
e[:,0]

array([1, 2, 3, 4, 5])

We can also get above using **reshape**.

In [47]:
f=e.reshape((5,))
f

array([1, 2, 3, 4, 5])

and get back to original by this:

In [48]:
f.reshape((5,1))

array([[1],
       [2],
       [3],
       [4],
       [5]])

While providing the expected shape parameter in the reshape method, we can put a -1 in exactly 1 of the places and it figures out the right shape automatically

In [50]:
f.reshape((-1,1))

array([[1],
       [2],
       [3],
       [4],
       [5]])

In [49]:
e.reshape((-1,))

array([1, 2, 3, 4, 5])

#### Broadcasting

Quite elegant functionality of NumPy arrays. Helps to do arithmetic between arrays of different shapes.

Only 2 imp rules to remember abt broadcasting:
1. It compares shapes from the trailing axes/dimensions.
So if a.shape = (5,4,2) \
and b.shape = (4,2) \
Then we can do arithmetic with a and b, because last 2 dimensions are same. 

2. Also for a pair of dimensions to be equivalent they either have to be same or 1 of those 2 need to be 1.\
So if b.shape were (4,1) then also a and b would be compatible.\
It would have also worked if shape of b had been (2,)

#### Dot Product

`np.dot(a,b)` If $a$ and $b$ are both 1d arrays it is exactly like a dot product between 2 vectors ($a$ and $b$ must have same shape ie same number of components) ie a sum of product of their respective components or sum product in short. Also either $a$ or $b$ or both of these parameters can be scalars too.

In [62]:
dot_product_scalars = np.dot(3,4)
print(dot_product_scalars)
print(type(dot_product_scalars))

12
<class 'numpy.int32'>


In [73]:
dot_product_0d_arrays = np.dot(np.array(3),np.array(4))
print(dot_product_0d_arrays)
print(type(dot_product_0d_arrays))

12
<class 'numpy.int32'>


In [71]:
a=np.array([3,4])
b=np.array([1.,0])
dot_product_vectors = np.dot(a,b)
print(dot_product_vectors)
print(type(dot_product_vectors))

3.0
<class 'numpy.float64'>


In [72]:
print(np.dot(a,2)) #dot product between vector and scalar

[6 8]


In [75]:
np.dot(a,np.array([2]))

ValueError: shapes (2,) and (1,) not aligned: 2 (dim 0) != 1 (dim 0)

Broadcasting isn't meant for 'dot', it's meant for simple arithmetic like addition, subtraction , multiplication etc.

In [76]:
a*np.array([2])

array([6, 8])

**If $a$ is an N-D array and $b$ is a 1-D array, it is a sum product over the last axis of $a$ and $b$.**\
Above property implies that if $a$ is a 2d array then the result is exactly like a matrix multiplication in maths, as if $b$ were a single column matrix.

Thus when we need to multiply a matrix of multiple rows and columns with a single column matrix, instead of representing  matrix multiplication between two 2d arrays in NumPy, for simplicity, we can just show it as a dot product of multi-row, multi-column matrix (2d array) with a single column matrix represented as a 1d array although technically a matrix by definition is always a 2d object whether it contains multiple columns or not.



In [82]:
X=np.array([[1,2],[3,4],[5,6]])
w=np.array([3,5])
print(f'X:\n{X}')
print(f'w:{w}')
print(f'X dot w: {np.dot(X,w)}')

X:
[[1 2]
 [3 4]
 [5 6]]
w:[3 5]
X dot w: [13 29 45]


In [85]:
W= np.array([[3],[5]])
print(f'X:\n{X}')
print(f'W:\n{W}')
print(f'XW:\n{X@W}')

X:
[[1 2]
 [3 4]
 [5 6]]
W:
[[3]
 [5]]
XW:
[[13]
 [29]
 [45]]


*`@` is the operator for matrix multiplication. `a@b` is shorthand for `np.matmul(a,b)`.* **Incidentally np.dot between two 2d arrays (2 matrices) produces the same result as matmul although matmul or `@` is preferred**

In [88]:
print(f'dot:\n{np.dot(X,W)}')
print(f'matmul:\n{np.matmul(X,W)}')

dot:
[[13]
 [29]
 [45]]
matmul:
[[13]
 [29]
 [45]]


**Actually if $a$ and $b$ are np arrays of 1 <= rank <= 2 then both the operators produce the same result. Matmul doesn't support any of its operands to be scalar or even 0d np array so if operands could possibly be scalars then must use dot instead of matmul**

In [89]:
print(f'X dot w: {np.dot(X,w)}')
print(f'X matmul w: {np.matmul(X,w)}')

X dot w: [13 29 45]
X matmul w: [13 29 45]


In [90]:
a=np.array([3,4])
b=np.array([1.,0])
print(f'dot_product_vectors: {np.dot(a,b)}')
print(f'matmul_vectors: {np.matmul(a,b)}')

dot_product_vectors: 3.0
matmul_vectors: 3.0


In [92]:
c=np.array([3])
d=np.array(2)
np.matmul(c,d)

ValueError: matmul: Input operand 1 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?,m?) requires 1)

In [93]:
np.matmul(c,2)

ValueError: matmul: Input operand 1 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?,m?) requires 1)