# "Machine Learning" Course from [Coursera](https://www.coursera.org/learn/machine-learning/home/week/2)

## My Understanding
- There are traditional method to solve cost function (squared error function), like normal equations (least square method) in a mathmatical way.
- The problem of the traditional method is that computational effort (if thousands, millions parameters), then from a computational way, the gradient descent is more powerful
- The idea of gradient descent is to try to find a local minimum, then derivative comes out, firstly get the derivation (like tangent point), secondly find a rate to move the iteration, and then converge into a local minimum (if it's a convex quadratic function, there is only one global minimum)

## Multivariate Linear Regression
- **Model Representation**
    - Hypothesis with multiple features: <BR>
    define $x_{0}^{(i)}$ = 1 for ($i \in 1,…,m$), <BR>
    $x_{j}^{(i)}$ = value of feature $j$ in the $i^{th}$ training example, <BR>
    $x^{(i)}$ = the input (features) of the $i^{th}$ training example, <BR>
    $m$ = the number of training examples, <BR>
    $n$ = the number of features <BR>
$h_\theta{(x)} = \theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \theta_{3}x_{3} + \dots + \theta_{n}x_{n}$
<BR>
$x = \begin{bmatrix}x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n\end{bmatrix} \in \mathbb {R}^{n+1}$,   
$\theta = \begin{bmatrix}\theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n\end{bmatrix} \in \mathbb {R}^{n+1}$
<BR>
Then we can get
<BR>
$h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x$

![Multivariate%20Linear%20Regression.png](./W2/Multivariate%20Linear%20Regression.png)    


## Cost Function
- Hypothesis: $h_\theta{(x)} = \theta^T x = \theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \theta_{3}x_{3} + \dots + \theta_{n}x_{n}$
- Parameters: $\theta$ equal to $\theta_0, \theta_1, \dots, \theta_n$
- Cost Function or Squared error function: $J(\theta) = J(\theta_0, \theta_1, \dots, \theta_n) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}^{(i)}- y^{(i)} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2$
- Goal: minimize $J(\theta)$



## Gradient descent
- Apply Gradient descent algorithm to the cost function $ \dfrac{∂}{∂\theta_{j}} J(\theta) =  \dfrac{∂}{∂\theta_{j}} \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2$
- The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features:
<br>
repeat until convergence {
<br>
$\theta_{j} := \theta_{j} - \alpha \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right) \cdot x_{j}^{(i)}$
<br>
$\text{for j := 0...n} \rbrace$


## Gradient Descent in Practice I - Feature Scaling
- As the scale of the features are very different, it might take a long time to run gradient dscent, to speed up we can have the features in roughly the same range. _(This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.)_ 
- Two techniques to help with this are **feature scaling** and **mean normalization**
<BR> $x_i := \dfrac {x_i - \mu_i}{s_i}$
<BR>Where $\mu_i$ is the average of all the values for feature ($i$) and $s_i$ is the range of values (max - min), or $s_i$ is the standard deviation.


## Gradient Descent in Practice II - Learning Rate
- **Debugging gradient descent**: Make a plot with number of iterations on the x-axis. Now plot the cost function, $J(\theta)$ over the number of iterations of gradient descent. If $J(\theta)$ ever increases, then you probably need to decrease α
- To summarize: 
    - For sufficiently small $\alpha$, $J(\theta)$ should decrease on every iteration
    - If $\alpha$ is too small: slow convergence.
    - If $\alpha$ is too large: may not decrease on every iteration and thus may not converge.
    - Try to plot with number of iterations, and choose by below threshold, ideally choose one just below the max threshold


![Learning%20Rate.png](./W2/Learning%20Rate.png)    



## Features and Polynomial Regression
- Sometimes we can determine the features, for example, instead of choose frontage and depth as two features to predict house price, we can use area (frontage * depth) as one feature to better fit the model
- For some curves, we can use Polynomial Regression (quadratic function, cubic function, sqare root function, etc)
    - Sqare root function: $h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 \sqrt{x_1} $
    - Cubic function: $h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3 $
- If do so, don't forget to do feature scaling

![Polynomial%20Regression.png](./W2/Polynomial%20Regression.png)    


## Normal Equation
- Normal Equation (least square method) can allow us to find the optimum theta without iteration.
    - The normal equation formula: $\theta = (X^T X)^{-1}X^Ty$
    - $X$ is a $m*n$ matrice, $m$ is number of observersation, $n$ is number of features
<BR>

|Gradient Descent	|Normal Equation|
|-----------|-----------|
|Need to choose alpha	|No need to choose alpha|
|Needs many iterations  |No need to iterate|
|Need to do feature scaling	|No need to do feature scaling|
|Computations $O(kn^2)$	|Computations $O(n^3)$, need to calculate inverse of $X^T X$|
|Works well when n is large	|Slow if n is very large|
<BR>

- In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process
- In some cases, $X^T X$ might be **noninvertible**: 1) **Redundant features**, where two features are very closely related (i.e. they are linearly dependent); 2) **Too many features** (e.g. m ≤ n). In this case, delete some features or use "regularization".



## Linear Regression with multiple variable

In [1]:
import numpy as np
import pandas as pd

np.set_printoptions(suppress=True)

# x = np.loadtxt('./W2/ex1data2.txt',delimiter=',')
df = pd.read_csv('./W2/ex1data1.txt',delimiter=',', header=None)



x = df.iloc[:,:1].to_numpy()
y = df.iloc[:,1:].to_numpy()

def normalization(arr):
    mn = np.mean(arr, axis=0)
    sd = np.std(arr, axis=0)
    m = arr.shape[0]
    
    norm = (arr - np.ones((m, 1)).dot([mn])) / np.ones((m, 1)).dot([sd])
    

    
    return mn, sd, norm



def costfunction(x, y, theta):
    
    m = y.shape[0] # number of training examples

    J = 0

    se = x @ theta - y

    J = se.T @ se / (2*m)
    
    return J

mn, sd, x_norm = normalization(x)

theta = np.array([[0]])

costfunction(x_norm, y, theta)


# x, y

# # Create a matrice
# A = np.array([[1, 2, 3],[4, 5, 6],[7,8,9],[10,11,12]])
# print(A)

# # Initialize a vector 
# v = np.array([[1], [2], [3]])
# print(v)

# # Get the dimension of the matrix A where m = rows and n = columns
# dim_A = A.shape
# print(dim_A)

# # Get the dimension of the vector v 
# dim_v = v.shape
# print(dim_v)

# # Now let's index into the 2nd row 3rd column of matrix A
# A_23 = A[1][2]
# a_23 = A[1,2]
# print(A_23, a_23)


array([[32.07273388]])