# Model and Cost Function

For a given set of $m$ input/output pairs $(x_{i}, y_{i})$, we try to use some hypothesis function $h_{\theta}$ to approximate $y_{i}$ given $x_{i}$.

$$
h_{\theta}(x) = \theta_{0} - \theta_{1}x
$$

We then use this hypothesis function to derive our cost function, $J(\theta_{0},\theta_{1})$:

$$
J(\theta_{0},\theta_{1}) = \min_{\theta_{0},\theta_{1}} \frac{1}{2m}\sum_{i=1}^{m} (h(x_{i}) - y_{i})^{2}
$$

So, we try to find the $\theta_{0}$ and $\theta_{1}$ which minimize the average squared error (multiplied by $\frac{1}{2}$ to "make the math easier" according to Ng)  over the data set of $X, Y$

# Parameter Learning and Gradient Descent


Process of gradient descent for can be described as repeating the following until convergence is reached:

$$
\theta_{j} := \theta_{j} - \alpha \frac{\partial}{\partial \theta_{j}} J(\theta_{0},\theta_{1})
$$

- $\alpha$ is the _learning rate_. We'll consider it a constant scalar for the moment.
- Crucially, $\theta_{0}...\theta{n}$ must be updated simultaneously. Each $\theta_{i}$ represents a different _feature_.
- directionality of the "step" is determined by the derivative of $J$ w/r/t $\theta_{j}$

Conventiently, we can get convergence while leaving the _learning rate_ $\alpha$ constant, because as we approach the local optimum, the magnitude of the slope will decrease, decreasing our step size.

Next, we can use gradient descent to implement the minimization of our cost function as used in linear regression. So it seems like the $\frac{1}{2}$ constant we multiplied the average squared error by in our linear regression function "makes the math easier" because we're differentiating a square function (i.e. the derivative of $\frac{1}{2m}\sum_{i=1}^{m} (h(x_{i}) - y_{i})^{2}$ is just $\frac{1}{m}\sum_{i=1}^{m} h(x_{i}) - y_{i}$).

# Linear Algebra Review

- Matrices are $n$ rows by $m$ columns
- $A_{ij}$ expresses the element at the $i$th row and $j$th column of matrix $A$
- no zero-indexing here. sad!
- $\mathbb{R}$ refers to set of real scalar values, $\mathbb{R}^{n}$ refers to the set of all $n$-dimensional, real-valued vectors
- addition/multiplication of matrices, vectors, and scalars is the same it's been in every single other math class you've already taken
- an $m$ x $n$ matrix can only be multiplied by a $1$ x $m$ vector on its left or an $n$ x $1$ vector on its right.
- similarly, an $n$ x $m$ matrix can only be multiplied on its left by a matrix of $p$ x $m$ on its left or a matrix of $n$ x $p$ on its right, where $p$ is an arbitrary natrual number.
- matrix multiplication is not commutative, but it is associative
- the identity matrix ($I$) has all $1$s down it's top-left to bottom-right diagonal and zeros otherwise; holds the property that any matrix $A$ multiplied by it (either on the left or right) is equal to itself, given that the dimensions are applicable.
- the transpose of a matrix $A$ is denoted as $A^{T}$ and is simply a transposition of its rows and columns, i.e. if $A$ is $m$ x $n$, then $A^{T}$ is $n$ x $m$.
- the inverse of a matrix $A$ is denoted as $A^{-1}$ and is expressed as $A^{-1}A = I$