# Machine Learning from Coursera by Andrew Ng

---

## Octave
 
 Useful for prototyping ML projects because it has ways to easily implement ML algorithms with builtin functions.

--
 
Examples of Machine Learning:

* Database mining (Web click data, medical records, biology, engineering)
* Applications that can't be programmed by hand (Autonomous helicopter, handwriting recognition, most of Natural Language Processing (NLP), computer vision.
* Self-customizing programs (Amazon, Netflix product recommendations)

## Machine Learning Algorithms
 
### Supervised Learning
 
* The "correct answers" are given for the algorithm to learn from; A.K.A. the "training set".
* Usually useful for __Regression__: Predicting a continuous valued output.
* __Classification__ problems: predicting a discrete valued output, like 0 or 1 (no or yes), specific categories, etc.
* Notation for this course: $m$ = number of training examples, $x$ = input variable (features), $y$ = output variable (target variable), $(x,y)$ = single training example.
 
 
### Unsupervised Learning
 
* The "correct answers" are not given and the algorithm has to learn some structure in the data.
* A common thing for the algorithms to do is to __cluster__ the data into separate groups. (i.e. Google News, human genes, organizing computing clusters, social network analysis, market segmentation, astronomical data analysis).
 
 
### Univariate Linear Regression
 
* One variable, model $h_{\theta}(x) = \theta_0 + \theta_1 x$ is a straight line mapping $x$ to $y$ after fitting to the $m$ training examples.
* Fitting process:
 * Find $\theta_0$ and $\theta_1$ in $h_{\theta}(x) = \theta_0 + \theta_1 x$ such that we minimize the __cost function__ (or __squared error function__):
 \begin{equation*}
    J(\theta_0, \theta_1) = \frac{1}{2m} * \Sigma_{i=1}^m (h_{\theta}(x_i)-y_i)^2
 \end{equation*}

 
 ### Gradient Descent

 * Basic idea is to start with some $\theta_0$, $\theta_1$ and keep changing them to reduce $J(\theta_0, \theta_1)$ until we end up at a minimum.
 * This can be applied for any number of variables $\theta_0,\dots,\theta_n$.
 * For each variable $\theta_j$, repeatedly update $\theta_j$ with $\theta_j-(\delta J/\delta\theta_j)*\alpha$ (where $\alpha$ is the __learning rate__).
 * You must simultaneously update all $\theta_j$; don't do it one at a time because each update will affect the derivative for the other variables.


### Multivariate Linear Regression

* When there's more than one feature (independent variable) involved, the hypothesis becomes

\begin{equation*}
    $h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \dots + \theta_n x_n$
\end{equation*}

* Notation:
 * $x^{(i)} =$ the vector of features of the $i^{th}$ training example
 * $x_j^{(i)} =$ value of feature $j$ in the $i^{th}$ training example
 * $m = $ the number of training examples
 * $n = $ the number of features

* Let $x_0^{(i)} = 1$ for all $i$. Then we can just do vector/matrix multiplication with two $n+1$ dimensional vectors $\theta$ and $x$ so that 

\begin{equation*}
    $h_{\theta}(x) = \theta^{T}x$
\end{equation*}

 ### Gradient Descent
 
 * Use the same cost function, but now you have vectors $\theta$, $x^{(i)}$, and $y^{(i)}$, and the derivatives are with respect to each $\theta_j$:

 \begin{equation*}
    J(\theta) = \frac{1}{2m} * \Sigma_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2
    \theta_j = \theta_j - \alpha\frac{\delta J(\theta)}{\delta\theta_j} \textrm{ simultaneously for all } j = 0,\dots,n
 \end{equation*}

 