# Machine Learning (Coursera) - Andrew Ng - Notes 

## Introduction (week 1)
### ML Definition 
Arthur Samuel (1959) : field of Study that gives computers the ability to learn without being explicitly programmed.
Tom Mitchell (1998): A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. 

Diagram: 
<img src="MLDiagram.jpeg" style="height:300px">

### Cost Function 
We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output y's.
<img src="cost_function.png" style="height:300px">

Intuition: h(theta)-> hypothesis and J(theta)-> cost function. For every H you will have a value of J and you need to minimize this last value. Visually if you have 2 parameters thetas then you can get contours: 
<img src="intuition_cost.png" style="height:300px">


### Gradient Descent 
We put theta_0 on the x axis and theta_1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.
<img src="gradient1.png" style="height:300px">

The learning rate alpha is going to give you the speed to reach the local/global minimum. 
<img src="gradient2.png" style="height:300px">
The assumption is that the partial derivative will decrease in time so even if the learning rate is fixed. 


## Regression (week 2)
### Gradient Descent for Linear Regression 
This method looks at every example in the entire training set on every step, and is called *batch gradient descent*. Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; (J is a quadratic convex function)
<img src="gradient_regression.png" style="width:450px">

### Gradient Descent for Multiple Linear Regression 
we just have to repeat it for our 'n' features:
<img src="gradient_regression2.png" style="width:450px">

### Feature Scaling: to help Gradient Descent 
We can speed up gradient descent by having each of our input values in roughly the same range. This is because theta will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
Two techniques to help with this are feature scaling and mean normalization. 

- **Feature scaling** involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. 
- **Mean normalization** involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. 



### In practice: 
- **Debugging gradient descent** Make a plot with number of iterations on the x-axis. Now plot the cost function, J(theta) over the number of iterations of gradient descent. If J(theta) ever increases, then you probably need to decrease alpha.
- **Automatic convergence test** Declare convergence if J(theta) decreases by less than E in one iteration, where E is some small value such as 10−3. However in practice it's difficult to choose this threshold value.

Rules: 
if alpha is too small: slow convergence 
if alpha is too large: may not decrease on every iteration and thus may not converge.






## Normal Equation 
Gradient descent gives one way of minimizing J. Let's discuss a second way of doing so, this time performing the minimization explicitly and without resorting to an iterative algorithm. In the "Normal Equation" method (classic method learned in Uni!), we will minimize J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The normal equation formula is given below:
<img src="normal_equation.png" style="width:450px">

So when is better to use Nromal Equation or Gradient Descent? 
<img src="normal_equation2.png" style="width:650px">

If XT^X is non-invertible, the common causes might be having :
- Redundant features, where two features are very closely related (i.e. they are linearly dependent)
- Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization".



## Classification (week 3) 
The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1. 


## Logistic Regression 
### Hypothesis Representation 
<img src="logistic1.png" style="width:450px">

### Cost Function 
We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima.
<img src="logistic2.png" style="width:450px">
Note that writing the cost function in this way guarantees that J(theta) is convex for logistic regression.
<img src="logistic3.png" style="width:550px">

Vectorized implementation: 
<img src="logistic4_vec.png" style="width:450px">


### Gradient Descent 
this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta. 

<img src="logistic5.png" style="width:450px">



Vectorized implementation: 
<img src="logistic6.png" style="width:450px">

## Advanced optimization (not only gradient decent)
"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized. Octave provides them. In the course we use the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand. 


## Multi-class Classification - One vs all strategy 
We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.
<img src="multiclass1.png" style="width:450px">




------
## Regularization 
### Overfitting 
we’ll say the figure on the left shows an instance of underfitting—in which the data clearly shows structure not captured by the model—and the figure on the right is an example of overfitting.
<img src="overfitting.png" style="width:550px">
- **Underfitting, or high bias**, is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. 
- **Overfitting, or high variance**, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

###  How to deal with Overfitting? 
1) Reduce the number of features:

    - Manually select which features to keep.
    - Use a model selection algorithm (studied later in the course).

2) Regularization

    - Keep all the features, but reduce the magnitude of parameters theta_j.
    - Regularization works well when we have a lot of slightly useful features.
    
### How the cost function will change? 
The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated. This means: 
- if lambda is too large, thetas will be so small that you risk to smooth out the model too much - underfitting. 
- if lambda is too small (λ=0), then the cost function is the same as before, risking not to tackle the overfitting problem. 
<img src="reg1.png" style="width:450px">


### Regularized linear regression  
Remember to exclude theta_0 because we don't need to regularize the intercept. 
<img src="reg2.png" style="width:650px">

### Regularized logistic regression 
Remember to exclude theta_0 because we don't need to regularize the intercept. 
<img src="reg3.png" style="width:650px">

-------
## Neural Networks 
