# I/ What is Machine Learning?

In traditional programming, a computer need to be told exactly what to do (explicit instructions: if you see X, then do Y). We configure a machine to accept our input and produce an output based on the algorithm (input = command, output = predetermined response). When problem get trickier and we can't explicitly instruct the computer what to do, we need machine learning

Machine Learning is similar to how human learn: we give the computer the data and tools it needs to study and solve the problem without being told what to do. The computer has ability to adapt, evolve, and learn. 

**Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.**

# II/ How Machine Learning works?

UC Berkeley breaks out the learning system of a machine learning algorithm into three main parts.
1. **A Decision Process**: In general, machine learning algorithms are used to make a prediction or classification. Based on some input data, which can be labelled or unlabeled, your algorithm will produce an estimate about a pattern in the data.


2. **An Error Function**: An error function serves to evaluate the prediction of the model. If there are known examples, an error function can make a comparison to assess the accuracy of the model.


3. **An Model Optimization Process**: If the model can fit better to the data points in the training set, then weights are adjusted to reduce the discrepancy between the known example and the model estimate. The algorithm will repeat this evaluate and optimize process, updating weights autonomously until a threshold of accuracy has been met.  

# III/ Different Ways a Machine learns:
1. **Supervised Learning** (we teach the computer how to do sth)

    Supervised Learning is defined by its use of *labeled datasets* to train algorithms that to classify data or predict outcomes accurately. As input data is fed into the model, it adjusts its weights until the model has been fitted appropriately. This occurs as part of the cross validation process to ensure that the model avoids overfitting or underfitting. Some methods used in supervised learning include neural networks, naïve bayes, linear regression, logistic regression, decision tree/random forest, support vector machine (SVM), and more.


2. **Unsupervised Learning** (we let the computer learn by itself)

    Unsupervised Learning uses machine learning algorithms to analyze and cluster *unlabeled datasets*. These algorithms discover hidden patterns or data groupings without the need for human intervention. Some algorithms used in unsupervised learning include neural networks, k-means clustering, probabilistic clustering methods, and more.
        

3. **Semi-supervised learning** offers a happy medium between supervised and unsupervised learning. During training, it uses a smaller labeled data set to guide classification and feature extraction from a larger, unlabeled data set. Semi-supervised learning can solve the problem of having not enough labeled data (or not being able to afford to label enough data) to train a supervised learning algorithm. 


4. **Reinforcement learning** is a behavioral machine learning model that is similar to supervised learning, but the algorithm isn’t trained using sample data. This model learns as it goes by using trial and error. A sequence of successful outcomes will be reinforced to develop the best recommendation or policy for a given problem.

Source: https://www.ibm.com/cloud/learn/machine-learning

# IV/ Supervised Learning

There are two types of Supervised Learning:
    1. Regression
    2. Classification

## 1. Regression

### a) Linear Regression

https://www.youtube.com/watch?v=F6GSRDoB-Cg&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=8

We start by feeding the training data to a Learning Algorithm. Then, that algorithm will output a function called hypothesis $h_{\theta}(x)$. This hypothesis will map the new data to a predicted value:
$$h_{\theta}(x^{i}) = {\theta_0} + {\theta_1}{x_1} + ... + {\theta_n}{x_n}$$

How to choose those parameters $\theta_i$ so that $h_{\theta}(x)$ $\approx$ y for given (x,y)?


**Hypothesis:**
$$h_{\theta}(x^{i}) = {\theta_0} + {\theta_1}{x_1} + ... + {\theta_n}{x_n}$$

**Parameters**
$$\theta = vector [{\theta_0}, {\theta_1}, ... , {\theta_n}]$$

**Cost Function / Error function:**
$$ J({\theta}) = \frac{1}{2m} \sum_{i=1}^{m}(h_{\theta}(x^{i}) - y^{i})^2$$

**Goal: Minimize $J({\theta_0},{\theta_1},..,{\theta_n})$**

*Notation:*
- *m = number of training examples*
- *($x^{i}, y^{i}$) = $i^{th}$ example*

Two ways to find the mininum of the cost function $J({\theta})$:
1. **Gradient descent** = an optimization algorithm for finding a local minimum of a differentiable function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.
2. **Normal equation** = linear algebra approach using matrix multiplication to find the optimal parameters $\theta$

**1. Gradient Descent:**

We simultaneously update ${\theta_j}$ until J(${\theta_j}$) converge to a local minimum

    repeat until convergence (for j = 1, .., n) {
$${\theta_j} := {\theta_j} - \alpha*\frac{\partial}{\partial {\theta_j}}J({\theta})$$
    
    }
    
where:
- $\alpha$ = learning rate
- $\frac{\partial}{\partial {\theta_j}}J({\theta})$ = partial derivative of cost function J with respect to $\theta_j$ = slope of tangent line

<img src="gradient_descent.png">

#### Note:

- **Feature Scaling**: make sure features are on a similar scale, which help gradient descent to converge more quickly 

    We should get every feature into approximately -1 $\leq x_i \leq 1$ range. 
    
    We can do that by applying *Mean Normalization*: 
$$x_i = \frac{{x_i} - {\mu}_i}{s_i}$$
     where: $s_i$ = range of $x_i$ in training set
     
     
- **Making sure gradient descent is working correctly**: $J(\theta)$ should decrease after every iteration
  
   If gradient descent is not working, use smaller $\alpha$. 
   
   - If $\alpha$ is too small, gradient descent can be slow to converge.
   
   - If $\alpha$ is too large, $J(\theta)$ may not converge or not decrease on every iteration

**2. Normal Equation:**

Hypothesis
$$h_{\theta}(x^{i}) = {\theta_0} + {\theta_1}{x_1} + ... + {\theta_n}{x_n}$$

Let define $x_0 = 1$, then we have:
$$h_{\theta}(x^{i}) = {\theta_0}{x_0} + {\theta_1}{x_1} + ... + {\theta_n}{x_n}$$

$$ X = [{x_0},{x_1},..,{x_n}]^T$$

We can calculate the optimal $\theta$ by using below Normal Equation:
$$\theta = ({{X^T}X})^{-1}{X^T}y$$

*Note: If ${X^T}X$ is non-invertible, try to remove redundant features or use regularization*

**When to use Gradient Descent or Normal Equation?**

Only choose Normal Equation if data is small

| **Gradient Descent** | **Normal Equation** |
| :-: | :-: |
| Need to choose $\alpha$ | Don't need to choose $\alpha$ |
| Need many iterations | Don't need many iterations |
| Works well when n is large | Slow if n is large (because compute $({{X^T}X})^{-1}$ is $O(n^3)$|