# Introduction to Machine Learning

### Examples of Machine Learning:
- Database mining
    - Large datasets from growth of automation/web. E.g. Web click data, medical records, biology, engineering, etc.
- Applications can't program by hand
    - E.g. Autonomous helicopter, handwriting recognition, NLP, Computer Vision(CV)
- Self-customizing programs
    - E.g. Amazon, Netflix product recommendations
- Understanding human learning (brain, real AI)

### What is machine learning?

- Arthur Samuel (1959) defined machine learning as the field of study that gives computers the ability to learn without being explicitly programmed.
- Tom Mitchell (1998) defined machine learning as: a computer program is said to *learn* from experience E with respect to some task T and some performance measure P if its performance on T, as measured by P, improves with experience E.

### Types of machine learning:
- Supervised learning
- Unsupervised learning

Others: Reinforcement learning, recommender systems

### Supervised Learning

- A learning method in which the computer is provided a training set of data with the "right answers" given.
- Supervised learning problems are categorized into "regression" and "classification" problems.
    - Regression problems = predict continuous valued output (e.g. the price of houses in the Portland housing price example)
    - Classification problems = predict discrete valued output (0 or 1)

### Unsupervised Learning
Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.

We can derive this structure by clustering the data based on relationships among the variables in the data.

With unsupervised learning there is no feedback based on the prediction results.

- Examples include:
    - Organize computing clusters
    - Social network analysis
    - Market segmentation
    - Astronomical data analysis

### Model Representation: Linear Regression with one variable
Notation:
- m = Numer of training examples
- x's = input variable/features
- y's = output variable/target variable (i.e. the 'predicted outcome')
- (x,y) = one training example
- $ (x^i,y^i) $ = ith training example (the index of the training set)

Algorithm workflow:
![image-7.png](attachment:image-7.png)
How do we represent *h*?
- $ h_\mathsf{\theta}(x)= \mathsf{\theta}_0 +\mathsf{\theta}_1x$
    - the prediction y is the solution to the linear fucntion h(x)
    - Univariate linear regression = a linear regression problem with a single (1) variable
    
### Linear Regression with one variable: Cost Function
- $ h_\mathsf{\theta}(x)= \mathsf{\theta}_0 +\mathsf{\theta}_1x$
    - $\mathsf{\theta}_i$'s = stablaize the parameters of the model
- different values of $\mathsf{\theta}$ can affect the model's performance
    - The goal is to choose values for $\mathsf{\theta}_0$,$\mathsf{\theta}_1$ so that $ h_\mathsf{\theta}(x)$ is close to y for the training examples (x,y).
    
- minimize $\mathsf{\theta}_0$ $\mathsf{\theta}_1$ =  $\frac{1}{2m}$ $\sum \limits_{i=1} ^{m}(h_\mathsf{\theta}(x^i)-y^i)^2$
    - $h_\mathsf{\theta}(x^i)$ = the prediction
    - $y^i$ = the actual value
    -  The mean is halved ($\frac{1}{2}$) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the $\frac{1}{2}$ term
    - the equation calculates the sum from 1 to m of the difference of the squared error and then takes the average of the sum
    - this is called the **squared error cost function** or **mean squared error**
    - Also written as: J($\mathsf{\theta}_0$ $\mathsf{\theta}_1$) =  $\frac{1}{2m}$ $\sum \limits_{i=1} ^{m}(h_\mathsf{\theta}(x^i)-y^i)^2$
    
####  What is J($\mathsf{\theta}_0$ $\mathsf{\theta}_1$) doing?
A simplified hypothesis looks like: $h_\mathsf{\theta}(x) = \mathsf{\theta}_1 x$ where $\mathsf{\theta}_0 = 0$
- There are two functions of interest:
    1. $h_\mathsf{\theta}(x)$ = for fixed $\mathsf{\theta}_1$, this is a function of x.
    2. $J(\mathsf{\theta}_1$) = function of the parameter $\mathsf{\theta}_1$
- For the simplified hypothesis, see the following graphical representations of the hypothesis and cost functions:
![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)

What if were to use J($\mathsf{\theta}_0$ $\mathsf{\theta}_1$) with both parameter variables?
- A contour plot is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line. Examples of such graphs are listed below.
![image-4.png](attachment:image-4.png)
![image-5.png](attachment:image-5.png)
![image-6.png](attachment:image-6.png)

### Gradient Descent and its application in Linear Regression Problems
If we have some function J($\mathsf{\theta}_0$ $\mathsf{\theta}_1$), we want to minimize ($\mathsf{\theta}_0$, $\mathsf{\theta}_1$) of J($\mathsf{\theta}_0$ $\mathsf{\theta}_1$)

Outline:
1. Start with some ($\mathsf{\theta}_0$,$\mathsf{\theta}_1$)
    - These are random "guesses" that the gradient descent algorithm begins with. For example, one of the initial set of values for the parameters may be $\mathsf{\theta}_0$ = 0 and $\mathsf{\theta}_1$ = 0
2. Keep changing the values of ($\mathsf{\theta}_0$,$\mathsf{\theta}_1$) to reduce J($\mathsf{\theta}_0$ $\mathsf{\theta}_1$) until a minimum is (hopefully) reached.

Gradient Descent Algorithm formula:
![image-8.png](attachment:image-8.png)

- $\mathsf{\theta}_j: = \mathsf{\theta}_j -\mathsf{\alpha}\frac{\mathsf{\delta}}{\mathsf{\delta}\mathsf{\theta}_j}J(\mathsf{\theta}_0$ $\mathsf{\theta}_1)$ (for j = 0 and j=1)
    - repeat the above formula until convergence
    - := is used to denote assignment. For example, a:=b means that the computer will take the value of b and overwrite whatever the value of a is.
    - $\mathsf{\alpha}$ = the learning rate (controls how big the 'step' downhill is.
    - $\frac{\mathsf{\delta}}{\mathsf{\delta}\mathsf{\theta}_j}J(\mathsf{\theta}_0$ $\mathsf{\theta}_1)$ = a derivative term
    - j=0,1 represents the feature index number.
    - $\mathsf{\theta}_0$ and $\mathsf{\theta}_1$ are simultaneously updated
        - $temp0 := \mathsf{\theta}_0 -\mathsf{\alpha}\frac{\mathsf{\delta}}{\mathsf{\delta}\mathsf{\theta}_0}J(\mathsf{\theta}_0$ $\mathsf{\theta}_1)$
        - $temp1 := \mathsf{\theta}_1 -\mathsf{\alpha}\frac{\mathsf{\delta}}{\mathsf{\delta}\mathsf{\theta}_1}J(\mathsf{\theta}_0$ $\mathsf{\theta}_1)$
        - $\mathsf{\theta}_0$ := temp0
        - $\mathsf{\theta}_1$ := temp1
        - At each iteration j, one should simultaneously update the parameters $\mathsf{\theta}_1$,$\mathsf{\theta}_2$,$\mathsf{\theta}_3$,$\mathsf{\theta}_n$.Updating a specific parameter prior to calculating another one on the $j^{(th)}$ iteration would yield to a wrong implementation.
 
- What does the derivative term, $\frac{\mathsf{\delta}}{\mathsf{\delta}\mathsf{\theta}_j}J(\mathsf{\theta}_0$ $\mathsf{\theta}_1)$, do and how do you calculate it?
    - Regardless of the slope's sign for $\frac{\mathsf{\delta}}{\mathsf{\delta}\mathsf{\theta}_1}J(\mathsf{\theta}_1$),$\mathsf{\theta_1}$ eventually converges to its minimum value. 
    - For a detailed walkthrough of the derivative calculations, refer to: https://mccormickml.com/2014/03/04/gradient-descent-derivation/#:~:text=When%20there%20are%20multiple%20variables,update%20rule%20for%20each%20variable.&text=A%20partial%20derivative%20just%20means,%CE%B82%20as%20a%20constant.
    
![image-9.png](attachment:image-9.png)

- The $\mathsf{\alpha}$, or learning rate, determines how large the step is taken when converging towards the minimum
    - If $\mathsf{\alpha}$ is too small, then gradient descent can be slow.
    - If $\mathsf{\alpha}$ is too large, gradient descent can overshoot the minimum. This means that the algorithm might fail to converge or even diverge. 
![image-10.png](attachment:image-10.png)
![image-11.png](attachment:image-11.png)

- Once the derivatives of the cost function in respect to $\mathsf{\theta}_0$,$\mathsf{\theta}_1$ have been calculated, we can plug those values back into the gradient descent equation to calculate the minimization of the cost function:
    - $\mathsf{\theta}_0 := \mathsf{\theta}_0 -\mathsf{\alpha}\frac{1}{m}\sum \limits_{i=1}^{m}(h_\mathsf{\theta_0}(x^i)-y^i)$
    - $\mathsf{\theta}_1 := \mathsf{\theta}_0 -\mathsf{\alpha}\frac{1}{m}\sum \limits_{i=1}^{m}((h_\mathsf{\theta_0}(x^i)-y^i)(x^i))$

- The Gradient Descent Algorithm is also sometimes referred to as "Batch" Gradient Descent.
    - 'Batch': Each step of the gradient descent uses **all** of the training examples.
![image-12.png](attachment:image-12.png)

### Linear Algebra Review

- Matrices:
    - a matrix is a rectangular array of numbers:
    $$A=\begin{bmatrix} 1902 & 191 \\ 1371 & 821 \\ 949 & 1437 \\ 147 & 1448 \end{bmatrix}$$
    - the dimension of a matrix is written as the number of rows (R) by the number of columns(C), so **R x C**
    - a matrix might also be written as: $\mathbb{R}^{3x2}$. This notation simply means all matrices with 3 rows and 2 columns in real space $\mathbb{R}$
    - considering the matrix A above, $A_{ij}$ means the $i^{th}$ row and the $j^{th}$ column of A
- Vectors:
    - a vector is an n x 1 matrix, or a matrix with only 1 column:
    $$y=\begin{bmatrix} 460\\ 232 \\ 315 \\ 178 \end{bmatrix}$$
    - the example above might also be called a **4-dimensional vector**, or a 4 x 1 matrix, or $\mathbb{R}^{4}$
    - $y_i = i^{th}$ element
    - there are two different ways to index a vector:
        1. 1- index vectors $$y=\begin{bmatrix} y_1\\ y_2 \\ y_3 \\ y_4 \end{bmatrix}$$
        2. 0- index vectors $$y=\begin{bmatrix} y_0\\ y_1 \\ y_2 \\ y_3 \end{bmatrix}$$
- Matrices are usually referred to via capital variables, e.g. A,B,C,X
- Vectors are usually referred to via lowercase variables, e.g. a,b,x,y

- Addition and Scalar Multiplication:
    - Matrix Addition:
        $$\begin{bmatrix} 1 & 0 \\ 2 & 5 \\ 3 & 1 \end{bmatrix}+\begin{bmatrix} 4 & 0.5 \\ 2 & 5 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 5 & 0.5 \\ 4 & 10 \\ 3 & 2 \end{bmatrix}$$ 
        - you can only add two matrices if they are of the same dimension
    - Scalar Multiplication:
    $$3 x \begin{bmatrix} 1 & 0 \\ 2 & 5 \\ 3 & 1 \end{bmatrix}= \begin{bmatrix} 3 & 0 \\ 6 & 15 \\ 9 & 3 \end{bmatrix}$$
        - you can also divide by a number, which is the same as multiplying by the fraction:
    $$\frac{\begin{bmatrix} 4 & 0 \\ 6 & 3\end{bmatrix}}{4} = \frac{1}{4}x\begin{bmatrix} 4 & 0 \\ 6 & 3\end{bmatrix} = \begin{bmatrix} 1 & 0 \\ \frac{3}{2} & \frac{3}{4}\end{bmatrix}$$
    
- Matrix-vector Multiplication:
    - **Example 1:**
    $$\begin{bmatrix} 1 & 3 \\ 4 & 0 \\ 2 & 1 \end{bmatrix} \begin{bmatrix} 1\\ 5\end{bmatrix}=\begin{bmatrix} 16 \\ 4 \\ 8 \end{bmatrix}$$
    - **Example 2:**
     $$\begin{bmatrix} 1 & 2 & 1 & 5 \\ 0 & 3 & 0 & 4 \\ -1 & -2 & 0 & 0 \end{bmatrix} \begin{bmatrix} 1\\ 3 \\ 2 \\ 1 \end{bmatrix} = \begin{bmatrix}14 \\ 13 \\ -7 \end{bmatrix}$$
     
- Matrix-matrix Multiplication: