# Introduction

- *What is Machine Learning?* Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
- Regression: Continous variables, predicts real-valued output
- Classification: Discrete variables
- When you have thousands of items, treat these problems are regression. 
- Example:
    - Regression - Given a picture of a person, we have to predict their age on the basis of the given picture
    - Classification - Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.


# Model and Cost Function 
- ** Supervised Learning **
    - Supervised learning problems are categorized into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output.
    
    
- ** Unsupervised Learning**
    - Clustering: Putting things in groups like maybe type of customers for a retail business or clustering the people in a social media platform
    - Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.
    - With unsupervised learning there is no feedback based on the prediction results.


- When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.


- ** Linear regression: Cost Function**
     - Choosing parameters
     - This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved (1/2) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term
     - It's intersting bc we use the linear regression, and calculate the square error function but divide it over 1/2m, m being the total number of observations. Then, we plot the results we get against the beta_1 (or slope), beta_1 being in the x-axis
     - If we plot a range of values, we see that we get a convex parabola.
     - Our objective is to get the best possible line. The best possible line will be such so that the average squared vertical distances of the scattered points from the line will be the least. Ideally, the line should pass through all the points of our training data set. In such a case, the value of J(θ0,θ1) will be 0

# Parameter Learning
- ** Gradient Descent**
    - So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in hypothesis function. That's where gradient descent comes in.
    - We put θ0 on the x axis and θ1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters.
    - We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum.
    - Reason being that when we calculate the Cost Function, we want the least cost, as in the error between the predicted minus the observed is quite small (the sum)
    - You start at some point and from there, you keep moving in the direction where you find the minimal result and continue to do
    - The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent, and the size of each step is determined by the parameter α, which is called the learning rate.
    - alpha: the learning rate (large means we are taking large steps and small means we are taking small steps). This number is multiplied with the partial derivatives and can check the direction it should go.
    - Uses partial derivatives
    - There are setbacks if the alpha is too small, take a long time, or it's to large, takes big steps and you might miss your mark and could diverge
    - This method looks at every example in the entire training set on every step, and is called batch gradient descent
    - while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges
    
    
- **“Batch” Gradient Descent**: Each step of gradient descent uses all the training examples.
   
   
**Gradient Descent Tips**
- Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.
- Automatic convergence test. Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as 10−3. However in practice it's difficult to choose this threshold value.
- It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration. Andrew Ng recommends decreasing α by multiples of 3.

# Linear Algebra Review
- ** Matrixes **
     - Rows x Columns
     - Addition and Subraction
         - Add by the specific place in the matrix to another matrix, the dimension will remain exactly the same
     - Scalar Mult
         - Mult. by a real number, multply each value in the matrix by the scalar value
     - Combination of Operation
         - The same order as regular algerbric math
     - Matrix Vector Mult
         - Remember, it's going down in the row * going to the right in the other col and then add all the products
         - Can also use it for equation where the equation parameters are just a vector of numbers   
     - Matrix Matrix Mult
         - Can only multply the ones that match (first matrix's col and second matrix's row)
         - Like with the matrix vector mult, you can use system of equation but now you can use more than one equation
     - Properties
         - IS Not cumlative, A x B is NOT B x A
         - IS associative, if you A x B x C, can done as (A x B) x C or A x (B x C)
         - 1 is the *Identity Matrix* , there are 1 in the diagnols. For any matrix A, A x I is I x A
         - Inverse: A x A^-1 (inverse) = I
         - Matrixes that don't have an inverse are called *singular or degenerate*
         - Matrix Transpose: The first row of a matrix is the first column of the transpose matrix
- ** Vector**: Only has one columns with n rows
     - 1 indexed vs 0 indexed (The number of the index the rows begin with)
   

Notation and terms:

- Aij refers to the element in the ith row and jth column of matrix A.
- A vector with 'n' rows is referred to as an 'n'-dimensional vector
- vi refers to the element in the ith row of the vector.
- In general, all our vectors and matrices will be 1-indexed. Note that for some programming languages, the arrays are 0-indexed.
- Matrices are usually denoted by uppercase names while vectors are lowercase.
- "Scalar" means that an object is a single value, not a vector or matrix.
- ℝ refers to the set of scalar real numbers
- ℝ𝕟 refers to the set of n-dimensional vectors of real numbers


<img src="../images/image1.png" alt="Drawing" style="width: 800px;" >

**Matrix Notation**

The Gradient Descent rule can be expressed as:

θ:=θ−α∇J(θ)
Where ∇J(θ) is a column vector of the form:
<img src="../images/image2.png" alt="Drawing" style="width: 800px;">