### <mark >Supervised Machine Learning: Regression and Classification</mark>

#### **<span style="color:hotpink">Week 1: Introduction to Machine Learning</span>**

\
**<span style="color:pink">What is Machine Learning?</span>**

"Field of study that gives computers the ability to learn without being explicitly programmed." -- Arthur Samuel (1959)

\
**<span style="color:pink">Machine Learning Algorithms</span>**

* Supervised Learning -- used most in real-world applications (coverd in course 1 & 2)
* Unsupervised Learning
* Recommender Systems
* Reinforcement Learning

>Note: Unsupervised learning, recommender systems and reinforcement learning will be covered in more detail in course 3.

___
#### **<span style="color:hotpink">Supervised Learning</span>**

Machine learning algorithms that learn `x` to `y` or `input` to `output` label mappings.

The key characteristic of supervised learning is that you give your learning algorithm examples to learn from that include the "right answers" (correct `y-label` for a given input `x`).

The learning algorithm learns to take just the input alone and gives a reasonably accurate prediction.

\
<span style="color:pink">Some applications of supervised machine learning include...</span>

|Input (X)|Output (Y)|Application|
|:---:|:---:|:---:|
|email|spam? (0/1)|<span style="color:pink">spam filtering</span>|
|audio|text transcripts|<span style="color:pink">speech recognition</span>|
|English|Spanish|<span style="color:pink">machine translation</span>|
|ad, user info|click? (0/1)|<span style="color:pink">online advertising</span>|
|image, radar info|position of other cars|<span style="color:pink">self-driving car</span>|
|image of phone|defect? (0/1)|<span style="color:pink">visual inspection</span>|

>Note: In each of the above applications, you would first train your models with examples of inputs (`x`) and the right answers (`y-label's`). After the model has learned from these `x` and `y` pairs, it can take a brand new input `x` and try to produce the appropriate corresponding output `y`.

___
#### **<span style="color:hotpink">There are two main types of supervised learning:</span>**
* **Regression** - *used to predict a number* from infinantly many possible outputs
* **Classification** - used to predict a small number of outputs or categories

___
#### **<span style="color:hotpink">Regression Example: Predicting Housing Prices</span>**

<span style="color:pink">Say you want to predict housing prices based on the size of the house.</span>

You have collected and plotted your data...

<img src="images/regression_1.png" alt="" height="" width="400">

<span style="color:pink">Fitting the data with a line, curve or other function, we can interpolate a price for a house of a given size.</span>

How much can we sell a 750 thousand square foot house for, using either a line or curve to interpolate?

<img src="images/regression_2.png" alt="" width="400" height="">

From the diagram above, a straight line will predict a selling price near $150,000, where a curved line will predict a price closer to $200,000.

>Note: Getting an algorithm to systematically choose the most appropriate line or curve or other thing to fit to this data is an example of regression.

___
#### **<span style="color:hotpink">Classification Example: Breast cancer detection</span>**

<span style="color:pink">Say you are building a system so that doctors can have a diagnostic tool to detect breast cancer.</span>

This is important because early detection can save lives .

Using a patients records machine learning tries to figure out if a tumor is malignant or benign.

\
<span style="color:pink">Let's say your data set has tumors of various sizes and these tumors are labled as benign (0) or malignant (1).</span>

We can plot the data such that the horizontal axis represents the size of the tumor and the vertical axis takes on only two values 0 or 1, depending on wether the tumor is benign or malignant.

<img src="images/classification_1.png" alt="" width="400" height="">

<span style="color:pink">One way that this example differs from regression is that we are only trying to predict a small number of possible outputs or categories.</span>

In this case there are two possible outputs 0 or 1, benign or malignant. 

This is different from regression which tries to predict any number, out of an infinitely number of possible numbers (our regression example can predict an infinite number of prices for example: $0.25, $4.73, etc). 

\
<span style="color:pink">You can also plot this data set on a line, using two different symbols to denote the category (`o` for benign and `x` for malignant).</span>

<img src="images/classification_02.jpg" alt="" height="" width="400">

\
If a new patient walks in for a diagnosis and they have a lump of a given size, will your system classify this as benign or malignant?

<span style="color:pink">In classification problems, we can have more than two possible output categories.</span>

Maybe your learning algorithm can output multiple types of cancer diagnoses if it turns out to be malignant.

\
Let's call the two types of cancer `type 1` and `type 2`.

Now our learning algorithm will have three possible output categories it could predict...

<img src="images/classification_4.jpg" alt="" height="" width="400">

Note: In classification, the terms output classes and categories are used interchangably.

<span style="color:pink">Classification algorithms predict categories, which can be non-numerical.</span>

For example, we can predict if a picture is that of a cat or a dog, or if a tumor is benign or malignant. 

\
<span style="color:pink">Categories can also be numbers.</span>

For example, our example predicts 0, 1, 2 but not all possible numbers inbetween, such as 0.5, or 1.7.

\
<span style="color:pink">We can use multiple inputs to predict an output.</span>

For example, let's say we know both the tumor size and the patients age in years ... our new dataset now has two inputs.

We can plot benign tumors as `o`'s and malignant tumors as `x`'s...

\
<span style="color:pink">Given a patients tumor size and age, how can we predict whether the tumor is benign or malignant?</span>

<img src="images/classification_5.jpg" alt="" height="" width="400">

\
<span style="color:pink">A machine learning algorithm could find some boundary that separates out the malignant tumors from the benign ones...</span>

<img src="images/classification_6.jpg" alt="" height="" width="400">

\
So the machine learning algorith needs to decide how to fit a boundary line to this data.

This can help the doctor determine the likelyhood that a tumor will be benign or malignant.

>Note: In actual practice, many additional inputs would be used (thickness of tumor clump, uniformity of the cell shape, etc.)

___
#### **<span style="color:hotpink">Summary: Supervised Learning</span>**

<img src="images/supervised_learning_1.jpg" alt="" height="" width="400">

___

#### **<span style="color:hotpink">Unsupervised Learning</span>**

Unsupervised learning is the most widely used form of machine learning after supervised learning.

\
**<span style="color:pink">Supervised Learning VS Unsupervized Learning</span>**

<img src="images/unsupervised_learning_0.jpg" alt="" width="400" height="">

In our supervised learning examples, each data point was associated with an output label y, such as benign or malignant.

In unsupervised learning we are given data that isn't associated with any output labels.

For example, you are given data on a patients tumor size and age but not if the tumor is benign or malignant.

\
<span style="color:pink">How can we determine whether a tumor is benign or malignant using unsupervised learning?</span>

Since we are not given any labels from our dataset, our job is to find some structure or pattern in the data; or just something interesting.

We call this unsupervised learning because we are not trying to supervise the algorithm to give the "right answer" for every input. Instead we ask the algorithm to figure out, all by itelf, what's interesting or what patterns or structures there might be in this data.

**<span style="color:pink">Clustering</span>**

<img src="images/unsupervised_learning_1.jpg" alt="" width="400" height="">

In this particular example, an unsupervised learning algorithm may decide that the data can be assigned into two different groups or clusters.

This is known as a clustering algorithm because it places the unlabeled data into different clusters.

Clustering is used in many applications...

<span style="color:pink">Clustering: Google News</span>

<img src="images/clustering_1.png" alt="" width="400" height="">

Every day google news looks at hundreds of thousands of news articles on the internet and groups related stories together.

This is done by grouping articles containing similar words into clusters.

There is so much content, it is not feasible to have individuals sorting through news articles every day.

The algorithm needs to figure out on it's own, without supervision, what are the clusters of news articles today.

<span style="color:pink">Clustering: DNA Microarray</span>

<img src="images/clustering_2.png" alt="" width="400" height="">

DNA microarray's cluster genetic data.

Each column represents the DNA activity of one person (represented by stick figures).

Each row represents a particular gene (eye color, height, taste, etc).

\
DNA microarrays are used to measure how much certain genes are expressed for each individual person.

The colors show the degree to which individuals do or do not have a specific gene active.

\
We can use a clustering algorithm to group individuals into categories / types of people (type 1, type 2, etc).

This is unsupervised learning because we are not telling the algorithm in advance that there is a type 1 person with certain characteristics, or a type 2 person with certain characteristics, etc. Instead we are saying, here is a bunch of data, I don't know what the different types of people are but can you automatically find structure in the data and find out who are the major types of individuals.

<span style="color:pink">Clustering: Grouping Customers</span>

Many companies have huge databases of customer information.

Given this data, can you automatically group your customers into different market segments so that you can more efficiently serve your customers?

\
The `DeepLearning.AI` team did some research to better understand their community.

Why do individuals take certain classes? 

Do they subscribe to the newsletter or attend events?

\
Let's visualize the `DeepLearning.AI` community as this collection of people.

<img src="images/clustering_4.png" alt="" width="400" height="">

Who are the major learners in the `DeepLearning.AI` community?

Running clustering (or market segmentation) found a few distinct groups of individuals:

* Group 1: primary motivation was seeking knowladge to grow their skill
* Group 2: primary motivation was looking for a way to build their career
* Group 3: primary motivation was to stay updated on how AI impacts their field of work

#### <span style="color:hotpink">Summary</span>

<image src="images/unsupervised_learning_2.png" alt="" width="400" height="">


>Anomaly detection is especially important for fraud detection in the financial system where unusual transactions could be a sign of fraud.

>Dimentionality reduction is used to compress a data set while minimizing the loss of information.

___
___
#### **<span style="color:hotpink">Linear Regression Model</span>**

Most widely used learning algorithm.

Basically fitting a straight line to your data.

\
<span style="color:pink">Example: You want to predict the price of a house based on size...</span>

>Note: We will use a data set on house in Portland.

<img src="images/lr_1.png" alt="" width="400" height="">

If we want to determine the selling price from this data we can build a linear regression model.

In other words we can fit a straight line (best fit line) to the data from which we can interpolate.

<img src="images/lr_2.png" alt="" width="400" height="">

We call this supervised learning because you are first training a model by giving a data that has right answers.

This linear regression model is a particular type of supervised learning model. It's called regression model because it predicts numbers as the output like prices in dollars. 

Any supervised learning model that predicts a number such as 220,000 or 1.5 or negative 33.2 is addressing what's called a regression problem. 

___
#### **<span style="color:hotpink">Terminology</span>**

**<span style="color:pink">Training Set</span>** - Data used to train the model

**<span style="color:pink">Input Variable</span>** - Also called the feature variable, is denoted by the letter `x` 

**<span style="color:pink">Output Variable</span>** - Also called the target variable, is denoted by the letter `y` 

**<span style="color:pink">Training Example</span>** - To indicate a sinle training example, we use the notation `(x,y)`

**<span style="color:pink">Number of Training Examples</span>** - Represented by the letter `m`

**<span style="color:pink">i<sup>th</sup> Training Example</span>** - Represented by the notation <span style="color:orange">(x<sup>(i)</sup>,y<sup>(i)</sup>)</span>

\
Example Training Set:

<img src="images/training_set_1.png" alt="" width="200" height="">

In our first example, the data used to train our model is our `training set`. The `input variable` is the house size in square feet, denoted by `x`. The `output variable` is the price of the corresponding house, in dollars, which is represented by the letter `y`.

Each row in the `training set` represents a different training example.

In our first `training example` `(x,y)` = (2104,400) and the `total number of training examples` is 47.

To refer to a specific training example we use the notation <span style="color:orange">(x<sup>(i)</sup>,y<sup>(i)</sup>)</span>, where `i` refers to a specific row. Therefore, <span style="color:orange">(x<sup>(1)</sup>,y<sup>(1)</sup>)</span> = (2104,400).

>Note: <span style="color:orange">x<sup>(2)</sup></span> is not an exponent; it refers to the 2nd training example.

___
#### **<span style="color:hotpink">How Does Supervised Learning Work?</span>**

<img src="images/supervised_learning_2.jpg" alt="" width="400" height="">

In supervised learning the `training set` includes <span style="color:orange">input features</span> (size of the house) as well as the <span style="color:red">output targets</span> (price of the house).

To train the model we pass our `training set` to our `learning algorithm`.

The `learning algorithm` will produce a function *<span style="color:aqua">f</span>*.

>Note: The function *<span style="color:aqua">f</span>* used to be called the hypothesis.

The job of this function *<span style="color:aqua">f</span>* is to take a new input <span style="color:orange">x</span> and output an estimate or prediction, <span style="color:fuchsia">ŷ</span>.

\
<span style="color:aqua">f</span> is the model.

<span style="color:orange">x</span> is the input feature.

<span style="color:fuchsia">ŷ</span> is the prediction, or the estimated value of <span style="color:red">y</span>.

<span style="color:red">y</span> is the target or "true value" taken from our training set.

##### **`Linear Regression`**

When we design a learning algorithm, a key question is: `"How are we going to represent the function` <span style="color:aqua">*f*</span> `?"` or `"What is the math formula we are going to use to compute` <span style="color:aqua">*f*</span> `?"`

Assuming <span style="color:aqua">*f*</span> is a straight line our function can be written as _**<span style="color:aqua">f<sub>w,b</sub>(x) = wx + b</span>**_.

If we plot our training set such that our input feature <span style="color:orange">x</span> is on the horizontal axis and our output target <span style="color:red">y</span> is on the vertical axis, our algorithm "learns" from this data and generates a best fit line, <span style="color:aqua">*f*</span> (a straight line in this case).

In this case our <span style="color:aqua">function</span> is making predictions for the value of <span style="color:red">y</span> using a straight line function of <span style="color:orange">x</span>.

This model is called `linear regression`.

More specifically, a `linear model with one input` or `univariant linear regression` because we only have one input variable (size of the house).

\
Why use a linear function instead of a <span style="color:purple">non-linear function</span> like a curve or parabola?

Linear functions are easier to work with. Therefore we will use a line as a foundation to understand more complex non-linear models.

___
#### **<span style="color:hotpink">Cost Function</span>**

The cost function tells us how well our model is doing.

<img src="images/cost_func_1.png" alt="" width="400" height="">

Recall:
* Training set contains input features x and output targets y
* The model we are using to fit the training set is a linear function f<sub>w,b</sub>(x) = wx + b.
* x, and b are the parameters of the model (also called coefficients or weights); they can be adjusted during training to improve the model.

\
Let's see what happens when we change our parameters...

<img src="images/cost_func_2.png" alt="" width="400" height="">

We want to choose values of w and b that so that the straight line we get from out function f fits the data well.

<img src="images/cost_func_3.png" alt="" width="400" height="">

<span style="color:pink">Let's break down the cost function...</span>

>We want to measure how far off the prediction is from the target.

\
<span style="color:pink">The cost function takes the prediction <span style="color:fuchsia">ŷ</span> and compares it to the target <span style="color:red">y</span> by taking the difference betweent the two.</span>

<span style="color:fuchsia">ŷ</span> - <span style="color:red">y</span>

\
<span style="color:pink">The difference or "error" is squared to get a positive number.</span>

(<span style="color:fuchsia">ŷ</span> - <span style="color:red">y</span>)<sup>2</sup>

\
<span style="color:pink">We want to measure the error across the entire training set by summing the squared error's.</span>

∑<sup><span style="color:purple">m</span></sup><sub>i=1</sub>(<span style="color:fuchsia">ŷ</span> - <span style="color:red">y</span>)<sup>2</sup>

\
<span style="color:pink"><span style="color:purple">m</span> is the total number of training examples (47 for this data set). As <span style="color:purple">m</span> increases so will our cost function, since we are summing over more examples. To correct for this we will compute the average squared error rather than the total squared error. The average is computed by dividing the cost function by <span style="color:purple">m</span>.</span>

(1/<span style="color:purple">m</span>) ∑<sup><span style="color:purple">m</span></sup><sub>i=1</sub>(<span style="color:fuchsia">ŷ</span> - <span style="color:red">y</span>)<sup>2</sup>

\
<span style="color:pink">By convention we subtract this value by 2. This cleans up our calculations later down the line, but the cost function works the same either way.</span>

`Cost Function`: J<sub>(<span style="color:aqua">w,b</span>)</sub> = (1/2<span style="color:purple">m</span>) ∑<sup><span style="color:purple">m</span></sup><sub>i=1</sub>(<span style="color:fuchsia">ŷ</span> - <span style="color:red">y</span>)<sup>2</sup>

\
>Note: The `Cost Function` is also called the `Squared Error Cost Function` because we are taking the square of the error terms. In machine learning different cost functions are used for different applications, but the `squared error cost function` is the most common for linear regression.

\
<span style="color:pink">The prediction <span style="color:fuchsia">ŷ</span> is equal to the output of the model <span style="color:aqua">*f*</span>(<span style="color:orange">x</span>). Therefore, we can rewrite the function as follows:</span>

`Cost Function`: J<sub>(<span style="color:aqua">w,b</span>)</sub> = (1/2<span style="color:purple">m</span>) ∑<sup><span style="color:purple">m</span></sup><sub>i=1</sub>(J<sub>(<span style="color:aqua">w,b</span>)</sub> - <span style="color:red">y</span>)<sup>2</sup>

>Note: Eventually, we will want to find values of w and b that make the cost function small.

#### **<span style="color:hotpink">What is the Cost Function doing?</span>**

<img src="images/cost_func_4.png" alt="" width="400" height="">

<span style="color:pink">Standard Model Review</span>

We want to use our model to fit a straight line to the training data.

We can generate different lines depending on the chosen parameters (`w` & `b`)

We use the cost function to test how well the model fits our training data.

The cost function measures the difference between the models predictions `ŷ` or `f`<sub>`w,b`</sub>`(x)` and the actual "true" values for `y`.

In other words, we want to minimize the cost function `J` by adjusting the parameters `w` and `b`.

<span style="color:pink">Simplified Model</span>

Let's look at a simplified model by setting the parameter `b` equal to `0`.

Now we only have one parameter, `w`.

Now `f`<sub>`w`</sub>`(x`<sup>`(i)`</sup>`) = wx`<sup>`(i)`</sup> and `J` is a function of `w`.

With this simplified model, our goal is to find a value for `w` that minimizes `J(w)`.

Using this simplified model, let's see how the cost function changes as you choose different values of the parameter `w`...

* If `w` is fixed, `f`<sub>`w`</sub>`(x)` is a function of the input `x`.

* In contrast `J(w)` is a function of the parameter `w` which controls the slope of the line defined by `f(w)`. Therefore the cost function depends on the parameter `w`.

<span style="color:pink">`w=1` → `f`<sub>`w`</sub>`(x)=x` → `J(w)=0` (there is NO difference between predicted and "real" value)</span>

<img src="images/cost_func_05.png" alt="" width="400" height="">

<span style="color:pink">`w=0.5` → `f`<sub>`w`</sub>`(x)=0.5x` → `J(w)≃ 0.58` (there is a difference between predicted and "real" value)</span>

<img src="images/cost_func_6.png" alt="" width="400" height="">

<span style="color:pink">`w=0` → `f`<sub>`w`</sub>`(x)=0` → `J(w)≃2.3` (there is a difference between predicted and "real" value)</span>

<img src="images/cost_func_7.png" alt="" width="400" height="">

We can continue calculating the cost function for different values of `w`...

\
`w` can be any number including negative values. 

Negative values of `w` result in downward sloaping lines.

If `w=-0.5` the line would slope downward and we would have a greater value of `J(w)≃5.25`.

<img src="images/cost_func_8.jpeg" alt="" width="400" height="">

Each value of `w` corresponds to a different straight line fit on the graph on the left.

For each value of `w`, we can calculate a cost `J(w)`. This value corresponds to a single point on the graph on the right.

By computing a range of `w` values we can trace out what the `J(w)` function looks like.

<span style="color:pink">How can we choose a value `w` that results in a function `f`<sub>`w`</sub>`(x)` that fits our training set "well"?</span>

We want to choose a parameter `w` that minimizes the cost function `J(w)`.

>Note: In the general case, we want to choose parameters `w` and `b` that would minimize `J(w,b)`.

<span style="color:pink">Let's look at the cost function using our original model...</span>

<img src="images/cost_func_9.png" alt="" width="400" height="">



<img src="images/cost_func_10.png" alt="" width="400" height="">

On the right is a training set of house sizes and prices.

When we only had one parameter `w` the cost function was a `u-shaped` curve like the one on the right.

\
Using two parameters `w` and `b`, the cost function becomes a little more complex...

<img src="images/cost_func_12.png" alt="3d surface plot" width="400" height="">

Any single point on this surface represents a particular choice for `w` and `b`.
 
For example, if `w=-10` and `b=-15` then the height of the surface above this point is the value of `J(w=-10,b=-15)`.

Another way to visualize this is by using a contour plot...

<img src="images/cost_func_13.png" alt="" width="400" height="">

At the bottom of this slide is a 3-d surface plot of the cost function `J(w,b)`.

At the upper right is a contour plot of the exact same cost function, with `b` on the vertical axis and `w` on the horizontal axis. Each point on a particular elipse is the exact same height (same value of `J(w,b)` even though `b` and `w` are different).

On the upper left you can see that these three points correspond to different functions `f`<sub>`w,b`</sub>`(x)`.

The cost function is at a minimum at the center of the elipses.

<span style="color:hotpink">Let's look at some more visualizations of `w` and `b`...</span>

<img src="images/cost_func_15.png" alt="" width="400" height="">

<span style="color:hotpink">Top Left:</span> This is a contour plot. The point indicated on the graph, where `w≃-15` and `b≃800` corresponds to one pair of values `w` and `b` that yeild a particular cost `J`. The cost is far from the center of the ellipse, indicating that our function `f(x)` is not a good fit.

<span style="color:hotpink">Top Right:</span> This plot shows our training set and a function corresponding to the parameters `w=-0.15` and `b=800`. Just looking at the plot, we can see that our function `f(x)=-0.15x+800` is not a good fit to the data. This is because many of the predictions `ŷ` are far from the actual target value of `y` that is in the training data.

<img src="images/cost_func_16.png" alt="" width="400" height="">

Not a good fit, but slightly better...

<span style="color:hotpink">Top Left:</span> The point indicated on this contour plot represents the cost corresponding to the values `w≃0` and `b≃360`. 

<span style="color:hotpink">Top Right:</span> This pair of parameters corespond to the function of `f(x)=360` which is a flat line.

<img src="images/cost_func_17.png" alt="" width="400" height="">

Not a great fit to the data. Further away from the minimum than the previous example.

>Recall: Minimum is at the center of smallest ellipse.

<img src="images/cost_func_18.png" alt="" width="400" height="">

<span style="color:hotpink">Top Left:</span> The function `f(x)` is a pretty good fit to training data. If you measure vertical distances between the data points and the predicted values on the straight line, the sum of their squared errors is close to the minimum.

<span style="color:hotpink">Top Right:</span> Close to center (minimum).

___
#### **<span style="color:hotpink">Gradient Descent</span>**

<img src="images/gradient_descent_1.png" alt="" width="400" height="">

Provides a more systematic way of finding optimal values of `w` and `b` that minimize `J`.

Gradient descent applies to more general functions, including other cost functions that work with models that have more than two parameters.

For example, if you have a cost function such that J(w<sub>1</sub>, w<sub>2</sub>,...,w<sub>n</sub>,b). Your objective is then to mininimize `J` over the parameters `w`<sub>`1`</sub> to `w`<sub>`n`</sub> and `b`. In other words, you want to pick values for these parameters that gives you the smallest possible value of `J`.

We start by making some initial guesses for `w` and `b`. In linear regression it doesn't matter what the initial values are so it is common to set them both equal to `0` for the initial guess.

Next, we continue changing these values to reduce the cost function until we settle at or near a minimum.

>Note: Not all cost functions `J` will have a bowl shape. In other words, it is possible to have more than one minimum.

<span style="color:pink">Let's look at a more complex surface plot...</span>

<img src="images/gradient_descent_2.png" alt="" width="400" height="">

This is not a squared error cost function.

Linear regression with a squared error cost function always ends up with a bowl-shape or hammock-shape.

This is the type of function you may get if you were training a nerual network model.

On this plot we have `w` and `b` on the bottom axes.

For different values of `w` and `b`, you get different points on the surface `J(w,b)`.

The height of the surface at some point is the value of the cost function.

From some initial starting point, we look around until we find the `direction of steepest descent`. You can imagine yourself standing on the top of a hill and you are looking for the quickest way down into one of the vallys.

Once we determine the shortest path, we take a "step" in the direction of steepest descent and keep repeating this procedure until we reach some minimum value.

What we did was to go through multiple steps of gradient descent until we reached a local minimum.

>Note: If we repeated this process using a different starting point, we may end up at a different local minima.

Next, let's look at the math behind gradient descent...

<img src="images/gradient_descent_3.png" alt="" width="400" height="">

<span style="color:pink">Let's break down the first equation...</span>

`w=w-α(∂/∂w)J(w,b)`

This equation is saying: `w` is updated to the old value of `w` minus `α` multiplied by the partial derivative of `J(w,b)` with respect to `w`.

The `=` symbol is the assignment operator (*see chart below*).

`α` is the learning rate. It is usually a small positive value between 0 and 1 (e.g. 0.01). Controls how big of a step we will take "down hill". A large value of `α` corresponds to an aggressive gradient descent procedure where you are taking huge steps "down hill" and the opposite is true for small values of `α`.

`(∂/∂w)J(w,b)` is the derivative term for the cost funcion `J`. Controls the direction in which we want to "step". In combination with our learning rate `α`, it also controls the size of the steps we want to take

\
<span style="color:pink">Now let's look at the second equation</span>

The second equation `b=b-α(∂/∂b)J(w,b)` is saying: `b` is updated to the old value of `b` minus `α` multiplied by the partial derivative of `J(w,b)` with respect to `b`.

\
<span style="color:pink">Implementing gradient descent</span>

In gradient descent the updated parameters `w` and `b` will be simultaneously updated until the algorithm "converges", meaning that it reaches the point at a local minimun where the parameters no longer change much with each additional step you take.

To correctly implement gradient descent, we want to simultaneously update `w` and `b`.

This means that we want to use the pre-updated values of `w` and `b` to calculate `temp_w` and `temp_b` before updating our parameters.

An incorrect way to do it would be to calculate `temp_w` and then update `w` before calculating `temp_b`. In that scenario we would be using the old value of `w` to update `w` and the new updated version of `w` to calculate `b`.

>Note: If we did use the incorrect method to implement gradient descent, our model would still probably work; but it is incorrect.

\
<span style="color:pink">Assignment Operator vs Truth Assertion</span>

||Assignment Operator|Truth Assertion|
|:--:|:--:|:--:|
|Definition|Assigns a value to a variable|Asserts the truth of the equality of two values|
|Used in|Coding|Mathematics and Coding|
|Symbol|`=`|`=` (math); `==` (coding)|
|a=c|Takes the value `c` and stores it in your computer inside of the variable `a`|"Asserts" or claims that the values for `a` and `c` are equal|
|a=a+1|Increments the value of `a` by one|This statement is mathmatically incorrect|
|a==c|N/A|Tests to see if `a` is equal to `c`|

>In math notation we can use `=` to indicate either an assignment operator or a truth assertion, so we will try to indicate which is which in the notes. In the slide above we are using `=` to represent the assignment operator.

Let's dive more deeply into gradient descent to get better intuition...

<img src="images/gradient_descent_4.png" alt="" width="400" height="">

Recall from the previous slide that `α` is the learning rate. The learning rate controls how big of a step you take when updating the model's parameters, `w` and `b`.

The derivative term controls the direction of the "step" we are going to take.

\
What is the effect of these terms when updating `w` and `b`?

\
Let's use a simpler example, minimizing one parameter `w` to get a better understanding of how the learning rate and the derivative term work together.

Using the simplified equation `w=w-α(∂/∂w)J(w)` our goal is to minimize the cost `J(w)` by adjusting the parameter `w`.

This is similar to a previous example, where we temporarily set the parameter `b=0`.

Looking at one parameter instead of two, we can visualize gradient descent using a 2-D graph...

<img src="images/gradient_descent_5.png" alt="" width="400" height="">

Let's initialize gradient descent for some starting value for `w`...

>Note: A derivative is the slope of the tangent line.

>Note: The learning rate `α` is always a positive number.

<span style="color:pink">Upper Left:</span> The derivative term is a positive number; Therefore, `w` will decrease.

<span style="color:pink">Upper Left:</span> The derivative term is a negative number; Therefore, `w` will increase.

<span style="color:pink">How to choose an appropriate value for the learning rate?</span>

The choice of the learning rate, alpha will have a huge impact on the efficiency of your implementation of gradient descent.

If alpha, the learning rate is chosen poorly rate of descent may not even work at all. 

<img src="images/gradient_descent_6.png" alt="" width="400" height="">

<span style="color:pink">What happens if you initilize gradient descent at a local minimum?</span>

<img src="images/gradient_descent_7.png" alt="" width="400" height="">

If you are already at a local minimum, gradient descent leaves `w` unchanged.

<span style="color:pink">Step size decreases as we approach the mininum</span>

<img src="images/gradient_descent_8.png" alt="" width="400" height="">

In cases where we do not initlize gradient descent at a local minimum, the steps naturally get smaller as they approach some local minimum.

This is because the slope of the tangent line approaches zero (it is not as steep) with each subsequent step; in other words, as we approach the minimum the derivative gets closer and closer to zero.

#### <span style="color:hotpink">Putting it all together...</span>

<img src="images/gradient_descent_9.png" alt="" width="400" height="">

<img src="images/gradient_descent_10.png" alt="" width="400" height="">

<img src="images/gradient_descent_11.png" alt="" width="400" height="">

<img src="images/gradient_descent_12.png" alt="surface plot with more than one local minimum" width="400" height="">

Gradient descent can lead to a local minimum rather than a global minimum.

The global minimum is the point that has the lowest possible value for the cost function `J`.

The surface plot above has more than one local minimum.

Depending on where you initilaize the parameters, you can end up at a different local minimum.

<img src="images/gradient_descent_13.png" alt="" width="400" height="">

When using a squared error cost function with linear regression, the cost function will never have multiple local minima.

It has a single global minimum because it is a convex or bowl-shaped function.

One nice property of convex functions is that as long as you choose an appropriate learning rate, gradient descent will always converge to the global minimum.