# Introduction to Deep Learning 
Scale Drives Deep Learning Progress

### Logistic Regression as a Neural Network

- An image is represented by a matrix of pixel values
- Most common image format is a 3 layer matrix, where each layer represents the Red, Green and Blue (RGB) values of the image.
- Each pixel is represented by a vector of 3 numbers ranging from 0 to 255.
- Hence, a 64x64 image will have 64x64x3 = 12288 features.

### Notation
- X = [x<sup>(1)</sup>, x<sup>(2)</sup>, ..., x<sup>(m)</sup>]; a matrix of shape (n<sub>x</sub>, m) where m is the number of training examples. X contains the input features of the training examples.
- Y = [y<sup>(1)</sup>, y<sup>(2)</sup>, ..., y<sup>(m)</sup>]; a vector of shape (1, m) where m is the number of training examples. Y contains the labels of the training examples.
- Hence, a training example is represented by (x<sup>(i)</sup>, y<sup>(i)</sup>).

### Logistic Regression
- Logistic regression is a binary classifier. It classifies the input into one of the two classes.
- The output of the logistic regression is a number between 0 and 1.
- The output of the logistic regression is the probability of the input belonging to class 1.
- The output of the logistic regression is calculated using the sigmoid function.
- The sigmoid function is defined as:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
- Given an input X, the output of the logistic regression is calculated as:
$$\hat{y} = \sigma(w^T X + b)$$ 
-> where w is a vector of shape (n<sub>x</sub>, 1) and b is a scalar.

### Loss (Error) Function
- The loss function is used to measure the accuracy of the model.
- The loss function is defined as:
$$L(\hat{y}, y) = - (y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}))$$
- If y = 1, then the loss function is defined as, (we want $\hat{y}$ to be close to 1):
$$L(\hat{y}, y) = - \log(\hat{y})$$
- If y = 0, then the loss function is defined as, (we want $\hat{y}$ to be close to 0):
$$L(\hat{y}, y) = - \log(1 - \hat{y})$$

### Cost Function
- The cost function is the average of the loss function over all the training examples.
- The cost function is defined as:
$$J(w, b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})$$

### Gradient Descent
- Gradient descent is an optimization algorithm used to minimize the cost function.
- Gradient descent is used to find the optimal values of w and b.
- The algorithm starts with some initial values of w and b.
- The algorithm then iteratively updates the values of w and b to minimize the cost function.
- The algorithm stops when the cost function converges to a minimum value.
- The algorithm is defined as:
$$w = w - \alpha \frac{\partial J(w, b)}{\partial w}$$
$$b = b - \alpha \frac{\partial J(w, b)}{\partial b}$$
-> where $\alpha$ is the learning rate.