# Supervised Learning
Algorithms are trained used labeled examples such as input where the desired output is known (this is done with data annotators)
- for example: segment of text could have a category lavel such as:
    - Spam Email vs. Legitimate Email
    - Positive vs. Negative Movie Review

Historical data predicts future events 

In chronological order,  it goes...

![33.PNG](attachment:33.PNG)

### 1. Data Acquisition

Getting data can come from anywhere, customers, physical sensors, database online, etc.

### 2. Data Cleaning

We need to clean and format the data so that the data can actually be used by a neural network, usually done with pandas library.

### 3. Split the data into: A) Training Data... OR... B) Test Data

Take some portion of the data to be test data, and large majority of the data to be training data. We use the training set on our network/model to fit a model to the training data. 

What is the difference between Training Data and Test Data? 

#### Training Dataset (Training Set)
This is the subset of data used to train the model. It consists of input data paired with corresponding ground truth labels or target values. In supervised learning "ground truth" refers to the true labels or target values associated with the input data. These true labels represent the corect answers or outcomes that the model aims to predict during training and evaluation. 

#### Testing Dataset (Test Set)
This subset of the data is used to evaluate the performance of the trained model. We are "testing" the model to see if it knows the correct answer. Like the training dataset, it consists of input data paired with corresponding grounds truth labels or target values. However, these labels are withheld from the model during training, and the model's predictions on this data are compared against the true labels to assess its performance. 

"Ground Truth" refers to the true labels or target values associated with both training and testing data. The images themselves are not the ground truth; rather, they are the input data for which we want the model to make predictions, which are then compared to the true labels to evaluate the model's performance.

### 4. Model Testing/Evaluation

Then we want to know how our model actually performed. We run the test data through the model, compare the model's prediction to the correct label the test data had. We actually know the correct label for the test data. We can run the test data features through the model, get our model's predictions and compare to the right answer. We can evaluate the model and maybe go back to our model and adjust the model parameters (this might mean adding more layers or more neurons) to get a better fit to the test data. 

- When testing the model against the test data, you will have some performance metric. Is it fair to use the accuracy off the test data as your model's final performance metric since technically, you were able to update the model parameters again and again after the evaluation results on the test set. How do we fix this?
    - The data in NNs and Deep Learning are split 3 sets: 1. Training data 2. Validation data 3. Test Data
        1. Training data is used to train model parameters
            - model looks at features, correct output, and fit to training data
        2. Validation data is used to determine what model hyperparameters to adjust
            - after training on the training data, we check performance on validation data, based off of performance, we may go 
                back and adjust our model (i.e. more neurons, layers, or changing architecture, etc.)
            - repeat this process over and over again until we are satisfied with the performance of the model data 
        3. Test data is used to get some final performance metric
            - use this third dataset that the model has never seen before to get some final performance metric 
            - key: once you run the model through the test data, that is the performance metric that you expect your model to
                perform with in the real world since you are not going back to adjust any weights or parameters, etc.
            - once you are on final test dataset, you are not able to go back and adjust the model to refine the performance of 
                of final test dataset
            - this final measure is what we refer to as the true-performance of the model to be

If we can, we will iterate on the model and improve it in this step

### 5. Model Deployment

When we are satisfied with this, we can deploy the model into the real world. 



## Neural Networks
For NNs, the network receives a set of inputs along with the corresponding corerct outputs, and the algorithm learns by comparing its actual output with correct outputs to find errors (this is basic pattern matching). It then modifies the model accordingly (such as adjusting the weights and bias values). 

## Overfitting
When the model fits too much to the noise from the data. This often results in low error on training sets but high error on test/validation sets.

- A good model looks like an average amongst different data points
- An overfitted model looks like it spikes beyond the data points or has drastic ups and downs


## Underfitting
When the model does not capture the underlying trend of the data and does not fit the data well enough

- There is low variance but high bias
- Underfitting is often a result of an excessively simple model
- What that looks like is a scatterplot but a straight line goes through the model (doesn't really represent the whole dataset)


## Overfitting & Underfitting over multi-dimensional datasets

A good model will show that there is less error with more training time & more Epochs
- Epoch is when the model has gone through an entire dataset once, which means it has completed on full pass over all the training examples
- Each training session has multiple epochs
- An iteration of a dataset is called a "batch"
- Each epoch consists of several batches, each batch has a subset of dataset


## Relationship between overfitting & underfitting and model performance with training vs. test set

When we think about overfitting and underfitting, we want to think about the relationship of the model performance on the training set vs. the test/validation set.

If we see that a training set performs like less errors over more Epochs, but the test set performs better, ideally, the model would perform well with similar behavior. 

But what happens if we overfit on the training data? It will result in perform poorly on the new test data. Where the test data underperforms over time (performing worse than the training set), at the point of intersection between the performance is where we should think to cut off training time on the training data. 
- it's a good indicator that we spent too much training on the training set

# Classification Model Performance Evaluation Metrics

Usually in classification, the model can only achieve 2 results: (1) Model was correct (2) Or its prediction was incorrect

For supervised learning problems, we will first fit/train a model on training data (that means we will have pre-labeled image data), then test that model on testing data (data that the model has never seen before, also known as new images). 

We'll show new images to the model. Then, we'll see the model's prediction and compare the model's prediction from the test data (also known as "x_test" data) with the correct answers (true y-values) we already know. In the real world, a single metric will not tell the complete story. 

5 Key Classification Metrics:
- 1. Accuracy
    - proportion of correctly classified instances out of the total number of instances
        - the number of model's correct predictions divided by the total number of predictions
            - ex: 80 correct images out of 100 total predictions... 80% accuracy
        - accuracy is most useful when target classes are well balanced
            - ex: we would have roughly the same amount of cat images as we have dog images throughout dataset
        - accuracy is NOT a good metric to use when we have unbalanced classes
            - thought experiment: if we had 99 images of dogs and 1 image of a cat, if our model wasn't machine learning, and only printed a line that it was a dog, it would have a 99% accuracy (which is not accurate at all)
- 2. Recall
    - ability of a model to find all relevant cases within a dataset
        - it is the number of true positives divided by the number of true positives PLUS the number of false negatives
    - proportion of true positive predictions out of all actual positive instances in the dataset
    - good for an unbalanced dataset
- 3. Precision
    - ability of a classification model to identify only the relevant data points
        - number of true positives divded by the number of true positives plus the number of false positives
    - proportion of true positive predictions out of all positive predictions made by the model
    - good for an unbalanced dataset
    
    
    
Tradeoffs between Recall and Precision
- while recall expresses the ability to find all relevant instances in a dataset
- precision expresses the proportion which data points our model says was relevant were actually relevant     
    
    
    
- 4. F1-Score
    - the harmonic mean of precision and recall, providing a balanced measure between the two
        - combination of recall and precision
        - F1 score = 2((precision * recall)/(precision + recall))
    - we use harmonic mean because we want to punish extreme values (fair assessment of tradeoffs of precision vs. recall)
        - ex: a classifier with a precision of 1.0 and a recall of 0.0 has a simple average of 0.5, but an F1 score of 0
- 5. Confusion Matrix 
    - table that summarizes the counts of true positive, false positive, true negative, and false negative predictions
    - organizing real and confused values

    


![22222.PNG](attachment:22222.PNG)

- True Positive = somebody has a disease, and model predicts their disease
- True Negative = somebody not having a disease, and model predicts no disease
- False Positive (Type 1 Error) = somebody has NO disease, and model predicts that they do have it
- False Negative (Type 2 Error) = somebody has disease, and the model predicts they do NOT have it


Key takeaway: the confusion matrix and the various calculated metrics is that they are all fundamnetally ways of comparing the predicted values versus the true values. What constitutes "good" metrics will depend on the situation

There is no universal rating for whether an accuracy is "good" enough. It'll depend on the context. Do we have balanced or unbalanced classes? 

These models are typically are used as a quick diagnostic test before having a more invasive test (we should consider what is at stake).
   - ex: getting a urine test before getting a straight up biopsy
   
Should our models focus on fixing false positives or false negatives? At the cost of decreasing false negatives, we likely increase false positives. 
   - ex: in disease diagnosis, it's probably better to go in the direction of false positives (increasing them) than false negatives so we make sure we correctly classify as many cases of disease as possible 
       - it's probably better for somebody to be falsely diagnosed with a disease and have them adopt healthier living habits or move them to the next step of additional testing than to not warn them about a disease they already have
       

CONCLUSION: There is no universal truth for these evaluation metrics. And these models are not performed in a "vacuum". It depends a lot about context and domain. 

### Dog Classification Example:

We take a picture of a dog from the x_test set (the set that the model has not seen yet) and we take a correct corresponding label from the training dataset (the dataset the model was originally trained on). We take the test image and feed it into the model to see if the model will predict the correct corresponding label for the image that was fed in. Then we compare the label that the model predicts with the correct corresponding label from the training dataset and we see if it's correct. 
   - FYI: "x" in "x_test" set just means features, and the image itself is a feature
   - If the model predicts "DOG" and the correct corresponding label is "DOG", it asks does "DOG == DOG"? If yes, prediction is correct
   
Then we repeat this process for all x_test images in the data (some are dogs, some are cats), and we see at the end, what the count of the correct matches and incorrect matches are. 
   - *The key realization we need to make is that in the real world, NOT ALL INCORRECT OR CORRECT MATCHES HOLD EQUAL VALUE* 
    

# What is BatchNorm?

BatchNorm is not directly related to supervised learning itself, but is a technique used during the training phase of supervised learning model in deep neural networks. BatchNorm is a normalization technique applied to the activations of neurons within each layer of the network, helping stablize and accelerate training process by normalizing activations, ensuring they have consistent statistical properties across different mini-batches of data, mitigating issues like vanishing gradients,  exploding gradients, thereby improving convergence and generalization performance of the model. 
- While it's not a component of supervised learning, it is a technique used to enhance training of supervised learning models


In deep learning, when we train neural networks, we're adjusting the parameters (i.e. weights and biases) to make better predictions. During training, sometimes the layers in the network can become too large or small. This can slow down the training process or even lead the model not learning well. 

   - Neural layers become too large due to several reasons: 
       - nature of optimization process during training (as the network learns, activations within each layer can exhibit large variations in magnitude and slow down training or even lead to numerical instability
       - network parameters are updated during training, distributions of the activations may shift, causing the network to adapt to these changes.
       
BatchNorm helps mitigate these issues by normalizing the activations within each layer, ensuring that they have similar statistical properties regardless of variations in the input data or changes in parameters. Batch normalization ensures that the values within each layer of the neural network are balanced and consistent. It does this by:
1. Caclulating mean and variance 
    - for each mini-batch of data passed through the network during training, BatchNorm calculates the mean and variance of activations (outputs) of the neurons in that mini-batch
2. Normalize Activations
    - then, it normalizes the activations by subtracting the mean and dividing by the square root of the variance, this centers the activations around zero, and scales them to have a unit variance
3. Scale and Shift
    - then, it introduces two learnable parameters per neuron: a scale parameter (gamma) and a shift parameter (beta)
    - these parameters allow the network to adapatively scale and shift the normalized activations to better suit the learning process
4. Update Parameters
    - during training, BatchNorm learns the optimal values for gamma and beta along with the other parameters of the network through backpropagation and gradient descent

### Neuron Activation

Neuron activation occurs as followed:

1. Forward Pass
    - forward pass of training, input data in the form of mini batch flows through the NN
    - as the data passes through each layer, various math operations (like matrix multiplication, biases addition, activation functions) are applied to produce the activations of the neurons in that layer

2. Calculation of mean and variance
    - BatchNorm calculates the mean and variance within the mini-batch
    - This means that for each neuron in the layer, BatchNorm computes the average activation across all the data samples in the mini-batch, as well as the variance of these activations

3. Normalization of Activations
    - BatchNorm normalizes the activations of each neuron within the layer
    - this involves subtracting the mean calculated earlier from each activation and dividing by square root of the variance, ensuring that the activations are centered around zero and have similar scale

4. Scaling and shifting
    - after normalization, BatchNorm scales and shifts the normalized activations using learnable parameters (gamma and beta)
    - these parameters allow the network to adaptively adjust the scale and offset of the activations providing additional flexibility during training

5. Forward propagation
    - the scaled and shifted activations are passed to the next layer of the network for further processing

Summary: 
Neuron activation happens throguh a series of operations that help sure that the activations within each layer of the network are well-conditioned and conducive to effective training. This can be done in conjunction to Quantization and Pruning. It can be used to complement techniques of stabilization activations and facilitating training in scenarios where model size or computational complexity needs to be reduced

### Math behind BatchNorm

Given a mini-batch of activations where X = {x1,x2, ... xm} from a layer where "m" is batch size, the steps of batch normalization can be summarized mathematically as follows:

    Tip: Here's a legend of what each symbol means:
    
![legend.PNG](attachment:legend.PNG)


Step 1: Calculating the Mean and Variance:

![1.PNG](attachment:1.PNG)

- X = {x1,x2, ... xm} represents a mini-batch of activations from a layer, where m is the batch size. x sub i represents the activation of the i-th neuron in the mini-batch. 

- Mu is the mean of the activations in the mini-batch, it's calculated by summing up all the activations and dividing by the batch size m. This tells us roughly where the average is in the group
    - this "group" refers to a batch of data inputs that pass through a layer of NNs during training, usually containing multiple data points, each with a set of features. 
    - "group" in this context is referring to the activations of neurons across a batch of input data, not the individual neurons themselves

- Sigma squared is the variance of the activations in the mini-batch. It measures the spread or dispersion of the activations from the mean Mu. 


In summary: Mean is the average of all the numbers, variance is how spread out those numbers are from the average. 

Step 2: Normalize Activations:

![2.PNG](attachment:2.PNG)

- hat-x sub i represents the normalized activation of the i-th neuron in the mini-batch, it's obtained by subtracting the mean Mu from x sub i, and diving by the square root of the variance Sigma squared plus a small constant Epsilon for numerical stability.
- Epsilon is just a tiny number that we add for stability. It's like a safety measure to make sure we don't accidentally divide by zero.

In summary: normalizing the number of activations of neurons means we're making sure they are all on the same scale, adjust all of them to be around the same number of activations 

Step 3: Scale and Shift: 

Scaled Activation + Scale Parameter + Shift Parameter
- after we make sure that neuron activations are all on the same scale, we want to make some activations vary, this is called "scaling".
- we might also want to move the activations around in a certain order, this is called "shifting"

![3.PNG](attachment:3.PNG)

- y sub i represents the output of the batch normalization layer for the i-th neuron in the mini-batch. It's obtained by scaling (Gamma hat sub i) and shifting (+ Beta) the normalized activation hat x sub i. 
    - this is scaled activation
- Gamma is a learnable parameter called the "scale" parameter. It's used to scale the normalized activations, allowing the network to learn the optimal scale for each neuron
    - this is scaling parameter, also known as how much we want to stretch or shrink the activations
- Beta is another learnable parameter called the "shift" parameter. It's used to shift the scaled activations, enabling the network to learn the optimal offset for each neuron
    - this is shift parameter, telling us how much left or right we want to move the activations

### Tradeoffs for BatchNorm

1. Increased Memory Usage:
    - it requires storing additional parameters (mean and variance) for each layer of the NN during training, increasing memory footprint of the model, especially for large networks with many layers

2. Computational Overhead:
    - during training, it involves computations for calculating means and avariance, normalizing activations, updating the scale, and shifting parameters, resulting in increased computational overhead, especially with the mini-batch sizes are large or when deploying models on resource-constrained devices

3. Dependency on Mini-batch Statistics: 
    - during inference, it operates differently compared to training, requiring using the population statistics (e.g. running averages of mean and variance) instead of batch statistics
        - this introduces additional computational & memory overhead during inference which may impact inference latency, resource utilization in real-time or latency-sensitive applications

4. Impact on Inference Performance:
    - additional scale and shift parameters (gamma and beta) that need to be learned during training effectiveness is sensitive to the choice of learning rate, initialization strategies for these parameters
        - suboptimal choices may lead to slower convergence or training instability

5. Sensitivity to Learning Rate and Initialization:
    - in certain scenarios, such as transfer learning with small datasets, BatchNorm doesn't really offer signficant performance improvements
        - in such cases, computational and memory overhead may outweigh its benefits


### End result: 
BatchNorm ensures that the activations in each layer of the neural network have similar statistical properties, regardless of the variations in input data. It helps stabilize the training process, speed up convergence, and lead to better generalization performance of the model, helping NN learn more efficiently and effectively by ensuring that values within each layer stay within a desirable range throughout training. 

# Regression Evaluation Metrics

Evaluating performance for regression tests for regression models

Regression is a task when a model attempts to predict continuous values (unlike categorical values which is considered classification)
- ex: predicting the price of a house given its features

### The 3 most common evaluation metrics for regression: 
1. Mean Absolute Error (MAE)
2. Mean Square Error (MSE)
3. Root Mean Squared Error (RMSE)

 ## Mean Absolute Error
- this is the mean of the absolute value of error

![MAE.PNG](attachment:MAE.PNG)

Compare predictions (a continuous value) to the true y-label
- ex: compare prediction of house price to actual true house price

Calculated: 
(True price) minus (the predicted price (y-hat)), and take the absolute value of that, and then average that out for all your predictions
- reason why we take absolute value is because prediction could be over or under the real value of the house
- since we want to average this out across all predictions, we take the absolute value

Cons: 
- MAE will not punish large errors 
- Anscombe's quartet has the same errors through MAE
    - we have a wide variety of scatterpoints here but the line of best fit is all the same

![ambscombes%20quartet.PNG](attachment:ambscombes%20quartet.PNG)

x3,y3 we see that one point is a huge outlier 

we want our error metrics to account for these outliers

## Mean Squared Error

- this is the mean of the squared errors
- larger errors are noted more than with MAE, making MSE more popular

![MSE.PNG](attachment:MSE.PNG)

we take the difference between the true value and minus the predicted value and then square it
- when we take the square, the larger errors are noted more than MAE
- we punish the model for outlier situations that it's not fitting to
    - and we don't have to take absolute value because anything squared is positive
    
    
Cons: 
- squaring the labels actually squares the units themselves 
    - ex: predicting price of house, then our error metric would be in units of dollars squared

## Root Mean Squared Error

- most popular error metric
- this is the root of the mean of squared errors (maintains the same unit measurements, no foot^2 or dollars^2 that you would find in MSE)

![RMSE.PNG](attachment:RMSE.PNG)

at the end, take the square root of your mean squared error

it still punishes larger error values, and has same metrics as y

error numbers require context... a RSME of $10 is fantastic when predicting a price of a house, but terrible if it's a candy bar

we should always compare error metric to the average value of the label in your dataset to try to get an intuition of its overall performance
- domain knowledge plays an important role

# Machine Learning with Python

using Scikit Learn to perform ML with Python

- Training Set = the dataset where we are training the model on (a.k.a. the original dataaset)
- Test Set = dataset that the model has never seen before and will test if the model is actually making correct predictions

### Scikit Learn process

every algorithm is exposed in scikit-learn via an "Estimator" object

## step 1: import the model, general form looks like...

from (insert sklearn.family) import (insert Model)

for example:

`from sklearn.linear_model import LinearRegression`
- "linear_model" is the family of models
- "LinearRegression" is the model, also known as the "Estimator" object

## step 2: instantiate the model/estimator

*note: for estimator parameters, all the parameters of an estimator can be set when it is instantiated, and have suitable default values (you can use shift+tab to check possible parameters*

for example:

`model = LinearRegression(normalize=True)`

`print(model)`

the default parameters for the LinearRegression estimator are below (they are all defaulted to "=True")

`LinearRegression(copy_X=True, fit_intercept=True, normalize=True)`

these are parameters to tune the model to be more specific

## step 3: fit model to some data

once we created the model with paramters, we fit the model to some data

we should split this data into a training set and test set first! 

for example (how we can do this): 

![111.PNG](attachment:111.PNG)



- import numpy as np to create data
- `from sklearn.cross_validation import train_test_split` is a command to split the datasets into two
- we have the sets of data, x & y, where "x" is the features, and "y" is the label for the feature rows


![214214.PNG](attachment:214214.PNG)

- using "trained_test_split", pass in "x" and "y", and pass in the "test_size"
- using "trained_test_split" on features and labels, Scikit Learn will automatically output training set and testing set and we will have "x_train", "y_train", "x_test", and "y_test"

In [None]:
##