<a href="https://colab.research.google.com/github/dominiksakic/deeplearning00/blob/main/FirstPrinciples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# First Principle Machine Learning Notebook.

## Goal
1. To create a Notebook that sums up my understanding of the topic.
2. While creating this notebook gaining a deeper understanding.
3. Getting familiar with Notebooks and what I can do in them.

### What is Machine learning?

Machine learning models are trained to learn a mapping from input features to output targets. This is done by adjusting the model's parameters to minimize the difference between its predictions and the actual target values in the training data. This process is called optimization. W

#### A model can be in one of three states:
1. **Underfit**: An underfit model is too simple to capture the underlying patterns in the data. It performs poorly on both the training data and unseen data. This often happens because the model hasn't been trained for long enough or isn't complex enough. h
2. **Well-generalized (Good Fit)**: A well-generalized model (or a model with a "good fit") learns the underlying patterns in the data without memorizing the noise. It performs well on the training data and, crucially, also performs well on unseen data. This ability to perform well on unseen data is called generalization, which is the core goal of machine learning.
3.**Overfit**: An overfit model learns the training data too well, including the noise. While it might perform very well on the training data, it struggles to generalize to unseen data. This happens when the model is too complex or has been trained for too long, essentially memorizing the specifics of the training set instead of learning the general patterns.

#### Training data may include noise, uncertainty, or rare features:
**Noise**: refers to errors in the measurement of the features or targets. It's random and doesn't represent true underlying patterns.

**Uncertainty**: represents the inherent randomness in the data itself. Even with the same input features, the true output might vary due to some underlying randomness in the process being modeled.

**Rare Features:** are features that appear infrequently in the data. They can lead to spurious correlations, where the model learns to rely on these features for predictions, but those features don't generalize to new data. This is because the model mistakes a correlation in the small number of examples for a true underlying relationship

### Why does Generalization work?

### Manifold Hypothesis
Data lies on a low-dimensonal manifold (latent manifold) within the high-dimensonal space where it is encoded.

** A manifold is a subspace of some parent space.

**Implication**
1. Machine learning models only have to fit this lower-dimensional manifold
one example is a crunmpeled paper ball. The lower-dimensional manifold is the 2d surface if it.
2. within these manifolds, its possible to **interpolate** between two inputs. Morph from one to the other via a connected, continous path.

### Interpolation, source of Generalization

Deep learning achieves generalization via interpolation on a learned approximation of the data manifold.

1. local generalization : making sense of things that are very close to what youve seen before (interpolation)
2. extreme genralization: abstraction, symbolic models, innate priors (What we humans can do as well)

### Clarification

The higher dimensional space exist outside of the model.
The Model( weights, bias and the architecture) is looking for a lower dimensional pathway through that space hat explains the patterns in your data.
The model curve is being fit to the data, gradually and smoothly over time.

At one point it is a rough approximation of the latent space.

### Fazit
Deep learning is suited learning these latent manifolds, since they are smooth and continous in their structure just like the latent manifolds.

Thus the genralization is more of the natural structure included in the Data.

** The model will only be able to generalize where the data builds a manifold to interpolate between points **

From this we can see that if our data sampling is noisy or not dense the loss function will be a bad approximation of the natural manifold.
The loss function is taking the direct way between points.

I want to end with this comment from Chollet
"The only thing you will find in your Deep learning model is what you put into it; The priors encoded in its architecture and the data it was trained on."



### How to evaluate a machine learning model?
**You can only control what you can observe.**

One hurdle in machine learning is that a lot is happening unseen to the eye, and we only gain insights by measuring.

What will we measure: the training process, or the model at the end of the training process?
With what?

1. Training data: used to optimize the weights
2. Validation data: used to evaluate and tune the hyperparameters *
3. Test data: data used after optimization


*The engineer observes the outcome of the validation data and then decides how to optimize the hyperparameters.
--> This leads to overfitting on the validation set because information from it leaks into the model, making it perform better on that set.
To counteract this, you must test on a third dataset that the model never encounters until the end.


Hyperparameter = priors

Parameter = weightsThere are three strategies to split data:

---

### Data Splits

**Simple Holdout Validation**
(Validation set | Training set); Test set


**How to spot if you don't have enough data for this strategy?**

1. Before splitting the data, shuffle it.
2. Then train and measure.
3. Are the performance measurements very different each time?

If yes, then you don't have enough data because of variance in the dataset.

**K-Fold Validation** Split the data into partitions of equal size.

One partition will be used to evaluate.

Train a model for each combination of partitions.

The final score is the average of the scores obtained.

Both methods use a dedicated split portion.

**Iterated K-Fold with Shuffling** Same as K-fold, BUT you apply K-fold multiple times, shuffling the data each time before splitting it into equal partitions.

This is useful for precise measurements but is compute heavy. (Times of applying K-fold * K-models)

---

### Set a common-sense baseline:

**What is your model better at?
How would one think about that?**


If you have other models to compare to, good.

But if you don't have other models, try looking at the bigger picture.

For example, if you have a model that predicts "good" or "bad":If you were to guess randomly on an even split of data (this goes into the baseline as well), the random model would be 50% right.

Can the machine learning model beat that?

---
### Things to be careful about when splitting data:

1. **Data Representativeness**: Your data should be well shuffled so that there is good variance in both the training and validation sets. Having some classes not represented in the training set will affect the model!
2. **Arrow of Time**: Don't shuffle randomly if you are trying to predict the future based on time. The test set should always be posterior to the data in the training set!
3. **Redundancy in Your Data**: If a data point appears twice—once in training and once in validation—you end up testing on part of your training data, which is problematic because the model has already fit that data.

---
### Improving the Model Fit!

**To achieve the perfect fit, you must overfit.**
--> You want to overfit while maintaining generalization capabilities; after that, you can refine generalization.


**What if:**
1. Training doesn't get started:
* Training loss doesn't go down over time;
* there is no overfitting.

Overcoming this hurdle is usually possible because the main problem is that the loss is stuck.
**The model is always capable of memorizing data (overfitting) no matter what!**

At the core of this problem is the gradient descent process that isn't getting optimized.Optimizer choice, initial values of the weights, learning rate, and batch size can be adjusted.

**Try this**: Adjust the learning rate and the batch size!

Logic behind this:A high learning rate might overshoot a proper fit.A low learning rate might seem to make no progress—it’s just too slow.

A high batch size will lead to gradients that are more informative and less noisy.

2. Training gets started, but there is no meaningful generalization:
* The common-sense baseline is still better than your model!

This might be the worst-case scenario:No generalization means something is **fundamentally wrong with your approach**.

**Try this**:
* Check if the input data contains sufficient information to predict the targets.
* The model used might not be suitable for the problem.

**Using a model that makes the right assumptions about the problem is key to achieving generalization.**


3. Validation loss and training loss go down over time, but there is no overfitting, meaning we are still underfitting:

You’ve already won half the battle!

The model you are using may be lacking representational power.

**Also, never forget: Overfitting is ALWAYS achievable.**

**Try this**:
Increase:
* layersSize
* layersComplexity
* layer structure


# General Workflow of Machine Learning

|--Data---Assumption---Model Topology---Learning---Observing---Validating---Using the Model--|

**The goal is to build a model, that generalized well - works on unseen data**


There are key moments that are worth revisiting while builing.

1.   The Assumptions about the data - this will determine the topology and thus the hypothesis space of the model.
2.   The loss function - the model will take any shortcut it can to minimize.
  Choosing the right loss function for the right problem(!check your Assumptions)






## General Advice

1. Preprocessing raw data before feeding it into a neural network.

2. Normalizing data, mean shifting all data points into a range where the average sum is 0 (subtracting the mean and dividing by the standard diviation)

3. Overfit the model to figure out the ideal training epochs. (Measure twice, cut once)

4. Small training data, small model with only one or two layers. To avoid overfitting.

5. Many categories for classfication, need larger layers. --> Information Bottleneck/Features are getting compressed

6. working with little data, K-fold validation can help reliably evaluate your model