# Ensemble Learning Techniques

Ensemble Learning is a ML paradigm where multiple models (often called "learners" or "base models") are generated and combined to solve a particular ML problem 

The idea is to build a prediction model by integrating outcomes of multiple smaller models together to improve robustness, accuracy, and performance. The goal is to reduce variance (bagging), bias (boosting) or improving predictions (stacking) 

Ensemble is a powerful approach in ML to achieve high accuracy and robustness across a wide range of tasks by combining multiple models. Ensembles can capture patterns in data that might be typically missed by individual models, making them popular for environments that require high predictive performance

General Guidance:
- Boosting is often chosen for problems where accuracy is paramount and the individual models are weak or biased.
- Stacking is chosen when you have access to diverse models, and you're aiming for the best possible predictive performance, often at the expense of computational resources and model interpretability.
- Bagging is preferred when dealing with high-variance models and when you want to improve robustness and stability without necessarily making the model more complex.





## Bootstrap Aggregation (Bagging)
https://en.wikipedia.org/wiki/Bootstrap_aggregating

Example: Random Forest Classifier

### Process:
Bagging involves training multiple models of the same type on different subsets of the training data. The subsets are created by randomly sampling with replacement from the original dataset (and some samples can appear more than once). The prediction is made by averaging the predictions (for regression problems) or by majority vote (for classification problems) 

Bagging is typically used in decision trees (Random Forests), but can be applied to most ML models to improve performance. 

### Sampling: 
You start with standard training dataset "D" with a sample size of "n". Bagging generates new training sets called "m" (each potentially varying in size). The size of "m" subsets are described by "n'"
- "With replacement" means when you take the sample from the original dataset "D" and give it to "m", you don't remove that sample from the original dataset "D"

The new training sets created through this process are the bootstramp samples. When n' = n (the new trianing sets are the same size as the original dataset D)
- it's statistically expected that each bootstrap sample will contain about 63.2% unique instances from the original dataset (63.2% comes from the formula "1 -(1/e)" where "e" is the base of the natural logarithm, roughly = ~2.71828) 

Sampling with replacement ensures that the bootstrap sample is independent from others. The selection of one data point does not affect the selection of another so each training set can be considered independently created - this is crucial to ensure these models have diverse perspectives on data

#### Why is Out-of-bag Dataset important? 

The original dataset (initial one) will have all the datapoints, but the bootstrapped ones might have repeats. And in the cases where there are repeated samples in the bootstrapped dataset, that would mean that some of the samples are not reflected, creating something called "out-of-bag" dataset where it's the remainder of samples not represented in the bootstrap dataset
- `Original Dataset` (12 samples) minus `Bootstrap dataset` (7 unique samples, but 12 total samples) = `Out-of-bag Dataset` (5 samples) 
    - 12 - 7 = 5 
    
Out-of-bag dataset is used to test the accuracy of a random forest algorithm. For example, a model that produces 50 trees using the bootstrap and out-of-bag datasets will have a better accuracy than if it produced 10 trees. 
- the algorithm generates multiple trees, therefore, multiple datasets so the chance that an object is left ouf of the bootstrap dataset is low 

Decision trees are also just words for decision matrix


#### Why is it ok to have duplicate values in the bootstrapped dataset? 
Multiple occurences in the same datapoint in the bootstrapped dataset but not in others will introduce variability, but when the predictions from these models are aggregated (through majority vote or averaged out), the ensemble can achieve more generalized peformance

This ensemble model assumes that while each model may have its own biases due to its training dataset (like having duplicates of a certain datapoint), the aggregation process will even out the biases and lead to a final prediction that is more accurate and less prone to overfitting than any single model prediction


### Model Training:
Training - for each "m"-bootstrap samples, a separate model is trained on that data. This means you end up with "m"-number of models, each trained on a slightly different set of data due to random sampling process

Combining - once we have the predictions from the models trained on "m", we combine the results
- for regression: the output is averaged across all the "m"-models
- for classification: there is a majority vote on what to classify 


### Pros:
1. Bagging reduces variance (without increasing bias) leading to a model that generalizes better for unseen data 
    - however, bagging might not significantly reduce bias if a single model is already biased (but it also doesnt dramatically increase bias either)
    - the main objective is variance reduction (when averaging outputs of multiple models, the variance decreases)
2. Helps avoid overfitting when models get complex
    - even though each subset model might have high-variance predictions due to overfitting to its bootstrapped dataset, averaging the models can cancel out the individual variances leading to a stable-er prediction

![bagging%20pros%20and%20cons.PNG](attachment:bagging%20pros%20and%20cons.PNG)



### When to use Bagging
- When you have a variance problem: Bagging is effective at reducing variance without increasing bias. If your model is overfitting the training data, bagging can help generalize better to unseen data
- With stable and complex models: Bagging can be beneficial for complex models like deep learning or large decision trees that are prone to overfitting. It makes such models more robust by averaging out their predictions
- For parallelizable training: Since each model in bagging can be trained independently of the others, bagging can be efficiently parallelized, making it suitable for problems where speed and efficiency are concerns
- When model simplicity is not critical: Bagging, especially in the form of random forests, can lead to very accurate models, but these models can be large and not easily interpretable




## Boosting 
Boosting focuses on reducing bias and building a strong model from a number of weak ones in sequential order. Each model in the sequence focuses on correctly predicting the instances that were misclassified by the previous model. Predictions from all models are then combined through a weighted vote (or sum) to produce a final prediction.

The technique builds the model in stages, and at each stage, it adjusts the weights of incorrectly classified instances so that subsequent models focus more on difficult cases. Boosting is particularly known for its ability to reduce bias and variance, leading to improved model accuracy. 


Examples: 
- Adaptive Boost (AdaBoost) - adjust weights of incorrectly classified instances so that subsequent classifiers focus on difficult cases
- Gradient Boosting - builds models in sequential manner but uses the gradient of the loss function to guide the learning process (both for regression and classification 
- Extreme Gradient Boosting (XGBoost) - optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable (under the gradient boosting framework) 


### Process
The process starts with a dataset and assigning equal weight to each instance of the dataset, and weights indicate the importance of correctly classifying each instance in the next model to be trained. 

Next, train a series of weak models (often decision trees) in a sequential manner. A "weak model" refers to a model that performs slightly better than random guessing but is still very simple with high bias. 

After training, adjust the weights of the instances based on the correctness of the model's predictions. Increase the weight of the instances taht were incorrectly predicted, making them more important and decrease the weights of the correctly predicted instances (less important for the next model). 

Then each model is given a weight that reflects its accuracy, and the final prediction is made by a wegihted vote (for classification) or weighted sum (for regression) of the predictions of all models. 

### Sampling

Boosting does not involve sampling of instances to create multiple datasets (unlike bagging). Instead, all instances are used for each model, but their weights are adjusted, effectively changing their distribution for the next model. This method iteratively re-weights difficult predictions, making sure the ensemble pays more attention to them

### Model Training
Models are trained one at a time, with each model learning from the error of the previous one (sequential order is crucial because model performance determines which to emphasize moving forward) 

The individual models n boosting are often simple (weak learners) such as shallow decision trees. The rationale is that simple models contribute to overall model diversity and reduce risk of overfitting. 

Each model's influence on the final prediction is weighted by its accuracy. Models that perform better have more say in the final prediction than those with weaker performances

### When to use Boosting
- When you have a bias problem: Model is too simple and underfitting the training data, boosting can help increase model complexity and reduce bias 
- With imbalanced data: Boosting has shown good performance on imbalanced datasets, especially AdaBoost, by focusing more on hard-to-classify instances
- For improving accuracy: If your primary goal is to squeeze out every bit of accuracy from your model, and you're less concerned about model interpretability or computational efficiency, boosting is often a strong candidate
- When individual models are weak: Boosting is designed to improve the performance of models that are slightly better than random guessing and effective when you're working with simple models and want to incrementally improve their performance






## Stacked Generalization (Stacking) 
Stacking involves training a new model to combine the predictions of several base models. Base models are trained on the complete training set, then their predictions are used as an input features for the final model (the "stacker" or "meta-learner") to make the final prediction. This approach can leverage the strength of each base model. 

### Process

### Sampling

### Model Training

### Pros

### When to use Stacking
- When diversity is key: Stacking is effective when you can train diverse models that make different assumptions about the data. The diversity can come from using different types of models, different feature sets, or different hyperparameters
- For maximizing predictive performance: Stacking can outperform individual models and other ensemble techniques by learning how to best combine their predictions. It's often used in machine learning competitions for this reason
- When computational resources and time are not the primary concern: Stacking involves training multiple models and a meta-model, which can be computationally expensive and time-consuming, especially with large datasets and complex base models


# Interpreting ML and Traditional ML Algorithms

## Interpretability Analysis

# Sampling and Data Splitting


# Loss

## Class-balanced Loss

## Focal-loss 

## Cross-entropy loss

## MSE loss

## MAE loss

## Huber loss

# Model and Data Parallelism

# Regularization

## L1 and I2 Regularization

## Entropy Regularization

# K-fold cross validation

# Dropout

# Optimization Algorithms

## Stochastic Gradient Descent

## AdaGrad 

## Momentum

## RMSProp 


# Activation Function

## ELU

## ReLU

## Tanh

## Sigmoid

# Model Eval

## FID Score

## Inception score

## BLEU metrics

## METEOR metrics

## ROUGE score

## CIDEr score

## SPICE score

## Model Compression Survey

## Shadow deployment

## A/B Testing

## Canary Release 




# Quantization-aware training


# Interleaving Experiment

# Multi-armed Bandit

# ML Infrastructure