All the models and descreptions

# Supervised Learning

Supervised learning models are models that map inputs to outputs, and attempt to extrapolate patterns learned in past data on unseen data. Supervised learning models can be either regression models, where we try to predict a continuous variable, like stock prices—or classification models, where we try to predict a binary or multi-class variable, like whether a customer will churn or not. In the section below, we'll explain two popular types of supervised learning models: linear models, and tree-based models.

## Linear Models

In a nutshell, linear models create a best-fit line to predict unseen data. Linear models imply that outputs are a linear combination of features. In this section, we'll specify commonly used linear models in machine learning, their advantages, and disadvantages.

| Algorithm          | Description                                                                                                              | Applications                               | Advantages                                     | Disadvantages                                   |
|--------------------|--------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|------------------------------------------------|-------------------------------------------------|
| Linear Regression  | A simple algorithm that models a linear relationship between inputs and a continuous numerical output variable          | Stock Price Prediction                     | - Explainable method - Interpretable results by its output coefficient - Faster to train than other machine learning models     | - Assumes linearity between inputs and output - Sensitive to outliers - Can underfit with small, high-dimensional data |
| Logistic Regression| A simple algorithm that models a linear relationship between inputs and a categorical output (1 or 0)                   | Predicting credit risk score              | - Interpretable and explainable - Less prone to overfitting when using regularization - Applicable for multi-class predictions | - Assumes linearity between inputs and outputs - Can overfit with small, high-dimensional data |
| Ridge Regression   | Part of the regression family — it penalizes features that have low predictive outcomes by shrinking their coefficients closer to zero. Can be used for classification or regression | Predictive maintenance for automobiles     | - Less prone to overfitting - Best suited where data suffer from multicollinearity - Explainable & interpretable - All the predictors are kept in the final model | - Doesn't perform feature selection |
| Lasso Regression   | Part of the regression family — it penalizes features that have low predictive outcomes by shrinking their coefficients to zero. Can be used for classification or regression | Predicting housing prices                  | - Less prone to overfitting - Can handle high-dimensional data - No need for feature selection | - Can lead to poor interpretability as it can keep highly correlated variables |

# Import List for Supervised Learning Models

## Linear Models
```python
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
```

## Tree-based models

In a nutshell, tree-based models use a series of "if-then" rules to predict from decision trees. In this section, we'll specify commonly used linear models in machine learning, their advantages, and disadvantages.

| Algorithm                 | Description                                                                                             | Applications                        | Advantages                            | Disadvantages                          |
|---------------------------|---------------------------------------------------------------------------------------------------------|-------------------------------------|---------------------------------------|----------------------------------------|
| Decision Tree             | Decision Tree models make decision rules on the features to produce predictions. It can be used for classification or regression | Customer churn prediction            | - Explainable and interpretable - Can handle missing values | - Prone to overfitting - Sensitive to outliers |
| Random Forests            | An ensemble learning method that combines the output of multiple decision trees                         | Credit score modeling, Predicting housing prices | - Reduces overfitting - Higher accuracy compared to other models | - Training complexity can be high - Not very interpretable |
| Gradient Boosting Regression | Gradient Boosting Regression employs boosting to make predictive models from an ensemble of weak predictive learners | Predicting car emissions, Predicting ride-hailing fare amount | - Better accuracy compared to other regression models - It can handle multicollinearity - It can handle non-linear relationships | - Sensitive to outliers and can therefore cause overfitting - Computationally expensive and has high complexity |
| XGBoost                   | Gradient Boosting algorithm that is efficient & flexible. Can be used for both classification and regression tasks | Churn prediction, Claims processing in insurance | - Provides accurate results - Captures non-linear relationships | - Hyperparameter tuning can be complex - Does not perform well on sparse datasets |
| LightGBM Regressor        | A gradient boosting framework that is designed to be more efficient than other implementations           | Predicting flight time for airlines, Predicting cholesterol levels based on health data | - Can handle large amounts of data - Computational efficient & fast training speed - Low memory usage | - Can overfit due to leaf-wise splitting and high sensitivity - Hyperparameter tuning can be complex |

# Import List for Tree-based Models

## Tree-based Models
```python
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
import lightgbm as lgb
```

# Unsupervised Learning

Unsupervised learning is about discovering general patterns in data. The most popular example is clustering or segmenting customers and users. This type of segmentation is generalizable and can be applied broadly, such as to documents, companies, and genes. Unsupervised learning consists of clustering models, that learn how to group similar data points together, or association algorithms, that group different data points based on pre-defined rules.

## Clustering models

| Algorithm              | Description                                                                                                  | Applications                                   | Advantages                                 | Disadvantages                                   |
|------------------------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------|---------------------------------------------|-------------------------------------------------|
| K-Means                | K-Means is the most widely used clustering approach—it determines K clusters based on euclidean distances   | Customer segmentation, Recommendation systems | - Scales to large datasets - Simple to implement and interpret - Results in tight clusters | - Requires the expected number of clusters from the beginning - Has troubles with varying cluster sizes and densities |
| Hierarchical Clustering | A "bottom-up" approach where each data point is treated as its own cluster—and then the closest two clusters are merged together iteratively | Fraud detection, Document clustering based on similarity | - There is no need to specify the number of clusters - The resulting dendrogram is informative | - Doesn’t always result in the best clustering - Not suitable for large datasets due to high complexity |
| Gaussian Mixture Models| A probabilistic model for modeling normally distributed clusters within a dataset                            | Customer segmentation, Recommendation systems | - Computes a probability for an observation belonging to a cluster - Can identify overlapping clusters | - Requires complex tuning - Requires setting the number of expected mixture components or clusters |

# Import List for Unsupervised Learning Models
## Unsupervised Learning Models
```python
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
```

## Association

| Algorithm          | Description                                                                                           | Applications                       | Advantages                                      | Disadvantages                                  |
|--------------------|-------------------------------------------------------------------------------------------------------|------------------------------------|-------------------------------------------------|------------------------------------------------|
| Apriori Algorithm  | Rule based approach that identifies the most frequent itemset in a given dataset where prior knowledge of frequent itemset properties is used | Product placements, Recommendation engines, Promotion optimization | - Results are intuitive and interpretable - Exhaustive approach as it finds all rules based on the confidence and support | - Generates many uninteresting itemsets - Computationally and memory intensive - Results in many overlapping item sets |

Source

[Machine Learning Cheat Sheet](https://www.datacamp.com/cheat-sheet/machine-learning-cheat-sheet)

# Unsupervised Learning
## Clustering Algorithms

| Algorithm                                  | Description                                                                                            | Applications                                | Advantages                                      | Disadvantages                                  |
|--------------------------------------------|--------------------------------------------------------------------------------------------------------|---------------------------------------------|-------------------------------------------------|------------------------------------------------|
| K-means                                    | Partitioning method that partitions data into K clusters by minimizing the within-cluster variance     | Image segmentation, Customer segmentation  | - Simple and efficient - Scales well to large datasets - Easy to interpret clusters | - Assumes spherical clusters - Sensitive to initial centroid positions - Requires specifying the number of clusters beforehand |
| K-medoids (PAM)                            | Partitioning method similar to K-means, but uses actual data points as cluster centers (medoids)      | Data mining, Biological data analysis      | - Robust to noise and outliers - Can handle non-Euclidean distances | - Computationally expensive - Sensitive to the choice of distance metric |
| Agglomerative Clustering                   | Hierarchical method that starts with each point as its own cluster and merges clusters based on distance | Biology, Text mining                       | - Produces a dendrogram for hierarchical clustering - No need to specify the number of clusters | - Computationally expensive for large datasets - Memory intensive for storing distance matrix |
| DBSCAN                                     | Density-based method that groups together closely packed points based on density-reachability            | Anomaly detection, Spatial data analysis  | - Robust to noise and outliers - Can discover clusters of arbitrary shapes | - Sensitive to parameters epsilon and minPts - Doesn't work well with varying density clusters |
| OPTICS                                    | Extension of DBSCAN that provides a hierarchical clustering based on the density reachability          | Spatial data analysis, Network analysis   | - Captures clusters of varying densities - No need to specify epsilon and minPts beforehand | - Computationally expensive - Memory intensive for storing reachability plot |
| HDBSCAN                                    | Hierarchical extension of DBSCAN that overcomes its limitations by using a minimum spanning tree        | Image analysis, Customer segmentation     | - Automatically finds the optimal number of clusters - Robust to noise and outliers | - Computationally expensive for large datasets - Requires tuning of min_cluster_size parameter |
| Spectral Clustering                        | Graph-based method that uses eigenvectors of a similarity matrix to partition data into clusters       | Image segmentation, Community detection   | - Can capture complex structures and non-linear boundaries - Not sensitive to initializations | - Computationally expensive for large datasets - Requires tuning of parameters such as affinity matrix construction |
| Mean-Shift Clustering                      | Density-based method that locates centroids of clusters by iteratively shifting data points towards the mode | Image segmentation, Object tracking    | - No need to specify the number of clusters - Can handle non-linear data and arbitrary shapes | - Computationally expensive - Sensitive to bandwidth parameter - Not suitable for high-dimensional data |
| Gaussian Mixture Models (GMM)              | Probabilistic method that models data points as belonging to one of several Gaussian distributions      | Pattern recognition, Density estimation   | - Can model clusters with different shapes and sizes - Provides soft clustering with associated probabilities | - Computationally expensive for high-dimensional data - Convergence to local optima can occur |
| Variational Bayesian Gaussian Mixture (VBGMM) | Bayesian extension of GMM that infers the number of components and their parameters from data       | Image segmentation, Natural language processing | - Automatically determines the number of components - Provides uncertainty estimation of cluster assignments | - Computationally expensive for large datasets - Requires careful choice of prior distributions |
| Finite Mixture Models (FMM)                | Generalization of GMM that allows for different distributions to be used for modeling each cluster    | Bioinformatics, Finance                    | - More flexible than GMM in modeling cluster shapes - Can handle non-Gaussian distributions | - Requires specifying the number and types of distributions beforehand - Computationally expensive for high-dimensional data |
| STING                                      | Grid-based clustering that partitions data space into rectangular cells and merges them based on density | Geographic information systems (GIS), Spatial data analysis | - Efficient for large datasets - Scalable to high-dimensional data | - Assumes rectangular cluster shapes - Sensitive to grid size and resolution |
| WaveCluster                                | Grid-based clustering that uses wavelet transforms to perform density estimation and cluster detection | Image segmentation, Pattern recognition   | - Can handle arbitrary shapes and densities - Robust to noise and outliers | - Requires tuning of parameters such as wavelet bandwidth and resolution - Computationally expensive for high-dimensional data |
| Fuzzy C-Means (FCM)                        | Soft clustering method that assigns data points to clusters with associated membership degrees         | Medical image analysis, Customer segmentation | - Provides soft clustering with fuzzy memberships - Robust to noise and outliers | - Sensitive to the choice of fuzziness parameter - Computationally expensive for large datasets |
| Gustafson-Kessel Algorithm                 | Extension of FCM that adapts the distance metric based on the data distribution and covariance matrices | Pattern recognition, Biometrics           | - Accounts for the shape and orientation of data distribution - Robust to noise and outliers | - Computationally expensive for high-dimensional data - Sensitive to the choice of initial cluster centers |
| Self-Organizing Maps (SOM)                 | Competitive learning method that maps high-dimensional data onto a low-dimensional grid               | Feature extraction, Visualization          | - Produces a topological map of the input space - Robust to noise and outliers | - Requires tuning of parameters such as learning rate and neighborhood function - Computationally expensive for large datasets |
| Growing Neural Gas (GNG)                   | Competitive learning method that incrementally constructs a neural network to represent the data space | Data visualization, Clustering            | - Adapts to non-linear data distributions - Robust to noise and outliers | - Requires tuning of parameters such as learning rate and decay factor - Computationally expensive for large datasets |
| Affinity Propagation                       | Graph-based method that finds exemplars (data points) representing clusters based on message passing | Image segmentation, Social network analysis | - Automatically determines the number of clusters - Robust to noise and outliers | - Computationally expensive for large datasets - Sensitive to the choice of damping factor |
| BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) | Hierarchical clustering method that incrementally builds a tree structure to represent the data space | Stream data clustering, Image compression | - Scalable to large datasets - Memory efficient for storing clusters in a tree structure | - Assumes spherical clusters - Sensitive to the choice of branching factor and threshold |
| COBWEB (Categorization based on the Wider Environment of a Blackboard) | Hierarchical clustering method that constructs a concept hierarchy based on incremental learning | Concept formation, Decision support systems | - Adapts to concept drift and incremental updates - Provides explanation of clustering decisions | - Requires tuning of parameters such as category utility and categorization criterion - Limited to small to medium-sized datasets |


## Dimensionality Reduction

| Algorithm                                  | Description                                                                                            | Applications                                | Advantages                                      | Disadvantages                                  |
|--------------------------------------------|--------------------------------------------------------------------------------------------------------|---------------------------------------------|-------------------------------------------------|------------------------------------------------|
| Principal Component Analysis (PCA)         | Technique for reducing the dimensionality of data by projecting it onto a lower-dimensional subspace  | Data visualization, Feature extraction     | - Retains as much variance as possible in fewer dimensions - Helps in removing redundant features | - Assumes linear relationships between variables - May not perform well with non-linear data |
| t-Distributed Stochastic Neighbor Embedding (t-SNE) | Technique for visualizing high-dimensional data by preserving local structure and distances         | Visualization, Dimensionality reduction   | - Preserves local structure and non-linear relationships - Effective for visualizing high-dimensional data in low-dimensional space | - Computationally expensive for large datasets - Interpretability can be challenging |
| Linear Discriminant Analysis (LDA)         | Technique for dimensionality reduction and classification by maximizing the separability between classes | Pattern recognition, Bioinformatics        | - Maximizes class separability in reduced space - Supervised approach for feature extraction | - Assumes Gaussian distribution of data in classes - Sensitive to outliers and noise |
| Non-negative Matrix Factorization (NMF)    | Technique for decomposing non-negative data into parts to represent the original data in a reduced form | Image processing, Text mining              | - Generates parts-based representations suitable for interpretation - Can handle sparse and non-negative data | - Requires specifying the number of components - May converge to local optima |
| Independent Component Analysis (ICA)       | Technique for separating a multivariate signal into additive, independent components                | Blind source separation, Image processing  | - Uncovers underlying independent sources in mixed signals - Useful for feature extraction and noise reduction | - Assumes statistical independence of components - Sensitive to scaling and whitening of data |
| Autoencoders                               | Neural network architecture used for learning efficient data codings in an unsupervised manner       | Dimensionality reduction, Feature learning | - Learn non-linear transformations for dimensionality reduction - Can capture complex data distributions | - Requires tuning of architecture and hyperparameters - May suffer from overfitting with small datasets |

## Association Rule Learning

| Algorithm          | Description                                                                                           | Applications                       | Advantages                                      | Disadvantages                                  |
|--------------------|-------------------------------------------------------------------------------------------------------|------------------------------------|-------------------------------------------------|------------------------------------------------|
| Apriori Algorithm  | Rule-based approach that identifies the most frequent itemset in a given dataset using prior knowledge | Market basket analysis, Recommender systems | - Results are intuitive and interpretable - Helps in understanding associations between items | - Generates many uninteresting itemsets - Computationally and memory intensive - Results in many overlapping item sets |
| FP-Growth Algorithm | Frequent pattern mining algorithm that uses a tree structure to efficiently discover frequent itemsets | Market basket analysis, Web usage mining | - Generates fewer candidate itemsets compared to Apriori - Requires fewer passes over the dataset | - Requires a large amount of memory for constructing FP-tree - Can be challenging to implement efficiently |

## Anomaly Detection

| Algorithm          | Description                                                                                           | Applications                       | Advantages                                      | Disadvantages                                  |
|--------------------|-------------------------------------------------------------------------------------------------------|------------------------------------|-------------------------------------------------|------------------------------------------------|
| Isolation Forest   | Random forest-based algorithm that isolates anomalies by partitioning data into subsets efficiently   | Intrusion detection, Fraud detection | - Efficient for high-dimensional data - Scalable to large datasets - Handles outliers well | - Requires specifying the number of anomalies - May struggle with clusters of anomalies |
| One-Class SVM      | Support vector machine-based algorithm that learns a decision boundary around normal data points     | Outlier detection, Novelty detection | - Suitable for high-dimensional data - Effective in detecting outliers in sparse datasets | - Requires tuning of hyperparameters such as kernel and nu - Sensitive to choice of kernel function |
| Local Outlier Factor (LOF) | Density-based algorithm that measures the local deviation of a data point with respect to its neighbors | Anomaly detection, Outlier detection | - Does not assume a specific distribution of data - Can detect outliers in arbitrarily shaped clusters | - Sensitive to the choice of distance metric - Parameter tuning required for optimal performance |

## Density Estimation

| Algorithm                   | Description                                                                                          | Applications                          | Advantages                                      | Disadvantages                                  |
|-----------------------------|------------------------------------------------------------------------------------------------------|---------------------------------------|-------------------------------------------------|------------------------------------------------|
| Kernel Density Estimation   | Non-parametric method for estimating the probability density function of a random variable          | Data smoothing, Statistical modeling | - Flexibility in modeling arbitrary distributions - Easy to implement and interpret | - Computationally expensive for large datasets - Bandwidth selection can affect estimation quality |
| Gaussian Kernel Density Estimation | Special case of KDE where Gaussian kernels are used for smoothing                                       | Image processing, Anomaly detection  | - Smooth and continuous probability density estimates - Suitable for data with known distributions | - Requires specifying the bandwidth parameter - Sensitivity to outliers and noise |
| Generative Adversarial Networks (GANs) | Deep learning framework for learning to generate data by training a generator and discriminator network | Image generation, Data augmentation  | - Learn complex data distributions without explicit modeling - Can generate realistic data samples | - Training instability - Mode collapse - Requires large amounts of data and computational resources |

## Others

| Algorithm                                  | Description                                                                                            | Applications                                | Advantages                                      | Disadvantages                                  |
|--------------------------------------------|--------------------------------------------------------------------------------------------------------|---------------------------------------------|-------------------------------------------------|------------------------------------------------|
| Self-Organizing Maps (SOM)                 | Competitive learning method that maps high-dimensional data onto a low-dimensional grid               | Feature extraction, Visualization          | - Produces a topological map of the input space - Robust to noise and outliers | - Requires tuning of parameters such as learning rate and neighborhood function - Computationally expensive for large datasets |
| Hierarchical Density-Based Clustering (HDBSCAN) | Hierarchical extension of DBSCAN that overcomes its limitations by using a minimum spanning tree    | Image analysis, Customer segmentation     | - Automatically finds the optimal number of clusters - Robust to noise and outliers | - Computationally expensive for large datasets - Requires tuning of min_cluster_size parameter |
