- When the number of features is very large relative to the number of observations in your dataset, certain algorithms struggle to train effective models. This is called the **“Curse of Dimensionality,”** and it’s especially relevant for clustering algorithms that rely on distance calculations.

## Feature Selection
- Feature selection is for filtering irrelevant or redundant features from your dataset. The key difference between feature selection and extraction is that feature selection keeps a subset of the original features while feature extraction creates brand new ones.
- Some supervised algorithms already have built-in feature selection, such as Regularized Regression and Random Forests. 


- **Variance Thresholds**
    - Variance thresholds remove features whose values don't change much from observation to observation (i.e. their variance falls below a threshold). These features provide little value.
    - Example: In a public health dataset where 96% of observations for 35-year-old men, 'Age' and 'Gender' features can be eliminated without a major loss in information.
    - Because variance is dependent on scale, you should always normalize your features first.
    - **Strengths:** Applying variance thresholds is based on solid intuition: features that don't change much also don't add much information. This is an easy and relatively safe way to reduce dimensionality at the start of your modeling process.
    - **Weaknesses:** If your problem does require dimensionality reduction, applying variance thresholds is rarely sufficient. Furthermore, you must manually set or tune a variance threshold, which could be tricky. We recommend starting with a conservative (i.e. lower) threshold.
- **Correlation Thresholds**
    - Correlation thresholds remove features that are highly correlated with others (i.e. its values change very similarly to another's). These features provide redundant information.
    - Uou'd first calculate all pair-wise correlations. Then, if the correlation between a pair of features is above a given threshold, you'd remove the one that has larger mean absolute correlation with other features.
    - **Strengths:** Applying correlation thresholds is also based on solid intuition: similar features provide redundant information. Some algorithms are not robust to correlated features, so removing them can boost performance.
    - **Weaknesses:** You must manually set or tune a correlation threshold, which can be tricky to do. Plus, if you set your threshold too low, you risk dropping useful information. Whenever possible, we prefer algorithms with built-in feature selection over correlation thresholds. Even for algorithms without built-in feature selection, Principal Component Analysis (PCA) is often a better alternative.
- **Genetic Algorithms (GA)**
    - They search algorithms that are inspired by evolutionary biology and natural selection, combining mutation and cross-over to efficiently traverse large solution spaces.
    - GA's have two main uses. 
        - The first is for optimization, such as finding the best weights for a neural network.
        - The second is for supervised feature selection. In this use case, "genes" represent individual features and the "organism" represents a candidate set of features. Each organism in the "population" is graded on a fitness score such as model performance on a hold-out set. The fittest organisms survive and reproduce, repeating until the population converges on a solution some generations later.
    - **Strengths:** Genetic algorithms can efficiently select features from very high dimensional datasets, where exhaustive search is unfeasible. When you need to preprocess data for an algorithm that doesn't have built-in feature selection (e.g. nearest neighbors) and when you must preserve the original features (i.e. no PCA allowed), GA's are likely your best bet. These situations can arise in business/client settings that require a transparent and interpretable solution.
    - **Weaknesses:** GA's add a higher level of complexity to your implementation, and they aren't worth the hassle in most cases. If possible, it's faster and simpler to use PCA or to directly use an algorithm with built-in feature selection.
- **Honorable Mention: Stepwise Search**
    - Stepwise search is a supervised feature selection method based on sequential search, and it has two flavors: forward and backward. For forward stepwise search, you start without any features. Then, you'd train a 1-feature model using each of your candidate features and keep the version with the best performance. You'd continue adding features, one at a time, until your performance improvements stall.
    - Backward stepwise search is the same process, just reversed: start with all features in your model and then remove one at a time until performance starts to drop substantially.
    - THIS MODEL RARELY DOES WELL

## Feature Extraction
- Feature extraction is for creating a new, smaller set of features that stills captures most of the useful information.


- **Principal Component Analysis (PCA)**
    - Principal component analysis is an unsupervised algorithm that creates linear combinations of the original features. The new features are orthogonal, which means that they are uncorrelated. They are ranked in order of their "explained variance." The first principal component explains the most variance in your dataset, PC2 explains the second-most variance, and so on.
    - **Strengths:** PCA is a versatile technique that works well in practice. It's fast and simple to implement, which means you can easily test algorithms with and without PCA to compare performance. In addition, PCA offers several variations and extensions (i.e. kernel PCA, sparse PCA, etc.) to tackle specific roadblocks.
    - **Weaknesses:** The new principal components are not interpretable, which may be a deal-breaker in some settings. In addition, you must still manually set or tune a threshold for cumulative explained variance.
- **Linear Discriminant Analysis (LDA)**
    - LDA, not to be confused with latent Dirichlet allocation - also creates linear combinations of your original features. However, unlike PCA, LDA doesn't maximize explained variance. Instead, it maximizes the separability between classes.
    - LDA is a supervised method that can only be used with labeled data.
    - **Strengths:** LDA is supervised, which can (but doesn't always) improve the predictive performance of the extracted features. Furthermore, LDA offers variations (i.e. quadratic LDA) to tackle specific roadblocks.
    - **Weaknesses:** As with PCA, the new features are not easily interpretable, and you must still manually set or tune the number of components to keep. LDA also requires labeled data, which makes it more situational.
- **Autoencoders**
    - Autoencoders are neural networks that are trained to reconstruct their original inputs. 
    - For example, image autoencoders are trained to reproduce the original images instead of classifying the image as a dog or a cat.
    - The key is to structure the hidden layer to have fewer neurons than the input/output layers. 
    - **Strengths:** Autoencoders are neural networks, which means they perform well for certain types of data, such as image and audio data.
    -**Weaknesses:** Autoencoders are neural networks, which means they require more data to train. They are not used as general-purpose dimensionality reduction algorithms.