## What is Dimensionality Reduction?

Dimensionality reduction refers to techniques for reducing the number of input variables in training data.

Fewer input dimensions often mean correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom. A model with too many degrees of freedom is likely to overfit the training dataset and therefore may not perform well on new data.

Dimensionality reduction is a data preparation technique performed on data prior to modeling. It might be performed after data cleaning and data scaling and before training a predictive model.As such, any dimensionality reduction performed on training data must also be performed on new data, such as a test dataset, validation dataset, and data when making a prediction with the final model.

## Why do we need it?

1. Space required to store the data is reduced as the number of dimensions comes down
2. Less dimensions lead to less computation/training time
3. Some algorithms do not perform well when we have a large dimensions. So reducing these dimensions needs to happen for the algorithm to be useful
4. It takes care of multicollinearity by removing redundant features. For example, you have two variables – ‘time spent on treadmill in minutes’ and ‘calories burnt’. These variables are highly correlated as the more time you spend running on a treadmill, the more calories you will burn. Hence, there is no point in storing both as just one of them does what you require
5. It helps in visualizing data. As discussed earlier, it is very difficult to visualize data in higher dimensions so reducing our space to 2D or 3D may allow us to plot and observe patterns more clearly

## Common Dimensionality Reduction Techniques

Dimensionality reduction can be done in two different ways:

1. By only keeping the most relevant variables from the original dataset (this technique is called <b>feature selection</b>)
2. By finding a smaller set of new variables, each being a combination of the input variables, containing basically the same information as the input variables (this technique is called <b>dimensionality reduction</b>)

### Summary:

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2018/08/Screenshot-from-2018-08-10-12-07-43.png">

- Missing Value Ratio: If the dataset has too many missing values, we use this approach to reduce the number of variables. We can drop the variables having a large number of missing values in them
- Low Variance filter: We apply this approach to identify and drop constant variables from the dataset. The target variable is not unduly affected by variables with low variance, and hence these variables can be safely dropped
- High Correlation filter: A pair of variables having high correlation increases multicollinearity in the dataset. So, we can use this technique to find highly correlated features and drop them accordingly
- Random Forest: This is one of the most commonly used techniques which tells us the importance of each feature present in the dataset. We can find the importance of each feature and keep the top most features, resulting in dimensionality reduction
- Both Backward Feature Elimination and Forward Feature Selection techniques take a lot of computational time and are thus generally used on smaller datasets
- Principal Component Analysis: This is one of the most widely used techniques for dealing with linear data. It divides the data into a set of components which try to explain as much variance as possible
- ISOMAP: We use this technique when the data is strongly non-linear
- t-SNE: This technique also works well when the data is strongly non-linear. It works extremely well for visualizations as well
- UMAP: This technique works well for high dimensional data. Its run-time is shorter as compared to t-SNE

### Missing Value Ratio:

Suppose that while exploring the data, you find that your dataset has some missing values.What if we have too many missing values (say more than 50%)? Should we impute the missing values or drop the variable? We can set a threshold value and if the percentage of missing values in any variable is more than that threshold, we will drop the variable.

### Low Variance Filter:

Consider a variable in our dataset where all the observations have the same value, say 1. If we use this variable, do you think it can improve the model we will build? The answer is no, because this variable will have zero variance.

So, we need to calculate the variance of each variable we are given. Then drop the variables having low variance as compared to other variables in our dataset. The reason for doing this, as I mentioned above, is that variables with a low variance will not affect the target variable. The premise is that a feature which doesn’t vary much within itself, has very little predictive power.

### Random Forest:

Random Forest is one of the most widely used algorithms for feature selection. It comes packaged with in-built feature importance so you don’t need to program that separately. This helps us select a smaller subset of features.

We need to convert the data into numeric form by applying one hot encoding, as Random Forest (Scikit-Learn Implementation) takes only numeric inputs.

Alernatively, we can use the [SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html) of sklearn to do so. It selects the features based on the importance of their weights.

### Backward Feature Elimination:

1. We first take all the n variables present in our dataset and train the model using them
2. We then calculate the performance of the model
3. Now, we compute the performance of the model after eliminating each variable (n times), i.e., we drop one variable every time and train the model on the remaining n-1 variables
4. We identify the variable whose removal has produced the smallest (or no) change in the performance of the model, and then drop that variable
5. Repeat this process until no variable can be dropped

<b>This method can be used when building Linear Regression or Logistic Regression models</b>

### Forward Feature Selection:

This is the opposite process of the Backward Feature Elimination we saw above. Instead of eliminating features, we try to find the best features which improve the performance of the model. This technique works as follows:

1. We start with a single feature. Essentially, we train the model n number of times using each feature separately
2. The variable giving the best performance is selected as the starting variable
3. Then we repeat this process and add one variable at a time. The variable that produces the highest increase in performance is retained
4. We repeat this process until no significant improvement is seen in the model’s performance

### NOTE :

Both Backward Feature Elimination and Forward Feature Selection are time consuming and computationally expensive.They are practically only used on datasets that have a small number of input variables. The techniques we have seen so far are generally used when we do not have a very large number of variables in our dataset. These are more or less feature selection techniques.

### Principal Component Analysis (PCA):

Source: 

- https://builtin.com/data-science/step-step-explanation-principal-component-analysis (Explains it really well)
- https://youtu.be/FgakZw6K1QQ - StatsQuest explanation
- https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
- https://youtu.be/9oSkUej63yk -  till 7:29 explains math behind PCA

Key Components:

- A principal component is a linear combination of the original variables
- Principal components are extracted in such a way that the first principal component explains maximum variance in the dataset
- Second principal component tries to explain the remaining variance in the dataset and is uncorrelated to the first principal component
- Third principal component tries to explain the variance which is not explained by the first two principal components and so on

### Non-linear Dimensionality Reduction Methods:

Non-linear transformation methods or manifold learning methods are used when the data doesn’t lie on a linear subspace. It is based on the manifold hypothesis which says that in a high dimensional structure, most relevant information is concentrated in small number of low dimensional manifolds. If a linear subspace is a flat sheet of paper, then a rolled up sheet of paper is a simple example of a nonlinear manifold. Informally, this is called a Swiss roll, a canonical problem in the field of non-linear dimensionality reduction.

Some popular manifold learning methods are:

1.<b> Multi-dimensional scaling (MDS)</b> : A technique used for analyzing similarity or dissimilarity of data as distances in a geometric spaces. Projects data to a lower dimension such that data points that are close to each other (in terms if Euclidean distance) in the higher dimension are close in the lower dimension as well.

2.<b>Isometric Feature Mapping (Isomap)</b> : Projects data to a lower dimension while preserving the geodesic distance (rather than Euclidean distance as in MDS). Geodesic distance is the shortest distance between two points on a curve. It is computationally expensive.

3.<b> t-distributed Stochastic Neighbor Embedding (t-SNE)</b>: Computes the probability that pairs of data points in the high-dimensional space are related and then chooses a low-dimensional embedding which produce a similar distribution.

4.<b>UMAP</b> - t-SNE works very well on large datasets but it also has it’s limitations, such as loss of large-scale information, slow computation time, and inability to meaningfully represent very large datasets. Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can preserve as much of the local, and more of the global data structure as compared to t-SNE, with a shorter runtime. This method uses the concept of k-nearest neighbor and optimizes the results using stochastic gradient descent. It first calculates the distance between the points in high dimensional space, projects them onto the low dimensional space, and calculates the distance between points in this low dimensional space. It then uses Stochastic Gradient Descent to minimize the difference between these distances. 