# Dimensionality Reduction in Python

High-dimensional datasets can be overwhelming and leave you not knowing where to start. Typically, you’d visually explore a new dataset first, but when you have too many dimensions the classical approaches will seem insufficient. Fortunately, there are visualization techniques designed specifically for high dimensional data. After exploring the data, you’ll often find that many features hold little information because they don’t show any variance or because they are duplicates of other features; detect these features and drop them from the dataset so that you can focus on the informative ones.  In a next step, you might want to build a model on these features, and it may turn out that some don’t have any effect on the thing you’re trying to predict. You’ll learn how to detect and drop these irrelevant features too, in order to reduce dimensionality and thus complexity. Finally, you’ll learn how feature extraction techniques can reduce dimensionality for you through the calculation of uncorrelated principal components.

### Exploring High Dimensional Data
Learn the difference between feature selection and feature extraction and will apply both techniques for data exploration.
* **Dimensionality:** the number of columns in your dataset (assuming that you have a tidy dataset)
* **Tidy data set:** Every column represents a variable or feature and every row represents an observation or instance of each variable.
* **High-dimensional:** When you have many columns, or features, in your dataset; high-dimensionality indicates complexity.
* **Note:** by default, `.describe()` ignores the non-numeric columns in a dataset; we can tell describe to do the opposite, by passing the argument `exclude='number'`; or, `df.describe(exclude='number')`; we will then get summary statistics adapted to non-numeric data

* Becoming familiar with the shape of your dataset and the properties of the features within it, is a crucial step you should take before you make the decision to reduce dimensionality

#### Methods for reducing dimensionality:
* Drop columns with little to no variance (when you are looking to determine differences among observations in a dataset)

#### Feature selection vs Feature Extraction
* Reducing the number of dimensions in your dataset has multiple benefits. Your dataset will become:
    * less complex
    * require less disk space
    * require less computation time
    * have lower chance of model overfitting
    
* The simplest way to reduce dimensionality is to only select the features or columns that are important to you from a larger dataset
    * If you're new to a dataset or have little background knowledge of a dataset topic, you'll likely have to do some exploring to determine which features are both relevant and useful.
    * Seaborn's **pairplot** is excellent to visually explore small to medium sized datasets
    
```
sns.pairplot(ansur_df, hue = 'gender', diag_kind='hist')
```
#### Pairplots
* **sns pairplot** provides a 1x1 comparison of each numeric feature in the dataset in the form of a scatterplot. Plus, diagonally, a view of the distribution of each feature (for example, with a histogram, as specified in the above code).
    * Pairplots make it very easy to visually spot duplicated features (such as a weight column of different units- kilogramsa and pounds), as well as unvarying features (such as a constant); both of these types of columns can typically be dropped for dimensionality reduction

* Always try to minimize information loss by only removing features that are irrelevant or hold little unique information (if possible)

#### Feature extraction
* Compared to feature selection, **feature extraction** is a completely different approach but with the same goal of reducing dimensionality
* Instead of selecting a subset of features from our initial dataset, we calculate or extract new features from the original ones (for example: PCA).
    * These new features have as little redundant information as possible and are therefore fewer in number
    * One downside: the newly created features are often less intuitive to understand than the original ones
* Dimensionality of datasets with a lot of strong correlations between the different features in it, can be reduced a lot with feature extraction