# Dimensionality Reduction in Python

High-dimensional datasets can be overwhelming and leave you not knowing where to start. Typically, you’d visually explore a new dataset first, but when you have too many dimensions the classical approaches will seem insufficient. Fortunately, there are visualization techniques designed specifically for high dimensional data. After exploring the data, you’ll often find that many features hold little information because they don’t show any variance or because they are duplicates of other features; detect these features and drop them from the dataset so that you can focus on the informative ones.  In a next step, you might want to build a model on these features, and it may turn out that some don’t have any effect on the thing you’re trying to predict. You’ll learn how to detect and drop these irrelevant features too, in order to reduce dimensionality and thus complexity. Finally, you’ll learn how feature extraction techniques can reduce dimensionality for you through the calculation of uncorrelated principal components.

### Exploring High Dimensional Data
Learn the difference between feature selection and feature extraction and will apply both techniques for data exploration.
* **Dimensionality:** the number of columns in your dataset (assuming that you have a tidy dataset)
* **Tidy data set:** Every column represents a variable or feature and every row represents an observation or instance of each variable.
* **High-dimensional:** When you have many columns, or features, in your dataset; high-dimensionality indicates complexity.
* **Note:** by default, `.describe()` ignores the non-numeric columns in a dataset; we can tell describe to do the opposite, by passing the argument `exclude='number'`; or, `df.describe(exclude='number')`; we will then get summary statistics adapted to non-numeric data

* Becoming familiar with the shape of your dataset and the properties of the features within it, is a crucial step you should take before you make the decision to reduce dimensionality

#### Methods for reducing dimensionality:
* Drop columns with little to no variance (when you are looking to determine differences among observations in a dataset)

#### Feature selection vs Feature Extraction
* Reducing the number of dimensions in your dataset has multiple benefits. Your dataset will become:
    * less complex
    * require less disk space
    * require less computation time
    * have lower chance of model overfitting
    
* The simplest way to reduce dimensionality is to only select the features or columns that are important to you from a larger dataset
    * If you're new to a dataset or have little background knowledge of a dataset topic, you'll likely have to do some exploring to determine which features are both relevant and useful.
    * Seaborn's **pairplot** is excellent to visually explore small to medium sized datasets
    
```
sns.pairplot(ansur_df, hue = 'gender', diag_kind='hist')
```
#### Pairplots
* **sns pairplot** provides a 1x1 comparison of each numeric feature in the dataset in the form of a scatterplot. Plus, diagonally, a view of the distribution of each feature (for example, with a histogram, as specified in the above code).
    * Pairplots make it very easy to visually spot duplicated features (such as a weight column of different units- kilogramsa and pounds), as well as unvarying features (such as a constant); both of these types of columns can typically be dropped for dimensionality reduction

* Always try to minimize information loss by only removing features that are irrelevant or hold little unique information (if possible)

#### Feature extraction
* Compared to feature selection, **feature extraction** is a completely different approach but with the same goal of reducing dimensionality
* Instead of selecting a subset of features from our initial dataset, we calculate or extract new features from the original ones (for example: PCA).
    * These new features have as little redundant information as possible and are therefore fewer in number
    * One downside: the newly created features are often less intuitive to understand than the original ones
* Dimensionality of datasets with a lot of strong correlations between the different features in it, can be reduced a lot with feature extraction

### t-SNE visualization of high-dimensional data
* **t-SNE** = **t-Distributed Stochastic Neighbor Embedding**
* A powerful technique to visualize high-dimensional data using feature extraction
* t-SNE will maximize the distance in 2-D space between observations that are mmost different in a high-dimensional space
    * Because of this, observations that are similar together will be close to one another and may become clustered
    * **t-SNE doesn't work with non-numeric data** (though you can always use one-hot-encoding to get around this if necessary).

```
from sklearn.manifold import TSNE

m = TSNE(learning_rate=50)
tsne_features = m.fit_transform(df_numeric)
tsne_features[1:4,:]

df['x'] = tsne_features[:, 0]
df['y'] = tsne_features[:, 1]

```
* `.fit_transform()` will project our high-dimensional dataset onto a NumPy array with two dimensions
* Assign these two dimensions back to our original dataset, naming them 'x' and 'y.'

* **High learning** rates will cause the algorithm to be **more adventurous** in the configurations it tries out
* **Low learning** rates will cause the algorithm to be **conservative**.
* Usually, learning rates fall in the 10 to 1000 range

* Plot t-SNE using seaborn's scatterplot:

```
import seaborn as sns
sns.scatterplot(x = 'x', y = 'y', data = df)
plt.show()
```

```
import matplotlib.pyplot as plt
sns.scatterplot(x = 'x', y = 'y', hue = 'BMI_class', data = df)
plt.show()
```

* t-SNE helps us to visually explore our dataset and identify the most importnat drivers of variance in body shapes.

## Feature selection I, selecting for feature information 
#### The curse of dimensionality

* Models tend to overfit badly on high-dimensional data
* How to detect low quality features and how to remove them?
* With each feature you add to a dataset, you should also be ready to increase the number of observations in your dataset
    * If you don't, you'll end up with a lot of unique combinations of features that models can easily memorize and thus overfit to 
* The solution to the curse of high dimensionality is to apply dimensionality reduction.

```
# Import SVC from sklearn.svm and accuracy_score from sklearn.metrics
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Create an instance of the Support Vector Classification class
svc = SVC()

# Fit the model to the training data
svc.fit(X_train, y_train)

# Calculate accuracy scores on both train and test data
accuracy_train = accuracy_score(y_train, svc.predict(X_train))
accuracy_test = accuracy_score(y_test, svc.predict(X_test))

print("{0:.1%} accuracy on test set vs. {1:.1%} on training set".format(accuracy_test, accuracy_train))
```

#### Features with missing values or little variance
* Automate the selection of features that have sufficient variance and not too many missing values 
* Creating a feature selector:

#### VarianceThreshold()
* built-in sklearn feature selection tool

```
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=1)
sel.fit(ansur_df)

mask = sel.get_support()
print(mask)

reduced_df = ansur_df.loc[:, mask]
print(reduced_df.shape)
```
* The `.get_support()` method will give us a `True` or `False` value on whether each feature's variance is above the threshold or not
* We call this type of Boolean array a mask, and we can use this mask to reduce the number of dimensions in our dataset

#### Variance selector caveats
* One problem with variance thresholds is that the variance values aren't always easy to interpret or compare between features
* `buttock_df.boxplot()`
* If, for example, higher values have higher variances, we should normalize the variances before using for feature selection
    * to do so, divide each column by its mean value before fitting the selector
    * After normalization, the variance in the dataset will be lower (so we can therefore reduce the variance threshold- but make sure to inspect your data visually while setting this value)
    
```
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.005)
sel.fit(ansur_df / ansur_df.mean())
```

```
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold = 0.005)
sel.fit(ansur_df / ansur_df.mean())
mask = sel.get_support()
reduced_df = ansur_df.loc[:, mask]
print(reduced_df.shape)
```

* Another reason that you might want to drop a feature is if it contains a lot of missing values.
* `.isna()`
* Get ration of missing values between 0 and 1 for each column/feature in dataframe:
* **`pokemon_df.isna().sum() / len(pokemon_df)`**
* Based on this $\Uparrow$ ratio, we can create a mask for features that have fewer missing values than a certain threshold:

```
mask = pokemon_df.isna().sum() / len(pokemon_df) < 0.3
reduced_df = pokemon_df.loc[:, mask]
```
* When features have some missing values, but not *too* much, we could apply imputation to fill in the blanks

```
from sklearn.feature_selection import VarianceThreshold

# Create a VarianceThreshold feature selector
sel = VarianceThreshold(threshold=0.001)

# Fit the selector to normalized head_df
sel.fit(head_df/ head_df.mean())

# Create a boolean mask
mask = sel.get_support()

# Apply the mask to create a reduced dataframe
reduced_df = head_df.loc[:, mask]

print("Dimensionality reduced from {} to {}.".format(head_df.shape[1], reduced_df.shape[1]))
```

#### Pairwise correlation
* Look at how features relate to one another to decide if they are worth keeping.
* `sns.pairplot(ansur, hue = 'gender')` allows us visually identify strongly correlated features; however, if we want to quanity the correlation between features, this method would fall short
* To solve this, we use **correlation coefficient** $\rho$
* The value of $\rho$ always lies between minus one and plus one;
    * -1 desscribes a perfectly negative correlation 
    * 1 describes a perfectly positive correlation
    * 0 describes no correlation 
* Calculate **correlation matrix:**
    * `weights_df.corr()`
    * correlation matrix shows the correlation coefficient for each pairwise combination of features in the dataset
    * By definition, the diagonal in our correlation matrix shows a series of ones, telling us, not surprisingly, that each each feature is perfectly correlated to itself.
* Visualizing the correlation matrix:

```
cmap = sns.diverging_palette(h_neg = 10,
                             h_pos = 240,
                             as_cmap = True)

sns.heatmap(weights_df.corr(), center = 0, cmap= cmap, linewidths= 1, annot= True, fmt= ".2f")
```
* We can improve this plot further by removing duplicate and unnecessary information like the correlation coefficients of 1 on the diagonal; to do so, we'll create a boolean mask.

```
corr = weights_df.corr()

mask = np.triu(np.ones_like(corr, dtype=bool))
```
* NumPy's `ones_like()` function creates a matrix filled with True values (or 1's) with the same dimensions as our correlation matrix 
* We then pass this to NumPy's `triu()` ("triangle upper") function, to set all non-upper triangle values to False.
* When we pass this mask to the heatmap() function, it will ignore the upper triangle, allowing us to focus on the interesting part of the plot:
* `sns.heatmap(weights_df.corr(), mask=mask, center=0, cmap=cmap, linewidths=1, annot= True, fmt=".2f")`


#### Removing highly correlated features 
* Features that are perfectly correlated with each other, with a correlation coefficient of 1 or -1 bring no new information to a dataset, but do add to the comlexity.
    * So, naturally, we'd want to drop one of the duplicated features that holds the same information
* In addition to this, we might want to drop features that have correlation coefficients close to 1 or -1 if they are measurements of the same or similar things 
* If you are confident that dropping highly correlated features will not cause you to lose too much information, you filter them out using a threshold value.

```
# Create positive correlation matrix
corr_df = chest_df.corr().abs()
# Create and apply mask
mask = np.triu(np.ones_like(corr_df, dtype=bool))
tri_df = corr_matrix.mask(mask)
tri_df
```
* When we pass this mask to the pd dataframe `mask()` method, it will replace all positions in the dataframe where the mask has a True value with NA
* Then use a list comprehension to find all columns that have a correlation to any feature stronger than the threshold value.

```
# Find columns that meet threshold
to_drop = [c for c in tri_df.columns if any(tri_df[c] > 0.95)]
# Drop those columns 
reduced_df = chest_df.drop(to_drop, axis=1)
```
* The reason we use a mask to set half of the matrix to NA values is that we want to avoid removing *both* features when they have a strong correlation
* This method is a bit of a brute force approach that *should only be applied if you have a good understanding of the dataset.*
* **Note:** Correlation coefficients can produce weird results when the relation between two features is non-linear or when outliers are involved.
* **Strong correlations do not imply causation.**

```
# Calculate the correlation matrix and take the absolute value
corr_matrix = ansur_df.corr().abs()

# Create a True/False mask and apply it
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
tri_df = corr_matrix.mask(mask)

# List column names of highly correlated features (r > 0.95)
to_drop = [c for c in tri_df.columns if any(tri_df[c] >  0.95)]

# Drop the features in the to_drop list
reduced_df = ansur_df.drop(to_drop, axis=1)

print("The reduced dataframe has {} columns.".format(reduced_df.shape[1]))
```

### Feature selection II, selecting for model accuracy
* Learn how to let models help you find the most important features in a dataset for predicting a particular target feature.

# Selecting features for model performance
* A more pragmatic approach to dimension reduction is to select features based on how they affect model performance

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
lr = LogisticRegression()
lr.fit(X_train_std, y_train)

X_test_std = scaler.transform(X_test)
y_pred = lr.predict(X_test_std)
print(accuracy_score(y_test, y_pred))

print(lr.coef_)
```
* Use above method to check if any coefficient values are close to zero. 
* Since these coefficients will be multiplied with the feature values when the model makes a prediction, features with coefficients close to zero will contribute little to the end result
* We can use the zip function to transform the output into a dictionary that shows which feature has which coefficient.
* `print(dict(zip(X.columns, abs(lr.coef_[0]))))`
* If we want to remove a feature from the initial dataset with as little effect on the predictions as possible, we should pick the one(s) with the lowest coefficient(s).
* **The fact what we standardized the data first makes sure that we can compare the coefficients to one another.**
* We could repeat the above process until we have the desired number of features, but there is a sklearn function for that: **RFE** for **Recursive Feature Elimination** is a feature selection algorithm that can be wrapped around any model that produces feature coefficients of feature importance values.
* We can pass it the model we want to use and the number of features we want to select
* While fitting to our data, it will repeat a process where it first fits the internal model and then drops the feature with the weakest coefficient.
    * It will keep doing this until the desired number of features is reached.

```
from sklearn.feature_selection import RFE
rfe = RFE(estimator= LogisticRegression(), n_features_to_select=2, verbose=1)
rfe.fit(X_train_std, y_train)
```
* If we set RFE's verbose parameter higher than zero, we'll be able to see that features are dropped one by one while fitting
* Another pro: this recursive method is much safer than many other methods, as dropping one feature will cause all of the other coefficients to change (and so the task must be done recursively).

```
X.columns[rfe.support_]
```
* `.support_` feature containts True/False values indicating which features were kept in the dataset
* Using the zip function once more, we can also check out rfe's `.ranking_` attribute to see in which iteration a feature was dropped (values of 1 mean that a feature was kept in the dataset until the end, while high values mean the feature was dropped early on):

```
X.columns[rfe.support_]
print(dict(zip(X.columns, rfe.ranking_)))
print(accuracy_score(y_test, rfe.predict(X_test_std)))
```

#### Tree-based feature selection
* Some models will perform feature selection by design to avoid overfitting 
    * For example: Random Forest Classifier
* Random Forest algorithm naturally calculates feature importance values
    * These values can be extracted from a trained model using the **`feature_importances_`** attribute:
    
```
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
print(rf.feature_importances_)
```
* Just like the coefficients produced by the logistic regressor, these feature importance values can be used to perform feature selection, as unimportant features will be closer to zero.
* An advantage of these feature importance values over coefficients is that they are comparable between features by default, since they always sum up to one (meaning we don't have to scale out input data first)
* We can also use the feature importance values to create a True/False (boolean) mask for features that meet a certain importance threshold:

```
mask = rf.feature_importancces_ > 0.1
X_reduced = X.loc[:, mask]
print(X_reduced.columns)
```
* **Remember** that dropping one weak feature can make other features relatively more or less important; if you want to play it safe and minimize the risk of dropping the wrong features, you should not drop all least important features at once, but rather one by one; to do so, once again wrap a Recursive Feature Eliminator (or `RFE()`) around the model:

```
from sklearn.feature_selection import RFE

rfe= RFE(estimator= RandomForestClassifier(), 
                    n_features_to_select= 6,
                    verbose=1)
rfe.fit(X_train, y_train)
```
* However, training the model once for each feature we want to drop can result in too much computational overhead; to speed up the process, we can pass the `step` parameter to `RFE()`:

```
from sklearn.feature_selection import RFE

rfe= RFE(estimator= RandomForestClassifier(), 
                    n_features_to_select= 6,
                    step = 10,
                    verbose=1)
rfe.fit(X_train, y_train)
```
* With `step` set to 10, on each iteration, the 10 least important features are dropped.
* Once the final model is trained, we can use the feature eliminator's `.support_` attribute as a mask to print the remaining column names.
* `print(X.columns[rfe.support_])`

```
# Perform a 75% training and 25% test data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Fit the random forest model to the training data
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)

# Calculate the accuracy
acc = accuracy_score(y_test, rf.predict(X_test))

# Print the importances per feature
print(dict(zip(X.columns, rf.feature_importances_.round(2))))

# Print accuracy
print("{0:.1%} accuracy on test set.".format(acc))
```
***
***
```
# Create a mask for features importances above the threshold
mask = rf.feature_importances_ > 0.15
# Apply the mask to the feature dataset X
reduced_X = X.loc[:, mask]
```
***
*** 

```
# Set the feature eliminator to remove 2 features on each step
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=2, step=2, verbose=1)

# Fit the model to the training data
rfe.fit(X_train, y_train)

# Create a mask
mask = rfe.support_

# Apply the mask to the feature dataset X and print the result
reduced_X = X.loc[:, mask]
print(reduced_X.columns)
```

#### Regularized linear regression
* Linear regressions: build a model that derives the linear function between three input values and a target

```
lr = LinearRegression()
lr.fit(X_train, y_train)

# Calculate R-squared
print(lr.score(X_test, y_test)
```
* With **regularization**, the model will not only try to be as accurate as possible by minimizing the loss function, but it will also **try to keep the model simple** by keeping the coefficients low. 
* The strength of regularization can be tweaked with **alpha** ($\alpha$).
* When alpha is too low, the model might overfit.
* When alpha is too high, the model might become too simple and inaccurate (the model might be underfitted).
* **LASSO regularization:** **L**east **A**bsolute **S**hrinkage and **S**election **O**perator

```
from sklearn.linear_model import Lasso

la = Lasso(alpha = 0.05)
la.fit(X_train, y_train)
print(la.coef_)
```

```
# Set the test size to 30% to get a 70-30% train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Fit the scaler on the training features and transform these in one go
X_train_std = scaler.fit_transform(X_train)

# Create the Lasso model
la = Lasso()

# Fit it to the standardized training data
la.fit(X_train_std, y_train)

# Transform the test set with the pre-fitted scaler
X_test_std = scaler.transform(X_test)

# Calculate the coefficient of determination (R squared) on X_test_std
r_squared = la.score(X_test_std, y_test)
print("The model can predict {0:.1%} of the variance in the test set.".format(r_squared))

# Create a list that has True values when coefficients equal 0
zero_coef = la.coef_ == 0

# Calculate how many features have a zero coefficient
n_ignored = sum(zero_coef)
print("The model has ignored {} out of {} features.".format(n_ignored, len(la.coef_)))
```
* **NOTE TO SELF:** Always `.fit_transform` scaler on `Xtrain` $\Rightarrow$ and then fit regularizing object (`Lasso()`, etc) on `X_train_scaled`, `y_train` $\Rightarrow$ then use pre-fitted scaler to `.transform` **ONLY** `X_test`.

```
# Find the highest alpha value with R-squared above 98%
la = Lasso(0.1, random_state=0)

# Fits the model and calculates performance stats
la.fit(X_train_std, y_train)
r_squared = la.score(X_test_std, y_test)
n_ignored_features = sum(la.coef_ == 0)

# Print peformance stats 
print("The model can predict {0:.1%} of the variance in the test set.".format(r_squared))
print("{} out of {} features were ignored.".format(n_ignored_features, len(la.coef_)))
```

#### Combining feature selectors 
* Above we manually found an optimal alpha value, but this can be very tedious.
* In order to automate this: `LassoCV()` class regressor
* **`LassoCV()`** will **use cross validation to try out different alpha settings and select the best one.**

```
from sklearn.linear_model import LassoCV
lcv = LassoCV()
lcv.fit(X_train, y_train)
print(lcv.alpha_)
```
* When we fit this model to our training data, it will get an `.alpha_` attribute with the optimal value
* To actually remove the features to which the Lasso regressor assigned a zero, we create a mask for all non-zero coefficients:

```
mask = lcv.coef_ != 0
reduced_X = X.loc[:, mask]
```

**Recap: Random Forests as a type of ensemble model**
* Random forest is a combination of decision trees
* We can use combination of models for feature selection too.
* Designed on the idea that a lot of weak predictors can combine to form a strong one
* Instead of trusting a single model to tell us which features are important, we could have multiple models each cast their vote on whether we should keep a feature or not

* **Feature selection with LassoCV:**

```
from sklearn.linear_model import LassoCV
lcv = LassoCV()
lcv.fit(X_train, y_train)
lcv.score(X_test, y_test)
lcv_mask = lcv.coef_ != 0
sum(lcv_mask)
```
* Output: `66`

* **Feature selection with random forest:**

```
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor

rfe_rf = RFE(estimator=RandomForestRegressor(),
             n_features_to_select=66,
             step=5,
             verbose=1)
rfe_rf.fit(X_train, y_train)
rf_mask= rfe_rf.support_
```
* Note that here (above) we've wrapped a RFE around the model to have it select the same number of features as the LassoCV() regressor (66).

* **Feature selection with a gradient boosting regressor:**

```
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingRegressor

rfe_bg = RFE(estimator= GradientBoostingRegressor(),
             n_features_to_select=66, step=5, verbose=1)
rfe_gb.fit(X_train, y_train)
gb_mask = rfe+gb.support_
``` 

* **Combining the (3) feature selectors:**

```
import numpy as np

votes = np.sum([lcv_mask, rf_mask, gb_mask], axis=0)

print(votes)
```
* **The output will be an array with the number of votes that each feature got.** For example:
* `array([3, 2, 2, ..., 3, 0, 1])`
* What we do with this vote then depends on how conservative we want to be
    * If we want to make sure we don't lose any information, we could select all features with at least one vote
    * For majority voting (in this case of using 3 estimators in the vote), we could use 2:

```
mask = votes >= 2
reduced_X = X.loc[:, mask]
```