# Dimensionality Reduction Techniques



### Feature Selection

Main library: sklearn.feature_selection and sklearn.manifold


#### t-SNE
- When you want to visually explore the patterns in a high dimensional dataset.
- Stands for t-distributed stochastic neighbor embedding
- statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map.
-  **HOW** t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map.


-  **PROS:**
    - fd
-  **CONS:**
    - Only numeric data will be transformed. Non-numeric categorical data set can be used with one-hot encoding 

In [None]:
from sklearn.manifold import TSNE

m = TSNE(learning_rate = 50) #high learning rates will be unstable and low learning rates will be more conservative

tsne_features = m.fit_transform(df)


#### Variance Thresholds

- Since we want to select features with most amount of variance, we can select variance limits using **VarianceThreshold**
- Usually larger values will also have larger variances, do to normalize, usually columns are divided by mean values before applying feature selection funciton. 

In [None]:
from sklearn.feature_selection import VarianceThreshold

sel = VarianceThreshold(threshold =0.005) #setting threshold to be SD of atleast 1. Can be 0.005 after normalization

sel.fit(df/ df.mean()) #dividing by mean to normalize

mask = sel.get_support()

df.loc[:,mask]

#mask will be a boolean list with true and false value for every columns

##### Missing values or high correlation:
- Count if the missing values are more than 30% (or a randomly selected perc). If higher, feature might as well be dropped.
- Find correlation between all features. For features with high level of correlation, one of them might be dropped. 
- **MUST BE CAREFUL!** If not sure that feature selection due to correlation will not result in loss of important information, use feature extraction instead. 

In [None]:
'''Missing/NaN values'''
mask = df.isna().sum() / len(df) < 0.30 #will provide boolean list of values with less than 30% missing/nan values. 

'''Correlation Matrix'''
corr = df.corr().abs() #will give pairwise correlation matrix between all values. 

sns.pairplot(data=df, hue=#) #is an excellent way to check for correlation as well. 
             
            #ORR

mask_df = np.triu(np.ones_like(corr, dtype=bool)) #this mask will make an upper triangle of True values the same size as correlation matrix
             
sns.heatmap(data=df, mask=mask_df) #using mask here will hide upper half of matrix
             
####Filter for correlation###

# Create a True/False mask and apply it
mask = np.triu(np.ones_like(corr_df, dtype=bool))
tri_df = corr_df.mask(mask)

# List column names of highly correlated features (r > 0.95)
to_drop = [c for c in tri_df.columns if any(tri_df[c] >  0.95)]

# Drop the features in the to_drop list
reduced_df = ansur_df.drop(to_drop, axis=1)

            

## Feature Selection using model performance and complexity

- Standardize the features after splitting into train and text groups using StandardScaler. Necessary in order to compare features. 
- Then use the model for prediction. If feature coefficient is low for a certain feature, you may drop it to reduce model complexity. 
- Each time a feature is dropped, coeff of other features will change so this is a recursive method

- sklearn has model called **RFE - Recursive Feature Selection**. Will select features dependend on imporovement in model performance (based on train set). 

In [None]:
from sklearn.feature_selection import RFE

rfe = RFE(estimator=LogisticRegression(),         #can be any estimator
          n_features_to_select=2,
          step = 10,                              #number of values to drop in each iteration
          verbose=1) 

rfe.fit(X_train_std, y_train) #fit with standardized training set.

# Print the features and their ranking (high = dropped early on)
print(dict(zip(X.columns, rfe.ranking_)))

# Print the features that are not eliminated
print(X.columns[rfe.support_])



#### Tree Based Feature Selection
- RandomForestClassifier() has feature_importances_ values for each feature. 
- Since its tree based aggregator model, no need for standardizing variables.
- can also feed this to RFE

In [None]:
# Fit the random forest model to the training data
rf = RandomForestClassifier(random_state=0)
rf.fit(X_train, y_train)

# Calculate the accuracy
acc = accuracy_score(y_test, rf.predict(X_test))

# Create a mask for features importances above the threshold
mask = rf.feature_importances_ > 0.15

# Apply the mask to the feature dataset X
reduced_X = X.loc[:,mask]

'''OR'''

# Set the feature eliminator to remove 2 features on each step
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=2, step=2, verbose=1)

# Fit the model to the training data
rfe.fit(X_train, y_train)

# Create a mask
mask = rfe.support_

# Apply the mask to the feature dataset X and print the result
reduced_X = X.loc[:, mask]

#### Regularization:

- **Lasso**: alpha or L1 Norm
- **Ridge**: lamda or L2 Norm
- Ridge and Lasso can be used for any algorithms involving weight parameters, including neural nets. 
Dropout is primarily used in any kind of neural networks e.g. ANN, DNN, CNN or RNN to moderate the learning.

**We can use ensemble method and have multiple model vote on importance of features as well**

## Feature Extraction

#### PCA - Principle Component Analysis
- Uses **eigen decomposition**
- We are trying to find the "intrinsic dimensions" of a dataset. We can check using PCA with high variance
- NOT recommended for categorical data sets

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Create the scaler and standardize the data
scaler = StandardScaler()
df_std = scaler.fit_transform(df) #Always standardize data with var 1 and mean 0

# Create the PCA instance and fit and transform the data with pca
pca = PCA()
pc = pca.fit_transform(df_std)

# This changes the numpy array output back to a DataFrame
pc_df = pd.DataFrame(pc, columns=['PC 1', 'PC 2', 'PC 3', 'PC 4']) 
#this example had 4 components. Actual # of components will be min(n-1,p)

# Inspect the explained variance ratio per component
print(pca.explained_variance_ratio_)

#Understanding Components

pca.components_[:,0] #1st Component

#Pipelines

