The techniques for feature selection in machine learning can be broadly classified into the following categories:

### Supervised Techniques: 
These techniques can be used for labeled data, and are used to identify the relevant features for increasing the efficiency of supervised models like classification and regression.

### Unsupervised Techniques: 
These techniques can be used for unlabeled data.

In [1]:
import pandas as pd
import numpy as np

data_tox: pd.DataFrame = pd.read_csv('../CD databases/qsar_oral_toxicity.csv')
data_heart: pd.DataFrame = pd.read_csv('../CD databases/heart_failure_clinical_records_dataset.csv')

### Chi-square Test (supervised)
Ideal for categorical target

In [25]:
from sklearn.feature_selection import SelectKBest, chi2

y_tox: np.ndarray = data_tox.pop('classification').values
X_tox: np.ndarray = data_tox.values

best_vars_tox = SelectKBest(chi2, k=200).fit_transform(X_tox, y_tox)

print('Original feature number: ', X_tox.shape[1])
print('Reduced feature number: ', best_vars_tox.shape[1])

Original feature number:  1024
Reduced feature number:  200


In [26]:
y_heart: np.ndarray = data_heart.pop('DEATH_EVENT').values
X_heart: np.ndarray = data_heart.values
    
best_vars_heart = SelectKBest(chi2, k=5).fit_transform(X_heart, y_heart)

print('Original feature number: ', X_heart.shape[1])
print('Reduced feature number: ', best_vars_heart.shape[1])

Original feature number:  12
Reduced feature number:  5


### Variance Threshold (unsupervised)
The variance threshold is a simple baseline approach to feature selection. It removes all features which variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples. We assume that features with a higher variance may contain more useful information

In [33]:
from sklearn.feature_selection import VarianceThreshold

th = VarianceThreshold(threshold=0.8)
#tox_high_variance = th.fit_transform(data_tox)
heart_high_variance = th.fit_transform(data_heart)

print('Original feature number: ', X_heart.shape[1])
print('Reduced feature number: ', heart_high_variance.shape[1])

Original feature number:  12
Reduced feature number:  7


### Correlation Coefficient (unsupervised)
If two variables are correlated, we can predict one from the other. Therefore, if two features are correlated, the model only really needs one of them, as the second one does not add additional information.

In [2]:
from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters()
tox_data = pd.read_csv('../CD databases/qsar_oral_toxicity.csv')

corr_mtx = [tox_data.corr()]

high_vars = []
#le cada uma das corr_mtx e conta o numero de dados acima de um threshold
for data in corr_mtx:
    for i in range(len(data)):
        for j in range(len(data)):
            if data.iat[i,j] < -0.75 or data.iat[i,j] > 0.75 and data.iat[i,j] != 1:
                high_vars.append(j) if j not in high_vars else high_vars

high_vars_str = [str(i+1) for i in high_vars]
c=0
for column in high_vars_str:
    del tox_data[column]
    c = c+1
print(c)

144


In [3]:
tox_data.to_csv(r'C:\Users\jocam\OneDrive\Documentos\Data Science IST\GitWork\CD databases\toxicity_reduced.csv', index = False)


FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\jocam\\OneDrive\\Documentos\\Data Science IST\\CD databases\\toxicity_reduced.csv'

In [39]:
tox_data.shape

(8992, 881)