# Improving Classification Models: Feature Subset Selection

## What to Expect:

1. Quick Intro: Iris Dataset
2. What is Feature Subset Selection
3. Variable Thresholding
4. K-Best Features
5. Forward and Backward Stepwise Selection (Wine Dataset) 

## Objectives:

* Do you have a better understanding of Feature Subset Selection?
* Attempt Variable Thresholding, K-Best Feature Selection and Foward/Backward Stepwise Selection

---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Quick Intro: Iris Dataset

In [2]:
from sklearn.datasets import load_iris

In [3]:
# Load the Iris dataset
iris = load_iris()

In [4]:
# Create a DataFrame from the Iris dataset
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add the target column to the DataFrame
iris_df['target'] = iris.target

# Display the first few rows of the DataFrame
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [5]:
X = iris_df.drop(columns = 'target')
y = iris_df['target']

In [6]:
# Import the scaler module
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()

# Scale data
X_scaled = scaler.fit_transform(X)

---

## 2. What is Feature Subset Selection?

The success of machine learning algorithms depends on the quality of the data they use to extract knowledge. Machine learning algorithms may produce inaccurate or unintelligible results if data is inadequate or contains irrelevant information. By **removing irrelevant and redundant information** before learning, feature subset selection algorithms aim to **reduce the amount of time** it takes to learn. It reduces *data dimensionality*, improves **algorithm efficiency**, and **enhances performance and interpretability**. 

---

## 3. Variable Thresholding

Variable thresholding in classification involves adjusting the decision threshold for predicted probabilities to achieve a desired balance between precision and recall or to optimize for specific performance metrics.

```python
# Import syntax for the VarianceThreshold
from sklearn.feature_selection import VarianceThreshold
```

But first! lets take a look at the variance of each feature.

In [None]:
from sklearn.feature_selection import VarianceThreshold

# Display the variance of each column
column_variances = iris_df.var()

print("Variance of Each Column:")
print(column_variances)


In [None]:
# Use VarianceThreshold to remove low-variance features

#You can adjust this threshold based on your needs
variance_threshold = 0.2  

selector = VarianceThreshold(threshold=variance_threshold)

# Exclude the target column
# Careful! not all data are to scale you may need to!
selected_features = selector.fit_transform(iris_df.iloc[:, :-1]) 

# Create a DataFrame for the selected features
selected_features_df = pd.DataFrame(data=selected_features)

# Get the selected feature indices
selected_feature_indices = selector.get_support(indices=True)

# Get the names of the selected features from the original DataFrame
selected_feature_names = iris_df.columns[selected_feature_indices]

# Create a DataFrame for the selected features with their names
selected_features_df = pd.DataFrame(data=selected_features,
                                    columns=selected_feature_names)

selected_features_df.head()

How will you apply this to split data above?

---

## 4. K-Best Features

K-Best Features in classification involve selecting the top k features based on **statistical measures**, enhancing model performance by focusing on the most informative attributes. This technique aids in dimensionality reduction, mitigates overfitting, and improves computational efficiency.

In [None]:
from sklearn.model_selection import train_test_split

# Split into train and test
X_train, X_test, y_train,y_test = train_test_split(X_scaled, y, 
                                                    test_size=0.2,
                                                    random_state=10)

In [None]:
# Import the feature selector module
from sklearn import feature_selection
from sklearn.feature_selection import f_classif

# Set up selector, choosing score function and number of features
selector_kbest = feature_selection.SelectKBest(score_func=f_classif, 
                                               k= 3)

# Transform (i.e.: run selection on) the training data
X_train_kbest = selector_kbest.fit_transform(X_train, y_train)

In [None]:
X_train_kbest.shape

In [None]:
# Get the selected feature indices
selected_feature_indices = selector_kbest.get_support(indices=True)

# Get the names of the selected features from the original DataFrame
selected_feature_names = X.columns[selected_feature_indices]

# Create a DataFrame for the selected features with their names
selected_features_df = pd.DataFrame(data=X_train_kbest, 
                                    columns=selected_feature_names)

# Print the DataFrame
selected_features_df.head()

---

## 5. Forward and Backward Stepwise Selection


Forward and Backward Stepwise Selection are feature selection techniques employed in the realm of statistical modeling and machine learning to enhance model efficiency and interpretability. 

- Forward Stepwise Selection

Forward Stepwise Selection begins with an empty set of features and iteratively adds the most significant variables, stopping when a predefined criterion is met. This method is particularly beneficial when the dataset contains a large number of potential features.

- Backward Stepwise Selection

On the other hand, Backward Stepwise Selection starts with the entire set of features and removes the least significant ones sequentially until the stopping criterion is satisfied.

#### Now onto the code!

In [None]:
from sklearn.datasets import load_wine
# Wine Dataset that we load
wine = load_wine()

In [None]:
# Create a DataFrame from the Iris dataset
wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)

# Add the target column to the DataFrame
wine_df['target'] = wine.target

# Display the first few rows of the DataFrame
wine_df.head()

In [None]:
# Display the variance of each column
column_variances = wine_df.var()

print("Variance of Each Column:")
print(column_variances)


In [None]:
data = load_wine()
mask = data.target != 0  # Binary classification: Class 1 vs Others
X1, y1 = data.data[mask], data.target[mask]

In [None]:
# Import the scaler module
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()

# Scale data
X_scaled1 = scaler.fit_transform(X1)

In [None]:
from sklearn.model_selection import train_test_split

# Split into train and test
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_scaled1, y1, 
                                                            test_size=0.2, 
                                                            random_state=10)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# lm instance for stepwise
lm_sfs = LogisticRegression()

In [None]:
!pip install mlxtend

In [None]:
# Import the selector module, and the accuracy_score module to computer performance
from sklearn.metrics import f1_score
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

In [None]:
# We then build our forward feature selector
sfs = sfs(lm_sfs, k_features= 4, forward=True, 
          scoring='f1', cv = 5)

In [None]:
new_selected_sfs_data = sfs.fit(X_train1, y_train1)

In [None]:
# Plot the results
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

fig1 = plot_sfs(new_selected_sfs_data.get_metric_dict(), 
                kind='std_dev')

plt.ylim([0.9, 1.1])
plt.title('Sequential Forward Selection (w. StdDev)')
plt.grid()
plt.show()

---

## Objectives:

* Do you have a better understanding of Feature Subset Selection?
* Attempt Variable Thresholding, K-Best Feature Selection and Foward/Backward Stepwise Selection