###  **Feature Selection – Filter Methods Only**

**Definition:**
Filter methods evaluate each feature **independently of the model** using statistical tests and rank them based on their relevance to the target.

---

### **Common Filter Techniques:**

1. **Correlation Coefficient**

   * Removes features highly correlated with each other.
   * Goal: Keep features with high correlation to target, low with each other.

2. **Chi-Square Test (χ²)**

   * Used for **categorical features**.
   * Measures the dependency between feature and target.

3. **ANOVA (F-test)**

   * Used for **continuous features** and a **categorical target**.
   * Measures variance between groups vs within groups.

4. **Mutual Information**

   * Measures how much information a feature gives about the target.
   * Works with both categorical and continuous variables.

---

### **Why Use Filter Methods:**

* Fast and scalable to large datasets.
* Model-independent.
* Useful as a **preprocessing step** before applying wrappers or embedded methods.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [5]:
df = pd.read_csv('human_activity_recognition_using_smartphone.csv')

In [6]:
df.shape

(7352, 563)

In [9]:
df.columns
df.drop('subject',axis=1,inplace=True)

In [10]:
df.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",Activity
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.298676,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,STANDING
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.595051,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,STANDING
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,-0.390748,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,STANDING
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,-0.11729,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,STANDING
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.351471,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,STANDING


In [18]:
np.round(df.Activity.value_counts()/df.shape[0] * 100,2)

Unnamed: 0_level_0,count
Activity,Unnamed: 1_level_1
LAYING,19.14
STANDING,18.69
SITTING,17.49
WALKING,16.68
WALKING_UPSTAIRS,14.59
WALKING_DOWNSTAIRS,13.41


In [20]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [49]:
x = df.drop('Activity',axis=1)
y = df.Activity

In [50]:
encode = LabelEncoder()
y = encode.fit_transform(y)
# y.sample(5)

In [51]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [52]:
# shape
print('rows for training\t',x_train.shape[0],'\nrows for testing\t',x_test.shape[0])

rows for training	 5881 
rows for testing	 1471


In [53]:
lr = LogisticRegression()
lr.fit(x_train,y_train)
lr

In [54]:
y_pred = lr.predict(x_test)
y_pred

array([4, 4, 3, ..., 1, 1, 1])

In [55]:
print('accuracy score\t',np.round(accuracy_score(y_test,y_pred)*100,2))

accuracy score	 98.03


In [56]:
# when we will have 100 columns instead 561

In [57]:
# dropping duplicate features
df.T.duplicated().sum()

np.int64(21)

We have around 21 duplicate columns

In [58]:
x_train = x_train.loc[:, ~x_train.T.duplicated()]
x_test = x_test.loc[:, ~x_test.T.duplicated()]

In [59]:
# variance threshold
# constant
# quasi constant feature

### 1. Variance Threshold
A feature selection technique that removes features (columns) with low variance.

#### 2. Constant Feature
A feature that has the same value in every row

#### 3. Quasi-Constant Feature
A feature that has the same value in almost all rows (e.g., 99% same).

In [68]:
# (x_train.var()<0.05).sum()
# we have around 191 columns

In [71]:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=0.05)
sel

In [72]:
sel.fit(x_train)

In [75]:
sel.get_support().sum()

np.int64(349)

In [82]:
cols = np.array(x_train.columns[sel.get_support()])
# # print(cols)
# cols

In [83]:
x_train = sel.transform(x_train)
x_test = sel.transform(x_test)

In [86]:
x_train = pd.DataFrame(x_train,columns=cols)
x_test = pd.DataFrame(x_test,columns=cols)

In [89]:
print('number of columns left\t',x_train.shape[1])

number of columns left	 349


**Points to consider before applying a Variance Threshold** for feature selection:

---

### 1. **Data Should Be Numeric**

* Variance threshold works only on numerical features.
* Convert categorical variables (if any) using encoding techniques before applying.


### 2. **Scale Matters**

* Variance is affected by the scale of features.
* If features are on different scales (e.g., age vs. salary), consider **standardizing** the data using `StandardScaler` or `MinMaxScaler`.


### 3. **Check for Constant Features First**

* Before setting a threshold, remove features with **zero variance** (constant features). These are always useless.


### 4. **Use Domain Knowledge**

* Some low-variance features might still be important (e.g., disease presence might be rare but very predictive).

### 5. **Experiment with Different Thresholds**

* Try multiple thresholds (e.g., 0.01, 0.02, 0.05) and **observe model performance**.
* There's no one-size-fits-all value — it's dataset-specific.

### 6. **Quasi-constant Check Alternative**

* If a feature has the same value in >98% of rows, it may not be worth keeping.

### 7. **Beware of Sparse Data**

* In text classification or one-hot encoded datasets, many features may be sparse but still valuable — don’t drop just based on low variance alone.