## Bernoulli Variance Feature Selection

A random variable follows the Bernoulli distribution when the variable which takes the value 1 with probability $p$ and the value 0 with probability $q = 1- p$. Less formally, it can be thought of as a model for the set of possible outcomes of any single experiment that asks a True/False or yes/no questions. 

$$ P(X=1) = p = 1-P(X=0) = 1-q$$

Variance for the Bernoulli distribution is defined as; 

$$ Var(x) = p(1-q) = pq$$

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import bernoulli
from sklearn.feature_selection import VarianceThreshold

In [2]:
# create bernoulli variables with varying variance
v_1 = bernoulli.rvs(0.30, size=10, random_state = 101)
v_2 = bernoulli.rvs(0.50, size=10, random_state = 101)
v_3 = bernoulli.rvs(0.70, size=10, random_state = 101)
v_4 = bernoulli.rvs(0.90, size=10, random_state = 101)

In [3]:
# prepare the data for a dataframe
matrix = np.vstack(np.array([v_1, v_2, v_3, v_4])).T

In [4]:
# create dataframe
df = pd.DataFrame(matrix, columns = ['Vector1', 'Vector2','Vector3','Vector4'])

# evaluate the data
df

Unnamed: 0,Vector1,Vector2,Vector3,Vector4
0,0,1,1,1
1,0,1,1,1
2,0,0,1,1
3,0,0,1,1
4,0,1,1,1
5,1,1,0,1
6,0,0,1,1
7,1,1,0,1
8,1,1,0,1
9,0,0,1,1


In [5]:
# calculate the variance for each vector
df.var()

Vector1    0.233333
Vector2    0.266667
Vector3    0.233333
Vector4    0.000000
dtype: float64

<b>Threshold:</b> Features with a training-set variance lower than this threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.

In [6]:
#calculate the desired threshold
pq = (.75 * (1 - .75))

print('Threshold is', pq)      
      
# run threshold by variance
thresholder = VarianceThreshold(threshold= pq)

# Fit the transformer
thresholder.fit(df)

# show the outcome
print('Support:', thresholder.get_support())

# get the vector names being kept in the model
print('kept features:', thresholder.get_feature_names_out())

Threshold is 0.1875
Support: [ True  True  True False]
kept features: ['Vector1' 'Vector2' 'Vector3']
