# Feature Selection

Feature Selection is the process of "selecting best features" for model prediction. It attempts to weed out the noise in the data and remove it. 
The goal of feature selection is to improve machine learning capabilities by
    - increasing the predictive power
    - reducing the time cost

Feature Selection methods
    - Statistical
    - Model-based
This separation may not 100% capture the complexity of the science and art of feature selection, but will work to drive real and actionable results in our machine learning pipeline.

- Statistical-based feature selection will rely heavily on statistical tests that are separate from our machine learning
models in order to select features during the training phase of our pipeline.

- Model-based selection relies on a preprocessing step that involves training a secondary machine learning model and using that model's predictive power to select features.

Both of these types of feature selection attempt to reduce the size of our data by subsetting from our original features only the best ones with the highest predictive power.

Better Performance of Machine Learning Algorithms is measured by "Metrics" and "Meta-metrics".<br/>
<u>Metrics for Classification tasks</u>:
    - Accuracy for Classification tasks
    - True and False positive rate
    - Sensitivity and Specificity
    - False negative and false positive rate

<u>Metrics for Regression</u>:
    - RMSE for Regression tasks
    - Mean and Absolute Error
    - R Square

<u>Meta-metrics</u>:
    - Time it takes to fit/train data
    - Time it takes to predict new instances
    - Data size if it needs to be persisted

In [5]:
# Generic function to evaluate models and produce metrics
from sklearn.model_selection import GridSearchCV

def get_best_model_and_accuracy(model, params, X, y):
    grid = GridSearchCV(model, params, error_score=0)
    grid.fit(X, y)
    
    print("Best Accuracy: {}".format(grid.best_score_))
    print("Best Parameters: {}".format(grid.best_params_))
    print("Average time to fit (s): {}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
    print("Average time to score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))

In [6]:
import pandas as pd
import numpy as np

np.random.seed(123)

credit_card_default = pd.read_csv('Data/credit_card_default.csv')
credit_card_default.shape

(30000, 24)

In [7]:
credit_card_default.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [8]:
# Transpose results for viewing
credit_card_default.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
LIMIT_BAL,30000.0,167484.322667,129747.661567,10000.0,50000.0,140000.0,240000.0,1000000.0
SEX,30000.0,1.603733,0.489129,1.0,1.0,2.0,2.0,2.0
EDUCATION,30000.0,1.853133,0.790349,0.0,1.0,2.0,2.0,6.0
MARRIAGE,30000.0,1.551867,0.52197,0.0,1.0,2.0,2.0,3.0
AGE,30000.0,35.4855,9.217904,21.0,28.0,34.0,41.0,79.0
PAY_0,30000.0,-0.0167,1.123802,-2.0,-1.0,0.0,0.0,8.0
PAY_2,30000.0,-0.133767,1.197186,-2.0,-1.0,0.0,0.0,8.0
PAY_3,30000.0,-0.1662,1.196868,-2.0,-1.0,0.0,0.0,8.0
PAY_4,30000.0,-0.220667,1.169139,-2.0,-1.0,0.0,0.0,8.0
PAY_5,30000.0,-0.2662,1.133187,-2.0,-1.0,0.0,0.0,8.0


In [9]:
# Check for missing values
# Notice that there are no missing values

credit_card_default.isnull().sum()

LIMIT_BAL                     0
SEX                           0
EDUCATION                     0
MARRIAGE                      0
AGE                           0
PAY_0                         0
PAY_2                         0
PAY_3                         0
PAY_4                         0
PAY_5                         0
PAY_6                         0
BILL_AMT1                     0
BILL_AMT2                     0
BILL_AMT3                     0
BILL_AMT4                     0
BILL_AMT5                     0
BILL_AMT6                     0
PAY_AMT1                      0
PAY_AMT2                      0
PAY_AMT3                      0
PAY_AMT4                      0
PAY_AMT5                      0
PAY_AMT6                      0
default payment next month    0
dtype: int64

In [10]:
# Create feature matrix
X = credit_card_default.drop('default payment next month', axis=1)

# Create response variable
y = credit_card_default['default payment next month']

In [11]:
# Null accuracy rate is 77.88% (we set this as the target accuracy we need to achieve)
y.value_counts(normalize=True)

0    0.7788
1    0.2212
Name: default payment next month, dtype: float64

In [12]:
# Import learning models upon which we select best one (Spot check)

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

lr_params = {'C': [1e-1, 1e0, 1e1, 1e2], 'penalty': ['l1', 'l2']}
knn_params = {'n_neighbors': [1, 3, 5, 7]}
tree_params = {'max_depth': [None, 1, 3, 5, 7]}
forest_params = {'n_estimators': [10, 50, 100], 'max_depth': [None, 1, 3, 5, 7]}

In [13]:
lr = LogisticRegression(solver='liblinear')
knn = KNeighborsClassifier()
d_tree = DecisionTreeClassifier()
forest = RandomForestClassifier()

In [14]:
# Logistic Regression model
# Note that the accuracy 80.95 % > Null Accuracy (77.88%)
# Fit time = 0.921
# Predict time = 0.003

get_best_model_and_accuracy(lr, lr_params, X, y)



Best Accuracy: 0.8095333333333333
Best Parameters: {'C': 0.1, 'penalty': 'l1'}
Average time to fit (s): 0.918
Average time to score (s): 0.003


In [15]:
# KNN (without scaling)
# Note that the accuracy 76.02 %
# Fit time = 0.043
# Predict time = 0.885

get_best_model_and_accuracy(knn, knn_params, X, y)



Best Accuracy: 0.7602333333333333
Best Parameters: {'n_neighbors': 7}
Average time to fit (s): 0.039
Average time to score (s): 0.885


In [16]:
# KNN is a distance based model, in that it uses a metric of closeness in space that assumes that all the features
# are on the same scale, which we already know that our data is not on. So, for KNN, we construct a pipeline to include
# "z-score normalization" of features before applying KNN

# Notice that the pipeline now beats the metrics
#        i.e null accuary = 80.08 %
#            Fit time = 0.078
#            and predicting time = 9.652 

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

knn_pipe_params = {'classifier__{}'.format(k): v for k, v in knn_params.items()}
print (knn_pipe_params)

knn_pipe = Pipeline([('scale', StandardScaler()), ('classifier', knn)])

get_best_model_and_accuracy(knn_pipe, knn_pipe_params, X, y)


{'classifier__n_neighbors': [1, 3, 5, 7]}


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform

  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)


Best Accuracy: 0.8008
Best Parameters: {'classifier__n_neighbors': 7}
Average time to fit (s): 0.076
Average time to score (s): 7.719


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


In [17]:
# Decision Tree model
# Note the metrics
#     Accuracy = 82.02 %
#     Fit Time = 0.172
#     Prediction Time = 0.002

get_best_model_and_accuracy(d_tree, tree_params, X, y)



Best Accuracy: 0.8202666666666667
Best Parameters: {'max_depth': 3}
Average time to fit (s): 0.166
Average time to score (s): 0.004


In [18]:
# Random Forest model
# Note the metrics
#     Accuracy = 81.89 %
#     Fit Time = 1.225
#     Prediction Time = 0.073

get_best_model_and_accuracy(forest, forest_params, X, y)



Best Accuracy: 0.8195666666666667
Best Parameters: {'max_depth': 7, 'n_estimators': 50}
Average time to fit (s): 1.127
Average time to score (s): 0.065


In [19]:
# Summary of model performance

# Model                Accuracy         Fit Time       Predict Time
# =================================================================
# Logistic Regression   80.95            0.921         0.003
# KNN (with scaling)    80.08            0.078(Best)   9.652
# Decision Tree         82.02(Best)      0.172         0.002 (Best)
# Random Forest         81.89            1.225         0.073
# =================================================================

#### Statistical Based Feature Selection

Statistical-based feature selection will rely heavily on statistical tests that are separate from our machine learning
models in order to select features during the training phase of our pipeline.

Statistics provides us with relatively quick and easy methods of interpreting both quantitative and qualitative data. We have used some statistical measures in previous chapters to obtain new knowledge and perspective around our data, specifically in that we recognized 'mean' and 'standard deviation' as metrics that enabled us to calculate 'z-scores' and
scale our data. We will look at following "Univariate methods" of feature selection:
- Pearson Correlations
- Hypothesis Testing

They are quick and handy when the problem is to select out 'single feature' at a time in order to create a better dataset for our machine learning pipeline.

- Pearson Correlation<br>
The Pearson correlation coefficient (which is the default for pandas) measures the linear relationship between columns. The value of the coefficient varies between -1 and +1, where 0 implies no correlation between them. Correlations closer to -1 or +1 imply an extremely strong linear relationship.

In [20]:
# Let us select features with the help of Pearson's Correlation
# Pearson Correlation gives the correlation coefficient with reference to other features 
# (following columns include the Response variable also; so we get intra feature correlation 
# as well as correlation to Response variable from features)

credit_card_default.corr()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
LIMIT_BAL,1.0,0.024755,-0.219161,-0.108139,0.144713,-0.271214,-0.296382,-0.286123,-0.26746,-0.249411,...,0.293988,0.295562,0.290389,0.195236,0.178408,0.210167,0.203242,0.217202,0.219595,-0.15352
SEX,0.024755,1.0,0.014232,-0.031389,-0.090874,-0.057643,-0.070771,-0.066096,-0.060173,-0.055064,...,-0.02188,-0.017005,-0.016733,-0.000242,-0.001391,-0.008597,-0.002229,-0.001667,-0.002766,-0.039961
EDUCATION,-0.219161,0.014232,1.0,-0.143464,0.175061,0.105364,0.121566,0.114025,0.108793,0.09752,...,-0.000451,-0.007567,-0.009099,-0.037456,-0.030038,-0.039943,-0.038218,-0.040358,-0.0372,0.028006
MARRIAGE,-0.108139,-0.031389,-0.143464,1.0,-0.41417,0.019917,0.024199,0.032688,0.033122,0.035629,...,-0.023344,-0.025393,-0.021207,-0.005979,-0.008093,-0.003541,-0.012659,-0.001205,-0.006641,-0.024339
AGE,0.144713,-0.090874,0.175061,-0.41417,1.0,-0.039447,-0.050148,-0.053048,-0.049722,-0.053826,...,0.051353,0.049345,0.047613,0.026147,0.021785,0.029247,0.021379,0.02285,0.019478,0.01389
PAY_0,-0.271214,-0.057643,0.105364,0.019917,-0.039447,1.0,0.672164,0.574245,0.538841,0.509426,...,0.179125,0.180635,0.17698,-0.079269,-0.070101,-0.070561,-0.064005,-0.05819,-0.058673,0.324794
PAY_2,-0.296382,-0.070771,0.121566,0.024199,-0.050148,0.672164,1.0,0.766552,0.662067,0.62278,...,0.222237,0.221348,0.219403,-0.080701,-0.05899,-0.055901,-0.046858,-0.037093,-0.0365,0.263551
PAY_3,-0.286123,-0.066096,0.114025,0.032688,-0.053048,0.574245,0.766552,1.0,0.777359,0.686775,...,0.227202,0.225145,0.222327,0.001295,-0.066793,-0.053311,-0.046067,-0.035863,-0.035861,0.235253
PAY_4,-0.26746,-0.060173,0.108793,0.033122,-0.049722,0.538841,0.662067,0.777359,1.0,0.819835,...,0.245917,0.242902,0.239154,-0.009362,-0.001944,-0.069235,-0.043461,-0.03359,-0.026565,0.216614
PAY_5,-0.249411,-0.055064,0.09752,0.035629,-0.053826,0.509426,0.62278,0.686775,0.819835,1.0,...,0.271915,0.269783,0.262509,-0.006089,-0.003191,0.009062,-0.058299,-0.033337,-0.023027,0.204149


In [21]:
# Let us consider features having absolute value of correlation coef. > 0.2 (threshold we set) with response variable and 
# discard response variable from the list (which is also shown as correlation coef. to self will be = 1)

highly_correlated_features = credit_card_default.columns[credit_card_default.corr()['default payment next month'].abs() > 0.2].drop('default payment next month')
highly_correlated_features

Index(['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5'], dtype='object')

In [22]:
# Let us check the metrics in modeling these 5 features
# Note that the accuracy = 81.97 % > 82.03 % (expected) but the fit time is 20 times less

X_subsetted = X[highly_correlated_features]
get_best_model_and_accuracy(d_tree, tree_params, X_subsetted, y)



Best Accuracy: 0.8196666666666667
Best Parameters: {'max_depth': 3}
Average time to fit (s): 0.009
Average time to score (s): 0.003


In [23]:
# Let us improve the CustomCorrelationChooser to include logic of selecting best features using correlation
from sklearn.base import TransformerMixin, BaseEstimator
class CustomCorrelationChooser(TransformerMixin, BaseEstimator):
    def __init__(self, response, cols_to_keep=[], threshold=None):
        self.response = response
        self.threshold = threshold
        self.cols_to_keep = cols_to_keep
        
    def transform(self, X):
        return X[self.cols_to_keep]

    def fit(self, X, *_):
        df = pd.concat([X, self.response], axis=1)
        self.cols_to_keep = df.columns[df.corr()[df.columns[-1]].abs() > self.threshold]
        self.cols_to_keep = [c for c in self.cols_to_keep if c in X.columns]
        return self


In [24]:
ccc = CustomCorrelationChooser(threshold = 0.2, response=y)
ccc.fit(X)
ccc.cols_to_keep

['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5']

In [25]:
# Notice that the transform method retained features with absolute value of threshold > 0.2
ccc.transform(X).head()

Unnamed: 0,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5
0,2,2,-1,-1,-2
1,-1,2,0,0,0
2,0,0,0,0,0
3,0,0,0,0,0
4,-1,0,-1,0,0


In [26]:
tree_pipe_params = {'classifier__max_depth': 
                    [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}

In [27]:
# Our pipeline is showing us that if we threshold at 0.1, we have eliminated noise enough to improve accuracy and
# also cut down on the fitting time (from .158 seconds without the selector).
# Best Accuracy: 0.8206
# Best Parameters:
#   classifier__max_depth: 5
#   correlation_select__threshold: 0.1
# Average time to fit (s): 0.154
# Average time to score (s): 0.004
#

import copy

# instantiate our feature selector with the response variable set
ccc = CustomCorrelationChooser(response=y)

# make our new pipeline, including the selector
ccc_pipe = Pipeline([('correlation_select', ccc), 
                     ('classifier', d_tree)])

# make a copy of the decisino tree pipeline parameters
ccc_pipe_params = copy.deepcopy(tree_pipe_params)

# update that dictionary with feature selector specific parameters
ccc_pipe_params.update({'correlation_select__threshold':[0, .1, .2, .3]})

print (ccc_pipe_params)

# better than original (by a little, and a bit faster on 
# average overall
get_best_model_and_accuracy(ccc_pipe, ccc_pipe_params, X, y)  

{'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'correlation_select__threshold': [0, 0.1, 0.2, 0.3]}




Best Accuracy: 0.8206
Best Parameters: {'classifier__max_depth': 5, 'correlation_select__threshold': 0.1}
Average time to fit (s): 0.15
Average time to score (s): 0.004


In [28]:
# Examine which columns are selected for above metrics of threshold of 0.1
# It appears that our selector has decided to keep the five columns that we found, as well as two more, 
# viz., the LIMIT_BAL and the PAY_6 columns.
# Great! This is the beauty of automated pipeline gridsearching in scikit-learn. It allows our models to do what they
# do best and intuit things that we could not have on our own

ccc = CustomCorrelationChooser(threshold=0.1, response=y)
ccc.fit(X)

ccc.cols_to_keep

['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

- Hypothesis Testing<br/>
A hypothesis test is a statistical test that is used to figure out whether we can apply a certain condition for an entire population, given a data sample. We usually use a p-value (a non-negative decimal with an upper bound of 1, which is based on our significance level) to make this conclusion. <br/> In the case of feature selection, the null hypothesis would be "The feature has no relevance to the response variable". We test this relevance, by calculating the correlation with the response variable. <br/> - If the coefficient is weak or low, then we say that the hypothesis has no relevance to response variable so we accept the null hypothesis. <br/> - If the coefficient is strong or high, we reject null hypothesis in favor of the alternate hypothesis, that the feature has some relevance. <br/>We repeat this exercise on all features with the response variable.<br/><br/>
The p-values are a decimals between 0 and 1 that represent the probability that the data given to us occurred by chance under the hypothesis test.<br/>
Simply put, the lower the p-value, the better the chance that we can reject the null hypothesis. For our purposes, the smaller the p-value, the better the chances that the feature has some relevance to our response variable and we should keep it.

In [29]:
# SelectKBest selects features according to the k highest scores of a given scoring function
from sklearn.feature_selection import SelectKBest

# This models a statistical test known as ANOVA
from sklearn.feature_selection import f_classif
# f_classif allows for negative values, not all do
# chi2 is a very common classification criteria but only allows for positive values

# regression has its own statistical tests

In [30]:
#  keep only the best five features according to p-values of ANOVA test
k_best = SelectKBest(f_classif, k=5)

In [31]:
# matrix after selecting the top 5 features
k_best.fit_transform(X, y)

array([[ 2,  2, -1, -1, -2],
       [-1,  2,  0,  0,  0],
       [ 0,  0,  0,  0,  0],
       ...,
       [ 4,  3,  2, -1,  0],
       [ 1, -1,  0,  0,  0],
       [ 0,  0,  0,  0,  0]], dtype=int64)

In [32]:
# p values of columns
k_best.pvalues_

p_values = pd.DataFrame({'column': X.columns, 'p_value': k_best.pvalues_}).sort_values('p_value')
p_values.head()

Unnamed: 0,column,p_value
5,PAY_0,0.0
6,PAY_2,0.0
7,PAY_3,0.0
8,PAY_4,1.899297e-315
9,PAY_5,1.126608e-279


In [33]:
# features with a low p value
p_values[p_values['p_value'] < .05]

Unnamed: 0,column,p_value
5,PAY_0,0.0
6,PAY_2,0.0
7,PAY_3,0.0
8,PAY_4,1.899297e-315
9,PAY_5,1.126608e-279
10,PAY_6,7.29674e-234
0,LIMIT_BAL,1.302244e-157
17,PAY_AMT1,1.146488e-36
18,PAY_AMT2,3.166657e-24
20,PAY_AMT4,6.830942000000001e-23


In [34]:
# features with a high p value
p_values[p_values['p_value'] >= .05]

Unnamed: 0,column,p_value
14,BILL_AMT4,0.078556
15,BILL_AMT5,0.241634
16,BILL_AMT6,0.352123


In [35]:
# Let's use our SelectKBest in a pipeline to see if we can grid search our way into a better machine learning pipeline

k_best = SelectKBest(f_classif)

# Make a new pipeline with SelectKBest
select_k_pipe = Pipeline([('k_best', k_best), 
                          ('classifier', d_tree)])

select_k_best_pipe_params = copy.deepcopy(tree_pipe_params)

# I've excluded the 'all' from the range to avoid error
# select_k_best_pipe_params.update({'k_best__k':range(1,23)+['all'], })
                                 
select_k_best_pipe_params.update({'k_best__k':range(1, 23), })

print (select_k_best_pipe_params)
# comparable to our results with correlationchooser
get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y)

{'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21], 'k_best__k': range(1, 23)}




Best Accuracy: 0.8206
Best Parameters: {'classifier__max_depth': 5, 'k_best__k': 7}
Average time to fit (s): 0.104
Average time to score (s): 0.003


In [36]:
# Let's see which columns our tests are selecting for us

k_best = SelectKBest(f_classif, k=7)

In [37]:
# lowest 7 p values match what our custom correlationchooser chose before
# ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

p_values.head(7)

Unnamed: 0,column,p_value
5,PAY_0,0.0
6,PAY_2,0.0
7,PAY_3,0.0
8,PAY_4,1.899297e-315
9,PAY_5,1.126608e-279
10,PAY_6,7.29674e-234
0,LIMIT_BAL,1.302244e-157


#### Model-based Feature Selection

In [38]:
from sklearn import datasets 
from sklearn.model_selection import cross_val_score
from sklearn import svm

iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores 

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [39]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)


In [40]:
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [41]:
# Load iris data
iris = load_iris()

# Create features and target
X = iris.data
y = iris.target

# Convert to categorical data by converting data to integers
X = X.astype(int)

In [42]:
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
X_kbest = chi2_selector.fit_transform(X, y)

In [43]:
# Show results
print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_kbest.shape[1])

Original number of features: 4
Reduced number of features: 2


In [44]:
X_kbest

array([[1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [1, 0],
       [4, 1],
       [4, 1],
       [4, 1],
       [4, 1],
       [4, 1],
       [4, 1],
       [4, 1],
       [3, 1],
       [4, 1],
       [3, 1],
       [3, 1],
       [4, 1],
       [4, 1],
       [4, 1],
       [3, 1],
       [4, 1],
       [4,