# Big Idea

- We create an artificial dataset and use it to test common feature selection algorithms.
- We expect that the common algorithms work well to select features that give high accuracy on balanced datasets.  
- Our dataset is highly imbalanced, and we care about recall, not accuracy.  
- Is there a feature selection algorithm that will select the most useful features for our problem?
- Write a paper where we compare the efficacy of different approaches to feature selection, and apply it to our dataset and other well known datasets.  

# Vocabulary Notes

- Confusion Matrix


<table>
    <tr> <td> <td>     <td colspan="2"> Prediction
    <tr> <td> <td> <th> N <th> P 
    <tr> <td rowspan='2'> Actual <th scope="row"> N <td> TN <td> FP 
    <tr> <th scope="row"> P <td> FN <td> TP 
</table>

- Accuracy is the proportion of predictions that are correct.

$$\frac{TP + TN}{TN + TP + FN + FP}$$

- Precision is the proportion, of the things we predicted as positive, that are correct.

$$\frac{TP}{TP + FP}$$

- Recall is the proportion, of the things that are positive, that we predicted correctly.

$$\frac{TP}{TP + FN}$$

- f1 is the harmonic mean of precision and recall.
$$f1 = \frac{2 \times Precision \times Recall}{Precision + Recall}$$

# Method

- We will create a target ($y$-values) dataframe that can be balanced or imbalanced.  
- The dataset will have $n$ records.
- *Balanced* would mean that half of the elements of $y$ are $True$, half $False$.
- *Imbalanced* with parameter $imb$ would mean that $n \times imb$ elements would be $True$, and $n \times (1 - imb)$ would be $False$.  If $imb = 0.5$, then the set is balanced.  In our crash dataset, if we're looking to predict fatal crashes, $imb \approx 0.004$.

- Our *confusion matrix* for $y$ on itself is:

<table>
    <tr> <td> <td>     <td colspan="2"> Prediction
    <tr> <td> <td> <th> N <th> P 
    <tr> <td rowspan='2'> Actual <th scope="row"> N <td> n * (1 - imb) <td> 0 
    <tr> <th scope="row"> P <td> 0 <td> n * imb 
</table>
        

- We will create a dataframe of features ($x$-values), each starting with the ground truth values and randomly swapping True values to False with some probability, and/or swapping the False values to True with some probability, changing the Accuracy, Precision, and/or Recall.  

- For feature $x^{(i)}$, choose a parameter $p$ for modifying the $True$ values of $y$.  For each record, choose a random $r \in [0,1)$.  If $r>p$ and $y_j == False$, then $x^{(i)}_j = True$.  We have taken $p$ portion of the True Negatives and changed them to False Positives.  Now the confusion matrix for $x^{(i)}$ on $y$ is:
        
        
<table>
    <tr> <td>  <td>   <td colspan="2"> Prediction
    <tr> <td> <td> <th> N <th> P 
    <tr> <td rowspan='2'> Actual <th scope="row"> N <td> n * (1 - imb) * p <td> n * (1 - imb) * (1-p) 
    <tr> <th scope="row"> P <td> 0 <td> n * imb 
</table>
        
Note that if $p=1$, nothing changes; $p$ is the portion of things that don't change.
        

- Similarly, use parameter $q$ to swap some $True Positive$ values to $False Negative$.  
        
<table>
    <tr> <td> <td>     <td colspan="2"> Prediction
    <tr> <td> <td> <th> N <th> P 
    <tr> <td rowspan='2'> Actual <th scope="row"> N <td> n * (1 - imb) * p <td> n * (1 - imb) * (i-p) 
    <tr> <th scope="row"> P <td> n * imb * (1-q) <td> n * imb * q
</table>
        

- This confusion matrix gives us these values for accuracy, precision, and recall.
        
\begin{align}
    Accuracy &= \frac{TN + TP}{TN + FP + FN + TP} \cr
        &= \frac{n(1-imb)p + n(imb)(q)}{n} \cr
        &= (1-imb)p + imb \cdot q \cr
        \cr
    Precision &= \frac{TP}{TP + FP} \cr
        &= \frac{n \cdot imb \cdot q}{n \cdot imb \cdot q + n \cdot (1-imb) \cdot (1-p)} \cr
        &= \frac{imb \cdot q}{imb \cdot q + (1-imb) \cdot (1-p)} \cr
    \cr
Recall &= \frac{TP}{TP + FN} \cr
        &= \frac{n \cdot imb \cdot q}{n \cdot imb \cdot q + n \cdot imb (1-q)} \cr
        &= \frac{imb \cdot q}{imb \cdot q + imb (1-q)} \cr
        &= \frac{q}{q + (1-q)} \cr
        &= q
\end{align}


        

- We will loop through different values of $p$ and $q$ to give us a variety of sets of values for accuracy, precision, and recall.

# Import Libraries

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

import random

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import SelectFromModel

from sklearn.ensemble import ExtraTreesClassifier

from sklearn.svm import SVR
from sklearn.svm import LinearSVC

# Set the Parameters

In [2]:
n = 100000
imb = 0.1

# Create the Target

In [3]:
A = [False if i < (1-imb)*n else True for i in range (n)]
D = {'y':A}
data_y = pd.DataFrame(D).astype(bool)
data_y.head()

Unnamed: 0,y
0,False
1,False
2,False
3,False
4,False


# Create $x$-Features

In [4]:
data_x = pd.DataFrame()

## Create x-features Using Different Levels of $p$ and $q$.
- Ignore pairs $(p,q)$ that give accuracy less than 50%, because a fitting algorithm would just use a negative coefficient to swap the predictions.  
- Use SciKit-Learn's metrics to check my calculations of accuracy, precision, and recall.

In [5]:
for x in range (100, 45, -20):
    for y in range (100, 0, -20):
        p = x/100
        q = y/100
        acc = ((1-imb)*p + imb*q)
        pre = ((imb*q)/( imb*q + (1-imb)*(1-p) ))
        rec = q
        f1 = (2*pre*rec)/(pre+rec)
        acc = int(acc*100)
        pre = int(pre*100)
        rec = int(rec*100)
        f1 = int(f1*100)

        s = 'Acc_' + str(acc) + '_Pre_' + str(pre) + '_Rec_' + str(rec) + '_f1_' + str(f1)
        noise = np.random.random(n)
        if acc>=50 and acc<100:
            data_x[s] = np.where ( 
                (
                    (noise > p ) & (data_y['y']==False) | 
                    (noise > q ) & (data_y['y']==True)
                ),
                np.logical_not(data_y['y']), data_y['y'] )
            print (p, q, acc, pre, rec)
            C = confusion_matrix(data_y['y'],data_x[s])
            check_acc = round(accuracy_score(data_y['y'], data_x[s])*100,2)
            check_pre = round(precision_score(data_y['y'], data_x[s])*100,2)
            check_rec = round(recall_score(data_y['y'], data_x[s])*100,2)

            print (s)
            print (C/n*100)
            print(check_acc, check_pre, check_rec)
            print ()            
            
            
data_x

1.0 0.8 98 100 80
Acc_98_Pre_100_Rec_80_f1_88
[[90.     0.   ]
 [ 1.994  8.006]]
98.01 100.0 80.06

1.0 0.6 96 100 60
Acc_96_Pre_100_Rec_60_f1_74
[[90.     0.   ]
 [ 4.007  5.993]]
95.99 100.0 59.93

1.0 0.4 94 100 40
Acc_94_Pre_100_Rec_40_f1_57
[[90.     0.   ]
 [ 5.974  4.026]]
94.03 100.0 40.26

1.0 0.2 92 100 20
Acc_92_Pre_100_Rec_20_f1_33
[[90.     0.   ]
 [ 7.977  2.023]]
92.02 100.0 20.23

0.8 1.0 82 35 100
Acc_82_Pre_35_Rec_100_f1_52
[[72.093 17.907]
 [ 0.    10.   ]]
82.09 35.83 100.0

0.8 0.8 80 30 80
Acc_80_Pre_30_Rec_80_f1_44
[[71.991 18.009]
 [ 2.015  7.985]]
79.98 30.72 79.85

0.8 0.6 78 25 60
Acc_78_Pre_25_Rec_60_f1_35
[[72.211 17.789]
 [ 3.949  6.051]]
78.26 25.38 60.51

0.8 0.4 76 18 40
Acc_76_Pre_18_Rec_40_f1_25
[[71.925 18.075]
 [ 5.953  4.047]]
75.97 18.29 40.47

0.8 0.2 74 10 20
Acc_74_Pre_10_Rec_20_f1_13
[[72.041 17.959]
 [ 7.932  2.068]]
74.11 10.33 20.68

0.6 1.0 64 21 100
Acc_64_Pre_21_Rec_100_f1_35
[[54.026 35.974]
 [ 0.    10.   ]]
64.03 21.75 100.0

0.6 0.8 

Unnamed: 0,Acc_98_Pre_100_Rec_80_f1_88,Acc_96_Pre_100_Rec_60_f1_74,Acc_94_Pre_100_Rec_40_f1_57,Acc_92_Pre_100_Rec_20_f1_33,Acc_82_Pre_35_Rec_100_f1_52,Acc_80_Pre_30_Rec_80_f1_44,Acc_78_Pre_25_Rec_60_f1_35,Acc_76_Pre_18_Rec_40_f1_25,Acc_74_Pre_10_Rec_20_f1_13,Acc_64_Pre_21_Rec_100_f1_35,Acc_62_Pre_18_Rec_80_f1_29,Acc_60_Pre_14_Rec_60_f1_23,Acc_58_Pre_10_Rec_40_f1_16,Acc_56_Pre_5_Rec_20_f1_8
0,False,False,False,False,False,False,False,False,True,False,True,False,False,False
1,False,False,False,False,False,False,False,False,False,True,False,True,True,False
2,False,False,False,False,False,False,True,True,False,False,False,False,True,True
3,False,False,False,False,False,False,True,True,False,False,False,False,False,False
4,False,False,False,False,True,False,False,False,False,False,False,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,True,False,True,False,True,False,False,False,True,True,True,True,True,False
99996,True,True,True,False,True,True,True,True,False,True,True,True,True,False
99997,True,False,False,True,True,False,True,False,False,True,True,True,False,False
99998,False,False,False,False,True,True,True,False,False,True,True,True,False,False


# Apply Feature Selection Algorithms

In [6]:
A = data_x.columns
D = {'Features':A}
FS = pd.DataFrame(D)
FS.head()

Unnamed: 0,Features
0,Acc_98_Pre_100_Rec_80_f1_88
1,Acc_96_Pre_100_Rec_60_f1_74
2,Acc_94_Pre_100_Rec_40_f1_57
3,Acc_92_Pre_100_Rec_20_f1_33
4,Acc_82_Pre_35_Rec_100_f1_52


## Variance Threshold
- I don't think it's relevant to our situation.  

In [7]:
V = []
A = []
for row in data_x:
    v = data_x[row].var()
    v = round(v,4)
    A.append(v)
    V.append([v, row])
FS['Variance'] = A
V = sorted(V, key=lambda x:x[0], reverse=False)
for row in V:
    print (row)
FS.head()

[0.0198, 'Acc_92_Pre_100_Rec_20_f1_33']
[0.0386, 'Acc_94_Pre_100_Rec_40_f1_57']
[0.0563, 'Acc_96_Pre_100_Rec_60_f1_74']
[0.0737, 'Acc_98_Pre_100_Rec_80_f1_88']
[0.1602, 'Acc_74_Pre_10_Rec_20_f1_13']
[0.1723, 'Acc_76_Pre_18_Rec_40_f1_25']
[0.1816, 'Acc_78_Pre_25_Rec_60_f1_35']
[0.1924, 'Acc_80_Pre_30_Rec_80_f1_44']
[0.2012, 'Acc_82_Pre_35_Rec_100_f1_52']
[0.235, 'Acc_56_Pre_5_Rec_20_f1_8']
[0.2401, 'Acc_58_Pre_10_Rec_40_f1_16']
[0.2434, 'Acc_60_Pre_14_Rec_60_f1_23']
[0.2462, 'Acc_62_Pre_18_Rec_80_f1_29']
[0.2484, 'Acc_64_Pre_21_Rec_100_f1_35']


Unnamed: 0,Features,Variance
0,Acc_98_Pre_100_Rec_80_f1_88,0.0737
1,Acc_96_Pre_100_Rec_60_f1_74,0.0563
2,Acc_94_Pre_100_Rec_40_f1_57,0.0386
3,Acc_92_Pre_100_Rec_20_f1_33,0.0198
4,Acc_82_Pre_35_Rec_100_f1_52,0.2012


In [8]:
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(data_x)
data_VT = data_x[data_x.columns[sel.get_support(indices=True)]]
data_VT.columns.symmetric_difference(data_x.columns)
data_VT.head()

Unnamed: 0,Acc_82_Pre_35_Rec_100_f1_52,Acc_80_Pre_30_Rec_80_f1_44,Acc_78_Pre_25_Rec_60_f1_35,Acc_76_Pre_18_Rec_40_f1_25,Acc_74_Pre_10_Rec_20_f1_13,Acc_64_Pre_21_Rec_100_f1_35,Acc_62_Pre_18_Rec_80_f1_29,Acc_60_Pre_14_Rec_60_f1_23,Acc_58_Pre_10_Rec_40_f1_16,Acc_56_Pre_5_Rec_20_f1_8
0,False,False,False,False,True,False,True,False,False,False
1,False,False,False,False,False,True,False,True,True,False
2,False,False,True,True,False,False,False,False,True,True
3,False,False,True,True,False,False,False,False,False,False
4,True,False,False,False,False,False,False,True,True,True


## SelectKBest
- SelectKBest ranks the features by some score, then chooses the $k$ best, which can be a number or a percentage.
- Test first with the default scoring function, f_classif,
- Then test with $\chi^2$,
- Then test with mutual_info_classif.  I haven't gotten this to work.

In [9]:
# Create and fit selector
selector = SelectKBest(f_classif, k='all')
selector.fit(data_x, data_y['y'])
scores = selector.scores_
Scores = []
for i, feature in enumerate(data_x):
    Scores.append([round(scores[i],0), feature])
Scores = sorted(Scores, key=lambda x:x[0], reverse=True)
for row in Scores:
    print (row)
FS['SKB_f_classif'] = scores
FS.head()

[361347.0, 'Acc_98_Pre_100_Rec_80_f1_88']
[134604.0, 'Acc_96_Pre_100_Rec_60_f1_74']
[60652.0, 'Acc_94_Pre_100_Rec_40_f1_57']
[40259.0, 'Acc_82_Pre_35_Rec_100_f1_52']
[22824.0, 'Acc_92_Pre_100_Rec_20_f1_33']
[20124.0, 'Acc_80_Pre_30_Rec_80_f1_44']
[15018.0, 'Acc_64_Pre_21_Rec_100_f1_35']
[8967.0, 'Acc_78_Pre_25_Rec_60_f1_35']
[6170.0, 'Acc_62_Pre_18_Rec_80_f1_29']
[2219.0, 'Acc_76_Pre_18_Rec_40_f1_25']
[1591.0, 'Acc_56_Pre_5_Rec_20_f1_8']
[1486.0, 'Acc_60_Pre_14_Rec_60_f1_23']
[3.0, 'Acc_74_Pre_10_Rec_20_f1_13']
[0.0, 'Acc_58_Pre_10_Rec_40_f1_16']


Unnamed: 0,Features,Variance,SKB_f_classif
0,Acc_98_Pre_100_Rec_80_f1_88,0.0737,361346.835105
1,Acc_96_Pre_100_Rec_60_f1_74,0.0563,134604.24572
2,Acc_94_Pre_100_Rec_40_f1_57,0.0386,60651.615869
3,Acc_92_Pre_100_Rec_20_f1_33,0.0198,22823.913577
4,Acc_82_Pre_35_Rec_100_f1_52,0.2012,40258.869794


In [10]:
# Create and fit selector
selector = SelectKBest(chi2, k='all')
selector.fit(data_x, data_y['y'])
scores = selector.scores_
Scores = []
for i, feature in enumerate(data_x):
    Scores.append([round(scores[i],0), feature])
Scores = sorted(Scores, key=lambda x:x[0], reverse=True)
for row in Scores:
    print (row)
FS['SKB_chi2'] = scores
FS.head()

[72054.0, 'Acc_98_Pre_100_Rec_80_f1_88']
[53937.0, 'Acc_96_Pre_100_Rec_60_f1_74']
[36234.0, 'Acc_94_Pre_100_Rec_40_f1_57']
[20693.0, 'Acc_82_Pre_35_Rec_100_f1_52']
[18207.0, 'Acc_92_Pre_100_Rec_20_f1_33']
[12398.0, 'Acc_80_Pre_30_Rec_80_f1_44']
[7054.0, 'Acc_64_Pre_21_Rec_100_f1_35']
[6267.0, 'Acc_78_Pre_25_Rec_60_f1_35']
[3266.0, 'Acc_62_Pre_18_Rec_80_f1_29']
[1691.0, 'Acc_76_Pre_18_Rec_40_f1_25']
[975.0, 'Acc_56_Pre_5_Rec_20_f1_8']
[851.0, 'Acc_60_Pre_14_Rec_60_f1_23']
[2.0, 'Acc_74_Pre_10_Rec_20_f1_13']
[0.0, 'Acc_58_Pre_10_Rec_40_f1_16']


Unnamed: 0,Features,Variance,SKB_f_classif,SKB_chi2
0,Acc_98_Pre_100_Rec_80_f1_88,0.0737,361346.835105,72054.0
1,Acc_96_Pre_100_Rec_60_f1_74,0.0563,134604.24572,53937.0
2,Acc_94_Pre_100_Rec_40_f1_57,0.0386,60651.615869,36234.0
3,Acc_92_Pre_100_Rec_20_f1_33,0.0198,22823.913577,18207.0
4,Acc_82_Pre_35_Rec_100_f1_52,0.2012,40258.869794,20693.337191


In [11]:
# Create and fit selector
selector = SelectKBest(mutual_info_classif, k='all')
selector.fit(data_x, data_y['y'])
scores = selector.scores_
Scores = []
for i, feature in enumerate(data_x):
    Scores.append([round(scores[i],6), feature])
Scores = sorted(Scores, key=lambda x:x[0], reverse=True)
for row in Scores:
    print (row)
FS['SKB_mutual_info_classif'] = scores
FS.head()

[0.229415, 'Acc_98_Pre_100_Rec_80_f1_88']
[0.159788, 'Acc_96_Pre_100_Rec_60_f1_74']
[0.143674, 'Acc_82_Pre_35_Rec_100_f1_52']
[0.101197, 'Acc_94_Pre_100_Rec_40_f1_57']
[0.086008, 'Acc_64_Pre_21_Rec_100_f1_35']
[0.073265, 'Acc_80_Pre_30_Rec_80_f1_44']
[0.04984, 'Acc_92_Pre_100_Rec_20_f1_33']
[0.034531, 'Acc_78_Pre_25_Rec_60_f1_35']
[0.033167, 'Acc_62_Pre_18_Rec_80_f1_29']
[0.012037, 'Acc_60_Pre_14_Rec_60_f1_23']
[0.010956, 'Acc_76_Pre_18_Rec_40_f1_25']
[0.010562, 'Acc_56_Pre_5_Rec_20_f1_8']
[0.004034, 'Acc_58_Pre_10_Rec_40_f1_16']
[0.001627, 'Acc_74_Pre_10_Rec_20_f1_13']


Unnamed: 0,Features,Variance,SKB_f_classif,SKB_chi2,SKB_mutual_info_classif
0,Acc_98_Pre_100_Rec_80_f1_88,0.0737,361346.835105,72054.0,0.229415
1,Acc_96_Pre_100_Rec_60_f1_74,0.0563,134604.24572,53937.0,0.159788
2,Acc_94_Pre_100_Rec_40_f1_57,0.0386,60651.615869,36234.0,0.101197
3,Acc_92_Pre_100_Rec_20_f1_33,0.0198,22823.913577,18207.0,0.04984
4,Acc_82_Pre_35_Rec_100_f1_52,0.2012,40258.869794,20693.337191,0.143674


## RFECV:  Recursive Feature Elimination and Cross-Validated selection of the best number of features.

In [None]:
#- This one takes a long time, so I turned the cell into into Markdown while I'm developing the rest of the notebook.

estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=4)
selector = selector.fit(data_x, data_y['y'])
selector.support_
selector.ranking_
scores = selector.ranking_
FS['RFECV'] = scores
FS.head()

## SelectFromModel

### L1-based Feature Selection

In [12]:
print (data_x.shape)
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(data_x, data_y['y'])
scores = lsvc.coef_[0]  
FS['SFM_L1'] = scores
FS.head()

(100000, 14)


Unnamed: 0,Features,Variance,SKB_f_classif,SKB_chi2,SKB_mutual_info_classif,SFM_L1
0,Acc_98_Pre_100_Rec_80_f1_88,0.0737,361346.835105,72054.0,0.229415,1.710469
1,Acc_96_Pre_100_Rec_60_f1_74,0.0563,134604.24572,53937.0,0.159788,1.593073
2,Acc_94_Pre_100_Rec_40_f1_57,0.0386,60651.615869,36234.0,0.101197,1.44534
3,Acc_92_Pre_100_Rec_20_f1_33,0.0198,22823.913577,18207.0,0.04984,1.000318
4,Acc_82_Pre_35_Rec_100_f1_52,0.2012,40258.869794,20693.337191,0.143674,0.29644


In [13]:
model = SelectFromModel(lsvc, prefit=True)
model.transform(data_x)
data_New = data_x[data_x.columns[model.get_support(indices=True)]]
print ("Number of retained features")
print (data_New.shape)
print ("Retained Features")
print (data_New.columns)
print ("Deleted Features")
data_New.columns.symmetric_difference(data_x.columns)

Number of retained features
(100000, 12)
Retained Features
Index(['Acc_98_Pre_100_Rec_80_f1_88', 'Acc_96_Pre_100_Rec_60_f1_74',
       'Acc_94_Pre_100_Rec_40_f1_57', 'Acc_92_Pre_100_Rec_20_f1_33',
       'Acc_82_Pre_35_Rec_100_f1_52', 'Acc_80_Pre_30_Rec_80_f1_44',
       'Acc_78_Pre_25_Rec_60_f1_35', 'Acc_76_Pre_18_Rec_40_f1_25',
       'Acc_64_Pre_21_Rec_100_f1_35', 'Acc_62_Pre_18_Rec_80_f1_29',
       'Acc_60_Pre_14_Rec_60_f1_23', 'Acc_56_Pre_5_Rec_20_f1_8'],
      dtype='object')
Deleted Features


Index(['Acc_58_Pre_10_Rec_40_f1_16', 'Acc_74_Pre_10_Rec_20_f1_13'], dtype='object')

### Tree-Based Feature Selection

In [14]:
print (data_x.shape)
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(data_x, data_y['y'])
scores = clf.feature_importances_  
FS['SFM_Tree_Based'] = scores
FS.head()

(100000, 14)


Unnamed: 0,Features,Variance,SKB_f_classif,SKB_chi2,SKB_mutual_info_classif,SFM_L1,SFM_Tree_Based
0,Acc_98_Pre_100_Rec_80_f1_88,0.0737,361346.835105,72054.0,0.229415,1.710469,0.345929
1,Acc_96_Pre_100_Rec_60_f1_74,0.0563,134604.24572,53937.0,0.159788,1.593073,0.227512
2,Acc_94_Pre_100_Rec_40_f1_57,0.0386,60651.615869,36234.0,0.101197,1.44534,0.123787
3,Acc_92_Pre_100_Rec_20_f1_33,0.0198,22823.913577,18207.0,0.04984,1.000318,0.056299
4,Acc_82_Pre_35_Rec_100_f1_52,0.2012,40258.869794,20693.337191,0.143674,0.29644,0.115319


In [15]:
model = SelectFromModel(clf, prefit=True)
model.transform(data_x)
data_New = data_x[data_x.columns[model.get_support(indices=True)]]
print ("Number of retained features")
print (data_New.shape)
print ("Retained Features")
print (data_New.columns)
print ("Deleted Features")
data_New.columns.symmetric_difference(data_x.columns)

Number of retained features
(100000, 4)
Retained Features
Index(['Acc_98_Pre_100_Rec_80_f1_88', 'Acc_96_Pre_100_Rec_60_f1_74',
       'Acc_94_Pre_100_Rec_40_f1_57', 'Acc_82_Pre_35_Rec_100_f1_52'],
      dtype='object')
Deleted Features


Index(['Acc_56_Pre_5_Rec_20_f1_8', 'Acc_58_Pre_10_Rec_40_f1_16',
       'Acc_60_Pre_14_Rec_60_f1_23', 'Acc_62_Pre_18_Rec_80_f1_29',
       'Acc_64_Pre_21_Rec_100_f1_35', 'Acc_74_Pre_10_Rec_20_f1_13',
       'Acc_76_Pre_18_Rec_40_f1_25', 'Acc_78_Pre_25_Rec_60_f1_35',
       'Acc_80_Pre_30_Rec_80_f1_44', 'Acc_92_Pre_100_Rec_20_f1_33'],
      dtype='object')