# Feature selection/ Extraction in Machine Learning for Numerical Data

This data originates from blog posts. The raw HTML-documents of the blog posts were crawled and processed. The prediction task associated with the data is the prediction of the number of comments on each blog post in the upcoming 24 hours.  In order to simulate this situation, a basetime (in the past) is selected and  the blog posts that were published at most 72 hours before the selected base date/time are taken. Then,  all the features of the selected blog posts from the information that was available at the basetime were collected.

The target is the number of comments that a blog post received in the next 24 hours from its basetime

Data set and data description taken from https://archive.ics.uci.edu/ml/datasets/BlogFeedback#

In [2]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
def load_data():
    return pd.read_csv("../Data/blogData_train.csv", header = None)

In [4]:
raw_training_data = load_data()

#Assign custom column names to the features since we will need header names for calculations later.
custom_column_names = ["feature"+str(idx) for idx in range(raw_training_data.shape[1])]
raw_training_data.columns = custom_column_names

In [5]:
raw_training_data.head()

Unnamed: 0,feature0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,...,feature271,feature272,feature273,feature274,feature275,feature276,feature277,feature278,feature279,feature280
0,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,40.30467,53.845657,0.0,401.0,15.0,15.52416,32.44188,0.0,377.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.0


In [6]:
raw_training_data.describe()

Unnamed: 0,feature0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,...,feature271,feature272,feature273,feature274,feature275,feature276,feature277,feature278,feature279,feature280
count,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,...,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0,52397.0
mean,39.444167,46.806717,0.358914,339.853102,24.681661,15.214611,27.959159,0.002748,258.66603,5.829151,...,0.171327,0.162242,0.154455,0.096151,0.088917,0.119167,0.0,1.242094,0.769505,6.764719
std,79.121821,62.359996,6.840717,441.430109,69.598976,32.251189,38.584013,0.131903,321.348052,23.768317,...,0.376798,0.368676,0.361388,0.2948,0.284627,1.438194,0.0,27.497979,20.338052,37.706565
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.285714,5.214318,0.0,29.0,0.0,0.891566,3.075076,0.0,22.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,10.63066,19.35312,0.0,162.0,4.0,4.150685,11.051215,0.0,121.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,40.30467,77.44283,0.0,478.0,15.0,15.998589,45.701206,0.0,387.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1122.6666,559.4326,726.0,2044.0,1314.0,442.66666,359.53006,14.0,1424.0,588.0,...,1.0,1.0,1.0,1.0,1.0,136.0,0.0,1778.0,1778.0,1424.0


In [7]:
'''Avoided prining the bins since there are a lot of features to display.'''
#raw_training_data.hist(bins = 50, figsize = (20,15))
#bins = The number of bins between the minimum and maximum data points for each feature, by default 10.
#figsize = The size in inches of the figure to create.

'Avoided prining the bins since there are a lot of features to display.'

In [8]:
'''Separate features and labels'''
num_features = len(raw_training_data.iloc[0]) - 1 #subtracting the label

raw_features, labels = raw_training_data.iloc[:,:num_features], raw_training_data.iloc[:,num_features]
#convert labels series to pandas data frame to give it a 2D shape
labels = pd.DataFrame(labels)

# FEATURE SELECTION/ EXTRACTION BEGINS

## 1.Check Missing Values

In [9]:
#Use the the df.describe() to generate the description of the data dnd then use this info.
features_desc= raw_features.describe().T
features_desc['missing %'] = 1- (features_desc['count']/len(raw_features))


#prevent pandas from skipping the middle elements while displaying
pd.options.display.max_rows  = 2000

features_desc

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,missing %
feature0,52397.0,39.444167,79.121821,0.0,2.285714,10.63066,40.30467,1122.6666,0.0
feature1,52397.0,46.806717,62.359996,0.0,5.214318,19.35312,77.44283,559.4326,0.0
feature2,52397.0,0.358914,6.840717,0.0,0.0,0.0,0.0,726.0,0.0
feature3,52397.0,339.853102,441.430109,0.0,29.0,162.0,478.0,2044.0,0.0
feature4,52397.0,24.681661,69.598976,0.0,0.0,4.0,15.0,1314.0,0.0
feature5,52397.0,15.214611,32.251189,0.0,0.891566,4.150685,15.998589,442.66666,0.0
feature6,52397.0,27.959159,38.584013,0.0,3.075076,11.051215,45.701206,359.53006,0.0
feature7,52397.0,0.002748,0.131903,0.0,0.0,0.0,0.0,14.0,0.0
feature8,52397.0,258.66603,321.348052,0.0,22.0,121.0,387.0,1424.0,0.0
feature9,52397.0,5.829151,23.768317,0.0,0.0,1.0,2.0,588.0,0.0


It appears that none of the features miss a single value and no further processing is required to handle missing values. If required, please refer to Handling_Missing_Values.ipynb file.

## 2.Discard zero-variance features

In [11]:
'''If the variance of a feature is zero, then the feature is constant and will not improve the performance of
the model. In that case, it should be removed. 

sklearn provides with us a class that allows to discard a feature that has a variance below a certain threhold. By
default, it discards a feature with zero variance.
'''

from sklearn.feature_selection import VarianceThreshold

print("Num of features before zero-variance features' removal: ", raw_features.shape[1])

high_variance_selector = VarianceThreshold(threshold = 0.) #By default threshold argument is 0.

features_with_variance = high_variance_selector.fit_transform(raw_features)
features_with_variance = pd.DataFrame(features_with_variance)

print("Num of features before zero-variance features' removal: ", features_with_variance.shape[1])


#Give column names
custom_column_names_feat_with_variance = ["feature"+str(idx) for idx in range(len(features_with_variance.iloc[0]))] 
features_with_variance.columns = custom_column_names_feat_with_variance



Num of features before zero-variance features' removal:  280
Num of features before zero-variance features' removal:  276


In [12]:
#There were 4 features with zero variance and this were removed from the data set.

In [13]:
'''In case there were too many features, may be tens of thousands, to make computations expensive, we could also discard
the features with low but non-zero varinace by first minmax normalizing all the features to the same scale between 0 and
1 and then removing the features with very low standard deviation/ varinace using the same method as above but with
a different threshold above 0. However, we can do all right with 276 features for now.'''

'In case there were too many features, may be tens of thousands, to make computations expensive, we could also discard\nthe features with low but non-zero varinace by first minmax normalizing all the features to the same scale between 0 and\n1 and then removing the features with very low standard deviation/ varinace using the same method as above but with\na different threshold above 0. However, we can do all right with 276 features for now.'

## 3.Remove correlation/ multicollinearity among the independent variables.

## Three different ways can be followed to handle multicollinearity among the features. 

### a.Multicollinearity Analaysis using Condition Index
This method does not completely eliminate multicollinearity in the data set. This method has been followed here.

### b.Clustering Analysis of the features (not the data samples)
If the task at hand is a supervised task and we want to retain the explicability of the model (such as we want to use the final features generated from feature preprocessing for linear regression and thus explain how much does the response variable change with an unit change in this certain final feature.), then we would prefer to do a cluster analysis of the features to generate clusters. Then, for each cluster, we take

i.either the centroid of all the features and assume that this centroid represents all the features in this cluster <br>
ii.or one particular feature that represents this cluster the best. This feature can be chosen using 1-R$^{2}$ ratio.

Method i does not preserve explicability of the original features.Clustering Analysis method does not completely reduce multicollinearity of the original features. 

### c.Principal Component Analysis (PCA)
If the task at hand is unsupervised and thus there is no relation to be explained by the model (and all we care about is the proper clustering of the same group items), then we can use PCA. PCA completely eliminates multicollinearity in the data set.

In [14]:
#The code below is inspired from the presentation by Vishal Patel at PyData 2016.


#generate the correlation matrix
corr_matrix = features_with_variance.corr()

custom_column_names_corr_matrix = ["feature"+str(idx) for idx in range(len(corr_matrix))] 
corr_matrix.columns = custom_column_names_corr_matrix



#set minimum variables to keep for the regression task to prevent from discarding man features
min_variables_to_keep = 50

count_of_features = len(corr_matrix)

print("Performing Multicollinearity Analysis.")

#if current number if features > min number of features to keep
if count_of_features > min_variables_to_keep:
    
    while True:
        
        col_names = corr_matrix.keys() #get column names if present, else the column number
        eigen_vals, eigen_vects = np.linalg.eig(corr_matrix)
        #Note that numpy returns complex eigen values due to truncation and rounding-off
        
        #The condition indices are computed by finding the square root of the maximum eigenvalue divided by
        #the eigenvalues of the design matrix. 
        
        condition_indices = (max(eigen_vals)/eigen_vals)**(1/2)  #Taking the square-root to calculate
        
        #If the condition index <= 30, then multicollinearity is not severe.
        if max(condition_indices) <= 30 or count_of_features <= min_variables_to_keep:
            break
        
        for idx, val in enumerate(eigen_vals):
            if val == min(eigen_vals):
                
                for idxxx, eigen_vec in enumerate(eigen_vects[:,idx]):
                    
                    if abs(eigen_vec) == max( (abs(eigen_vects[:,idx]) )):
                        
                        mask = np.ones(len(corr_matrix), dtype = bool)
                        
                        for num, column in enumerate(corr_matrix.keys()):
                            mask[num] = num != idxxx
                            
                            if num == idxxx:
                                mask[num] = 0
                            else:
                                mask[num] = 1
                        
                        #Delete the row corresponding to this feature with the highest loading in the Eigen vector
                        corr_matrix = corr_matrix[mask]
                        #Delete the column corresponding to the feature that has the highest loading in the Eigen vector
                        corr_matrix.pop(col_names[idxxx])
                        
        
print("Shape of the remaiing features after removing multicollinearity by some extent: ", corr_matrix.shape)

Performing Multicollinearity Analysis.




Shape of the remaiing features after removing multicollinearity by some extent:  (220, 220)


In [15]:
#Get the names of the features that should stay and slice out these features from the features matrix.
cols = []
for name in corr_matrix.columns:
    cols.append(name)

features_without_multicollinearity = pd.DataFrame()  #creates an empty dataframe

#create feature names
num_features_remaining = corr_matrix.shape[1]
# custom_column_names2 = ["feature"+str(idx) for idx in range(num_features_remaining)] 

features_with_variance.head()

counter = 0
for feature_num in cols:
    features_without_multicollinearity[cols[counter]] =  features_with_variance[cols[counter]]
    counter += 1

features_without_multicollinearity.head()

Unnamed: 0,feature3,feature7,feature15,feature16,feature18,feature19,feature21,feature23,feature25,feature26,...,feature265,feature266,feature267,feature269,feature270,feature271,feature272,feature273,feature274,feature275
0,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
print("The num of features after removing ulticollinearity is: ", features_without_multicollinearity.shape[1])

The num of features after removing ulticollinearity is:  220


In [17]:
#As seen above, severity of multicollinearity has been reduced by removing the features contributing highly towards
#multicollinearity. Now, there are only 220 features left to work on.

## 4.Correlation with the Target
Regardless of whichever method you follow to remove multicollinearity, if you have a class label/ target in the data set,  check if the remaining variables have a significant correlation with the target variable or not. Remove the variables that have too less absolute correlation with the target.

In [19]:
absolute_corr_with_target = pd.DataFrame(columns=['feature_name', 'abs_corr_with_target',])

for column_name in features_without_multicollinearity.columns:
    
    absolute_corr_with_target = absolute_corr_with_target.append( {"feature_name":column_name, "abs_corr_with_target":abs(labels.corrwith(features_without_multicollinearity[column_name])).values[0]}, ignore_index = True) 
        
        
print(f"Abs corr of features with the target:\n {absolute_corr_with_target}")

Abs corr of features with the target:
     feature_name  abs_corr_with_target
0       feature3              0.356604
1       feature7              0.034916
2      feature15              0.384654
3      feature16              0.053221
4      feature18              0.486316
5      feature19              0.503375
6      feature21              0.280792
7      feature23              0.004137
8      feature25              0.266815
9      feature26              0.001228
10     feature28              0.338961
11     feature32              0.461627
12     feature36              0.002224
13     feature40              0.232089
14     feature41              0.323661
15     feature42              0.233080
16     feature44              0.230493
17     feature46              0.002513
18     feature47              0.314446
19     feature48              0.472061
20     feature49              0.117642
21     feature50              0.314177
22     feature53              0.260903
23     feature54         

In [37]:
#Remove the features that have ab absolute correlation >= 0.65 with the target.
final_features = []
for idx in range(len(absolute_corr_with_target)):
    if absolute_corr_with_target.iloc[idx,1] < 0.65:
        #keeping only the features that have an abs correlation less than the threshold
        final_features.append(str(absolute_corr_with_target.iloc[idx,0]))
        
    
final_dataset = features_without_multicollinearity.filter(final_features, axis = 1)


Unnamed: 0,feature3,feature7,feature15,feature16,feature18,feature19,feature21,feature23,feature25,feature26,...,feature265,feature266,feature267,feature269,feature270,feature271,feature272,feature273,feature274,feature275
0,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.000000
1,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
2,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
3,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.000000
4,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.000000
5,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
6,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
7,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.000000
8,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.000000
9,401.0,0.0,48.475178,0.0,12.0,1.479934,-356.0,0.0,1.795416,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.000000


### After this, we perform Forward/ Backward/ Stepwise Selection/  LASSO/ tree-based methods . However, these methods themselves are standalone procedures and will be covered on a different notebook independently.

# 5.Evaluation of the performance on the features before and after the different engineering methods applied above.
Here we will evaluate how well the curated features perform in comparison to the original features.