In [22]:
import pandas as pd

data = pd.read_csv("bank-additional-full.csv", delimiter=";")
data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [23]:
print(data.shape) # Shape of the dataset.

(41188, 21)


In [24]:
data.info() # Data Summary.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

**Encoding Categorical Data.**

In [25]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['job'] = le.fit_transform(data['job'])
data['marital'] = le.fit_transform(data['marital'])
data['education'] = le.fit_transform(data['education'])
data['default'] = le.fit_transform(data['default'])
data['housing'] = le.fit_transform(data['housing'])
data['loan'] = le.fit_transform(data['loan'])
data['contact'] = le.fit_transform(data['contact'])
data['month'] = le.fit_transform(data['month'])
data['day_of_week'] = le.fit_transform(data['day_of_week'])
data['poutcome'] = le.fit_transform(data['poutcome'])
data['emp.var.rate'] = le.fit_transform(data['emp.var.rate'])
data['nr.employed'] = le.fit_transform(data['nr.employed'])
data['y'] = le.fit_transform(data['y'])

**Split Dataset into Independent (X) and Dependent (y) variables.**

In [26]:
data = data.astype("int")

X = data[['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome','emp.var.rate','nr.employed']]
y = data[['y']]

# **Chi-Square $\chi^{2}$ Test for Feature Selection.**

Compute Chi-Squared stats between each non-negative feature and class.

*   This score is used to evaluate categorical features in a classification task.

Chi-Square is calculated between each feature and the target variable, and select the desired number of features with the best Chi-Square scores. To correctly apply the chi-squared to test the relation between various features in the dataset and the target variable, the following conditions have to be satisfied, i.e., the variables have to be categorical, sampled independently, and values should have an expected frequency greater than 5.

This score can be used to select the $n\_features$ features with the highest values for the test chi-squared statistic from $X$, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification) relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function "weeds out" the features that are the most likely to be independent of class and therefore irrelevant for classification.

[sklearn.feature_selection.chi2](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html)


In [27]:
from sklearn.feature_selection import chi2, SelectKBest

chi2_features = chi2(X, y)

**Rank every categorical feature's importance, based on Chi-Square Score.**

In [28]:
p_values = pd.Series(chi2_features[1])
p_values.index = X.columns
p_values.sort_index(ascending = False)

poutcome         3.722828e-23
nr.employed      0.000000e+00
month            1.650559e-01
marital          1.348325e-07
loan             2.077547e-01
job              2.179406e-21
housing          2.566075e-02
emp.var.rate     0.000000e+00
education        2.464796e-38
default          5.521476e-72
day_of_week      1.380665e-03
contact         3.500598e-121
dtype: float64

***Observation:*** **"poutcome" is the most important column when compared to the output feature "Yes".**


# **----------------------------------------------------------------------------------------------------**

**Select KBest Categorical Features, based on Chi-Square Score.**

[sklearn.feature_selection.SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)

In [29]:
select_chi2_features = SelectKBest(chi2, k=9) # Select features according to the K highest scores.
X_KBest_features = select_chi2_features.fit_transform(X, y)

In [30]:
print("Original Features Number:", X.shape[1])
print("Reduced Features Number:", X_KBest_features.shape[1])

Original Features Number: 12
Reduced Features Number: 9
