# 03 - Redundancy management and dimensionality reduction

Redundancy Management:
In data analysis, redundancy refers to repetitive or irrelevant information that can skew results or lead to overfitting.

Dimensionality Reduction:
The process of reducing the number of variables or features in a dataset while retaining as much information as possible.

In this notebook I have created 3 log.reg in order to compare how these 2 tools affected metrics

In [None]:
#  Import Libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report



In [None]:
# Load dataset
df = pd.read_csv("bank_numeric.csv")

# Define features and target
target_column = "deposit"
X = df.drop(columns=[target_column])
y = df[target_column]

In [19]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,41,9,1,1,0,1270,1,0,2,5,5,1389,1,-1,0,3,1
1,42,4,2,2,0,0,1,1,2,5,5,562,2,-1,0,3,1
2,37,9,1,1,0,1,1,0,2,6,5,608,1,-1,0,3,1
3,38,0,2,1,0,100,1,0,2,7,5,786,1,-1,0,3,1
4,30,1,1,1,0,309,1,0,2,7,5,1574,2,-1,0,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5252,34,1,2,1,0,-72,1,0,0,7,7,273,5,-1,0,3,0
5253,33,1,2,0,0,1,1,0,0,20,4,257,1,-1,0,3,0
5254,39,7,1,1,0,733,0,0,2,16,6,83,4,-1,0,3,0
5255,32,9,2,1,0,29,0,0,0,19,8,156,2,-1,0,3,0


In [8]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [12]:
# 1 option

print("Initial Dataset Shape:", X_train.shape)

#  Logistic Regression on Original Dataset
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train, y_train)

# Predict and evaluate metrics
y_pred = log_reg.predict(X_test)
print("\nInitial Model Metrics:")
print(classification_report(y_test, y_pred))

acc = accuracy_score(y_test, y_pred)
print("\nModel overall accuracy (Original): {:.2f}%".format(acc * 100))

Initial Dataset Shape: (3679, 16)

Initial Model Metrics:
              precision    recall  f1-score   support

           0       0.80      0.89      0.84       915
           1       0.82      0.68      0.75       663

    accuracy                           0.81      1578
   macro avg       0.81      0.79      0.79      1578
weighted avg       0.81      0.81      0.80      1578


Model overall accuracy (Original): 80.54%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
# 2 option
# Redundancy Management (Variance Threshold)
var_thresh = VarianceThreshold(threshold=0.01)
X_train_reduced = var_thresh.fit_transform(X_train)
X_test_reduced = var_thresh.transform(X_test)

print("\nDataset Shape After Redundancy Management:", X_train_reduced.shape)

# Logistic Regression after Redundancy Management
log_reg_reduced = LogisticRegression(max_iter=1000, random_state=42)
log_reg_reduced.fit(X_train_reduced, y_train)

# Predict and evaluate metrics
y_pred_reduced = log_reg_reduced.predict(X_test_reduced)
print("\nMetrics After Redundancy Management:")
print(classification_report(y_test, y_pred_reduced))

acc_reduced = accuracy_score(y_test, y_pred_reduced)
print("\nModel overall accuracy (After Redundancy Management): {:.2f}%".format(acc_reduced * 100))



Dataset Shape After Redundancy Management: (3679, 16)

Metrics After Redundancy Management:
              precision    recall  f1-score   support

           0       0.80      0.89      0.84       915
           1       0.82      0.68      0.75       663

    accuracy                           0.81      1578
   macro avg       0.81      0.79      0.79      1578
weighted avg       0.81      0.81      0.80      1578


Model overall accuracy (After Redundancy Management): 80.54%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [14]:
# 3 option
# Dimensionality Reduction (PCA)
pca = PCA(n_components=0.95)  # Retain 95% of variance
X_train_pca = pca.fit_transform(X_train_reduced)
X_test_pca = pca.transform(X_test_reduced)

print("\nDataset Shape After PCA:", X_train_pca.shape)

# Logistic Regression after Dimensionality Reduction
log_reg_pca = LogisticRegression(max_iter=1000, random_state=42)
log_reg_pca.fit(X_train_pca, y_train)

# Predict and evaluate metrics
y_pred_pca = log_reg_pca.predict(X_test_pca)
print("\nMetrics After Dimensionality Reduction:")
print(classification_report(y_test, y_pred_pca))

acc_pca = accuracy_score(y_test, y_pred_pca)
print("\nModel overall accuracy (After Dimensionality Reduction): {:.2f}%".format(acc_pca * 100))


Dataset Shape After PCA: (3679, 2)

Metrics After Dimensionality Reduction:
              precision    recall  f1-score   support

           0       0.74      0.90      0.81       915
           1       0.80      0.57      0.66       663

    accuracy                           0.76      1578
   macro avg       0.77      0.73      0.74      1578
weighted avg       0.76      0.76      0.75      1578


Model overall accuracy (After Dimensionality Reduction): 75.67%


In [None]:
# insights:

# we can see that  Redundancy Management did not remove any data
# probably, did not find anything irrelavant
# + all duplicates and nan values were already removed

# Dimensionality Reduction removed almost all columns
# to be honest, it looks like a bug in the tool
# but suprisngly, overall accuracy did not fell down to the bottom

# I was wondering which columns were left
# PC1 is heavily influenced by balance (with a weight of 0.999716).
# PC2 is heavily influenced by duration (with a weight of 0.999738).
# these 2 columns are the most important


In [23]:
# Get the PCA components (contributions of original features to PCs)
pca_components = pd.DataFrame(
    pca.components_, 
    columns=X_train.columns[var_thresh.get_support()], 
    index=["PC1", "PC2"]
)

print("\nContribution of Original Features to Principal Components:")
print(pca_components)



Contribution of Original Features to Principal Components:
          age       job   marital  education   default   balance   housing  \
PC1  0.000457  0.000242  0.000053   0.000066 -0.000029  0.999716 -0.000056   
PC2 -0.000529 -0.000181  0.000050  -0.000072 -0.000002 -0.022853  0.000082   

         loan   contact       day     month  duration  campaign     pdays  \
PC1 -0.000060 -0.000040 -0.000112  0.000097  0.022855 -0.000267  0.006750   
PC2  0.000013 -0.000057 -0.000404 -0.000067  0.999738 -0.000262 -0.000442   

     previous  poutcome  
PC1  0.000075 -0.000035  
PC2 -0.000043  0.000021  
