# **Second Assignment: Machine Learning with Spark**


## Authors: Alfonso Delgado Lara & Diego Rivera Suárez

The primary objective of this project is to develop a scalable classification model to predict the success of telemarketing campaigns for a banking institution. Specifically, the goal is to predict whether a client will subscribe to a term deposit (binary classification: "yes" or "no"). This problem is critical for optimizing marketing strategies, reducing resource wastage, and increasing conversation rates.

To handle the data processing and modeling efficiently, we utilized Apache Spark (PySpark), leveraging its MLlib library to create robust Machine Learning pipelines. Furthermore, the core classification algorithm selected for this study is Logistic Regression, chosen for its interpretability and efficiency in binary classification tasks. Finally, this assignment will focus heavily on the use of the Universal Feature Selection and its importance in model performance and intrpretability, highlighting its important as a tool for reducing overfitting and computational inefficiency.

## Uploading the data

The first step for completing this study will be to upload the dataset, which is the same one used for the first assignment.

In [2]:
import pandas as pd

df = pd.read_pickle("bank_92.pkl")

In the first assignment, we were asked to preprocess and modify the **pdays** variable, since it is not usable still for classification purposes. Its preprocessing has been replicated from the previous assignment and has not been included in a pipeline, but rather has been applied now, so that the dataset has an optimal structure, before being transformed to a .csv format. As for the preprocessing itself:

The pdays variable records the number of days since the client was last contacted, with -1 indicating that the client was never contacted. Machine learning models can misinterpret such placeholder values, so pdays is transformed into two new features:

- **contacted_before**, is a binary indicator where 1 means the client was previously contacted (pdays > -1) and 0 means never contacted (pdays = -1).

- **days_since_last_contact**, preserves the original pdays value for previously contacted clients, while assigning max(pdays) + 100 to clients never contacted.

This transformation allows both tree-based and distance-based models to interpret the information effectively, treating "never contacted" as a meaningful and distinct situation.

In [3]:
#Preprocessing pdays using a custom transformer
#This transformer creates two new features and drops the original pdays column
from sklearn.base import BaseEstimator, TransformerMixin

class PdaysTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.max_pdays = None

    def fit(self, X, y=None):
        #Find the maximum pdays for clients who were previously contacted
        self.max_pdays = X.loc[X['pdays'] > -1, 'pdays'].max()
        return self

    def transform(self, X):
        X = X.copy()
        #Create binary feature: 1 if client was contacted before, 0 otherwise
        X['contacted_before'] = (X['pdays'] > -1).astype(int)

        #Create numeric feature for days since last contact
        #Assign max_pdays + 100 to clients never contacted
        X['days_since_last_contact'] = X['pdays'].apply(
            lambda x: x if x > -1 else self.max_pdays + 100
        )

        # Drop the original pdays column
        X = X.drop(columns=['pdays'])
        return X

transformer = PdaysTransformer()
df = transformer.fit_transform(df)
print(df.head())

   age         job  marital  education default  balance housing loan  contact  \
0   59      admin.  married  secondary      no     2343     yes   no  unknown   
1   56      admin.  married  secondary      no       45      no   no  unknown   
2   41  technician  married  secondary      no     1270     yes   no  unknown   
3   55    services  married  secondary      no     2476     yes   no  unknown   
4   54      admin.  married   tertiary      no      184      no   no  unknown   

   day month  duration  campaign  previous poutcome deposit  contacted_before  \
0    5   may      1042         1         0  unknown     yes                 0   
1    5   may      1467         1         0  unknown     yes                 0   
2    5   may      1389         1         0  unknown     yes                 0   
3    5   may       579         1         0  unknown     yes                 0   
4    5   may       673         2         0  unknown     yes                 0   

   days_since_last_contact

Once the treatment of the **pdays** variable has come to an end, the dataset can already be converted into a csv file.

In [4]:
df.to_csv('bank_92.csv', index=False)

## Creating the Spark Session

PySpark is the Python API for Apache Spark, a powerful open-source distributed computing framework. It allows developers to harness the simplicity of the Python language while leveraging Spark’s ability to process massive datasets in parallel across a cluster of computers. It effectively bridges the gap between data science in Python and big data engineering, enabling scalable data processing and Machine Learning (via the MLlib library).

This snippet initializes the Spark environment, establishing the necessary connection to the computing cluster. It creates a SparkSession, which serves as the unified entry point for programming with DataFrames, and exposes the SparkContext (sc) to coordinate task execution across the distributed worker nodes.

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Second Assignment: Machine Learning with Spark") \
    .getOrCreate()
sc = spark.sparkContext

print(spark)
print(sc)

<pyspark.sql.session.SparkSession object at 0x7baa98ccd3d0>
<SparkContext master=local[*] appName=Second Assignment: Machine Learning with Spark>


And now, we upload our dataset to our Spark Session:

In [6]:
ava_sd=spark.read.csv(path='bank_92.csv',header=True,inferSchema=True)
ava_sd.show()

+---+-----------+--------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+--------+--------+-------+----------------+-----------------------+
|age|        job| marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|previous|poutcome|deposit|contacted_before|days_since_last_contact|
+---+-----------+--------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+--------+--------+-------+----------------+-----------------------+
| 59|     admin.| married|secondary|     no|   2343|    yes|  no|unknown|  5|  may|    1042|       1|       0| unknown|    yes|               0|                    954|
| 56|     admin.| married|secondary|     no|     45|     no|  no|unknown|  5|  may|    1467|       1|       0| unknown|    yes|               0|                    954|
| 41| technician| married|secondary|     no|   1270|    yes|  no|unknown|  5|  may|    1389|       1|       0| unknown|    yes|               0|           

Now, we are going to prepare the data so that they can be used for our PySpark Machine Learning Environment:

In [7]:
data_sd = ava_sd.withColumnRenamed("deposit", "label")

ignore = ['label']


### Data Preprocessing Strategy

**Why is this step necessary?**

Machine Learning algorithms, such as Logistic Regression, rely on mathematical equations that require numerical input. Our raw dataset, however, contains **categorical variables** (strings like "married" or "admin"). Furthermore, PySpark's **MLlib** library has a specific architectural requirement: it expects all input features to be consolidated into a **single vector column**. Therefore, this preprocessing stage is essential to transform human-readable text into a machine-readable numerical format without introducing statistical bias.

The code implements a transformation pipeline consisting of three key stages:

1. **Label Indexing:**
   The target variable **label** is currently a string ("yes"/"no"). We use **StringIndexer** to convert this into a numerical label (0.0 or 1.0), allowing the algorithm to calculate the classification error.

2. **Categorical Encoding (The Loop):**
   We iterate through each categorical variable **cat_cols** and apply a two-step transformation:
   * **StringIndexer:** First, maps each unique string category to a numerical index based on frequency.
   * **OneHotEncoder:** Converts these indices into binary vectors. This step is vital for nominal variables to prevent the model from assuming an artificial ordinal relationship (e.g., incorrectly assuming that job category 2 is "greater than" job category 1).

3. **Feature Assembly:**
   Finally, the **VectorAssembler** takes the newly created One-Hot Encoded vectors and combines them with the original numerical columns **num_cols**. This results in a single column named **features**, containing a comprehensive vector representation of each client.

All these steps are wrapped in a **Pipeline**, ensuring that the sequence of transformations is applied consistently to both training and validation data.

In [8]:
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline

cat_cols = ['job', 'marital', 'education', 'default', 'housing',
            'loan', 'contact', 'poutcome','month']
num_cols = ['age','balance','duration','campaign','contacted_before',
            'days_since_last_contact','previous','day']

# Now, we are going to create the Pipeline for preprocessing the categorical
# features, since we cannot use the VectorAssembler function if we have them.

stages = []

label_indexer = StringIndexer(inputCol="label", outputCol="label_index")
stages += [label_indexer]

for categoricalCol in cat_cols:
    # First, we use stringIndexer for tranforming categories to a number
    stringIndexer = StringIndexer(inputCol=categoricalCol,
                                  outputCol=categoricalCol + "_index",
                                  handleInvalid="keep")

    # Then, we use OneHotEncoder for transforming the indices into a OHE vector
    #  (ej: 0 -> [1,0], 1 -> [0,1])
    encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()],
                            outputCols=[categoricalCol + "_vec"],
                            handleInvalid="keep")

    stages += [stringIndexer, encoder]


# Now, we join the OHE new columns with the original numerical ones:

assemblerInputs = [c + "_vec" for c in cat_cols] + num_cols

assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

# And the final Pipeline for this preprocessing part will be:

pipeline1= Pipeline(stages=stages)

Once we have constructed it, now we are able to transform the dataset to the appropiate ML language in PySpark, fitting our data to the previously constructed Pipeline.

In [9]:
pipeline_model = pipeline1.fit(data_sd)
dataFinal = pipeline_model.transform(data_sd)

dataFinal = dataFinal \
    .drop("label") \
    .withColumnRenamed("label_index", "label")

dataFinal.select('label', 'features').show(5, truncate=False)

+-----+-------------------------------------------------------------------------------------------------------------------------+
|label|features                                                                                                                 |
+-----+-------------------------------------------------------------------------------------------------------------------------+
|1.0  |(61,[3,13,17,22,26,28,32,35,40,53,54,55,56,58,60],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,59.0,2343.0,1042.0,1.0,954.0,5.0])|
|1.0  |(61,[3,13,17,22,25,28,32,35,40,53,54,55,56,58,60],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,56.0,45.0,1467.0,1.0,954.0,5.0])  |
|1.0  |(61,[2,13,17,22,26,28,32,35,40,53,54,55,56,58,60],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,41.0,1270.0,1389.0,1.0,954.0,5.0])|
|1.0  |(61,[4,13,17,22,26,28,32,35,40,53,54,55,56,58,60],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,55.0,2476.0,579.0,1.0,954.0,5.0]) |
|1.0  |(61,[3,13,18,22,25,28,32,35,40,53,54,55,56,58,60],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,

Before training the model, we must split the dataset to simulate how the model will perform on unseen data. Given our dataset size of approximately **11,000 records**, we selected an **80/20 split ratio**:

* **Training Set (80% ~ 8,800 rows):** This provides the algorithm with sufficient data to learn the underlying patterns and relationships between features without underfitting.
* **Validation Set (20% ~ 2,200 rows):** This size is statistically significant enough to evaluate the model's performance reliably. Since marketing datasets are often **imbalanced** (fewer "yes" responses), a smaller validation set (e.g., 10%) might result in too few positive examples to calculate a stable AUC or Accuracy. The 20% hold-out ensures the evaluation metrics are robust.

In [10]:
random_seed = 100566492

train, validation = dataFinal.randomSplit([0.8, 0.2], seed=random_seed)

# FEATURE SELECTION APPROACHES

Following the preprocessing stage, specifically the One-Hot Encoding of categorical variables, the dimensionality of our dataset has increased significantly (from the original input columns to over 60 generated features). While this provides granular information, a high number of features can lead to **overfitting**, increased computational cost, and "noise" that confuses the model.

To address this, we employ the **UnivariateFeatureSelector**. This tool evaluates each feature individually against the target variable **label** using statistical tests (ANOVA F-test) to determine the strength of the relationship.



We will conduct a comparative analysis using **four distinct strategies** to identify the optimal balance between model complexity and performance:

1.  **Baseline (No Selection):**
    We utilize all generated features. This serves as the control group to measure if feature selection actually improves or degrades the model.

2.  **FPR (False Positive Rate):**
    A **least conservative** strategy. It selects features based on a flexible p-value threshold, allowing more features to pass through. It prioritizes keeping potential signals, even at the risk of including some noise.

3.  **FWE (Family-Wise Error Rate):**
    A **most conservative** strategy. It applies a strict penalty to the p-values to control the probability of making even a single false discovery. This results in a very lean model containing only the most statistically significant features.

4.  **Percentile (Top 25%):**
    A pragmatic approach that ignores p-value thresholds and simply selects the top 25% of features with the highest statistical scores. This forces a fixed "budget" of features (reducing the 61 inputs to approximately 15).

*Note: For the UnivariateFeatureSelector to function correctly with our mixed dataset in PySpark, we configure the selector to treat all features as "continuous," allowing the underlying ANOVA test to assess both numerical and binary (dummy) variables uniformly.*


To predict the binary outcome **label**, we employ **Logistic Regression**. Despite its name, this is a linear model used for classification rather than regression. It estimates the probability of an event occurring (e.g., the probability of a client saying "yes") by mapping the output of a linear equation to a value between 0 and 1 using the **sigmoid function**.



**Note on Feature Scaling:**
Typically, linear models require input features to be scaled (standardized) so that variables with large magnitudes (like **balance** or **duration**) do not disproportionately influence the model compared to binary variables (0/1). However, we **excluded an explicit StandardScaler step** from our pipeline. This is because PySpark's **LogisticRegression** implementation includes a built-in parameter **standardization=True** (enabled by default). This automatically handles the scaling of features during the training process to ensure numerical stability and correct regularization, simplifying our workflow.

## No feature selection

Before applying any filtering strategy, it is essential to train a model using **all available features** (the full set of 61 columns generated by the preprocessing). This serves as our **control group**.

By training the Logistic Regression on the complete dataset, we establish a baseline performance level (Accuracy and AUC). This allows us to answer the critical question later: *Does removing features actually improve the model, or does it cause a loss of valuable information?*

In [11]:
from pyspark.ml.classification import LogisticRegression

# We create a Pipeline with Logistic Regression as its only step
lr = LogisticRegression()
pipeline_no_fs = Pipeline(stages=[lr])

# We train the model
model_no_fs = pipeline_no_fs.fit(train)

# And finally make predictions on the validation set.
predictions_no_fs = model_no_fs.transform(validation)
predictions_no_fs.select('label', 'prediction', 'probability').show(5)

+-----+----------+--------------------+
|label|prediction|         probability|
+-----+----------+--------------------+
|  1.0|       1.0|[1.66110065873039...|
|  1.0|       1.0|[0.19288408764953...|
|  0.0|       1.0|[0.17038390343469...|
|  1.0|       0.0|[0.52568405879518...|
|  1.0|       1.0|[0.03716517858075...|
+-----+----------+--------------------+
only showing top 5 rows


## **Feature Selection using the FPR Strategy**

Feature selection was applied using the **False Positive Rate (FPR)** strategy. This approach is considered the least conservative, selecting a larger set of features that may be relevant for the model.  

A significance threshold of 0.05 was used, retaining only features that show a statistically significant relationship with the target variable.  
The objective is to evaluate whether this feature selection improves, maintains, or reduces the performance of the logistic regression model compared to the baseline model without feature selection.


In [12]:
from pyspark.ml.feature import UnivariateFeatureSelector

# Configure the feature selector using the FPR strategy
selector_fpr = UnivariateFeatureSelector(
    featuresCol="features",
    outputCol="selectedFeatures_fpr",
    labelCol="label",
    selectionMode="fpr",
)

# Set the significance threshold to 0.05
selector_fpr.setSelectionThreshold(0.05)


UnivariateFeatureSelector_2a93f70cffb5

At this stage, with the preprocessing already made all input variables are considered continuous, and the target variable is treated as categorical. This setup ensures that the Univariate Feature Selector can correctly evaluate the statistical relevance of each feature with respect to the target, allowing the selection process to proceed under the appropriate assumptions for the data types.


In [13]:
# Specify that all input features are treated as continuous
selector_fpr.setFeatureType("continuous")

# Specify that the target variable is categorical
selector_fpr.setLabelType("categorical")

UnivariateFeatureSelector_2a93f70cffb5

After configuring the feature selection using the FPR strategy, a logistic regression model is applied to the selected features.  

The pipeline is constructed to combine the feature selector and the classifier, ensuring that the selection process and model training are performed sequentially.  

The model is then trained on the training dataset, and predictions are generated for the validation set in order to assess its performance.


In [14]:
# Configure the Logistic Regression model to use the features selected by the FPR strategy
lr = LogisticRegression(
    featuresCol="selectedFeatures_fpr",
    labelCol="label"
)

# Create a pipeline including the feature selector and the logistic regression model
pipeline_fpr = Pipeline(stages=[selector_fpr, lr])

# Train the model using the training data and generate predictions on the validation set
model_fpr = pipeline_fpr.fit(train)
predictions_fpr = model_fpr.transform(validation)


Once the model has been trained, the fitted feature selector is extracted from the pipeline in order to examine which features were retained by the FPR strategy.  

The indices of the selected features are retrieved and compared to the total number of original features, providing an overview of the dimensionality reduction achieved.  

This step allows assessing how many and which features the feature selection process considered most relevant for the model.


In [15]:
# Extract the fitted feature selector from the trained pipeline
selector_model = model_fpr.stages[0]

# Retrieve the indices of the features that were selected
selected_indices = selector_model.selectedFeatures

# Display the number of original features and the number of features selected by the FPR strategy
print(f"Original number of features: 61")
print(f"Number of selected features: {len(selected_indices)}")
print(f"Selected feature indices: {selected_indices}")


Original number of features: 61
Number of selected features: 43
Selected feature indices: [13, 19, 4, 16, 55, 28, 29, 54, 14, 50, 35, 51, 37, 57, 36, 45, 1, 17, 18, 12, 9, 38, 49, 56, 47, 53, 22, 25, 46, 59, 32, 0, 42, 23, 40, 7, 8, 58, 44, 60, 31, 26, 5]


## **Feature Selection using the FWE Strategy**

The feature selection process is now applied using the **Family-Wise Error (FWE)** strategy.  

This method is more conservative than the FPR approach, retaining only
features that show a strong statistical relationship with the target variable.

The objective is to evaluate how a stricter selection criterion affects the performance of the logistic regression model compared to both the baseline and the FPR-based selection.


In [16]:
# Configure the feature selector using the FWE strategy
selector_fwe = UnivariateFeatureSelector(
    featuresCol="features",
    outputCol="selectedFeatures",
    labelCol="label",
    selectionMode="fwe"
)

# Set the significance threshold to 0.05
selector_fwe.setSelectionThreshold(0.05)

UnivariateFeatureSelector_fdcc738ccee1


Again, the input features are treated as continuous and the target variable as categorical, ensuring that the FWE-based feature selection evaluates the statistical relevance of each feature correctly.

This configuration prepares the selector to apply the stricter significance criterion defined by the Family-Wise Error approach.


In [17]:
selector_fwe.setFeatureType("continuous")
selector_fwe.setLabelType("categorical")

UnivariateFeatureSelector_fdcc738ccee1

The logistic regression model is applied to the features selected by the FWE strategy.  

The pipeline is constructed and trained in the same way as for the FPR approach, allowing a direct comparison of model performance under the stricter feature selection criterion.


In [18]:
lr_fwe = LogisticRegression(featuresCol='selectedFeatures', labelCol='label')

# Create the pipeline specific to FWE selection
pipeline_fwe = Pipeline(stages=[selector_fwe, lr_fwe])

# 4. Train the model
model_fwe = pipeline_fwe.fit(train)

# 5. Generate predictions on the validation set
predictions_fwe = model_fwe.transform(validation)


In the same way as before, after training the model, the fitted feature selector is extracted from the pipeline to examine which features were retained by the FWE strategy.

The indices of the selected features are retrieved and compared to the total number of original features, providing an overview of the dimensionality reduction achieved.  

This step allows assessing how many and which features the stricter FWE-based selection process considered most relevant for the model.


In [19]:
# 6. Retrieve the number of features retained by the FWE strategy
selector_model_fwe = model_fwe.stages[0]

# Extract the indices of the selected features
selected_indices_fwe = selector_model_fwe.selectedFeatures

# Display summary of feature selection
print(f"Original number of features: 61")
print(f"Number of features selected (FWE): {len(selected_indices_fwe)}")
print(f"Selected feature indices: {selected_indices_fwe}")

Original number of features: 61
Number of features selected (FWE): 37
Selected feature indices: [13, 19, 4, 16, 55, 28, 29, 54, 14, 50, 35, 51, 37, 57, 45, 1, 17, 18, 12, 9, 38, 49, 56, 47, 25, 46, 59, 32, 42, 40, 7, 8, 58, 60, 31, 26, 5]


## **Feature Selection using Percentile Strategy**

The Percentile-based feature selection strategy selects a fixed proportion of the most relevant features, in this case the top 25% based on statistical relevance.  

All input features are treated as continuous and the target as categorical, as in the previous approaches.  

The main difference compared to the FPR and FWE strategies is that the selection threshold is defined as a proportion of features rather than a significance level, resulting in a consistent reduction to approximately 15–16 features from the original 61.


In [20]:
# Configure the feature selector using the Percentile strategy
selector_pct = UnivariateFeatureSelector(
    featuresCol="features",
    outputCol="selectedFeatures",
    labelCol="label",
    selectionMode="percentile"
)

# Select the top 25% of features based on their statistical relevance
selector_pct.setSelectionThreshold(0.25)

selector_pct.setFeatureType("continuous")
selector_pct.setLabelType("categorical")

lr_pct = LogisticRegression(featuresCol='selectedFeatures', labelCol='label')
pipeline_pct = Pipeline(stages=[selector_pct, lr_pct])

model_pct = pipeline_pct.fit(train)

predictions_pct = model_pct.transform(validation)

selector_model_pct = model_pct.stages[0]
selected_indices_pct = selector_model_pct.selectedFeatures

print(f"Original number of features: 61")
print(f"Number of features selected (top 25%): {len(selected_indices_pct)}")
print(f"Selected feature indices: {selected_indices_pct}")

Original number of features: 61
Number of features selected (top 25%): 15
Selected feature indices: [1, 5, 7, 12, 13, 14, 16, 18, 25, 26, 28, 29, 31, 32, 35]


# **Model Performance: Accuracy and AUC**

The performance of all four logistic regression models was evaluated using two complementary metrics: **accuracy** and **Area Under the Curve (AUC)**.  

As known, accuracy measures the proportion of correct predictions over the total number of instances. However, accuracy depends on a fixed decision threshold (typically 0.5 in logistic regression), and therefore only reflects model performance at that specific cutoff.  

The AUC, derived from the Receiver Operating Characteristic (ROC) curve, evaluates the model's discriminative ability across all possible thresholds. It represents the probability that a randomly chosen positive instance receives a higher predicted score than a randomly chosen negative instance. In this way, AUC captures the overall ranking quality of the model's predictions and is less sensitive to the choice of threshold. Even in the case of balanced classes, AUC is valuable for comparing models because it highlights differences in their ability to distinguish positive from negative cases that might not be apparent from accuracy alone.  

Together, these metrics provide a more complete assessment of model performance, allowing for an informed comparison of the baseline logistic regression without feature selection and the three feature selection strategies (FPR, FWE, and top 25% percentile).


In [21]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
import pandas as pd

# Accuracy evaluator
evaluator_acc = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="accuracy"
)

# Calculate accuracy for each model
acc_no_fs = evaluator_acc.evaluate(predictions_no_fs)
acc_fpr = evaluator_acc.evaluate(predictions_fpr)
acc_fwe = evaluator_acc.evaluate(predictions_fwe)
acc_pct = evaluator_acc.evaluate(predictions_pct)

# AUC evaluator
evaluator_auc = BinaryClassificationEvaluator(
    labelCol="label",
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)

# Calculate AUC for each model
auc_no_fs = evaluator_auc.evaluate(predictions_no_fs)
auc_fpr = evaluator_auc.evaluate(predictions_fpr)
auc_fwe = evaluator_auc.evaluate(predictions_fwe)
auc_pct = evaluator_auc.evaluate(predictions_pct)

# Create a summary table
results_df = pd.DataFrame({
    "Model": ["No Feature Selection", "FPR", "FWE", "Top 25% Percentile"],
    "Accuracy": [acc_no_fs, acc_fpr, acc_fwe, acc_pct],
    "AUC": [auc_no_fs, auc_fpr, auc_fwe, auc_pct]
})

# Display results
results_df


Unnamed: 0,Model,Accuracy,AUC
0,No Feature Selection,0.82389,0.906912
1,FPR,0.823418,0.906932
2,FWE,0.826251,0.907998
3,Top 25% Percentile,0.653919,0.715262


The table summarizes the performance of the logistic regression models under different feature selection strategies.  

The baseline model without feature selection achieves an accuracy of 0.824 and an AUC of 0.907, providing a reference for comparison. The FPR-based selection results in virtually identical performance, indicating that the less conservative selection retains most relevant features without degrading model quality.  

The FWE strategy slightly improves both metrics, with an accuracy of 0.826 and an AUC of 0.908, suggesting that a more conservative selection can slightly enhance the discriminative power of the model by removing marginally relevant features.  

In contrast, the top 25% percentile strategy significantly reduces performance, with accuracy dropping to 0.654 and AUC to 0.715. This indicates that selecting only a small subset of features may discard important information, impairing both the overall correctness of predictions and the model's ability to rank positive instances above negatives.  

Overall, the results highlight that moderate or conservative feature selection (FPR or FWE) can maintain or slightly improve model performance, while overly aggressive reduction (top 25%) can be detrimental.


##  **Extracting and Interpreting Selected Feature Names**

This section extracts the names of the features selected by the FWE-based feature selection strategy (strategy with best accuracy and best AUC).

Using the metadata stored in the vectorized feature column, the function retrieves the original feature names corresponding to the indices selected by the model. This allows identifying which features the selection process deemed most relevant for the logistic regression model.


In [22]:
# Function to get original feature names from a VectorAssembler column
def get_feature_names(df, features_col="features"):
    meta = df.schema[features_col].metadata
    attrs = meta.get("ml_attr", {}).get("attrs", {})

    all_attrs = []
    for t in ["numeric", "binary", "nominal"]:
        all_attrs += attrs.get(t, [])

    all_attrs.sort(key=lambda x: x["idx"])
    return [x["name"] for x in all_attrs]

# Retrieve all feature names
all_feature_names = get_feature_names(train)

# Retrieve indices selected by the FWE model
selected_indices = model_fwe.stages[0].selectedFeatures

# Display selected features with their positions
print(f"Selected Features by FWE ({len(selected_indices)}): ")
for idx in selected_indices:
    name = all_feature_names[idx]
    print(f"Position {idx}: {name}")


Selected Features by FWE (37): 
Position 13: marital_vec_married
Position 19: education_vec_primary
Position 4: job_vec_services
Position 16: marital_vec___unknown
Position 55: duration
Position 28: loan_vec_no
Position 29: loan_vec_yes
Position 54: balance
Position 14: marital_vec_single
Position 50: month_vec_mar
Position 35: poutcome_vec_unknown
Position 51: month_vec_dec
Position 37: poutcome_vec_success
Position 57: contacted_before
Position 45: month_vec_apr
Position 1: job_vec_blue-collar
Position 17: education_vec_secondary
Position 18: education_vec_tertiary
Position 12: job_vec___unknown
Position 9: job_vec_entrepreneur
Position 38: poutcome_vec_other
Position 49: month_vec_sep
Position 56: campaign
Position 47: month_vec_oct
Position 25: housing_vec_no
Position 46: month_vec_feb
Position 59: previous
Position 32: contact_vec_unknown
Position 42: month_vec_jul
Position 40: month_vec_may
Position 7: job_vec_student
Position 8: job_vec_unemployed
Position 58: days_since_last_co

The FWE (Family-Wise Error) strategy retained 37 features, reducing noise while keeping the variables most relevant to customer behavior. These features can be grouped into five main dimensions:

1. Immediate Engagement  
   - Feature: `duration`  
   - Interpretation: Call duration emerges as a key predictor. Longer calls generally indicate that meaningful conversations took place and the client was engaged, whereas very short calls suggest low interest or immediate disengagement.

2. Previous Campaign History  
   - Features: `poutcome_success`, `poutcome_unknown`, `previous`, `contacted_before`, `days_since_last_contact`  
   - Interpretation: Past interactions strongly influence future behavior. Clients who accepted previous offers are more likely to subscribe again. Additionally, the recency and frequency of contact help differentiate between clients who are currently engaged and those who are "cold" leads.

3. Financial Stability & Liquidity  
   - Features: `balance`, `housing_yes/no`, `loan_yes/no`  
   - Interpretation: The model takes into account both the client’s available funds and existing debts. `balance` reflects savings, while `housing` and `loan` indicators capture liabilities, which can affect disposable income and the likelihood of investing.

4. Seasonality  
   - Features: Specific months (`mar`, `dec`, `sep`, `oct`, `apr`, `feb`, `may`, `jul`) and `day`  
   - Interpretation: Selected months indicate seasonal patterns in subscriptions, likely tied to economic cycles, fiscal quarters, or periods of higher liquidity, such as bonuses or year-end savings.

5. Demographic Profiles  
   - Features: `job_retired`, `job_student`, `job_blue-collar`, `marital_single/married`, `education_primary/tertiary`  
   - Interpretation: Certain socioeconomic segments appear more likely to subscribe. `job_retired` may indicate clients seeking low-risk investments, `job_student` first-time account holders, and marital or education status reflects different financial priorities and life stages.
