#Project 7: Feature Optimization for Classification Problems using Recursive Feature Elimination (RFE)
**Project Title:** Feature Optimization for Classification Problems using Recursive Feature Elimination (RFE)

**Project Goal:** To enhance the performance and interpretability of a classification model by optimizing its feature set. This project will leverage Recursive Feature Elimination (RFE) on the Titanic dataset to identify the most impactful features for predicting survival, followed by a comparative analysis of model performance with and without the reduced feature set.
Objectives:
1.	Dataset Preparation:
o	Utilize a preprocessed and feature-engineered version of the Titanic dataset (as prepared in previous projects, if applicable).
o	Ensure the dataset is ready for classification modeling, with features (X) and the target variable (y, 'Survived') clearly defined.
2.	Recursive Feature Elimination (RFE) Implementation:
o	Apply Recursive Feature Elimination (RFE) to the dataset.
o	Select an appropriate base estimator (e.g., Logistic Regression or a tree-based model) for RFE.
o	Determine the optimal number of features to select using RFE with cross-validation. This will involve iterating RFE to find the feature subset that maximizes a chosen performance metric (e.g., accuracy, F1-score).
3.	Model Training and Performance Comparison:
o	Baseline Model: Train a classification model (e.g., Logistic Regression or Decision Tree, consistent with RFE's estimator) using all available features from the prepared dataset. Evaluate its performance on a held-out test set using relevant classification metrics (accuracy, precision, recall, F1-score).
o	Optimized Model: Train the same classification model using only the features selected by RFE. Evaluate its performance on the same held-out test set using the identical set of classification metrics.
4.	Results Analysis and Documentation:
o	Present the list of features selected by RFE and the rationale behind their selection.
o	Provide a comparative analysis of the performance metrics between the baseline model (all features) and the RFE-optimized model (selected features).
o	Discuss the implications of feature optimization on model performance, complexity, and interpretability.
Tools/Libraries:
•	Python 3.x
•	Pandas (for data manipulation)
•	Scikit-learn (for RFE: RFE, RFECV; classification models: LogisticRegression, DecisionTreeClassifier; and evaluation metrics: accuracy_score, precision_score, recall_score, f1_score, classification_report)
•	NumPy (for numerical operations, if needed)
Deliverables:
•	A well-structured and commented Python script (.py or .ipynb) that implements the RFE process, trains both models, and performs the comparative analysis.
•	Clear output displaying:
o	The list of features selected by RFE.
o	The classification performance metrics for the baseline model (using all features).
o	The classification performance metrics for the RFE-optimized model (using selected features).
•	A written analysis (within comments or markdown cells) interpreting the results, discussing the impact of feature reduction on model performance and potential benefits (e.g., faster training, better generalization, reduced overfitting).
Success Criteria:
•	RFE is correctly applied to select an optimal subset of features.
•	Both the baseline model and the RFE-optimized model are trained and evaluated effectively.
•	A comprehensive comparison of their performance metrics is provided.
•	The analysis clearly articulates the benefits or drawbacks of feature optimization in this context.
•	The code is robust, readable, and reproducible.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report


###Step 1: Dataset Preparation

In [None]:
# Load dataset
df = pd.read_csv("/content/titanic_preprocessed.csv")

In [None]:
df.head()

Unnamed: 0,Survived,Pclass,Age,Fare,Embarked_Q,Embarked_S,Sex_male,FamilySize,Title_Miss,Title_Mr,Title_Mrs,Title_Rare
0,0,3,-0.565736,0.014151,0.0,1.0,1.0,2,0.0,1.0,0.0,0.0
1,1,1,0.663861,0.139136,0.0,0.0,0.0,2,0.0,0.0,1.0,0.0
2,1,3,-0.258337,0.015469,0.0,1.0,0.0,1,1.0,0.0,0.0,0.0
3,1,1,0.433312,0.103644,0.0,1.0,0.0,2,0.0,0.0,1.0,0.0
4,0,3,0.433312,0.015713,0.0,1.0,1.0,1,0.0,1.0,0.0,0.0


In [None]:
# Define X and y
X = df.drop("Survived", axis=1)
y = df["Survived"]

In [None]:
# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

###Recursive Feature Elimination (RFE) Implementation

I am using Logistic Regression as my base model.
This is the model that will guide RFE when selecting important features.

In [None]:
# Create Logistic Regression model
logreg = LogisticRegression(max_iter=1000, random_state=42)


In [None]:
#Try RFE with different feature counts

# Store results for comparison
results = {}


In [None]:
# Try selecting different numbers of features
for n in range(1, X_train.shape[1] + 1):
    rfe = RFE(estimator=logreg, n_features_to_select=n)
    rfe.fit(X_train, y_train)

    # Cross-validation score with selected features
    X_train_rfe = X_train[X_train.columns[rfe.support_]]
    scores = cross_val_score(logreg, X_train_rfe, y_train, cv=5, scoring='accuracy')
    results[n] = scores.mean()


I loop over different feature counts (1 to all features).

For each number of features:

* Run RFE,

* Train the model,

* Test with cross-validation,

* Store the average accuracy.

Finally, I print how accuracy changes with feature count.

In [None]:
# Show results
for n, score in results.items():
    print(f"{n} features -> CV Accuracy: {score:.4f}")

1 features -> CV Accuracy: 0.7852
2 features -> CV Accuracy: 0.7851
3 features -> CV Accuracy: 0.7950
4 features -> CV Accuracy: 0.8034
5 features -> CV Accuracy: 0.7838
6 features -> CV Accuracy: 0.7838
7 features -> CV Accuracy: 0.8048
8 features -> CV Accuracy: 0.8048
9 features -> CV Accuracy: 0.8203
10 features -> CV Accuracy: 0.8203
11 features -> CV Accuracy: 0.8189


**Select the best feature count**

In [None]:
# Best number of features
best_n = max(results, key=results.get)
print("Best number of features selected by RFE:", best_n)

Best number of features selected by RFE: 9


In [None]:
# Run RFE again with the best number
rfe_final = RFE(estimator=logreg, n_features_to_select=best_n)
rfe_final.fit(X_train, y_train)

In [None]:
# Get selected features
selected_features = X_train.columns[rfe_final.support_]
print("Selected Features:", list(selected_features))

Selected Features: ['Pclass', 'Age', 'Fare', 'Embarked_S', 'Sex_male', 'FamilySize', 'Title_Mr', 'Title_Mrs', 'Title_Rare']


* Here I find which feature count gave the highest accuracy.

* Then I run RFE again with that number of features.

* Finally, I print out the names of the selected features.

#Step 3: Model Training and Performance Comparison.
* Train a Baseline Model (all features).

* Train an Optimized Model (only RFE-selected features).

* Compare their performance using accuracy, precision, recall, F1.

**Train Baseline Model (all features)**

In [None]:
# Baseline model: Logistic Regression with ALL features
logreg.fit(X_train, y_train)
y_pred_base = logreg.predict(X_test)

In [None]:
# Evaluate baseline
baseline_acc = accuracy_score(y_test, y_pred_base)
baseline_prec = precision_score(y_test, y_pred_base)
baseline_rec = recall_score(y_test, y_pred_base)
baseline_f1 = f1_score(y_test, y_pred_base)

In [None]:
print("Baseline Model Performance (All Features):")
print("Accuracy:", baseline_acc)
print("Precision:", baseline_prec)
print("Recall:", baseline_rec)
print("F1 Score:", baseline_f1)


Baseline Model Performance (All Features):
Accuracy: 0.8100558659217877
Precision: 0.7777777777777778
Recall: 0.7101449275362319
F1 Score: 0.7424242424242424


This is my baseline model trained on all available features.
I check its performance on the test set using accuracy, precision, recall, and F1-score.
This gives me a reference point to compare against.

**Train Optimized Model (RFE-selected features)**

In [None]:
# Optimized model: Logistic Regression with SELECTED features
logreg.fit(X_train[selected_features], y_train)
y_pred_opt = logreg.predict(X_test[selected_features])

In [None]:
# Evaluate optimized
opt_acc = accuracy_score(y_test, y_pred_opt)
opt_prec = precision_score(y_test, y_pred_opt)
opt_rec = recall_score(y_test, y_pred_opt)
opt_f1 = f1_score(y_test, y_pred_opt)

In [None]:
print("Optimized Model Performance (RFE-selected Features):")
print("Accuracy:", opt_acc)
print("Precision:", opt_prec)
print("Recall:", opt_rec)
print("F1 Score:", opt_f1)


Optimized Model Performance (RFE-selected Features):
Accuracy: 0.8044692737430168
Precision: 0.765625
Recall: 0.7101449275362319
F1 Score: 0.7368421052631579


This is my optimized model using only the features selected by RFE.
I evaluate it on the same test set using the same metrics.
This shows me whether removing unimportant features improves or worsens performance.

In [None]:
print("Performance Comparison:")
print(f"Baseline (All features):   Accuracy={baseline_acc:.4f}, Precision={baseline_prec:.4f}, Recall={baseline_rec:.4f}, F1={baseline_f1:.4f}")
print(f"Optimized (RFE features): Accuracy={opt_acc:.4f}, Precision={opt_prec:.4f}, Recall={opt_rec:.4f}, F1={opt_f1:.4f}")


Performance Comparison:
Baseline (All features):   Accuracy=0.8101, Precision=0.7778, Recall=0.7101, F1=0.7424
Optimized (RFE features): Accuracy=0.8045, Precision=0.7656, Recall=0.7101, F1=0.7368


###Step 4: Results Analysis and Documentation

In [None]:
print("Final Selected Features by RFE:", list(selected_features))


Final Selected Features by RFE: ['Pclass', 'Age', 'Fare', 'Embarked_S', 'Sex_male', 'FamilySize', 'Title_Mr', 'Title_Mrs', 'Title_Rare']


These are the features that RFE identified as most important for predicting survival. The selection is based on the Logistic Regression model coefficients, which rank the predictive power of each feature.

# **Performance Comparison:**

**Baseline Model (All Features):**

* Accuracy = 0.8101

* Precision = 0.7778

* Recall = 0.7101

* F1-score = 0.7424

**Optimized Model (RFE Features):**

* Accuracy = 0.8045

* Precision = 0.7656

* Recall = 0.7101

* F1-score = 0.7368



The baseline model with all features performed slightly better than the optimized model (difference of ~0.5% in accuracy and F1-score).

However, the optimized model used fewer features, making it simpler, faster to train, and easier to interpret.

Both models achieved almost the same recall (0.7101), meaning they are equally good at identifying survivors.

Precision and F1-score were only marginally lower for the optimized model.

**Final Note:**

Feature optimization with RFE did not significantly improve the Titanic classification model’s performance but helped reduce complexity. In real-world projects, such simplification is often valuable when model interpretability and efficiency are priorities.