## Package Installation

This cell installs the required packages, including Qiskit for quantum computations and scikit-learn for classical machine learning tasks.


In [None]:
!pip install pandas numpy matplotlib scikit-learn qiskit==0.45.1 qiskit-terra==0.45.1 qiskit-aer==0.12.2 qiskit-algorithms==0.2.1



## Imports and Dataset Loading

This cell imports all required libraries for classical and quantum processing, and loads the Heart Disease dataset.


In [None]:
# === üî¢ Data Manipulation & Analysis ===
import pandas as pd                     # DataFrame handling and CSV loading
import numpy as np                      # Numerical computations and array operations

# === üìä Machine Learning (Classical Baseline) ===
from sklearn.feature_selection import SelectKBest, f_classif  # Feature scoring (ANOVA F-test)
from sklearn.linear_model import LogisticRegression           # Classifier for performance comparison
from sklearn.metrics import accuracy_score                    # Evaluation metric

# === üßº Data Preprocessing ===
from sklearn.preprocessing import StandardScaler              # Normalize feature values
from sklearn.model_selection import train_test_split          # Split data into train/test sets

# === ‚öõÔ∏è Quantum Optimization (QAOA & QUBO modeling) ===
from qiskit.primitives import Sampler                         # Backend sampler to evaluate quantum circuits
from qiskit.algorithms.minimum_eigensolvers import QAOA       # Quantum Approximate Optimization Algorithm
from qiskit.algorithms.optimizers import COBYLA               # Classical optimizer used in QAOA
from qiskit.quantum_info import SparsePauliOp                 # Efficient representation of Ising Hamiltonians
from qiskit.opflow import PauliSumOp                          # Wrapper for compatibility with current QAOA API

  from qiskit.algorithms import QAOA


## Data Preprocessing

We separate features from the target, normalize the data, and split it into training and test sets.


In [None]:
df = pd.read_csv("heart.csv")

X = df.drop(columns=["target"])
y = df["target"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)


In [None]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [None]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


## Classical Feature Ranking using ANOVA F-score

We use `SelectKBest` with the ANOVA F-test to identify the top features. This is used as the input pool for comparison with the quantum method.


In [None]:
X_train_all = X_train
X_test_all = X_test
all_feature_names = X.columns.tolist()

best_acc = 0
best_k = 0
best_features_classic = []

for k in range(1, len(all_feature_names) + 1):
    selector = SelectKBest(score_func=f_classif, k=k)
    X_train_k = selector.fit_transform(X_train_all, y_train)
    X_test_k = selector.transform(X_test_all)

    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train_k, y_train)
    acc = accuracy_score(y_test, clf.predict(X_test_k))

    if acc > best_acc:
        best_acc = acc
        best_k = k
        selected_indices = selector.get_support(indices=True)
        best_features_classic = [all_feature_names[i] for i in selected_indices]

print(f"üìä Best Classical Accuracy (Unconstrained): {best_acc:.4f} using {best_k} features")
print("üîç Selected Features (Classical):", best_features_classic)


cp: F-score = 48.60
thalach: F-score = 40.93
exang: F-score = 57.70
oldpeak: F-score = 48.51
ca: F-score = 46.81
thal: F-score = 37.69


## QUBO Modeling

We model the feature selection task as a maximization problem based on the F-scores of features. This formulation is expressed as a Quadratic Unconstrained Binary Optimization (QUBO) problem.

In [None]:
# üìä Use all features
full_scores = f_classif(X_train, y_train)[0]
n = len(full_scores)
normalized_scores = full_scores / np.max(full_scores)

# Build QUBO objective: maximize total feature score by minimizing -score_i * x_i
linear = {i: -normalized_scores[i] for i in range(n)}
quadratic = {}

# Convert to Ising operator
def qubo_to_ising(linear, quadratic):
    pauli_dict = {}
    offset = 0

    for i, coeff in linear.items():
        pauli = ['I'] * n
        pauli[i] = 'Z'
        key = ''.join(reversed(pauli))
        pauli_dict[key] = pauli_dict.get(key, 0) + (-0.5 * coeff)
        offset += 0.5 * coeff

    for (i, j), coeff in quadratic.items():
        pauli = ['I'] * n
        pauli[i] = 'Z'
        pauli[j] = 'Z'
        key = ''.join(reversed(pauli))
        pauli_dict[key] = pauli_dict.get(key, 0) + (0.25 * coeff)
        offset += 0.25 * coeff

    pauli_dict["I" * n] = pauli_dict.get("I" * n, 0) + offset
    return SparsePauliOp.from_list([(k, float(v)) for k, v in pauli_dict.items()])

feature_map = {i: name for i, name in enumerate(X.columns)}
ising_op = qubo_to_ising(linear, quadratic)


## Running QAOA (Quantum Approximate Optimization Algorithm)

We run QAOA with one layer (p = 1) to solve the QUBO. The `Sampler` from Qiskit is used instead of a classical optimizer to ensure compatibility and avoid known issues with Pauli operator simplification.

In [None]:
sampler = Sampler()
optimizer = COBYLA(maxiter=100)
qaoa = QAOA(sampler=sampler, optimizer=optimizer, reps=1)

result = qaoa.compute_minimum_eigenvalue(operator=PauliSumOp(ising_op))

# Extract most probable bitstring
bitstring_probabilities = result.eigenstate.binary_probabilities()
most_probable_bitstring = max(bitstring_probabilities, key=bitstring_probabilities.get)

selected_indices = [i for i, bit in enumerate(reversed(most_probable_bitstring)) if bit == "1"]
selected_features_qaoa = [feature_map[i] for i in selected_indices]

print("üß† QAOA Selected Features:", selected_features_qaoa)

  result = qaoa.compute_minimum_eigenvalue(operator=PauliSumOp(ising_op))


üß† QAOA Selected Features: ['cp', 'thalach', 'exang', 'oldpeak', 'ca', 'thal']


## Performance Evaluation (Logistic Regression)

We compare the performance of the two feature sets (classical and quantum) using a logistic regression model, and measure their accuracy on the test set.

In [None]:
# Classical Feature Selection over the Full Dataset (varying number of features)
all_feature_names = X.columns.tolist()
best_acc = 0
best_k = 0
best_features_classic = []

# Evaluate all feature subset sizes from 1 to total number of features
for k in range(1, len(all_feature_names) + 1):
    selector = SelectKBest(score_func=f_classif, k=k)
    X_train_k = selector.fit_transform(X_train, y_train)
    X_test_k = selector.transform(X_test)

    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train_k, y_train)
    acc = accuracy_score(y_test, clf.predict(X_test_k))

    if acc > best_acc:
        best_acc = acc
        best_k = k
        selected_indices = selector.get_support(indices=True)
        best_features_classic = [all_feature_names[i] for i in selected_indices]
        X_train_classic = X_train[:, selected_indices]
        X_test_classic = X_test[:, selected_indices]

# Load features selected by QAOA from previous step
X_train_qaoa = X_train[:, [X.columns.get_loc(f) for f in selected_features_qaoa]]
X_test_qaoa = X_test[:, [X.columns.get_loc(f) for f in selected_features_qaoa]]

# Train and compare both models
clf = LogisticRegression(max_iter=1000)

clf.fit(X_train_classic, y_train)
acc_classic = accuracy_score(y_test, clf.predict(X_test_classic))

clf.fit(X_train_qaoa, y_train)
acc_qaoa = accuracy_score(y_test, clf.predict(X_test_qaoa))

# Print results
print(f"Accuracy (Best Classical - k={best_k}): {acc_classic:.4f}")
print("Selected Features (Classical):", best_features_classic)
print(f"Accuracy (QAOA): {acc_qaoa:.4f}")
print("Selected Features (QAOA):", selected_features_qaoa)


Accuracy (Best Classical - k=5): 0.8462
Selected Features (Classical): ['cp', 'thalach', 'exang', 'oldpeak', 'ca']
Accuracy (QAOA): 0.8352
Selected Features (QAOA): ['cp', 'thalach', 'exang', 'oldpeak', 'ca', 'thal']


## Conclusion

In this project, we compared classical and quantum-inspired approaches for feature selection using the Heart Disease dataset. The classical method used ANOVA F-score ranking combined with logistic regression to identify the best-performing subset of features. On the quantum side, we formulated the feature selection task as a QUBO problem and solved it using QAOA with Qiskit's `Sampler`.

Both methods were evaluated using logistic regression accuracy on a held-out test set. While the classical approach explored all subset sizes to find the optimal number of features, the quantum method selected a subset based on score maximization without explicit size constraints.

The results show that QAOA can identify feature subsets that perform competitively with classical methods, illustrating its potential as an alternative approach for combinatorial optimization in feature selection tasks. While current simulations remain limited by scalability and noise, this experiment demonstrates how quantum algorithms can be practically applied to real-world machine learning workflows.
