<a href="https://colab.research.google.com/github/dedeepya07/TEAM-68/blob/main/newfraud_(1)_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


# Task
Develop a hybrid quantum-classical fraud detection pipeline using Qiskit VQC on the IEEE-CIS Fraud Detection dataset ("train_transaction.csv", "train_identity.csv"). The pipeline should include data loading, preprocessing (handling missing values, encoding, normalization, feature selection), train/test split with sampling to handle class imbalance, training and evaluation of a classical baseline model, setup, training, and evaluation of a Qiskit VQC on a quantum simulator (with comments for switching to hardware), and a comparison of results. The final output should be an end-to-end notebook running within Colab.

In [None]:
!pip install qiskit qiskit-aer qiskit-machine-learning qiskit-algorithms

Collecting qiskit
  Downloading qiskit-2.2.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (12 kB)
Collecting qiskit-aer
  Downloading qiskit_aer-0.17.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.3 kB)
Collecting qiskit-machine-learning
  Downloading qiskit_machine_learning-0.8.4-py3-none-any.whl.metadata (13 kB)
Collecting qiskit-algorithms
  Downloading qiskit_algorithms-0.4.0-py3-none-any.whl.metadata (4.7 kB)
Collecting rustworkx>=0.15.0 (from qiskit)
  Downloading rustworkx-0.17.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting stevedore>=3.0.0 (from qiskit)
  Downloading stevedore-5.5.0-py3-none-any.whl.metadata (2.2 kB)
Collecting qiskit
  Downloading qiskit-1.4.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting scipy>=1.5 (from qiskit)
  Downloading scipy-1.15.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━

## Load data

### Subtask:
Load `train_transaction.csv` and `train_identity.csv` into pandas DataFrames.


**Reasoning**:
Import pandas and load the two CSV files into dataframes as instructed.



In [None]:
import pandas as pd

df_identity = pd.read_csv('/content/drive/MyDrive/ieee-fraud-detection/train_identity.csv')
df_transaction = pd.read_csv('/content/drive/MyDrive/ieee-fraud-detection/train_transaction.csv')

print("train_transaction.csv loaded successfully.")
print("train_identity.csv loaded successfully.")

train_transaction.csv loaded successfully.
train_identity.csv loaded successfully.


## Merge data

### Subtask:
Merge the two DataFrames on `TransactionID`.


**Reasoning**:
Merge the two dataframes on the 'TransactionID' column using an outer merge as instructed, then display the head and shape of the merged dataframe.



In [None]:
import pandas as pd
df_merged = pd.merge(df_transaction, df_identity, on='TransactionID', how='outer')
display(df_merged.head())
display(df_merged.shape)

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M


(590540, 434)

## Handle missing values

### Subtask:
Handle missing values in the merged DataFrame.


**Reasoning**:
Calculate the percentage of missing values for each column in the merged DataFrame.



In [None]:
missing_percentage = df_merged.isnull().sum() / len(df_merged) * 100
display(missing_percentage.sort_values(ascending=False).head())

Unnamed: 0,0
id_24,99.196159
id_25,99.130965
id_07,99.12707
id_08,99.12707
id_21,99.126393


**Reasoning**:
Drop columns with more than 50% missing values and impute missing values in the remaining columns based on their data type.



In [None]:
missing_threshold = 50
cols_to_drop = missing_percentage[missing_percentage > missing_threshold].index
df_merged_cleaned = df_merged.drop(columns=cols_to_drop)

for col in df_merged_cleaned.columns:
    if df_merged_cleaned[col].isnull().any():
        if df_merged_cleaned[col].dtype in ['int64', 'float64']:
            df_merged_cleaned[col] = df_merged_cleaned[col].fillna(df_merged_cleaned[col].median())
        else:
            df_merged_cleaned[col] = df_merged_cleaned[col].fillna(df_merged_cleaned[col].mode()[0])

display(df_merged_cleaned.isnull().sum().sum())

np.int64(0)

## Encode categorical features

### Subtask:
Apply label encoding to categorical columns in the `df_merged_cleaned` DataFrame.


**Reasoning**:
Apply label encoding to the categorical columns in `df_merged_cleaned`.



In [None]:
from sklearn.preprocessing import LabelEncoder

categorical_cols = df_merged_cleaned.select_dtypes(include=['object']).columns

for col in categorical_cols:
    le = LabelEncoder()
    df_merged_cleaned[col] = le.fit_transform(df_merged_cleaned[col])

display(df_merged_cleaned.head())
display(df_merged_cleaned.info())

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321
0,2987000,0,86400,68.5,4,13926,361.0,150.0,1,142.0,...,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0
1,2987001,0,86401,29.0,4,2755,404.0,150.0,2,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2987002,0,86469,59.0,4,4663,490.0,150.0,3,166.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2987003,0,86499,50.0,4,18132,567.0,150.0,2,117.0,...,135.0,0.0,0.0,0.0,50.0,1404.0,790.0,0.0,0.0,0.0
4,2987004,0,86506,50.0,1,4497,514.0,150.0,2,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 220 entries, TransactionID to V321
dtypes: float64(207), int64(13)
memory usage: 991.2 MB


None

## Normalize numerical features

### Subtask:
Scale numerical features in the `df_merged_cleaned` DataFrame.


**Reasoning**:
Scale the numerical features in the `df_merged_cleaned` DataFrame using MinMaxScaler, excluding 'TransactionID' and 'isFraud'.



In [None]:
from sklearn.preprocessing import MinMaxScaler

numerical_cols = df_merged_cleaned.select_dtypes(include=['int64', 'float64']).columns
cols_to_scale = numerical_cols.drop(['TransactionID', 'isFraud'])

scaler = MinMaxScaler()
df_merged_cleaned[cols_to_scale] = scaler.fit_transform(df_merged_cleaned[cols_to_scale])

display(df_merged_cleaned.head())

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321
0,2987000,0,0.0,0.002137,1.0,0.743044,0.522,0.381679,0.333333,0.306569,...,0.0,0.0,0.0,0.0,0.0,0.000873,0.0,0.0,0.0,0.0
1,2987001,0,6.359409e-08,0.0009,1.0,0.100885,0.608,0.381679,0.666667,0.014599,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2987002,0,4.387992e-06,0.00184,1.0,0.210566,0.78,0.381679,1.0,0.481752,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2987003,0,6.295815e-06,0.001558,1.0,0.984824,0.934,0.381679,0.666667,0.124088,...,0.002449,0.0,0.0,0.0,0.000533,0.010476,0.008022,0.0,0.0,0.0
4,2987004,0,6.740974e-06,0.001558,0.25,0.201023,0.828,0.381679,0.666667,0.014599,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Feature selection

### Subtask:
Select a reduced feature set (e.g., using PCA or other feature selection methods) to meet the qubit limits for the VQC.


**Reasoning**:
Separate the target variable and apply PCA to reduce the dimensionality of the features, then create a new DataFrame with the PCA components and the target variable.



In [None]:
from sklearn.decomposition import PCA

X = df_merged_cleaned.drop(columns=['TransactionID', 'isFraud'])
y = df_merged_cleaned['isFraud']

# Choose a small number of components for VQC compatibility
n_components = 4
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)

df_pca = pd.DataFrame(X_pca, columns=[f'pca_{i+1}' for i in range(n_components)])
df_pca['isFraud'] = y

display(df_pca.head())
display(df_pca.shape)

Unnamed: 0,pca_1,pca_2,pca_3,pca_4,isFraud
0,-0.029174,0.371937,0.829845,-0.073558,0
1,0.187855,0.285365,0.68957,0.105082,0
2,-0.058529,-0.400179,-0.074065,0.285307,0
3,-0.124235,-0.403499,-0.012147,0.221196,0
4,-0.623924,-0.320647,0.184692,0.169037,0


(590540, 5)

## Train/test split & sampling

### Subtask:
Split the data into training and testing sets, ensuring stratification to preserve the fraud ratio. Create a smaller subset for demonstration purposes.


**Reasoning**:
Split the data into training and testing sets, and create a smaller subset for demonstration purposes, ensuring stratification.



In [None]:
from sklearn.model_selection import train_test_split

X = df_pca.drop(columns=['isFraud'])
y = df_pca['isFraud']

# Split the data into training and testing sets with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Create a smaller subset for demonstration purposes
subset_size = 10000  # You can adjust this size
X_train_subset, _, y_train_subset, _ = train_test_split(X_train, y_train, train_size=subset_size, random_state=42, stratify=y_train)


# Display the shapes of the resulting sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)
print("Shape of X_train_subset:", X_train_subset.shape)
print("Shape of y_train_subset:", y_train_subset.shape)

Shape of X_train: (472432, 4)
Shape of X_test: (118108, 4)
Shape of y_train: (472432,)
Shape of y_test: (118108,)
Shape of X_train_subset: (10000, 4)
Shape of y_train_subset: (10000,)


## Handle class imbalance

### Subtask:
Apply RandomOverSampler to the training data to address class imbalance.


**Reasoning**:
Apply RandomOverSampler to the training data subset to address class imbalance.



In [None]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train_subset, y_train_subset)

print("Shape of original X_train_subset:", X_train_subset.shape)
print("Shape of resampled X_resampled:", X_resampled.shape)
print("Shape of original y_train_subset:", y_train_subset.shape)
print("Shape of resampled y_resampled:", y_resampled.shape)

Shape of original X_train_subset: (10000, 4)
Shape of resampled X_resampled: (19300, 4)
Shape of original y_train_subset: (10000,)
Shape of resampled y_resampled: (19300,)


## Classical baseline

### Subtask:
Train a classical model (Logistic Regression or RandomForest) and evaluate its performance.


**Reasoning**:
Train a Logistic Regression model on the resampled training data and evaluate its performance on the test data using various classification metrics.



In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report

# Instantiate the Logistic Regression model
classical_model = LogisticRegression(random_state=42, solver='liblinear')

# Train the model on the resampled training data
classical_model.fit(X_resampled, y_resampled)

# Make predictions on the test data
y_pred = classical_model.predict(X_test)
y_prob = classical_model.predict_proba(X_test)[:, 1] # Get probabilities for ROC-AUC

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

# Print the evaluation metrics
print("Classical Model Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")

# Print a detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Classical Model Performance:
Accuracy: 0.7208
Precision: 0.0756
Recall: 0.6221
F1-score: 0.1349
ROC-AUC: 0.7376

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.72      0.83    113975
           1       0.08      0.62      0.13      4133

    accuracy                           0.72    118108
   macro avg       0.53      0.67      0.48    118108
weighted avg       0.95      0.72      0.81    118108



## Summary:

### Data Analysis Key Findings

*   The dataset consists of transaction and identity information, merged successfully based on `TransactionID`, resulting in 590,540 rows and 434 columns.
*   A significant portion of columns (those with >50% missing values) were dropped during preprocessing to handle missing data. Remaining missing values were imputed using the median for numerical columns and the mode for categorical columns, resulting in a dataframe with no missing values.
*   Categorical features were successfully encoded using Label Encoding, converting 'object' type columns to numerical types.
*   Numerical features (excluding 'TransactionID' and 'isFraud') were scaled using `MinMaxScaler`.
*   Feature selection was performed using PCA, reducing the dimensionality to 3 principal components to align with potential qubit limitations for quantum processing.
*   The data was split into training (80%) and testing (20%) sets using stratification to preserve the fraud ratio. A smaller stratified subset of the training data (5000 samples) was created for demonstration.
*   Class imbalance in the training subset was addressed using `RandomOverSampler`, increasing the number of samples.
*   A classical Logistic Regression model was trained on the resampled training subset and evaluated on the test set, achieving a ROC-AUC of 0.7250, an accuracy of 0.7180, a precision of 0.0747, and a recall of 0.6196 for the fraud class.
*   Attempts to set up and train a Qiskit Variational Quantum Classifier (VQC) failed repeatedly due to persistent `ImportError` issues with the `COBYLA` optimizer across multiple attempted import paths (`qiskit.algorithms.optimizers`, `qiskit.optimize`, `qiskit.utils.algorithm_globals`).
*   Consequently, the VQC could not be trained or evaluated, making a direct performance comparison between the classical and quantum models impossible within this process.

### Insights or Next Steps

*   The primary bottleneck was the inability to import the required Qiskit optimizer. Resolving this library compatibility issue is the critical next step to enable VQC training and proceed with the hybrid quantum-classical pipeline.
*   Once the VQC training is functional, future steps should include hyperparameter tuning for both the classical and VQC models, exploring different feature selection methods (potentially involving more features if qubit limits allow), and potentially experimenting with different Qiskit feature maps and ansatz circuits to optimize VQC performance.


## Quantum model setup (Regenerated)

### Subtask:
Set up the VQC using Qiskit, defining the feature map, ansatz, optimizer, and backend. Include comments for switching to IBMQ hardware.

**Reasoning**:
Regenerating the VQC setup code with corrected import paths for Qiskit libraries. This includes importing `COBYLA` from `qiskit_algorithms.optimizers` and setting up the `EstimatorQNN` and `NeuralNetworkClassifier` with the previously defined feature map and ansatz.

In [None]:
from qiskit import QuantumCircuit
from qiskit_aer import Aer
# Corrected imports based on previous attempts
from qiskit_algorithms.optimizers import COBYLA
from qiskit_algorithms.utils import algorithm_globals
from qiskit.primitives import Estimator
from qiskit_machine_learning.neural_networks import EstimatorQNN
from qiskit_machine_learning.algorithms.classifiers import NeuralNetworkClassifier
from qiskit.circuit.library import RealAmplitudes, ZZFeatureMap
import warnings
import numpy as np

# Suppress DeprecationWarning from qiskit_machine_learning for now
warnings.filterwarnings("ignore", category=DeprecationWarning, module="qiskit_machine_learning")


# 2. Define the number of qubits based on the number of features (3 from PCA)
num_qubits = n_components # n_components is already defined as 3

# 3. Define the feature map
# Using ZZFeatureMap which is common for VQC. It takes num_qubits and input parameters.
# We will use the number of features (num_qubits) as the number of input parameters.
feature_map = ZZFeatureMap(feature_dimension=num_qubits, reps=1, entanglement='linear')


# 4. Define the ansatz (variational form)
# Using RealAmplitudes as an example
ansatz = RealAmplitudes(num_qubits, reps=1, entanglement='linear')


# 5. Choose an optimizer for training the VQC
optimizer = COBYLA(maxiter=50) # Reduced maxiter for faster demonstration


# 6. Select the quantum backend and primitive
# Initialize an Aer.get_backend('statevector_simulator') for simulation.
# Using Estimator as the primitive for VQC
#backend = Estimator()
backend = provider.get_backend('ibmq_quito')  # real quantum device backend



# Add comments explaining how to switch to an IBMQ real hardware backend:
# To run on IBMQ real hardware, you need to:
# 1. Import IBMQ: from qiskit import IBMQ
# 2. Load your account (if not already saved): IBMQ.load_account()
# 3. Get a provider: provider = IBMQ.get_provider(hub='ibm-q', group='open', project='main')
# 4. Specify the backend name: backend_name = 'ibm_perth' # Replace with the name of your desired backend
# 5. Get the backend instance: backend = provider.get_backend(backend_name)
# 6. Instantiate Estimator with the real hardware backend: backend = Estimator(backend)
# 7. Note: Running on real hardware requires careful consideration of circuit depth,
#    number of qubits, and available backend resources. Error mitigation techniques
#    are often necessary for noisy hardware.


# 7. Define the EstimatorQNN using the previously defined feature_map and ansatz
# Using the feature_map and ansatz defined in a previous successful setup step.
try:
    qnn = EstimatorQNN(
        circuit=feature_map.compose(ansatz), # Combine feature map and ansatz for the QNN
        input_params=list(feature_map.parameters), # Parameters for the input data
        weight_params=list(ansatz.parameters) # Trainable parameters
    )

    # Define the NeuralNetworkClassifier
    vqc_classifier = NeuralNetworkClassifier(
        neural_network=qnn,
        optimizer=optimizer,
        loss='cross_entropy',
        one_hot=False # Our labels are 0 and 1, not one-hot encoded
    )

    print("VQC setup complete.")
    print(f"Number of qubits: {num_qubits}")
    print("Feature Map (structure):")
    print(feature_map.draw())
    print("Ansatz (structure):")
    print(ansatz.draw())
    print(f"Optimizer: {type(optimizer).__name__}")
    print(f"Backend Primitive: {type(backend).__name__}") # Print backend type instead of name for Estimator


except NameError as ne:
    print(f"Error during VQC setup: {ne}. Ensure previous steps ran correctly.")
except Exception as e:
    print(f"An unexpected error occurred during VQC setup: {e}")

  backend = Estimator()
  qnn = EstimatorQNN(


VQC setup complete.
Number of qubits: 8
Feature Map (structure):
     ┌────────────────────────────────────────────────────────┐
q_0: ┤0                                                       ├
     │                                                        │
q_1: ┤1                                                       ├
     │                                                        │
q_2: ┤2                                                       ├
     │                                                        │
q_3: ┤3                                                       ├
     │  ZZFeatureMap(x[0],x[1],x[2],x[3],x[4],x[5],x[6],x[7]) │
q_4: ┤4                                                       ├
     │                                                        │
q_5: ┤5                                                       ├
     │                                                        │
q_6: ┤6                                                       ├
     │                                 

## Train & evaluate vqc

### Subtask:
Train the VQC and evaluate its performance on the test set.

**Reasoning**:
Import necessary classes for the VQC, define the QNN and NeuralNetworkClassifier, train the classifier on the resampled data, predict on the test set, and evaluate the performance.

In [None]:
from qiskit_machine_learning.neural_networks import EstimatorQNN
from qiskit_machine_learning.algorithms.classifiers import NeuralNetworkClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report
import warnings
import numpy as np # Import numpy as it's used later

# Suppress DeprecationWarning from qiskit_machine_learning
warnings.filterwarnings("ignore", category=DeprecationWarning, module="qiskit_machine_learning")

# Corrected import path for optimizers
from qiskit_algorithms.optimizers import COBYLA

# --- Use Best Hyperparameters from Tuning (assuming best_params is available) ---
# If best_params is not available (e.g., notebook restart), you might need to re-run
# the tuning cell or manually set the best parameters based on the tuning output.
# For this execution, we assume best_params is available from the previous tuning run.

# If best_params is not defined, manually set the best parameters based on the output of cell 2cb268b5
if 'best_params' not in locals() or best_params is None:
    print("Warning: best_params not found. Using default parameters for VQC training.")
    # Default parameters based on the tuning output in the provided notebook state
    best_feature_map_reps = 1
    best_ansatz_reps = 1
    best_ansatz_type = 'RealAmplitudes'
    best_optimizer_name = 'COBYLA'
    best_optimizer_params = {'maxiter': 50} # Use 50 maxiter as it showed best performance in tuning
else:
    best_feature_map_reps = best_params['feature_map_reps']
    best_ansatz_reps = best_params['ansatz_reps']
    best_ansatz_type = best_params['ansatz_type']
    best_optimizer_name = best_params['optimizer_name']
    best_optimizer_params = best_params['optimizer_params']


print(f"Using Best Hyperparameters:")
print(f"  Feature Map Reps: {best_feature_map_reps}")
print(f"  Ansatz Reps: {best_ansatz_reps}")
print(f"  Ansatz Type: {best_ansatz_type}")
print(f"  Optimizer: {best_optimizer_name}")
print(f"  Optimizer Params: {best_optimizer_params}")

# Define the number of qubits based on the selected features (3 from PCA)
# num_qubits is defined from the PCA step
num_qubits = X_resampled_refined.shape[1] # Use the number of features in the refined dataset


# Define the feature map using the best parameters
if best_feature_map_reps > 0: # Only create if reps > 0
    feature_map = ZZFeatureMap(feature_dimension=num_qubits, reps=best_feature_map_reps, entanglement='linear')
else: # Handle case where 0 reps might be desired (though unlikely for a feature map)
    feature_map = QuantumCircuit(num_qubits, name="FeatureMap")


# Define the ansatz based on the best parameters
if best_ansatz_type == 'RealAmplitudes':
    ansatz = RealAmplitudes(num_qubits, reps=best_ansatz_reps, entanglement='linear')
elif best_ansatz_type == 'EfficientSU2':
    ansatz = EfficientSU2(num_qubits, reps=best_ansatz_reps, entanglement='linear')
else:
    raise ValueError(f"Unknown best ansatz type: {best_ansatz_type}")


# Choose the optimizer based on the best parameters
if best_optimizer_name == 'COBYLA':
    optimizer = COBYLA(**best_optimizer_params)
elif best_optimizer_name == 'ADAM':
    from qiskit_algorithms.optimizers import ADAM
    optimizer = ADAM(**best_optimizer_params)
elif best_optimizer_name == 'SPSA':
    from qiskit_algorithms.optimizers import SPSA
    optimizer = SPSA(**best_optimizer_params)
else:
    raise ValueError(f"Unknown best optimizer: {best_optimizer_name}")

print(f"Optimizer: {type(optimizer).__name__} configured.")


# Define the EstimatorQNN using the previously defined feature_map and ansatz
# Using the feature_map and ansatz defined with the best hyperparameters.
try:
    # Combine feature map and ansatz for the QNN
    # Ensure input_params and weight_params are correctly assigned
    qnn_circuit = feature_map.compose(ansatz)
    input_params = list(feature_map.parameters)
    weight_params = list(ansatz.parameters)

    qnn = EstimatorQNN(
        circuit=qnn_circuit,
        input_params=input_params,
        weight_params=weight_params
    )

    # Define the NeuralNetworkClassifier
    vqc_classifier = NeuralNetworkClassifier(
        neural_network=qnn,
        optimizer=optimizer,
        loss='cross_entropy',
        one_hot=False # Our labels are 0 and 1, not one-hot encoded
    )

    # Train the VQC classifier using the REFINED resampled training data
    print("\nStarting VQC training with refined data and best hyperparameters...")
    # Convert pandas DataFrames/Series to numpy arrays for Qiskit ML
    # Use X_resampled_refined and y_resampled_refined from the refined sampling step
    vqc_classifier.fit(X_resampled_refined.values, y_resampled_refined.values)
    print("VQC training finished.")

    # Make predictions on the test data (using the full X_test)
    print("Making predictions on test set...")
    # Convert pandas DataFrame to numpy array for Qiskit ML
    # X_test and y_test should be available from previous step
    y_pred_vqc = vqc_classifier.predict(X_test.values)
    # Ensure probabilities are calculated correctly
    # Handle the case where predict_proba might only return one column
    y_prob_vqc_raw = vqc_classifier.predict_proba(X_test.values)
    if y_prob_vqc_raw.shape[1] > 1:
        y_prob_vqc = y_prob_vqc_raw[:, 1] # Get probabilities for the positive class
    else:
        # If only one column, assume it's the probability of the positive class
        y_prob_vqc = y_prob_vqc_raw.flatten()


    print("Predictions finished.")

    # --- Fix for [-1, 1] predictions ---
    # Map any prediction not equal to 1 to 0 to ensure binary (0 or 1) output
    y_pred_vqc_binary = np.where(y_pred_vqc == 1, 1, 0)
    # Ensure it's integer type
    y_pred_vqc_int = y_pred_vqc_binary.astype(int)
    # --- End fix ---


    # --- Debugging prints ---
    print("\n--- Debugging Metrics Inputs ---")
    print(f"Type of y_test.values: {type(y_test.values)}")
    print(f"Shape of y_test.values: {y_test.values.shape}")
    print(f"Unique values in y_test.values: {np.unique(y_test.values)}")
    print(f"Dtype of y_test.values: {y_test.values.dtype}")

    print(f"\nType of y_pred_vqc_int: {type(y_pred_vqc_int)}")
    print(f"Shape of y_pred_vqc_int: {y_pred_vqc_int.shape}")
    print(f"Unique values in y_pred_vqc_int: {np.unique(y_pred_vqc_int)}")
    print(f"Dtype of y_pred_vqc_int: {y_pred_vqc_int.dtype}")
    print("--- End Debugging Metrics Inputs ---")
    # --- End Debugging prints ---


    accuracy_vqc = accuracy_score(y_test.values, y_pred_vqc_int)
    # For precision, recall, and f1, specify zero_division to avoid warning/error if no positive predictions
    # Also explicitly set average='binary' and pos_label=1
    precision_vqc = precision_score(y_test.values, y_pred_vqc_int, average='binary', pos_label=1, zero_division=0)
    recall_vqc = recall_score(y_test.values, y_pred_vqc_int, average='binary', pos_label=1, zero_division=0)
    f1_vqc = f1_score(y_test.values, y_pred_vqc_int, average='binary', pos_label=1, zero_division=0)

    # Check if roc_auc_score is valid
    if len(np.unique(y_test.values)) > 1:
        roc_auc_vqc = roc_auc_score(y_test.values, y_prob_vqc)
    else:
        roc_auc_vqc = np.nan # ROC-AUC is not well-defined with only one class present


    # Print the evaluation metrics
    print("\nVQC Model Performance (Optimized):")
    print(f"Accuracy: {accuracy_vqc:.4f}")
    print(f"Precision: {precision_vqc:.4f}")
    print(f"Recall: {recall_vqc:.4f}")
    f1_vqc = f1_score(y_test.values, y_pred_vqc_int, average='binary', pos_label=1, zero_division=0)
    print(f"F1-score: {f1_vqc:.4f}")
    if not np.isnan(roc_auc_vqc):
        print(f"ROC-AUC: {roc_auc_vqc:.4f}")
    else:
        print("ROC-AUC: Not available (only one class present in y_test for VQC predictions)")


    # Print a detailed classification report
    print("\nClassification Report (Optimized VQC):")
    # Use zero_division=0 in classification_report as well
    print(classification_report(y_test.values, y_pred_vqc_int, zero_division=0))

except NameError as ne:
    print(f"Error during VQC training or evaluation: {ne}. Ensure previous steps ran correctly and required variables exist.")
except Exception as e:
    print(f"An unexpected error occurred during VQC training or evaluation: {e}")

Using Best Hyperparameters:
  Feature Map Reps: 1
  Ansatz Reps: 1
  Ansatz Type: RealAmplitudes
  Optimizer: COBYLA
  Optimizer Params: {'maxiter': 50}


NameError: name 'X_resampled_refined' is not defined

## Results & insights

### Subtask:
Compare the results of the classical and quantum models and summarize the findings.

**Reasoning**:
Compare the performance metrics of the classical Logistic Regression model and the VQC model, and provide a summary of the key findings and insights.

In [None]:
# Classical model performance metrics obtained from a previous step
# Ensure these variables are available from the classical model evaluation step (cell 654bf5a1)
classical_accuracy = accuracy
classical_precision = precision
classical_recall = recall
classical_f1 = f1
classical_roc_auc = roc_auc

print("Classical Model Performance:")
print(f"Accuracy: {classical_accuracy:.4f}")
print(f"Precision: {classical_precision:.4f}")
print(f"Recall: {classical_recall:.4f}")
print(f"F1-score: {classical_f1:.4f}")
print(f"ROC-AUC: {classical_roc_auc:.4f}")
print("-" * 30)

# VQC model performance metrics obtained from the previous step (cell 8354d478)
# Ensure these variables are available from the VQC training and evaluation step
vqc_accuracy = accuracy_vqc
vqc_precision = precision_vqc
vqc_recall = recall_vqc
vqc_f1 = f1_vqc
vqc_roc_auc = roc_auc_vqc

print("VQC Model Performance (Optimized):") # Added (Optimized)
print(f"Accuracy: {vqc_accuracy:.4f}")
print(f"Precision: {vqc_precision:.4f}")
print(f"Recall: {vqc_recall:.4f}")
f1_vqc = f1_score(y_test.values, y_pred_vqc_int, average='binary', pos_label=1, zero_division=0) # Recalculate F1 to be safe
print(f"F1-score: {f1_vqc:.4f}")
if not np.isnan(vqc_roc_auc):
    print(f"ROC-AUC: {vqc_roc_auc:.4f}")
else:
    print("ROC-AUC: Not available (only one class present in y_test for VQC predictions)") # Adjusted message


print("-" * 30)

print("\n--- Comparison and Summary ---")

# Compare performance metrics
print("\nPerformance Comparison:")
print(f"Metric     | Classical | Optimized VQC") # Added Optimized
print(f"-----------|-----------|------")
print(f"Accuracy   | {classical_accuracy:.4f}  | {vqc_accuracy:.4f}")
print(f"Precision  | {classical_precision:.4f}  | {vqc_precision:.4f}")
print(f"Recall     | {classical_recall:.4f}  | {vqc_recall:.4f}")
print(f"F1-score   | {classical_f1:.4f}  | {vqc_f1:.4f}")
if not np.isnan(classical_roc_auc) and not np.isnan(vqc_roc_auc):
     print(f"ROC-AUC    | {classical_roc_auc:.4f}  | {vqc_roc_auc:.4f}")
elif not np.isnan(classical_roc_auc):
     print(f"ROC-AUC    | {classical_roc_auc:.4f}  | N/A")
elif not np.isnan(vqc_roc_auc):
     print(f"ROC-AUC    | N/A       | {vqc_roc_auc:.4f}")
else:
     print(f"ROC-AUC    | N/A       | N/A")


print("\nKey Findings:")
print(f"- The classical Logistic Regression model achieved a ROC-AUC of {classical_roc_auc:.4f} on the test set.")
if not np.isnan(vqc_roc_auc):
    print(f"- The optimized VQC model achieved a ROC-AUC of {vqc_roc_auc:.4f} on the test set.") # Added Optimized
    if vqc_roc_auc > classical_roc_auc:
        print("- In terms of ROC-AUC, the optimized VQC model performed better than the classical model.") # Added Optimized
    elif vqc_roc_auc < classical_roc_auc:
        print("- In terms of ROC-AUC, the classical model performed better than the optimized VQC model.") # Added Optimized
    else:
        print("- In terms of ROC-AUC, the classical and optimized VQC models performed similarly.") # Added Optimized
else:
     print("- The ROC-AUC for the optimized VQC model is not available.") # Added Optimized


print(f"- For detecting fraudulent transactions (class 1), the classical model had a Recall of {classical_recall:.4f} and a Precision of {classical_precision:.4f}.")
print(f"- The optimized VQC model had a Recall of {vqc_recall:.4f} and a Precision of {vqc_precision:.4f} for the fraud class.") # Added Optimized

# Add insights based on the comparison (this part will be more specific after seeing the VQC results)
print("\nInsights:")
print("Based on the performance metrics:")
if not np.isnan(vqc_roc_auc):
    print(f"- The optimized VQC model, trained with the best hyperparameters found and on a larger resampled subset, shows the following performance compared to the classical Logistic Regression model:")
    print(f"  - Accuracy: Classical={classical_accuracy:.4f}, Optimized VQC={vqc_accuracy:.4f}")
    print(f"  - Precision: Classical={classical_precision:.4f}, Optimized VQC={vqc_precision:.4f}")
    print(f"  - Recall: Classical={classical_recall:.4f}, Optimized VQC={vqc_recall:.4f}")
    print(f"  - F1-score: Classical={classical_f1:.4f}, Optimized VQC={vqc_f1:.4f}")
    print(f"  - ROC-AUC: Classical={classical_roc_auc:.4f}, Optimized VQC={vqc_roc_auc:.4f}")

    # Provide interpretation of the results
    if vqc_roc_auc > classical_roc_auc:
        print("\nInterpretation:")
        print("The optimized VQC model achieved a higher ROC-AUC than the classical model, indicating better overall discrimination ability between the positive (fraud) and negative (non-fraud) classes. This suggests that the quantum model, with the optimized configuration and larger training data subset, is potentially capturing more complex patterns in the data relevant to fraud detection.")
        print("While the VQC's Accuracy might be lower, metrics like ROC-AUC and Recall are often more important in imbalanced datasets like this. The VQC's Recall is significantly higher, meaning it is better at identifying fraudulent transactions, which is crucial in fraud detection to minimize false negatives.")
    elif vqc_roc_auc < classical_roc_auc:
        print("\nInterpretation:")
        print("The classical model still outperforms the optimized VQC in terms of ROC-AUC. This could be due to several factors:")
        print("  - The limited number of features (3 from PCA) might not be sufficient for the VQC to demonstrate a significant advantage, even with hyperparameter tuning.")
        print("  - The current VQC architecture (feature map and ansatz) might not be complex enough to capture the intricate patterns in this specific dataset.")
        print("  - The dataset size, even with the refined subset and oversampling, might still be too large for the current VQC setup to train effectively within reasonable time and computational resources.")
        print("  - The classical Logistic Regression model, despite its simplicity, might be performing well due to the nature of the selected features.")
    else:
        print("\nInterpretation:")
        print("The optimized VQC model performs similarly to the classical model in terms of ROC-AUC. This indicates that with the current setup (3 features, chosen architecture, training data size), the VQC does not offer a significant advantage over the classical approach for this dataset.")
        print("Further improvements might require exploring more features (which is challenging with current quantum resources), different VQC architectures, or more advanced quantum machine learning techniques.")


    print("\nFuture Potential and Next Steps:")
    print("- **Explore more features:** Investigate techniques to use a larger, more informative feature set with the VQC, potentially involving feature engineering or more advanced quantum embedding methods if quantum hardware capabilities increase.")
    print("- **Experiment with different VQC architectures:** Explore different feature maps (like the alternative EfficientSU2 you defined) and ansatz circuits (including deeper or different structures) in the hyperparameter tuning process.")
    print("- **Advanced Quantum Techniques:** Research and implement more sophisticated QML algorithms or hybrid approaches that might be better suited for complex, large-scale datasets.")
    print("- **Real Hardware Exploration:** If access to real quantum hardware is available, experiment with running the optimized VQC on hardware, considering error mitigation techniques.")
    print("- **More Extensive Hyperparameter Tuning:** Conduct a more extensive hyperparameter search, potentially using techniques like random search or Bayesian optimization, and train on larger subsets if computational resources allow.")

else:
     print("- Due to the ROC-AUC for the optimized VQC model not being available, a direct comparison is not possible. This might indicate an issue with the VQC training or prediction step.")

Classical Model Performance:
Accuracy: 0.7260
Precision: 0.0755
Recall: 0.6073
F1-score: 0.1343
ROC-AUC: 0.7270
------------------------------
VQC Model Performance (Optimized):
Accuracy: 0.2820
Precision: 0.0393
Recall: 0.8331
F1-score: 0.0751
ROC-AUC: 0.5537
------------------------------

--- Comparison and Summary ---

Performance Comparison:
Metric     | Classical | Optimized VQC
-----------|-----------|------
Accuracy   | 0.7260  | 0.2820
Precision  | 0.0755  | 0.0393
Recall     | 0.6073  | 0.8331
F1-score   | 0.1343  | 0.0751
ROC-AUC    | 0.7270  | 0.5537

Key Findings:
- The classical Logistic Regression model achieved a ROC-AUC of 0.7270 on the test set.
- The optimized VQC model achieved a ROC-AUC of 0.5537 on the test set.
- In terms of ROC-AUC, the classical model performed better than the optimized VQC model.
- For detecting fraudulent transactions (class 1), the classical model had a Recall of 0.6073 and a Precision of 0.0755.
- The optimized VQC model had a Recall of 0.8

# Task
Analyze the provided Python notebook for a fraud detection task using both classical and Quantum Machine Learning (QML) approaches. The goal is to significantly improve the performance of the QML model, specifically the VQC, to demonstrate its superiority over the classical baseline. This involves a comprehensive review and modification of the code, including data preprocessing, feature selection, model architecture (especially for the VQC), training process, and evaluation. The final output should be a modified notebook with improved QML performance, a detailed explanation of all code cells, the data used, the workflow, and a comparative analysis highlighting the improvements and future potential of QML for this task. The notebook should be optimized for winning a competition, implying a focus on achieving the highest possible performance metrics for the QML model.

## Re-evaluate feature selection

### Subtask:
Re-evaluate feature selection to use more relevant features for improved VQC performance, potentially exploring methods beyond strict qubit limits for future applicability.


**Reasoning**:
Calculate the correlation of each feature with 'isFraud', select the top correlated features, and create a new DataFrame with these features and the target variable.



In [None]:
# 1. Calculate the correlation of each feature with the target variable 'isFraud'
correlations = df_merged_cleaned.corr()['isFraud'].abs().sort_values(ascending=False)

# 2. Select a larger number of top features
# Exclude 'isFraud' itself from the top features list
top_features = correlations[1:].head(50).index.tolist() # Select top 50 features, excluding isFraud

# 3. Create a new DataFrame with only these selected features and the target variable
X_selected = df_merged_cleaned[top_features]
y = df_merged_cleaned['isFraud'] # Target variable

# Add 'isFraud' back to X_selected for display if needed, but keep X and y separate for modeling
# For clarity in the next steps, we will keep X_selected as features and y as the target.
# Let's create a combined df_selected for inspection
df_selected = X_selected.copy()
df_selected['isFraud'] = y

# 4. Display the shape of X_selected and the list of selected feature names
print("Shape of X_selected:", X_selected.shape)
print("\nSelected Features:")
print(top_features)
display(df_selected.head())

Shape of X_selected: (590540, 50)

Selected Features:
['V45', 'V86', 'V87', 'V44', 'V52', 'V51', 'V40', 'V79', 'V39', 'V38', 'V43', 'V94', 'V33', 'V17', 'V18', 'V81', 'ProductCD', 'V34', 'V74', 'V80', 'V93', 'V15', 'V42', 'V92', 'card3', 'V16', 'V37', 'V50', 'V73', 'V58', 'V85', 'V84', 'V21', 'V57', 'V31', 'V77', 'V32', 'V123', 'V22', 'V47', 'V72', 'V23', 'V302', 'V78', 'V304', 'V71', 'M4', 'V63', 'V60', 'V59']


Unnamed: 0,V45,V86,V87,V44,V52,V51,V40,V79,V39,V38,...,V23,V302,V78,V304,V71,M4,V63,V60,V59,isFraud
0,0.020833,0.033333,0.033333,0.020833,0.0,0.0,0.0,0.0,0.0,0.018519,...,0.076923,0.0,0.032258,0.0,0.0,1.0,0.0,0.0,0.0,0
1,0.020833,0.033333,0.033333,0.020833,0.0,0.0,0.0,0.0,0.0,0.018519,...,0.076923,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0
2,0.020833,0.033333,0.033333,0.020833,0.0,0.0,0.0,0.0,0.0,0.018519,...,0.076923,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0
3,0.020833,0.033333,0.033333,0.020833,0.0,0.0,0.0,0.0,0.0,0.018519,...,0.076923,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0
4,0.020833,0.033333,0.033333,0.020833,0.0,0.0,0.0,0.0,0.0,0.018519,...,0.076923,0.0625,0.032258,0.0625,0.0,0.0,0.0,0.0,0.0,0


## Explore alternative feature maps and ansatz circuits

### Subtask:
Explore alternative feature maps and ansatz circuits to potentially improve the VQC's ability to learn complex patterns in the data.


**Reasoning**:
Define alternative feature maps and ansatz circuits using Qiskit, drawing their structures and explaining the rationale.



In [None]:
# 1. Import necessary modules from Qiskit
# Imports for QuantumCircuit, ZZFeatureMap, RealAmplitudes are already done in previous cells.
# Import EfficientSU2
from qiskit.circuit.library import EfficientSU2

# Assuming num_qubits is still 3 based on previous PCA step for now,
# although the feature selection step selected 50 features.
# For VQC demonstration purposes with current simulator limits, we'll continue with 3 qubits.
# NOTE: To use 50 features with VQC, significantly more qubits and potentially different
# quantum hardware or advanced techniques would be required.

# Let's redefine num_qubits based on the original PCA selection for this VQC part
num_qubits = 3 # Revert to 3 qubits as used in the VQC setup previously

# 2. Define an alternative feature map (e.g., EfficientSU2)
# EfficientSU2 can also be used as a feature map, varying input data
alt_feature_map = EfficientSU2(num_qubits, reps=1, entanglement='linear')
alt_feature_map.user_parameters = alt_feature_map.parameters # Needed for QNN compatibility

# 3. Define an alternative ansatz circuit (e.g., a deeper RealAmplitudes or EfficientSU2)
# Deeper RealAmplitudes with more repetitions
alt_ansatz_deeper = RealAmplitudes(num_qubits, reps=3, entanglement='linear')

# EfficientSU2 used as an ansatz
alt_ansatz_efficient = EfficientSU2(num_qubits, reps=1, entanglement='linear')


# 4. Print or display the structure of the defined alternative circuits
print("Alternative Feature Map (EfficientSU2) Structure:")
print(alt_feature_map.draw())

print("\nAlternative Ansatz (Deeper RealAmplitudes) Structure:")
print(alt_ansatz_deeper.draw())

print("\nAlternative Ansatz (EfficientSU2) Structure:")
print(alt_ansatz_efficient.draw())


# 5. Briefly explain the rationale
print("\nRationale for Alternative Circuits:")
print("- **Alternative Feature Map (EfficientSU2):** EfficientSU2 is a hardware-efficient ansatz that can also serve as a feature map. Its structure with SU(2) single-qubit rotations and CNOT entanglers can potentially encode classical data into quantum states in a different way than ZZFeatureMap, possibly capturing different data correlations.")
print("- **Alternative Ansatz (Deeper RealAmplitudes):** Increasing the number of repetitions (layers) in the RealAmplitudes ansatz provides more trainable parameters. This increased depth can allow the VQC to learn more complex functions and potentially improve its ability to classify non-linearly separable data.")
print("- **Alternative Ansatz (EfficientSU2):** Using EfficientSU2 as an ansatz offers a different parameterized structure compared to RealAmplitudes. Its specific arrangement of rotations and entanglers might be better suited for certain types of data landscapes or be more robust to noise on real hardware (due to being hardware-efficient).")
print("\nThese alternatives will be explored in subsequent steps to see if they can improve the VQC's performance compared to the initial setup.")

Alternative Feature Map (EfficientSU2) Structure:
     »
q_0: »
     »
q_1: »
     »
q_2: »
     »
«     ┌──────────────────────────────────────────────────────────────────────────────┐
«q_0: ┤0                                                                             ├
«     │                                                                              │
«q_1: ┤1 EfficientSU2(θ[0],θ[1],θ[2],θ[3],θ[4],θ[5],θ[6],θ[7],θ[8],θ[9],θ[10],θ[11]) ├
«     │                                                                              │
«q_2: ┤2                                                                             ├
«     └──────────────────────────────────────────────────────────────────────────────┘

Alternative Ansatz (Deeper RealAmplitudes) Structure:
     »
q_0: »
     »
q_1: »
     »
q_2: »
     »
«     ┌────────────────────────────────────────────────────────────────────────────────┐
«q_0: ┤0                                                                               ├
«     │   

## Hyperparameter tuning for vqc

### Subtask:
Perform hyperparameter tuning for the VQC by experimenting with different ansatz structures (number of layers), optimizer settings, and potentially feature map parameters.


**Reasoning**:
Define a function for VQC training and evaluation, split the resampled data, and define the hyperparameter search space and tuning strategy (grid search). Then, iterate through the search space, train and evaluate the VQC for each combination using the defined function, track performance, identify the best hyperparameters, and print the results.



In [None]:
# 1. Define a function for VQC training and evaluation
from qiskit_algorithms.optimizers import COBYLA, ADAM, SPSA
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from qiskit_machine_learning.neural_networks import EstimatorQNN
from qiskit_machine_learning.algorithms.classifiers import NeuralNetworkClassifier
from qiskit.primitives import Estimator
from qiskit.circuit.library import ZZFeatureMap, RealAmplitudes, EfficientSU2
import warnings
import numpy as np

# Suppress DeprecationWarning from qiskit_machine_learning
warnings.filterwarnings("ignore", category=DeprecationWarning, module="qiskit_machine_learning")


def train_and_evaluate_vqc(
    X_train_subset,
    y_train_subset,
    X_val,
    y_val,
    feature_map_reps,
    ansatz_reps,
    ansatz_type,
    optimizer_name,
    optimizer_params,
    num_qubits=3,
):
    """
    Trains and evaluates a VQC with given hyperparameters.

    Args:
        X_train_subset (np.ndarray): Training features.
        y_train_subset (np.ndarray): Training labels.
        X_val (np.ndarray): Validation features.
        y_val (np.ndarray): Validation labels.
        feature_map_reps (int): Number of repetitions for the feature map.
        ansatz_reps (int): Number of repetitions for the ansatz.
        ansatz_type (str): Type of ansatz ('RealAmplitudes' or 'EfficientSU2').
        optimizer_name (str): Name of the optimizer ('COBYLA', 'ADAM', 'SPSA').
        optimizer_params (dict): Dictionary of parameters for the optimizer.
        num_qubits (int): Number of qubits (features).

    Returns:
        float: ROC-AUC score on the validation set.
    """
    try:
        # Define the feature map
        # Using ZZFeatureMap, varying repetitions
        feature_map = ZZFeatureMap(
            feature_dimension=num_qubits, reps=feature_map_reps, entanglement='linear'
        )

        # Define the ansatz based on type
        if ansatz_type == 'RealAmplitudes':
            ansatz = RealAmplitudes(num_qubits, reps=ansatz_reps, entanglement='linear')
        elif ansatz_type == 'EfficientSU2':
            ansatz = EfficientSU2(num_qubits, reps=ansatz_reps, entanglement='linear')
        else:
            raise ValueError(f"Unknown ansatz type: {ansatz_type}")

        # Choose the optimizer
        if optimizer_name == 'COBYLA':
            optimizer = COBYLA(**optimizer_params)
        elif optimizer_name == 'ADAM':
            optimizer = ADAM(**optimizer_params)
        elif optimizer_name == 'SPSA':
            optimizer = SPSA(**optimizer_params)
        else:
            raise ValueError(f"Unknown optimizer: {optimizer_name}")


        # Define the EstimatorQNN
        qnn = EstimatorQNN(
            circuit=feature_map.compose(ansatz),
            input_params=list(feature_map.parameters),
            weight_params=list(ansatz.parameters),
        )

        # Define the NeuralNetworkClassifier
        vqc_classifier = NeuralNetworkClassifier(
            neural_network=qnn,
            optimizer=optimizer,
            loss='cross_entropy',
            one_hot=False,
        )

        # Train the VQC classifier
        vqc_classifier.fit(X_train_subset, y_train_subset)

        # Make predictions on the validation data
        y_prob_val_raw = vqc_classifier.predict_proba(X_val)

        # Handle predict_proba output format
        if y_prob_val_raw.shape[1] > 1:
            y_prob_val = y_prob_val_raw[:, 1]  # Get probabilities for the positive class
        else:
            y_prob_val = y_prob_val_raw.flatten()


        # Evaluate the model's performance using ROC-AUC
        if len(np.unique(y_val)) > 1:
             roc_auc_val = roc_auc_score(y_val, y_prob_val)
        else:
             roc_auc_val = np.nan # Cannot compute ROC-AUC with only one class


        return roc_auc_val

    except Exception as e:
        print(f"Error during VQC training or evaluation: {e}")
        return np.nan # Return NaN in case of errors


# 2. Split the X_resampled and y_resampled data into training and validation sets
# Using a smaller subset of the resampled data for faster tuning
# This is because training VQC is computationally expensive
subset_tune_size = 2000 # Adjust size for tuning
X_tune, X_val, y_tune, y_val = train_test_split(
    X_resampled.values, # Use numpy arrays
    y_resampled.values, # Use numpy arrays
    test_size=0.3, # Use 30% for validation
    random_state=42,
    stratify=y_resampled.values, # Stratify to maintain class distribution
)

print(f"Shape of X_tune: {X_tune.shape}")
print(f"Shape of y_tune: {y_tune.shape}")
print(f"Shape of X_val: {X_val.shape}")
print(f"Shape of y_val: {y_val.shape}")


# 3. Define a search space for the hyperparameters
# Grid search over a limited set of hyperparameters
param_grid = {
    'feature_map_reps': [1, 2],
    'ansatz_reps': [1, 2],
    'ansatz_type': ['RealAmplitudes'], # Start with one ansatz type for simplicity
    'optimizer_name': ['COBYLA'], # Start with COBYLA
    'optimizer_params': [
        {'maxiter': 50}, # Fewer iterations for tuning speed
        {'maxiter': 100},
    ],
}

# 4. Implement a hyperparameter tuning strategy (Grid Search)
best_roc_auc = -1
best_params = None
results = []

print("\nStarting VQC Hyperparameter Tuning (Grid Search)...")

for feature_map_reps in param_grid['feature_map_reps']:
    for ansatz_reps in param_grid['ansatz_reps']:
        for ansatz_type in param_grid['ansatz_type']:
            for optimizer_name in param_grid['optimizer_name']:
                for optimizer_params in param_grid['optimizer_params']:
                    print(
                        f"\nTraining with params: "
                        f"Feature Map Reps={feature_map_reps}, "
                        f"Ansatz Reps={ansatz_reps}, "
                        f"Ansatz Type={ansatz_type}, "
                        f"Optimizer={optimizer_name}, "
                        f"Optimizer Params={optimizer_params}"
                    )

                    # 5. Train and evaluate the VQC
                    current_roc_auc = train_and_evaluate_vqc(
                        X_tune,
                        y_tune,
                        X_val,
                        y_val,
                        feature_map_reps,
                        ansatz_reps,
                        ansatz_type,
                        optimizer_name,
                        optimizer_params,
                        num_qubits=X_tune.shape[1], # Use the actual number of features after potential selection
                    )

                    # 6. Keep track of the performance metric
                    results.append({
                        'feature_map_reps': feature_map_reps,
                        'ansatz_reps': ansatz_reps,
                        'ansatz_type': ansatz_type,
                        'optimizer_name': optimizer_name,
                        'optimizer_params': optimizer_params,
                        'roc_auc': current_roc_auc,
                    })

                    print(f"Validation ROC-AUC: {current_roc_auc:.4f}")

                    # 7. Identify the best set of hyperparameters
                    if current_roc_auc > best_roc_auc:
                        best_roc_auc = current_roc_auc
                        best_params = {
                            'feature_map_reps': feature_map_reps,
                            'ansatz_reps': ansatz_reps,
                            'ansatz_type': ansatz_type,
                            'optimizer_name': optimizer_name,
                            'optimizer_params': optimizer_params,
                        }

print("\nHyperparameter Tuning Finished.")

# 8. Print or display the best hyperparameters and the corresponding performance metric
print("\n--- Best Hyperparameters Found ---")
print(f"Best ROC-AUC on Validation Set: {best_roc_auc:.4f}")
print("Best Parameters:")
for key, value in best_params.items():
    print(f"  {key}: {value}")

# Optionally, display all results
# print("\n--- All Results ---")
# for res in results:
#     print(res)


Shape of X_tune: (13510, 3)
Shape of y_tune: (13510,)
Shape of X_val: (5790, 3)
Shape of y_val: (5790,)

Starting VQC Hyperparameter Tuning (Grid Search)...

Training with params: Feature Map Reps=1, Ansatz Reps=1, Ansatz Type=RealAmplitudes, Optimizer=COBYLA, Optimizer Params={'maxiter': 50}


  qnn = EstimatorQNN(


Validation ROC-AUC: 0.6015

Training with params: Feature Map Reps=1, Ansatz Reps=1, Ansatz Type=RealAmplitudes, Optimizer=COBYLA, Optimizer Params={'maxiter': 100}


  qnn = EstimatorQNN(


Validation ROC-AUC: 0.5780

Training with params: Feature Map Reps=1, Ansatz Reps=2, Ansatz Type=RealAmplitudes, Optimizer=COBYLA, Optimizer Params={'maxiter': 50}


  qnn = EstimatorQNN(


Validation ROC-AUC: 0.5600

Training with params: Feature Map Reps=1, Ansatz Reps=2, Ansatz Type=RealAmplitudes, Optimizer=COBYLA, Optimizer Params={'maxiter': 100}


  qnn = EstimatorQNN(


Validation ROC-AUC: 0.5969

Training with params: Feature Map Reps=2, Ansatz Reps=1, Ansatz Type=RealAmplitudes, Optimizer=COBYLA, Optimizer Params={'maxiter': 50}


  qnn = EstimatorQNN(


Validation ROC-AUC: 0.5724

Training with params: Feature Map Reps=2, Ansatz Reps=1, Ansatz Type=RealAmplitudes, Optimizer=COBYLA, Optimizer Params={'maxiter': 100}


  qnn = EstimatorQNN(


Validation ROC-AUC: 0.6011

Training with params: Feature Map Reps=2, Ansatz Reps=2, Ansatz Type=RealAmplitudes, Optimizer=COBYLA, Optimizer Params={'maxiter': 50}


  qnn = EstimatorQNN(


Validation ROC-AUC: 0.5699

Training with params: Feature Map Reps=2, Ansatz Reps=2, Ansatz Type=RealAmplitudes, Optimizer=COBYLA, Optimizer Params={'maxiter': 100}


  qnn = EstimatorQNN(


Validation ROC-AUC: 0.5201

Hyperparameter Tuning Finished.

--- Best Hyperparameters Found ---
Best ROC-AUC on Validation Set: 0.6015
Best Parameters:
  feature_map_reps: 1
  ansatz_reps: 1
  ansatz_type: RealAmplitudes
  optimizer_name: COBYLA
  optimizer_params: {'maxiter': 50}


## Refine data subset and sampling strategy

### Subtask:
Refine the data subset and sampling strategy for training the VQC.


**Reasoning**:
Define a larger subset size, select this subset from the original stratified training data, and apply RandomOverSampler to address class imbalance, then display the shapes.



In [None]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split # Import train_test_split again if not available

# 1. Define a new subset size for the training data.
# Use a larger subset than the one used for hyperparameter tuning (subset_tune_size = 2000)
# Let's use a subset size that balances computational feasibility and potential performance improvement.
subset_size_refined = 15000 # Increased size for refined training

# 2. Select this larger subset from the original stratified training data (X_train, y_train).
# X_train and y_train are the full stratified training sets from the earlier split.
# Use stratify=y_train to maintain the fraud ratio in the refined subset.
X_train_subset_refined, _, y_train_subset_refined, _ = train_test_split(
    X_train,
    y_train,
    train_size=subset_size_refined,
    random_state=42, # Use the same random state for reproducibility
    stratify=y_train # Stratify based on the original training set labels
)

# 3. Apply the RandomOverSampler to this new, larger training subset
# Instantiate the RandomOverSampler
ros_refined = RandomOverSampler(random_state=42)

# Apply oversampling to the refined training subset
X_resampled_refined, y_resampled_refined = ros_refined.fit_resample(
    X_train_subset_refined,
    y_train_subset_refined
)

# 4. Display the shapes of the original training subset, the refined resampled training features, and the refined resampled training labels
print("Shape of original refined X_train_subset:", X_train_subset_refined.shape)
print("Shape of original refined y_train_subset:", y_train_subset_refined.shape)
print("Shape of refined resampled X_resampled_refined:", X_resampled_refined.shape)
print("Shape of refined resampled y_resampled_refined:", y_resampled_refined.shape)


Shape of original refined X_train_subset: (15000, 3)
Shape of original refined y_train_subset: (15000,)
Shape of refined resampled X_resampled_refined: (28950, 3)
Shape of refined resampled y_resampled_refined: (28950,)


## Explore alternative feature maps and ansatz circuits

### Subtask:
Explore alternative feature maps and ansatz circuits to potentially improve the VQC's ability to learn complex patterns in the data.


**Reasoning**:
Define an alternative feature map (`EfficientSU2`) and alternative ansatz circuits (`RealAmplitudes` and `EfficientSU2`) as instructed, print their structures, and provide a rationale for exploring them.



In [None]:
# 1. Import necessary modules from Qiskit
# Imports for QuantumCircuit, ZZFeatureMap, RealAmplitudes are already done in previous cells.
# Import EfficientSU2
from qiskit.circuit.library import EfficientSU2

# Assuming num_qubits is still 3 based on previous PCA step for now,
# although the feature selection step selected 50 features.
# For VQC demonstration purposes with current simulator limits, we'll continue with 3 qubits.
# NOTE: To use 50 features with VQC, significantly more qubits and potentially different
# quantum hardware or advanced techniques would be required.

# Let's redefine num_qubits based on the original PCA selection for this VQC part
num_qubits = 3 # Revert to 3 qubits as used in the VQC setup previously

# 2. Define an alternative feature map (e.g., EfficientSU2)
# EfficientSU2 can also be used as a feature map, varying input data
alt_feature_map = EfficientSU2(num_qubits, reps=1, entanglement='linear')
alt_feature_map.user_parameters = alt_feature_map.parameters # Needed for QNN compatibility

# 3. Define an alternative ansatz circuit (e.g., a deeper RealAmplitudes or EfficientSU2)
# Deeper RealAmplitudes with more repetitions
alt_ansatz_deeper = RealAmplitudes(num_qubits, reps=3, entanglement='linear')

# EfficientSU2 used as an ansatz
alt_ansatz_efficient = EfficientSU2(num_qubits, reps=1, entanglement='linear')


# 4. Print or display the structure of the defined alternative circuits
print("Alternative Feature Map (EfficientSU2) Structure:")
print(alt_feature_map.draw())

print("\nAlternative Ansatz (Deeper RealAmplitudes) Structure:")
print(alt_ansatz_deeper.draw())

print("\nAlternative Ansatz (EfficientSU2) Structure:")
print(alt_ansatz_efficient.draw())


# 5. Briefly explain the rationale
print("\nRationale for Alternative Circuits:")
print("- **Alternative Feature Map (EfficientSU2):** EfficientSU2 is a hardware-efficient ansatz that can also serve as a feature map. Its structure with SU(2) single-qubit rotations and CNOT entanglers can potentially encode classical data into quantum states in a different way than ZZFeatureMap, possibly capturing different data correlations.")
print("- **Alternative Ansatz (Deeper RealAmplitudes):** Increasing the number of repetitions (layers) in the RealAmplitudes ansatz provides more trainable parameters. This increased depth can allow the VQC to learn more complex functions and potentially improve its ability to classify non-linearly separable data.")
print("- **Alternative Ansatz (EfficientSU2):** Using EfficientSU2 as an ansatz offers a different parameterized structure compared to RealAmplitudes. Its specific arrangement of rotations and entanglers might be better suited for certain types of data landscapes or be more robust to noise on real hardware (due to being hardware-efficient).")
print("\nThese alternatives will be explored in subsequent steps to see if they can improve the VQC's performance compared to the initial setup.")

Alternative Feature Map (EfficientSU2) Structure:
     »
q_0: »
     »
q_1: »
     »
q_2: »
     »
«     ┌──────────────────────────────────────────────────────────────────────────────┐
«q_0: ┤0                                                                             ├
«     │                                                                              │
«q_1: ┤1 EfficientSU2(θ[0],θ[1],θ[2],θ[3],θ[4],θ[5],θ[6],θ[7],θ[8],θ[9],θ[10],θ[11]) ├
«     │                                                                              │
«q_2: ┤2                                                                             ├
«     └──────────────────────────────────────────────────────────────────────────────┘

Alternative Ansatz (Deeper RealAmplitudes) Structure:
     »
q_0: »
     »
q_1: »
     »
q_2: »
     »
«     ┌────────────────────────────────────────────────────────────────────────────────┐
«q_0: ┤0                                                                               ├
«     │   

In [None]:
import joblib
import os

# Define the filename for the saved model
model_filename = 'vqc_classifier_model.joblib'

# Save the trained VQC classifier to the file
try:
    joblib.dump(vqc_classifier, model_filename)
    print(f"VQC model successfully saved to '{model_filename}'")

    # Provide a link to download the file in Colab
    from google.colab import files
    files.download(model_filename)

except NameError:
    print("Error: vqc_classifier is not defined. Please ensure the VQC training cell was run successfully.")
except Exception as e:
    print(f"An error occurred while saving the model: {e}")

VQC model successfully saved to 'vqc_classifier_model.joblib'


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import joblib, pickle

joblib.dump(classical_model, "/content/drive/MyDrive/classical_fraud_model.pkl")

with open("/content/drive/MyDrive/vqc_fraud_model.pkl", "wb") as f:
    pickle.dump(vqc_classifier, f)
