# **Feature Selection Task: Instructions**

Welcome to the feature selection evaluation task! Below are the instructions you need to follow to complete this exercise.

---

## **1. Objective**
The goal is to select the optimal set of features that maximizes the AUC score while using the **minimum number of features**. You will:
- Propose a feature list (e.g., 8, 10, 20 features).
- Train and test a model using these features.
- Measure AUC and verify which features are truly contributing to model performance.

---

## **2. Dataset**
### **Load the Dataset**
- The dataset is provided in the notebook or as a file.
- Load the dataset into your environment using `pandas`.

---

## **3. Steps to Complete the Task**

### **Step 1: Split the Data**
1. Split the dataset into **training** and **testing** sets.
   - Suggested ratio: 80% training, 20% testing.
   - Use `train_test_split` from `sklearn` to ensure a random but reproducible split.

### **Step 2: Select Features**
2. Decide on a list of features you think are important:
   - **Example**: Start with 8, 10, or 20 features.
   - Select features based on any method you prefer (e.g., domain knowledge, correlation, feature importance from models like Random Forest, etc.).

### **Step 3: Train a Model**
3. Train a model using your chosen features:
   - Use any binary classification model you prefer (e.g., Logistic Regression, Random Forest, Decision Tree, etc.).
   - Ensure the model is trained only on the **training set**.

### **Step 4: Evaluate**
4. Evaluate the model on the **test set**:
   - Measure the **AUC score**.
   - Identify which features significantly impact the AUC score.

---

## **4. Criteria for Evaluation**
- You must aim to **minimize the number of features** while maintaining a high AUC score.
- **Do not select features arbitrarily**:
  - If your initial feature list includes features that have no significant impact, you are expected to refine the selection.
  - Keep features only if they improve or maintain the AUC score.

---

## **5. Recommendations**
- Use any method to calculate feature importance (e.g., Random Forest feature importance, SHAP values, etc.).
- Document your process clearly:
  - List the selected features.
  - Show the AUC scores before and after feature refinement.
  - Explain why specific features were kept or removed.

---

## **6. Deliverables**
By the end of the task, you should submit:
1. A final AUC score on the **test set**.
2. A list of selected features.
3. A brief explanation of your process and decisions.

---

### **Notes**
- The dataset contains **X features** and **Y target**.
- Aim for an initial selection of **8-10 features**, but you can experiment with more or fewer as you refine the list.

In [2]:
import pandas as pd
import numpy as np
from typing import List, Tuple, Dict

In [None]:
def load_and_prepare_data(data: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:
    """
    Load and prepare the dataset for feature selection.
    You should:
    1. Separate features (X) and target variable (y)
    2. Handle any missing values if present
    3. Perform any necessary data type conversions

    Parameters:
    data (pd.DataFrame): Input dataset with 'target' column

    Returns:
    Tuple[pd.DataFrame, pd.Series]: Features (X) and target variable (y)
    """
    # YOUR CODE HERE
    pass


In [None]:
def split_dataset(X: pd.DataFrame, y: pd.Series) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
    """
    Split the dataset into training and testing sets (80% train, 20% test).

    Parameters:
    X (pd.DataFrame): Feature matrix
    y (pd.Series): Target variable

    Returns:
    Tuple containing (X_train, X_test, y_train, y_test)
    """
    # YOUR CODE HERE
    pass

In [None]:
def select_features(X_train: pd.DataFrame, y_train: pd.Series, n_features: int) -> List[str]:
    """
    Select the top N most important features using any method you prefer.
    You should:
    1. Calculate feature importance using any method
    2. Select the top n_features based on importance
    3. Return the list of selected feature names

    Parameters:
    X_train (pd.DataFrame): Training features
    y_train (pd.Series): Training target
    n_features (int): Number of features to select

    Returns:
    List[str]: List of selected feature names
    """
    # YOUR CODE HERE
    pass

In [None]:
def train_and_evaluate(X_train: pd.DataFrame,
                      X_test: pd.DataFrame,
                      y_train: pd.Series,
                      y_test: pd.Series,
                      selected_features: List[str]) -> float:
    """
    Train a model using selected features and evaluate its performance.
    You should:
    1. Train any classification model of your choice
    2. Make predictions on test set
    3. Calculate and return the AUC score

    Parameters:
    X_train (pd.DataFrame): Training features
    X_test (pd.DataFrame): Testing features
    y_train (pd.Series): Training target
    y_test (pd.Series): Testing target
    selected_features (List[str]): List of features to use

    Returns:
    float: AUC score on test set
    """
    # YOUR CODE HERE
    pass

In [None]:
def feature_selection_pipeline(data: pd.DataFrame, n_features: int) -> Tuple[List[str], float]:
    """
    Complete feature selection pipeline that combines all steps.
    This function should:
    1. Load and prepare the data
    2. Split the dataset
    3. Select the best features
    4. Train and evaluate the model
    5. Return selected features and final AUC score

    Parameters:
    data (pd.DataFrame): Input dataset
    n_features (int): Number of features to select

    Returns:
    Tuple[List[str], float]: Selected feature names and final AUC score
    """
    # YOUR CODE HERE
    return selected_features, auc_score

In [None]:
def submit(selected_features: List[str]):
  with open('submission.csv', 'w') as f:
    f.write(f"Selected Features: {selected_features}\n")