# Problem 2 - Automated Feature Engineering

## 2.1

**Answer:**

Interpretability and explainability are vital for building trust in machine learning models, ensuring their reliability, improving their performance, complying with regulatory standards, and making informed decisions based on their outputs.

- When models are interpretable, they are more transparent, making it easier for users to trust their predictions. Understanding how a model arrives at its conclusions builds confidence in its use, especially in critical applications like healthcare, finance, and legal decisions.
- An interpretable model allows developers and data scientists to identify and understand the model's decision-making process. This insight is crucial for diagnosing and fixing errors, biases, or overfitting in the model, leading to more robust and accurate models.
- In many industries, regulations require explanations for decisions made by automated systems, especially when these decisions impact human lives. Interpretability ensures compliance with such regulations and helps in maintaining ethical standards by avoiding unfair or discriminatory outcomes.
- In domains where the stakes are high, such as healthcare, finance, or security, decisions based on model predictions can have significant consequences. Interpretability allows decision-makers to understand the rationale behind a model's prediction, enabling them to make informed decisions.

## 2.2

In [3]:
from sklearn.datasets import load_diabetes
from autofeat import FeatureSelector

# Load data
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

In [6]:
import pandas as pd
fs = FeatureSelector(verbose=1)
new_X = fs.fit_transform(pd.DataFrame(X), pd.Series(y))

# Check how many features were discarded
print("Original feature count:", X.shape[1])
print("Selected feature count:", new_X.shape[1])
discarded_features = X.shape[1] - new_X.shape[1]
print("Features discarded:", discarded_features)

[featsel] Scaling data...done.
Original feature count: 10
Selected feature count: 6
Features discarded: 4


## 2.3

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit model
model = LinearRegression()
model.fit(X_train, y_train)

# R² score on training and test set
r2_train = model.score(X_train, y_train)
r2_test = model.score(X_test, y_test)
print("R^2 - training set:", r2_train)
print("R^2 - test set:", r2_test)

R^2 - training set: 0.5279193863361498
R^2 - test set: 0.4526027629719195


## 2.4

In [8]:
from autofeat import AutoFeatRegressor

# Initialize and fit AutoFeatRegressor
afreg = AutoFeatRegressor(verbose=1)
X_train_feat = afreg.fit_transform(X_train, y_train)
X_test_feat = afreg.transform(X_test)

# Fit the model and evaluate
model.fit(X_train_feat, y_train)

[featsel] Scaling data...done.


In [14]:
r2_train_feat = model.score(X_train_feat, y_train)
r2_test_feat = model.score(X_test_feat, y_test)
print("R^2 - training set:", r2_train_feat)
print("R^2 - test set:", r2_test_feat)

# New features generated
X_train_df = pd.DataFrame(X_train)
X_train_feat_df = pd.DataFrame(X_train_feat)

new_features = list(X_train_feat_df.columns[10:])
print("Five new features generated:", new_features[:5])

R^2 - training set: 0.5721579622952007
R^2 - test set: 0.5053018198487318
Five new features generated: ['x000**3*x001', 'Abs(x008)/x008', 'exp(x002)*exp(x008)', 'exp(x002)*exp(x003)', 'x002*x003']


**Answer:**

The updated output from AutoFeatRegressor reveals an improvement in both training and testing \( $R^2$ \) scores, suggesting that feature engineering has enhanced the model's fit. Specifically, the \( $R^2$ \) score for training data has risen to 0.57, a significant increase from the earlier value, and the \( $R^2$ \) score for test data has climbed to 0.51 from 0.45. Although this improvement suggests better model performance, it also raises concerns about overfitting, as evidenced by the widened gap between training and testing results. The added complexity from new features might be causing the model to not generalize effectively to new, unseen data.

The five new features listed above are essentially transformations of one or more original features. These transformations have helped in capturing more subtle aspects of our data. However, they also have the potential to lead to overfitting by adjusting not just to the underlying patterns but also to the noise present in the data.
