# Feature Selection

In this notebook, we investigate which features are most informative for predicting the target class.
We combine correlation analysis, tree‑based feature importance, statistical tests, and recursive feature elimination to identify a compact subset of variables that balances predictive power and model simplicity.


In [1]:
import pandas as pd
import seaborn as sns
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
import warnings
warnings.filterwarnings('ignore')

In [2]:
training_data_wt = pd.read_csv('Nata_Files/processed/train_dataset.csv', index_col=0)
testing_data_wt = pd.read_csv('Nata_Files/processed/test_dataset.csv', index_col=0)
prediction_data = pd.read_csv('Nata_Files/processed/pred_scaled.csv', index_col=0)
pastel_d_nata_lrn = pd.read_csv('Nata_Files/learn.csv', index_col=0)
pastel_d_nata_lrn = pastel_d_nata_lrn[pastel_d_nata_lrn['quality_class'].isna()==False]

In [3]:
train_data = training_data_wt.drop('target', axis=1)
train_target = training_data_wt[['target']]
test_data = testing_data_wt.drop('target', axis=1)
test_target = testing_data_wt[['target']]

### Heatmap of correlations between all features and target

<p float="left">
  <img src="Nata_Files/imgs/heatmap-full-combo.png" width="36%" />
  <img src="Nata_Files/imgs/heatmaps-target.png" width="56%" />
</p>

The first heatmap shows pairwise correlations among all numeric features, including the derived `target`.
A very strong positive correlation emerges between `oven_temp`, `egg_temp`, and `final_temp`, suggesting that these three temperature‑related variables are largely redundant and that some can be removed without losing much information.

In the second heatmap, which focuses on correlations with the `target`, `egg_yolk_cnt` stands out as most strongly associated with the outcome, followed by variables such as `baking_duration` and `sugar_content`.
In contrast, the dummy‑encoded `origin` appears only weakly correlated with the target, hinting that bakery location may play a minor role in predicting quality.


### Decision Tree feature importance

To complement the correlation analysis, we fit a Decision Tree Classifier on the original learning dataset (after splitting into training and validation sets with the same proportions used elsewhere).
Using the tree’s `feature_importances_`, we obtain a ranking that reflects how often and how early each feature is used to split the data when predicting `quality_class`.

The resulting importances confirm that `egg_yolk_cnt` and `baking_duration` are particularly influential, with `egg_temp`, `vanilla_extract`, and `sugar_content` also contributing meaningfully.
Meanwhile, `origin_dummy` receives essentially zero importance, reinforcing the earlier indication that origin may be safely dropped for the core predictive model.


In [4]:
X = pastel_d_nata_lrn.drop(['notes_baker','pastry_type', 'quality_class'], axis=1)
y = pastel_d_nata_lrn['quality_class']
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.75, random_state=15)
X_train_dtc = X_train.drop('origin', axis=1)
X_train_dtc['origin_dummy'] = train_data['origin_dummy']

In [5]:
dtc = DecisionTreeClassifier(max_leaf_nodes=112).fit(train_data, train_target)
pd.Series(dtc.feature_importances_, index=train_data.columns, name='feature importance').sort_values(ascending=False)

egg_yolk_cnt         0.330434
baking_duration      0.187521
egg_temp             0.092586
vanilla_extract      0.077023
sugar_content        0.065766
preheat_time         0.055282
salt_ratio           0.041570
oven_temp            0.032848
final_temp           0.031412
cooling_period       0.029473
ambient_humidity     0.023111
cream_fat_content    0.022414
lemon_zest_ph        0.010560
origin_dummy         0.000000
Name: feature importance, dtype: float64

### Comparing approval rates by origin

Before discarding `origin`, we check whether there is a difference in approval rates between Lisbon and Porto.
After normalizing the `origin` labels, we compute the proportion of `OK` natas in each city, revealing that Lisbon bakeries have a noticeably higher approval rate than their counterparts in Porto.


In [6]:
def origin_replace(df):
    origin_update = []
    for city in df['origin']:
        if 'lisboa' in str(city).lower() or 'lisbon' in str(city).lower():
            city= 'Lisbon'
            origin_update.append(city)
        elif 'porto' in str(city).lower():
            city= 'Porto'
            origin_update.append(city)
        else:
            origin_update.append(city)
    df.loc[:,'origin'] = pd.Series(origin_update, index=df.index)
    return df

In [7]:
proportion_testing = origin_replace(X_train)
proportion_testing['quality_class'] = y_train
lisbon = proportion_testing[proportion_testing['origin'] == 'Lisbon']
porto = proportion_testing[proportion_testing['origin'] == 'Porto']
lisbon['quality_class'] = lisbon['quality_class'].map({'OK': 1, 'KO': 0})
porto['quality_class'] = porto['quality_class'].map({'OK': 1, 'KO': 0})
lisbon_percent = 100 / len(lisbon) * lisbon['quality_class'].sum()
porto_percent = 100 / len(porto) * porto['quality_class'].sum()
f'{lisbon_percent:.2f}% of Pastel de Natas from Lisbon got the seal of approval, whereas it was only{porto_percent: .2f}% from Porto.'

'66.06% of Pastel de Natas from Lisbon got the seal of approval, whereas it was only 53.21% from Porto.'

### Chi‑squared test of independence

To formally assess whether `origin` and `quality_class` are statistically associated, we apply a Chi‑squared test of independence.
The test compares the observed contingency table of city by quality with the expected counts under independence; a large p‑value indicates that any differences in approval rates are not strong enough to treat `origin` as a key predictive feature.


In [8]:
def TestIndependence(X,y,var,alpha=0.05):
    dfObserved = pd.crosstab(y,X)
    chi2, p, dof, expected = stats.chi2_contingency(dfObserved.values)
    if p<alpha:
        result="{0} is IMPORTANT for Prediction".format(var)
    else:
        result="{0} is NOT an important predictor. (Discard {0} from model)".format(var)
    print(result)

In [9]:
TestIndependence(X_train['origin'],y_train, 'origin')

origin is IMPORTANT for Prediction


### Final feature subset

Combining the correlation analysis, tree‑based importances, and statistical tests, we refine the feature set used for modeling.
We drop two highly redundant temperature variables (`oven_temp` and `final_temp`) and also exclude `ambient_humidity`, `preheat_time`, `cream_fat_content`, `lemon_zest_ph`, and `cooling_period`, which showed limited predictive value or added unnecessary complexity.

The remaining features form a more compact and interpretable design matrix that retains the most relevant process and ingredient characteristics for predicting `target`.
We then align the training, testing, and prediction datasets to this reduced feature set and save them for use in the modeling phase.


In [10]:
model = DecisionTreeClassifier(max_leaf_nodes=112)
rfe1 = RFE(estimator = model, n_features_to_select = 1)
rfe1.fit_transform(train_data[['egg_temp', 'final_temp', 'oven_temp']], train_target)
ranking = {col: rfe1.ranking_[i] for i, col in enumerate(train_data[['egg_temp', 'final_temp', 'oven_temp']].columns)}
pd.Series(ranking, name='temp ranking').sort_values(ascending=True)

final_temp    1
egg_temp      2
oven_temp     3
Name: temp ranking, dtype: int32

In [11]:
model = LogisticRegression()
rfe1 = RFE(estimator = model, n_features_to_select = 1)
rfe1.fit_transform(train_data[['egg_temp', 'final_temp', 'oven_temp']], train_target)
ranking = {col: rfe1.ranking_[i] for i, col in enumerate(train_data[['egg_temp', 'final_temp', 'oven_temp']].columns)}
pd.Series(ranking, name='temp ranking').sort_values(ascending=True)

final_temp    1
egg_temp      2
oven_temp     3
Name: temp ranking, dtype: int32

In [12]:
cooler_train = train_data.drop(['egg_temp','oven_temp'], axis=1)
rfe1.fit_transform(cooler_train, train_target)
ranking = {col: rfe1.ranking_[i] for i, col in enumerate(cooler_train.columns)}
pd.Series(ranking, name='feature_ranking').sort_values(ascending=True)

egg_yolk_cnt          1
baking_duration       2
final_temp            3
sugar_content         4
vanilla_extract       5
salt_ratio            6
origin_dummy          7
preheat_time          8
cream_fat_content     9
ambient_humidity     10
lemon_zest_ph        11
cooling_period       12
Name: feature_ranking, dtype: int32

In [13]:
rfe7 = RFE(estimator = model, n_features_to_select = 7)
rfe7.fit_transform(cooler_train, train_target)
train_data_lf_nt = cooler_train.loc[:,rfe7.support_]
train_data_lf_nt

Unnamed: 0_level_0,baking_duration,egg_yolk_cnt,final_temp,salt_ratio,sugar_content,vanilla_extract,origin_dummy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3710,0.415806,0.285714,-1.297071,1.852718,-0.161560,1.430876,-1.0
396,0.654083,0.857143,-1.213389,0.158425,-0.078790,1.602032,-1.0
4517,0.958058,0.428571,0.301255,0.166553,0.050139,0.029062,1.0
4103,1.867231,0.285714,-1.347280,0.883008,-0.194986,0.767947,-1.0
1651,-0.177479,0.428571,0.384937,-0.319586,0.056506,0.179253,1.0
...,...,...,...,...,...,...,...
3273,1.553116,0.428571,-1.297071,1.105605,-0.171110,1.340187,-1.0
2717,-1.260220,0.285714,-0.184100,-1.463543,-0.252288,0.390266,1.0
2205,-1.260220,0.285714,-0.853556,-0.418466,-0.104258,-0.475187,1.0
2695,1.211917,0.142857,-0.410042,4.320901,0.031039,4.013433,-1.0


In [14]:
train_data_lf = train_data_lf_nt.join(train_target)
test_data_lf = test_data[train_data_lf_nt.columns].join(test_target)
pred_data_lf = prediction_data[train_data_lf_nt.columns].join(train_target)

In [16]:
train_data_lf.to_csv('Nata_Files/less_features/train_data_lf.csv')
test_data_lf.to_csv('Nata_Files/less_features/test_data_lf.csv')
pred_data_lf.to_csv('Nata_Files/less_features/pred_data_lf.csv')