
## Instructions

Review the dataset information here: [Wine Information Dataset on Kaggle](https://www.kaggle.com/datasets/dev7halo/wine-information)

### Objectives

1. **Regression Task**
   Use the dataset variables to predict wine price. Re-implement the model architecture and results you created last week.

2. **Classification Task**
   Use the dataset variables to classify the nation of origin.

### Requirements

#### Neural Network Implementation

* Re-implement your previous neural networks using Keras.
* For each model:

  * Print a model summary or include a model plot.
  * Print model performance metrics using a train-test split.

#### Additional Exploration

* Explore at least three different Keras function input settings not used in your previous implementation.
* Provide commentary on what you discover about these settings and how they affect the model.

#### Evaluation

* Report at least three different model performance metrics.
* Construct a confusion matrix for each neural network model. You may use libraries such as `pandas`, `numpy`, or `scikit-learn`.

#### Feature Constraints

* Do not use any variables that explicitly identify the nation when predicting nation.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import keras
from keras import layers
from sklearn import preprocessing
from keras import regularizers
from itertools import product
import random

import os
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures, LabelEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score, confusion_matrix, precision_score, \
    recall_score, f1_score, roc_auc_score, roc_curve, cohen_kappa_score, make_scorer, mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier, BaggingRegressor, StackingRegressor, StackingClassifier
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import root_mean_squared_error

def set_seeds(seed=123):
    np.random.seed(seed)
    random.seed(seed)
    tf.random.set_seed(seed)

In [2]:
# Read in the dataset
wine_df = pd.read_csv(
    'Data/cleansingWine.csv', low_memory=False
).drop(columns=['Unnamed: 0'])


# Display the first few rows to get a sense of the structure
wine_df.head()

Unnamed: 0,id,name,producer,nation,local1,local2,local3,local4,varieties1,varieties2,...,use,abv,degree,sweet,acidity,body,tannin,price,year,ml
0,137197,Altair,Altair,Chile,Rapel Valley,,,,Cabernet Sauvignon,Carmenere,...,Table,14~15,17~19,SWEET1,ACIDITY4,BODY5,TANNIN4,220000,2014,750
1,137198,"Altair, Sideral",Altair,Chile,Rapel Valley,,,,Cabernet Sauvignon,Merlot,...,Table,14~15,16~18,SWEET1,ACIDITY3,BODY4,TANNIN4,110000,2016,750
2,137199,Baron du Val Red,Baron du Val,France,,,,,Carignan,Cinsault,...,Table,11~12,15~17,SWEET2,ACIDITY3,BODY2,TANNIN2,0,0,750
3,137200,Baron du Val White,Baron du Val,France,,,,,Carignan,Ugni​ blanc,...,Table,11~12,9~11,SWEET1,ACIDITY3,BODY2,TANNIN1,0,0,750
4,137201,"Benziger, Cabernet Sauvignon",Benziger,USA,California,,,,Cabernet Sauvignon,,...,Table,13~14,17~19,SWEET1,ACIDITY3,BODY3,TANNIN4,0,2003,750


## Preparing Feature Sets

Our first goal is to predict the price of a wine based on a subset of features from the dataset.

To do this, we will:
- Build several **Neural Networks** with different settings to test how changes in the architecture and hyperparameters affect performance. We preemptively used Keras last week, so we will build upon the 3 most successful models for our efforts this week.

Our target variable is **`price`**.

#### Feature Selection

For simplicity and clarity, we focus on the following features for regression:

- `producer`
- `type`
- `use`
- `abv` (Alcohol by Volume)
- `sweet` (Sweetness level)
- `acidity` (Acidity level)
- `body` (Body level)
- `tannin` (Tannin level)
- `year` (Vintage year)
- `local1` (Local region)
- `varieties1` (Grape variety)

These features were chosen because they are intuitively related to wine pricing and were relatively clean after preprocessing. Adding `local1` and `varieties1` helped capture more variation in wine characteristics, leading to improved model performance.

In [3]:
from sklearn.model_selection import train_test_split

# Select only the columns of interest
features = ['producer', 'local1', 'varieties1', 'type', 'use', 'abv', 'sweet', 'acidity', 'body', 'tannin', 'year']
target = 'price'

# Make a copy of the working data
model_data = wine_df[features + [target]].copy()

# Drop any rows with missing values
model_data = model_data.dropna()

# Keep only rows where price is greater than 0
model_data = model_data[model_data['price'] > 0]


# Convert features to appropriate numeric types
def clean_range(value):
    """ Helper function to clean values like '14~15' into an average """
    if isinstance(value, str) and '~' in value:
        low, high = value.split('~')
        return (float(low) + float(high)) / 2
    try:
        return float(value)
    except:
        return None


for col in ['abv', 'year']:
    model_data[col] = model_data[col].apply(clean_range)


# Convert categorical columns like 'sweet', 'acidity', 'body', 'tannin'
# These are text codes like 'SWEET1', so we extract the number
def extract_number(value):
    """ Helper to pull numbers out of text labels """
    if isinstance(value, str):
        return int(''.join(filter(str.isdigit, value)))
    return None


for col in ['sweet', 'acidity', 'body', 'tannin']:
    model_data[col] = model_data[col].apply(extract_number)

# Drop again any rows with missing values after cleaning
model_data = model_data.dropna()

# Separate X and y
X = model_data[features]
y = model_data[target]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Preprocessing: scale numeric features, one-hot encode categoricals
categorical_features = ['producer', 'local1', 'varieties1', 'type', 'use']
numeric_features = ['abv', 'sweet', 'acidity', 'body', 'tannin', 'year']


preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False), categorical_features)
    ]
)

# Fit the preprocessor
X_train_prep = preprocessor.fit_transform(X_train)
X_test_prep = preprocessor.transform(X_test)

# 3. Get input shape for model
input_shape = X_train_prep.shape[1]



## Regression Modeling

#### Model Architectures

**Model 3B – Deeper Network with Bias Initialization**

This model builds on a basic feedforward structure and introduces deeper learning through **two hidden layers**:

* The **first layer** has 32 neurons and uses the **ReLU activation function**, which helps the network handle non-linear patterns in the data.
* The **second layer** has 16 neurons, also with ReLU, allowing the network to further refine learned patterns.
* A **bias initializer (`he_normal`)** is used to set initial bias values in a way that complements ReLU and speeds up early training.
* The **final layer** outputs a single numeric value (wine price), as this is a regression task.

**Model 5B – Dropout Regularization and L2 Penalty**

This architecture is designed to prevent **overfitting** by regularizing the model in two ways:

* The **first layer** has 32 neurons and applies **L2 regularization** (`kernel_regularizer`). This discourages overly large weights by adding a penalty to the loss function.
* A **Dropout layer** randomly disables 30% of neurons during each training step, forcing the network to generalize rather than memorize.
* The **second hidden layer** (16 neurons, ReLU) processes the information passed through the dropout.
* The final output layer again predicts wine price.

**Model 6B – Batch Normalization**

This architecture introduces **Batch Normalization**, which stabilizes and accelerates training:

* The **first hidden layer** has 64 neurons with ReLU and **zero-initialized biases**, followed by a **Batch Normalization layer**. This adjusts layer outputs so they have consistent scale and distribution, which helps the network train more reliably.
* The **second hidden layer** has 32 neurons and applies ReLU again.

In [4]:
# Model Architectures
def build_model_3B():
    return keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(32, activation='relu', bias_initializer='he_normal'),
        layers.Dense(16, activation='relu', bias_initializer='he_normal'),
        layers.Dense(1)
    ])

def build_model_5B():
    return keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(32, activation='relu', kernel_regularizer = regularizers.L2()),
        layers.Dropout(0.3),
        layers.Dense(16, activation='relu'),
        layers.Dense(1)
    ])

def build_model_6B():
    return keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(64, activation='relu', bias_initializer='zeros'),
        layers.BatchNormalization(),
        layers.Dense(32, activation='relu'),
        layers.Dense(1)
    ])

In [5]:
# Build the models
model_3B = build_model_3B()
model_3B.summary()

In [6]:
model_5B = build_model_5B()
model_5B.summary()

In [7]:
model_6B = build_model_6B()
model_6B.summary()

#### Training Function:

This function handles the full lifecycle of training a neural network model, evaluating it, and printing key performance results.

* **Model compilation:** The model is compiled with a specified optimizer, mean squared error (`mse`) as the loss function (appropriate for regression), and tracks two additional metrics:

  * Mean Absolute Error (`mae`)
  * Mean Absolute Percentage Error (`mape`)
* **Training process:** The model is trained on the training set using the given number of epochs and batch size.
  A portion (20%) of the training data is reserved as a **validation set**, used to monitor performance during training.
* **Prediction and evaluation:** After training, the model makes predictions on the test set, and we compute:

  * **MSE (Mean Squared Error):** Measures the average squared difference between predictions and true values.
  * **MAE (Mean Absolute Error):** Gives a straightforward interpretation of average error in original units.
  * **R² (Coefficient of Determination):** Measures how well the model explains the variation in wine prices.
* **Reporting:** Results are printed to the console, making it easy to compare different configurations later.

#### Grid Search Function:

This function automates the process of testing multiple neural network configurations by looping over combinations of hyperparameters:

* **Inputs:**

  * `model_architecture_fn`: A function that builds a model (e.g., `build_model_3B`)
  * Lists of possible values for:
    
    * Batch sizes
    * Learning rates
    * Optimizers (`'adam'`, `'rmsprop'`)
* **Execution:**

  * For each combination of hyperparameters, it:

    * Rebuilds a fresh model
    * Initializes the chosen optimizer with the given learning rate
    * Trains and evaluates the model using `build_and_train()`
* **Output:**

  * Results for each configuration (including MSE, MAE, and R²) are saved to a table (`pandas DataFrame`) for easy comparison and selection of the best-performing model.

In [8]:
def build_and_train(model, model_name, optimizer='adam', epochs=50, batch_size=32):
    model.compile(optimizer=optimizer, loss='mse', metrics=['mae', 'mape'])
    history = model.fit(
        X_train_prep, y_train,
        epochs=epochs,
        batch_size=batch_size,
        validation_split=0.2,
        verbose=0
    )
    y_pred = model.predict(X_test_prep).flatten()
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"{model_name} | Epochs: {epochs} | Batch Size: {batch_size}")
    print(f"Test MSE: {mse:.2f}")
    print(f"Test MAE: {mae:.2f}")
    print(f"Test R²: {r2:.3f}")
    print("-" * 40)
    return mse, mae, r2

def grid_search_nn(model_architecture_fn, model_name, epochs_list, batch_sizes_list, learning_rates_list=None, optimizer_name_list=None):
    set_seeds(123)
    results = []
    if learning_rates_list is None:
        learning_rates_list = [0.001]
    if optimizer_name_list is None:
        optimizer_name_list = ['adam']

    for epochs, batch_size, lr, opt_name in product(epochs_list, batch_sizes_list, learning_rates_list, optimizer_name_list):
        model = model_architecture_fn()
        if opt_name == 'adam':
            optimizer = keras.optimizers.Adam(learning_rate=lr)
        elif opt_name == 'rmsprop':
            optimizer = keras.optimizers.RMSprop(learning_rate=lr)
        else:
            raise ValueError(f"Unsupported optimizer: {opt_name}")

        mse, mae, r2 = build_and_train(
            model,
            model_name=f"{model_name} (opt={opt_name}, epochs={epochs}, batch={batch_size}, lr={lr})",
            optimizer=optimizer,
            epochs=epochs,
            batch_size=batch_size
        )
        results.append({
            "Model": model_name,
            "Optimizer": opt_name,
            "Epochs": epochs,
            "Batch Size": batch_size,
            "Learning Rate": lr,
            "MSE": mse,
            "MAE": mae,
            "R2": r2
        })

    return pd.DataFrame(results)

In [9]:
epochs_list = [100]
batch_sizes_list = [32, 64]
learning_rates_list = [0.001, 0.0005]
optimizer_name_list = ['adam', 'rmsprop']

In [10]:
results_3B = grid_search_nn(
    model_architecture_fn=build_model_3B,
    model_name="Model 3B: Deeper",
    epochs_list=epochs_list,
    batch_sizes_list=batch_sizes_list,
    learning_rates_list=learning_rates_list,
    optimizer_name_list=optimizer_name_list
)

[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 694us/step
Model 3B: Deeper (opt=adam, epochs=100, batch=32, lr=0.001) | Epochs: 100 | Batch Size: 32
Test MSE: 46935855104.00
Test MAE: 88776.02
Test R²: 0.236
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 868us/step
Model 3B: Deeper (opt=rmsprop, epochs=100, batch=32, lr=0.001) | Epochs: 100 | Batch Size: 32
Test MSE: 49585274880.00
Test MAE: 90663.42
Test R²: 0.193
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 592us/step
Model 3B: Deeper (opt=adam, epochs=100, batch=32, lr=0.0005) | Epochs: 100 | Batch Size: 32
Test MSE: 52446396416.00
Test MAE: 97487.82
Test R²: 0.147
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 643us/step
Model 3B: Deeper (opt=rmsprop, epochs=100, batch=32, lr=0.0005) | Epochs: 100 | Batch Size: 32
Test MSE: 53966786560.00
Tes

In [11]:
results_5B = grid_search_nn(
    model_architecture_fn=build_model_5B,
    model_name="Model 5B: Dropout",
    epochs_list=epochs_list,
    batch_sizes_list=batch_sizes_list,
    learning_rates_list=learning_rates_list,
    optimizer_name_list=optimizer_name_list
)

[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 694us/step
Model 5B: Dropout (opt=adam, epochs=100, batch=32, lr=0.001) | Epochs: 100 | Batch Size: 32
Test MSE: 47509233664.00
Test MAE: 87167.94
Test R²: 0.227
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 653us/step
Model 5B: Dropout (opt=rmsprop, epochs=100, batch=32, lr=0.001) | Epochs: 100 | Batch Size: 32
Test MSE: 49630380032.00
Test MAE: 89229.25
Test R²: 0.192
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 653us/step
Model 5B: Dropout (opt=adam, epochs=100, batch=32, lr=0.0005) | Epochs: 100 | Batch Size: 32
Test MSE: 53400096768.00
Test MAE: 97704.76
Test R²: 0.131
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 653us/step
Model 5B: Dropout (opt=rmsprop, epochs=100, batch=32, lr=0.0005) | Epochs: 100 | Batch Size: 32
Test MSE: 55029374976.00

In [12]:
results_6B = grid_search_nn(
    model_architecture_fn=build_model_6B,
    model_name="Model 6B: BatchNorm",
    epochs_list=epochs_list,
    batch_sizes_list=batch_sizes_list,
    learning_rates_list=learning_rates_list,
    optimizer_name_list=optimizer_name_list
)

[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 6B: BatchNorm (opt=adam, epochs=100, batch=32, lr=0.001) | Epochs: 100 | Batch Size: 32
Test MSE: 45039837184.00
Test MAE: 87428.64
Test R²: 0.267
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 714us/step
Model 6B: BatchNorm (opt=rmsprop, epochs=100, batch=32, lr=0.001) | Epochs: 100 | Batch Size: 32
Test MSE: 44283944960.00
Test MAE: 86290.17
Test R²: 0.279
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 735us/step
Model 6B: BatchNorm (opt=adam, epochs=100, batch=32, lr=0.0005) | Epochs: 100 | Batch Size: 32
Test MSE: 44050755584.00
Test MAE: 93987.96
Test R²: 0.283
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 694us/step
Model 6B: BatchNorm (opt=rmsprop, epochs=100, batch=32, lr=0.0005) | Epochs: 100 | Batch Size: 32
Test MSE: 4350439

### Overall Observations

| Model  | Best R²   | Optimizer | LR     | Batch Size |
| ------ | --------- | --------- | ------ | ---------- |
| **3B** | 0.236     | adam      | 0.001  | 32         |
| **5B** | 0.227     | adam      | 0.001  | 32         |
| **6B** | **0.292** | rmsprop   | 0.0005 | 32         |

### **Model 3B: Deeper**

* Performs best with **Adam + 0.001 LR + 32 batch** (R² = 0.236).
* Performance declines with **lower LR** (R² drops to 0.081 with 0.0005 and batch=64).
* **RMSprop** consistently underperforms Adam here.

**Conclusion**: This model is sensitive to learning rate and benefits from a moderately small batch size and a stable optimizer (Adam).

### **Model 5B: Dropout Regularization**

* Performs similarly to Model 3B with **Adam + 0.001 LR + 32 batch** (R² = 0.227).
* Dropout helps reduce overfitting but slightly limits maximum achievable R².
* **Performance degrades with larger batch sizes** and lower learning rates.

**Conclusion**: Dropout helped, but not enough to beat deeper architectures without regularization.

### **Model 6B: Batch Normalization**

* Consistently **better R² across the board**, peaking at **0.292 with RMSprop + 0.0005 + 32 batch**.
* **Small batch sizes (32)** and **lower learning rate (0.0005)** yielded the most stable results.
* Both **Adam** and **RMSprop** performed well, but **RMSprop edged ahead slightly**.

**Conclusion**: Batch normalization stabilizes training and boosts performance. Model 6B is the best overall model under current settings.

### Key Takeaways

* **BatchNorm > Dropout > Plain deeper network**, at least in this regression task.
* **Adam at 0.001 and RMSprop at 0.0005** are the best learning rate-optimizer combos.
* **Batch size of 32** consistently yields better generalization than 64.
* The highest **R² = 0.292** indicates the model explains \~29% of variance in price, which is not perfect, but good for noisy, high-cardinality features in our wine data.



## Classification Modeling

### Feature Selection - Cannot Carry National/Regional Hints

In [13]:
# How many Nations are there to Predict?
wine_df.nation.value_counts().shape[0]

31

In [14]:
from sklearn.preprocessing import LabelEncoder
# Select only the columns of interest
features = ['name', 'producer', 'varieties1', 'type', 'use', 'abv', 'sweet', 'acidity', 'body', 'tannin', 'year']
target = 'nation'

# Make a copy of the working data
model_data = wine_df[features + [target]].copy()

# Drop any rows with missing values
model_data = model_data.dropna()

# Keep only rows where price is greater than 0 (optional for classification—only if relevant for filtering low-quality data)
model_data = model_data[model_data['nation'].isna() != True]

# Clean columns with ranges like '14~15'
def clean_range(value):
    if isinstance(value, str) and '~' in value:
        low, high = value.split('~')
        return (float(low) + float(high)) / 2
    try:
        return float(value)
    except:
        return None

for col in ['abv', 'year']:
    model_data[col] = model_data[col].apply(clean_range)

# Convert coded categorical levels to integers
def extract_number(value):
    if isinstance(value, str):
        return int(''.join(filter(str.isdigit, value)))
    return None

for col in ['sweet', 'acidity', 'body', 'tannin']:
    model_data[col] = model_data[col].apply(extract_number)

# Drop again any rows with missing values after cleaning
model_data = model_data.dropna()

# Separate features and label
X = model_data[features]
y = model_data[target]

# Encode target variable (nation) as integers
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)  # Needed for sparse_categorical_crossentropy

# Save class names (for inverse mapping later)
class_names = label_encoder.classes_

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=123)

# Preprocessing pipeline
categorical_features = ['name', 'producer', 'varieties1', 'type', 'use']
numeric_features = ['abv', 'sweet', 'acidity', 'body', 'tannin', 'year']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False), categorical_features)
    ]
)

# Transform the data
X_train_prep = preprocessor.fit_transform(X_train)
X_test_prep = preprocessor.transform(X_test)

# Define input shape
input_shape = X_train_prep.shape[1]



### Updating Model Architectures for Classifying Nation

Each of the following models is a reimplementation of its regression counterpart, modified to classify wines into one of **31 possible countries of origin**. The key architectural change is the use of a **softmax activation function in the final layer**, which outputs a probability distribution across the 31 classes.

**Model 3B: Deeper Network with Bias Initialization**

This model uses a two-layer fully connected architecture to learn patterns in the wine data:

* The first hidden layer has **32 neurons** with the **ReLU** activation function and **He-normal bias initialization**, which helps stabilize early learning.
* The second hidden layer has **16 neurons**, also with ReLU and He-normal bias initialization.
* The final layer has **31 neurons** with a **softmax** activation function, producing a probability for each possible nation.

**Model 5B: Dropout Regularization with L2 Penalty**

This model adds regularization to help prevent overfitting, which is especially important with a high number of output classes:

* The first layer has **32 neurons** with ReLU activation and **L2 regularization**, which penalizes overly large weights.
* A **Dropout layer** randomly disables 30% of neurons during training, encouraging robustness.
* The second hidden layer has **16 neurons** with ReLU activation.
* The final **softmax** layer outputs the probability distribution across the 31 nations.

**Model 6B: Batch Normalization with Larger Capacity**

This model increases the network’s depth and stability:

* The first hidden layer has **64 neurons** with ReLU activation and zero-initialized biases.
* A **Batch Normalization layer** follows, which standardizes the outputs of the previous layer and helps the network train faster and more reliably.
* The second hidden layer has **32 neurons** with ReLU activation.
* The output layer is a **31-unit softmax**, which enables multi-class classification.


In [15]:
def build_model_3B_classification():
    return keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(32, activation='relu', bias_initializer='he_normal'),
        layers.Dense(16, activation='relu', bias_initializer='he_normal'),
        layers.Dense(31, activation='softmax')  # 31 output classes
    ])

def build_model_5B_classification():
    return keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(32, activation='relu', kernel_regularizer=regularizers.L2()),
        layers.Dropout(0.3),
        layers.Dense(16, activation='relu'),
        layers.Dense(31, activation='softmax')  # classification head
    ])

def build_model_6B_classification():
    return keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(64, activation='relu', bias_initializer='zeros'),
        layers.BatchNormalization(),
        layers.Dense(32, activation='relu'),
        layers.Dense(31, activation='softmax')  # classification head
    ])

In [16]:
from sklearn.metrics import accuracy_score, classification_report

def build_and_train_classification(model, model_name, optimizer='adam', epochs=50, batch_size=32):
    # Compile with categorical loss and accuracy
    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    # Train the model
    history = model.fit(
        X_train_prep, y_train,
        epochs=epochs,
        batch_size=batch_size,
        validation_split=0.2,
        verbose=0
    )

    # Predict class probabilities, then get predicted class labels
    y_pred_probs = model.predict(X_test_prep)
    y_pred_labels = np.argmax(y_pred_probs, axis=1)

    # Evaluate accuracy
    acc = accuracy_score(y_test, y_pred_labels)
    print(f"{model_name} | Epochs: {epochs} | Batch Size: {batch_size}")
    print(f"Test Accuracy: {acc:.3f}")
    print("-" * 40)
    return acc

def grid_search_nn_classification(model_architecture_fn, model_name, epochs_list, batch_sizes_list, learning_rates_list=None, optimizer_name_list=None):
    set_seeds(123)
    results = []

    if learning_rates_list is None:
        learning_rates_list = [0.001]
    if optimizer_name_list is None:
        optimizer_name_list = ['adam']

    for epochs, batch_size, lr, opt_name in product(epochs_list, batch_sizes_list, learning_rates_list, optimizer_name_list):
        model = model_architecture_fn()

        # Build optimizer
        if opt_name == 'adam':
            optimizer = keras.optimizers.Adam(learning_rate=lr)
        elif opt_name == 'rmsprop':
            optimizer = keras.optimizers.RMSprop(learning_rate=lr)
        else:
            raise ValueError(f"Unsupported optimizer: {opt_name}")

        # Train and evaluate
        acc = build_and_train_classification(
            model,
            model_name=f"{model_name} (opt={opt_name}, epochs={epochs}, batch={batch_size}, lr={lr})",
            optimizer=optimizer,
            epochs=epochs,
            batch_size=batch_size
        )

        # Store results
        results.append({
            "Model": model_name,
            "Optimizer": opt_name,
            "Epochs": epochs,
            "Batch Size": batch_size,
            "Learning Rate": lr,
            "Accuracy": acc
        })

    return pd.DataFrame(results)

from sklearn.metrics import classification_report, confusion_matrix
def evaluate_classification(model, X_test, y_test, class_names=None):
    # Predict probabilities and convert to class labels
    y_pred_probs = model.predict(X_test)
    y_pred = np.argmax(y_pred_probs, axis=1)

    # Generate consistent list of label IDs
    n_classes = len(class_names) if class_names is not None else len(np.unique(y_test))
    labels = list(range(n_classes))

    # Print classification report
    print("Classification Report:\n")
    print(classification_report(y_test, y_pred, target_names=class_names, labels=labels))
    
def evaluate_best_model(results_df, model_fn, X_train, y_train, X_test, y_test, class_names):
    best_config = results_df.sort_values("Accuracy", ascending=False).iloc[0]
    best_epochs = int(best_config["Epochs"])
    best_batch = int(best_config["Batch Size"])
    best_lr = float(best_config["Learning Rate"])
    best_opt = best_config["Optimizer"]

    print(f"Evaluating best model config:\n{best_config}\n")

    model = model_fn()
    if best_opt == 'adam':
        optimizer = keras.optimizers.Adam(learning_rate=best_lr)
    elif best_opt == 'rmsprop':
        optimizer = keras.optimizers.RMSprop(learning_rate=best_lr)
    else:
        raise ValueError(f"Unsupported optimizer: {best_opt}")

    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=best_epochs, batch_size=best_batch, validation_split=0.2, verbose=0)

    evaluate_classification(model, X_test, y_test, class_names=class_names)

In [17]:
model_3B_classification = build_model_3B_classification()
model_3B_classification.summary()

In [18]:
model_5B_classification = build_model_5B_classification()
model_5B_classification.summary()

In [19]:
model_6B_classification = build_model_6B_classification()
model_6B_classification.summary()

In [20]:
results_3B_class = grid_search_nn_classification(
    model_architecture_fn=build_model_3B_classification,
    model_name="Model 3B",
    epochs_list=[100],
    batch_sizes_list=[64],
    learning_rates_list=[0.01, 0.001],
    optimizer_name_list=['adam', 'rmsprop']
)

[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  
Model 3B (opt=adam, epochs=100, batch=64, lr=0.01) | Epochs: 100 | Batch Size: 64
Test Accuracy: 0.929
----------------------------------------
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  
Model 3B (opt=rmsprop, epochs=100, batch=64, lr=0.01) | Epochs: 100 | Batch Size: 64
Test Accuracy: 0.929
----------------------------------------
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  
Model 3B (opt=adam, epochs=100, batch=64, lr=0.001) | Epochs: 100 | Batch Size: 64
Test Accuracy: 0.934
----------------------------------------
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  
Model 3B (opt=rmsprop, epochs=100, batch=64, lr=0.001) | Epochs: 100 | Batch Size: 64
Test Accuracy: 0.903
----------------------------------------


In [21]:
# Sort by True Multi-Class Metric!
results_3B_class.sort_values("Accuracy", ascending=False)

Unnamed: 0,Model,Optimizer,Epochs,Batch Size,Learning Rate,Accuracy
2,Model 3B,adam,100,64,0.001,0.934178
0,Model 3B,adam,100,64,0.01,0.928898
1,Model 3B,rmsprop,100,64,0.01,0.928546
3,Model 3B,rmsprop,100,64,0.001,0.903203


In [22]:
evaluate_best_model(
    results_df=results_3B_class,
    model_fn=build_model_3B_classification,
    X_train=X_train_prep,
    y_train=y_train,
    X_test=X_test_prep,
    y_test=y_test,
    class_names=class_names
)

Evaluating best model config:
Model            Model 3B
Optimizer            adam
Epochs                100
Batch Size             64
Learning Rate       0.001
Accuracy         0.934178
Name: 2, dtype: object

[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  
Classification Report:

                              precision    recall  f1-score   support

                   Argentina       0.95      0.98      0.97        63
                   Australia       0.94      0.90      0.92       229
                     Austria       1.00      0.94      0.97        17
                    Bulgaria       0.00      0.00      0.00         1
                      Canada       0.86      0.86      0.86         7
                       Chile       0.99      0.97      0.98       319
                       China       0.00      0.00      0.00         2
                     Croatia       0.00      0.00      0.00         0
                      France       0.92      0.93      0.93 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [23]:
results_5B_class = grid_search_nn_classification(
    model_architecture_fn=build_model_5B_classification,
    model_name="Model 5B",
    epochs_list=[100],
    batch_sizes_list=[64],
    learning_rates_list=[0.01, 0.001],
    optimizer_name_list=['adam', 'rmsprop']
)

[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  
Model 5B (opt=adam, epochs=100, batch=64, lr=0.01) | Epochs: 100 | Batch Size: 64
Test Accuracy: 0.803
----------------------------------------
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  
Model 5B (opt=rmsprop, epochs=100, batch=64, lr=0.01) | Epochs: 100 | Batch Size: 64
Test Accuracy: 0.643
----------------------------------------
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  
Model 5B (opt=adam, epochs=100, batch=64, lr=0.001) | Epochs: 100 | Batch Size: 64
Test Accuracy: 0.909
----------------------------------------
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  
Model 5B (opt=rmsprop, epochs=100, batch=64, lr=0.001) | Epochs: 100 | Batch Size: 64
Test Accuracy: 0.806
----------------------------------------


In [24]:
# Sort by True Multi-Class Metric!
results_5B_class.sort_values("Accuracy", ascending=False)

Unnamed: 0,Model,Optimizer,Epochs,Batch Size,Learning Rate,Accuracy
2,Model 5B,adam,100,64,0.001,0.908835
3,Model 5B,rmsprop,100,64,0.001,0.806054
0,Model 5B,adam,100,64,0.01,0.802534
1,Model 5B,rmsprop,100,64,0.01,0.643435


In [25]:
evaluate_best_model(
    results_df=results_5B_class,
    model_fn=build_model_5B_classification,
    X_train=X_train_prep,
    y_train=y_train,
    X_test=X_test_prep,
    y_test=y_test,
    class_names=class_names
)

Evaluating best model config:
Model            Model 5B
Optimizer            adam
Epochs                100
Batch Size             64
Learning Rate       0.001
Accuracy         0.908835
Name: 2, dtype: object

[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  
Classification Report:

                              precision    recall  f1-score   support

                   Argentina       0.95      0.98      0.97        63
                   Australia       0.84      0.87      0.85       229
                     Austria       0.93      0.82      0.88        17
                    Bulgaria       0.00      0.00      0.00         1
                      Canada       0.86      0.86      0.86         7
                       Chile       0.97      0.97      0.97       319
                       China       0.00      0.00      0.00         2
                     Croatia       0.00      0.00      0.00         0
                      France       0.85      0.96      0.90 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [26]:
results_6B_class = grid_search_nn_classification(
    model_architecture_fn=build_model_6B_classification,
    model_name="Model 6B",
    epochs_list=[100],
    batch_sizes_list=[64],
    learning_rates_list=[0.01, 0.001],
    optimizer_name_list=['adam', 'rmsprop']
)

[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
Model 6B (opt=adam, epochs=100, batch=64, lr=0.01) | Epochs: 100 | Batch Size: 64
Test Accuracy: 0.936
----------------------------------------
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Model 6B (opt=rmsprop, epochs=100, batch=64, lr=0.01) | Epochs: 100 | Batch Size: 64
Test Accuracy: 0.852
----------------------------------------
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
Model 6B (opt=adam, epochs=100, batch=64, lr=0.001) | Epochs: 100 | Batch Size: 64
Test Accuracy: 0.943
----------------------------------------
[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
Model 6B (opt=rmsprop, epochs=100, batch=64, lr=0.001) | Epochs: 100 | Batch Size: 64
Test Accuracy: 0.936
----------------------------------------


In [27]:
# Sort by True Multi-Class Metric!
results_6B_class.sort_values("Accuracy", ascending=False)

Unnamed: 0,Model,Optimizer,Epochs,Batch Size,Learning Rate,Accuracy
2,Model 6B,adam,100,64,0.001,0.94333
3,Model 6B,rmsprop,100,64,0.001,0.935938
0,Model 6B,adam,100,64,0.01,0.935586
1,Model 6B,rmsprop,100,64,0.01,0.852165


In [28]:
evaluate_best_model(
    results_df=results_6B_class,
    model_fn=build_model_6B_classification,
    X_train=X_train_prep,
    y_train=y_train,
    X_test=X_test_prep,
    y_test=y_test,
    class_names=class_names
)

Evaluating best model config:
Model            Model 6B
Optimizer            adam
Epochs                100
Batch Size             64
Learning Rate       0.001
Accuracy          0.94333
Name: 2, dtype: object

[1m89/89[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
Classification Report:

                              precision    recall  f1-score   support

                   Argentina       0.97      0.98      0.98        63
                   Australia       0.92      0.91      0.92       229
                     Austria       1.00      0.94      0.97        17
                    Bulgaria       0.00      0.00      0.00         1
                      Canada       0.86      0.86      0.86         7
                       Chile       0.95      0.98      0.96       319
                       China       0.00      0.00      0.00         2
                     Croatia       0.00      0.00      0.00         0
                      France       0.91      0.95      0.93   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Overall Comparison

| Model  | Optimizer | LR    | Accuracy  | Notes                                                    |
| ------ | --------- | ----- | --------- | -------------------------------------------------------- |
| **3B** | Adam      | 0.001 | 0.934     | Strong general model, stable                             |
| **5B** | Adam      | 0.001 | 0.909     | Dropout adds regularization but reduces top-end accuracy |
| **6B** | Adam      | 0.001 | **0.943** | Best performer, benefits from BatchNorm                  |

Model 6B (BatchNorm + Adam, 0.001 LR)** delivers the highest classification accuracy: **0.943**, and very strong per-class metrics.

### **Model 3B: Deeper Network**

* Performed well (0.934 accuracy) with **Adam + 0.001 LR**
* **RMSprop degraded performance** (0.903)
* Misclassification happens on rare classes (China, UK, Japan), but recall is high on dominant classes like France, Chile, and Italy.

**Conclusion**: Stable and general-purpose; performs well but may benefit from normalization.

### **Model 5B: Dropout Regularization**

* Performance maxed out at **0.909 accuracy**
* Regularization helped prevent overfitting but capped expressiveness
* Accuracy and recall dropped on smaller classes; many minority nations had 0% recall

**Conclusion**: Good for generalization, but less capable of capturing edge cases—better for noisy data but weaker on fine-grained classification.

### **Model 6B: Batch Normalization**

* Best performer at **0.943 accuracy**
* Extremely strong precision/recall for dominant countries like France, Italy, and USA
* Still struggles with **ultra-rare classes** (0 support = 0 recall, which is expected)

**Conclusion**: BatchNorm boosts training stability and model capacity. Best pick when you're aiming for performance across many classes with imbalanced support.


#### Class-Specific Highlights

| Class                                | Observation                                                  |
|--------------------------------------|--------------------------------------------------------------|
| France, Italy, Chile, USA            | High precision + recall, dominant classes well modeled       |
| Argentina, Germany, Portugal, Spain  | Also well predicted across all models                        |
| China, Japan, Uruguay, UK, Moldova   | Always near 0, not enough support in training data           |
| New Zealand, Australia, South Africa | Variable depending on model, with moderate-to-high precision |


### Metric Summary

| Metric Type | Model 3B | Model 5B | Model 6B  |
| ----------- | -------- | -------- | --------- |
| Accuracy    | 0.934    | 0.909    | **0.943** |
| Macro F1    | 0.67     | 0.43     | **0.67**  |
| Weighted F1 | 0.93     | 0.90     | **0.94**  |


### Key Takeaways

* **Use Model 6B** for best performance and stability. Batch normalization gives it a significant edge, especially on major nations.
* **Model 5B is more defensive**—good when overfitting is a risk, but not optimal when high precision is needed.
* **Rare classes** remain a challenge; data augmentation or class weighting may help if improving those is a goal.
