## Predicting Wine Price

Please review the following site for information on our dataset of interest here: [https://www.kaggle.com/datasets/dev7halo/wine-information](https://www.kaggle.com/datasets/dev7halo/wine-information)

Your goal is to use the other variables in the dataset to predict wine price. Feel free to use only a subset of the variables.

### Assignment Specs

- You should compare Neural Networks as we discussed this week to at least one of our previous models from this quarter.
- A secondary goal of this assignment is to test the effects of the neural network function(s) arguments on the algorithm's performance.
- You should explore at least 5 different sets of settings for the function inputs, and you should do your best to find values for these inputs that actually change the results of your modelling. That is, try not to run three different sets of inputs that result in the same performance. The goal here is for you to better understand how to set these input values yourself in the future. Comment on what you discover about these inputs and how they behave.
- Additionally, I'd like you to include pictures of the network architecture for each of the neural network models you run. You may hand-draw them and insert pictures into your submitted files if you wish. You may also use software (e.g. draw.io) to create nice looking diagrams. I want you to become intimately familiar with these types of models and what they look like.
- Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.
- Again, submit an HTML, ipynb, or Colab link. Be sure to rerun your entire notebook fresh before submitting!


### The Data


In this activity, we will explore a dataset containing detailed information about wines, including attributes like country, points (rating), province, variety, and winery, among others. Our main goal is to use the available information to **predict the price** of a wine. 

The dataset we are using is sourced from Kaggle and has already undergone some initial cleansing (hence the file name `cleansingWine.csv`). We will perform further exploration and modeling to understand the patterns and relationships between wine characteristics and their pricing.

To begin, we will load the dataset using pandas and perform a quick initial inspection.


In [None]:
import pandas as pd
import numpy as np

In [112]:
# Read in the dataset
wine_df = pd.read_csv(
    'Data/cleansingWine.csv', low_memory=False
).drop(columns=['Unnamed: 0'])


# Display the first few rows to get a sense of the structure
wine_df.head()

Unnamed: 0,id,name,producer,nation,local1,local2,local3,local4,varieties1,varieties2,...,use,abv,degree,sweet,acidity,body,tannin,price,year,ml
0,137197,Altair,Altair,Chile,Rapel Valley,,,,Cabernet Sauvignon,Carmenere,...,Table,14~15,17~19,SWEET1,ACIDITY4,BODY5,TANNIN4,220000,2014,750
1,137198,"Altair, Sideral",Altair,Chile,Rapel Valley,,,,Cabernet Sauvignon,Merlot,...,Table,14~15,16~18,SWEET1,ACIDITY3,BODY4,TANNIN4,110000,2016,750
2,137199,Baron du Val Red,Baron du Val,France,,,,,Carignan,Cinsault,...,Table,11~12,15~17,SWEET2,ACIDITY3,BODY2,TANNIN2,0,0,750
3,137200,Baron du Val White,Baron du Val,France,,,,,Carignan,Ugni​ blanc,...,Table,11~12,9~11,SWEET1,ACIDITY3,BODY2,TANNIN1,0,0,750
4,137201,"Benziger, Cabernet Sauvignon",Benziger,USA,California,,,,Cabernet Sauvignon,,...,Table,13~14,17~19,SWEET1,ACIDITY3,BODY3,TANNIN4,0,2003,750


In [113]:
wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21605 entries, 0 to 21604
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           21605 non-null  int64 
 1   name         21605 non-null  object
 2   producer     21605 non-null  object
 3   nation       21603 non-null  object
 4   local1       20705 non-null  object
 5   local2       11145 non-null  object
 6   local3       3591 non-null   object
 7   local4       2 non-null      object
 8   varieties1   21256 non-null  object
 9   varieties2   7518 non-null   object
 10  varieties3   4028 non-null   object
 11  varieties4   1330 non-null   object
 12  varieties5   379 non-null    object
 13  varieties6   105 non-null    object
 14  varieties7   31 non-null     object
 15  varieties8   18 non-null     object
 16  varieties9   7 non-null      object
 17  varieties10  6 non-null      object
 18  varieties11  5 non-null      object
 19  varieties12  4 non-null  

## Modeling

Our goal is to predict the price of a wine based on a subset of features from the dataset.

To do this, we will:
- Build a **baseline model** using a Bagging Regressor with a Decision Tree estimator.
- Build several **Neural Networks** with different settings to test how changes in the architecture and hyperparameters affect performance.

Our target variable is **`price`**.

#### Feature Selection

For simplicity and clarity, we focus on the following features:

- `producer`
- `type`
- `use`
- `abv` (Alcohol by Volume)
- `sweet` (Sweetness level)
- `acidity` (Acidity level)
- `body` (Body level)
- `tannin` (Tannin level)
- `year` (Vintage year)
- `local1` (Local region)
- `varieties1` (Grape variety)

These features were chosen because they are intuitively related to wine pricing and were relatively clean after preprocessing. Adding `local1` and `varieties1` helped capture more variation in wine characteristics, leading to improved model performance.

### Preparing Feature Sets

In [114]:
from sklearn.model_selection import train_test_split

# Select only the columns of interest
features = ['producer', 'local1', 'varieties1', 'type', 'use', 'abv', 'sweet', 'acidity', 'body', 'tannin', 'year']
target = 'price'

# Make a copy of the working data
model_data = wine_df[features + [target]].copy()

# Drop any rows with missing values
model_data = model_data.dropna()

# Keep only rows where price is greater than 0
model_data = model_data[model_data['price'] > 0]

# Convert features to appropriate numeric types
def clean_range(value):
    """ Helper function to clean values like '14~15' into an average """
    if isinstance(value, str) and '~' in value:
        low, high = value.split('~')
        return (float(low) + float(high)) / 2
    try:
        return float(value)
    except:
        return None

for col in ['abv', 'year']:
    model_data[col] = model_data[col].apply(clean_range)

# Convert categorical columns like 'sweet', 'acidity', 'body', 'tannin'
# These are text codes like 'SWEET1', so we extract the number
def extract_number(value):
    """ Helper to pull numbers out of text labels """
    if isinstance(value, str):
        return int(''.join(filter(str.isdigit, value)))
    return None

for col in ['sweet', 'acidity', 'body', 'tannin']:
    model_data[col] = model_data[col].apply(extract_number)

# Drop again any rows with missing values after cleaning
model_data = model_data.dropna()

# Separate X and y
X = model_data[features]
y = model_data[target]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

### Bagging

To build a strong baseline for comparison against our Neural Networks, we first train a Bagging Regressor. Bagging reduces variance by averaging predictions from multiple decision trees trained on different subsets of the data. We tune key hyperparameters like the number of estimators and tree depth using GridSearchCV. The final model’s MSE and R² scores will provide a benchmark for evaluating the performance of our Neural Networks.


In [115]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Categorical features to OneHotEncode
categorical_features = ['producer', 'local1', 'varieties1', 'type', 'use']
categorical_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

# Preprocessor
tree_preprocessor = ColumnTransformer(
    transformers=[('cat', categorical_transformer, categorical_features)],
    remainder='passthrough'
)

# Full pipeline: preprocessing + bagging
pipe_tree = Pipeline(steps=[
    ('preprocessor', tree_preprocessor),
    ('regressor', BaggingRegressor(
        estimator=DecisionTreeRegressor(random_state=123),
        random_state=123
    )),
])

# Parameter grid for grid search
param_grid_tree = {
    'regressor__n_estimators': [50, 100, 200],
    'regressor__estimator__max_depth': [3, 5, 7],
    'regressor__estimator__min_samples_split': [2, 5]
}

# Grid search
grid_search = GridSearchCV(pipe_tree, param_grid_tree, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Evaluate the best model
best_tree = grid_search.best_estimator_
y_pred = best_tree.predict(X_test)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Bagging MSE: {mean_squared_error(y_test, y_pred):.2f}")
print(f"Bagging R²: {r2_score(y_test, y_pred):.3f}")

Best Parameters: {'regressor__estimator__max_depth': 7, 'regressor__estimator__min_samples_split': 2, 'regressor__n_estimators': 100}
Bagging MSE: 37368762858.38
Bagging R²: 0.392




The Bagging Regressor achieved a Mean Squared Error (MSE) of approximately 37.37 billion and an R² score of 0.392 on the test set. This indicates that the model explains about 39% of the variance in wine prices — a respectable performance given the complexity and variability in wine pricing. By tuning hyperparameters like the number of estimators and tree depth, we were able to strengthen the model’s ability to generalize beyond the training data. While not perfect, Bagging provided a strong baseline for comparing the more complex Neural Network models.

### 5 Neural Networks for Wine Price Prediction

To explore how Neural Networks perform on our wine price prediction task, we can design five different models, each with a unique architecture or training setting. The goal was to better understand how specific design choices—like the number of neurons, depth of the network, activation functions, and regularization techniques—affect the model’s accuracy and generalization. Each model builds on the one before it, allowing us to observe the effects of increasing complexity or adding stabilization techniques. For each model, we report the test performance and provide a visual diagram of its architecture. 

#### Model 1: Small Simple Network
![Model 1](Diagrams/Model1.jpg)

#### Model 2: More Neurons
![Model 2](Diagrams/Model2.jpg)

#### Model 3: Deeper Network
![Model 3](Diagrams/Model3.jpg)

#### Model 4: Tanh Activation
![Model 4](Diagrams/Model4.jpg)

#### Model 5: Dropout Regularization
![Model 5](Diagrams/Model5.jpg)

___

In [116]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# 2. Preprocessing: scale numeric features, one-hot encode categoricals
numeric_features = ['abv', 'sweet', 'acidity', 'body', 'tannin', 'year']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features)
    ]
)

# Fit the preprocessor
X_train_prep = preprocessor.fit_transform(X_train)
X_test_prep = preprocessor.transform(X_test)

# 3. Get input shape for model
input_shape = X_train_prep.shape[1]

# Function to compile, train, and evaluate a model
def build_and_train(model, model_name, optimizer='adam', epochs=50, batch_size=32):
    model.compile(optimizer=optimizer, loss='mse', metrics=['mae'])
    history = model.fit(
        X_train_prep, y_train,
        epochs=epochs,
        batch_size=batch_size,
        validation_split=0.2,
        verbose=0
    )
    y_pred = model.predict(X_test_prep).flatten()
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"{model_name} | Epochs: {epochs} | Batch Size: {batch_size}")
    print(f"Test MSE: {mse:.2f}")
    print(f"Test MAE: {mae:.2f}")
    print(f"Test R²: {r2:.3f}")
    print("-" * 40)
    return mse, mae, r2

from itertools import product

def grid_search_nn(model_architecture_fn, model_name, epochs_list, batch_sizes_list, learning_rates_list=None):
    results = []
    
    # Handle learning rate (optional)
    if learning_rates_list is None:
        learning_rates_list = [0.001]
    
    for epochs, batch_size, lr in product(epochs_list, batch_sizes_list, learning_rates_list):
        # Rebuild model fresh each time
        model = model_architecture_fn()
        
        # Custom optimizer with given learning rate
        optimizer = keras.optimizers.Adam(learning_rate=lr)
        
        mse, mae, r2 = build_and_train(
            model,
            model_name=f"{model_name} (epochs={epochs}, batch={batch_size}, lr={lr})",
            optimizer=optimizer,
            epochs=epochs,
            batch_size=batch_size
        )
        
        results.append({
            "Model": model_name,
            "Epochs": epochs,
            "Batch Size": batch_size,
            "Learning Rate": lr,
            "MSE": mse,
            "MAE": mae,
            "R2": r2
        })
        
    return pd.DataFrame(results)



In [117]:
# Model 1: Small Simple Network
def build_model_1():
    return keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(16, activation='relu'),
        layers.Dense(1)
    ])

# Model 2: More Neurons
def build_model_2():
    return keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(64, activation='relu'),
        layers.Dense(1)
    ])

# Model 3: Deeper Network
def build_model_3():
    return keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(32, activation='relu'),
        layers.Dense(16, activation='relu'),
        layers.Dense(1)
    ])

# Model 4: Different Activation Function (tanh)
def build_model_4():
    return keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(32, activation='tanh'),
        layers.Dense(1)
    ])

# Model 5: Dropout Regularization
def build_model_5():
    return keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(32, activation='relu'),
        layers.Dropout(0.3),
        layers.Dense(16, activation='relu'),
        layers.Dense(1)
    ])

# Model 6: Improved Network with Batch Normalization
def build_model_6():
    return keras.Sequential([
        layers.Input(shape=(input_shape,)),
        layers.Dense(64, activation='relu'),
        layers.BatchNormalization(),
        layers.Dense(32, activation='relu'),
        layers.Dense(1)
    ])

In [118]:
# Define hyperparameter settings
epochs_list = [50, 100]
batch_sizes_list = [16, 32]
learning_rates_list = [0.001] 

#### Model 1: Small Simple Network (Baseline)
Architecture: Input → Dense(16, relu) → Output(1)

In [119]:
results_model1 = grid_search_nn(
    model_architecture_fn=build_model_1,
    model_name="Model 1: Small Simple Network",
    epochs_list=epochs_list,
    batch_sizes_list=batch_sizes_list,
    learning_rates_list=learning_rates_list
)

[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 1: Small Simple Network (epochs=50, batch=16, lr=0.001) | Epochs: 50 | Batch Size: 16
Test MSE: 78538915840.00
Test MAE: 132266.70
Test R²: -0.278
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 1: Small Simple Network (epochs=50, batch=32, lr=0.001) | Epochs: 50 | Batch Size: 32
Test MSE: 79675752448.00
Test MAE: 135753.27
Test R²: -0.296
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 1: Small Simple Network (epochs=100, batch=16, lr=0.001) | Epochs: 100 | Batch Size: 16
Test MSE: 70710386688.00
Test MAE: 107798.47
Test R²: -0.151
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 1: Small Simple Network (epochs=100, batch=32, lr=0.001) | Epochs: 100 | Batch Size: 32
Test MSE: 77643071488.00
T

The first Neural Network used a simple architecture with just one hidden layer of 16 neurons. Across different training settings, the best result was achieved with 100 epochs and a batch size of 16, but the model still resulted in a negative R² score. This indicates that the model performed worse than simply predicting the average price for every wine. The small size of the network limited its ability to capture complex patterns in the data.


#### Model 2: More Neurons (Capacity Test)
Architecture: Input → Dense(64, relu) → Output(1)

In [120]:
results_model2 = grid_search_nn(
    model_architecture_fn=build_model_2,
    model_name="Model 2: More Neurons",
    epochs_list=epochs_list,
    batch_sizes_list=batch_sizes_list,
    learning_rates_list=learning_rates_list
)

[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 2: More Neurons (epochs=50, batch=16, lr=0.001) | Epochs: 50 | Batch Size: 16
Test MSE: 69099921408.00
Test MAE: 103644.16
Test R²: -0.124
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 2: More Neurons (epochs=50, batch=32, lr=0.001) | Epochs: 50 | Batch Size: 32
Test MSE: 76046262272.00
Test MAE: 124582.50
Test R²: -0.237
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 2: More Neurons (epochs=100, batch=16, lr=0.001) | Epochs: 100 | Batch Size: 16
Test MSE: 60545134592.00
Test MAE: 92502.96
Test R²: 0.015
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 2: More Neurons (epochs=100, batch=32, lr=0.001) | Epochs: 100 | Batch Size: 32
Test MSE: 67547275264.00
Test MAE: 100267.10
Test R²: -0.099

In this model, we increased the number of neurons to 64 in a single hidden layer. Although adding more neurons slightly reduced the Mean Absolute Error (MAE), the R² score remained mostly negative. This shows that simply making the network wider, without adding depth or other improvements, was not enough to meaningfully capture the complexity of wine pricing.

#### Model 3: Deeper Network (More Layers)
Architecture: Input → Dense(32, relu) → Dense(16, relu) → Output(1)

In [121]:
results_model3 = grid_search_nn(
    model_architecture_fn=build_model_3,
    model_name="Model 3: Deeper Network",
    epochs_list=epochs_list,
    batch_sizes_list=batch_sizes_list,
    learning_rates_list=learning_rates_list
)

[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 3: Deeper Network (epochs=50, batch=16, lr=0.001) | Epochs: 50 | Batch Size: 16
Test MSE: 49275625472.00
Test MAE: 91354.25
Test R²: 0.198
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 3: Deeper Network (epochs=50, batch=32, lr=0.001) | Epochs: 50 | Batch Size: 32
Test MSE: 51513749504.00
Test MAE: 95582.19
Test R²: 0.162
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 3: Deeper Network (epochs=100, batch=16, lr=0.001) | Epochs: 100 | Batch Size: 16
Test MSE: 45317394432.00
Test MAE: 86743.20
Test R²: 0.263
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 3: Deeper Network (epochs=100, batch=32, lr=0.001) | Epochs: 100 | Batch Size: 32
Test MSE: 49432190976.00
Test MAE: 92049.80
Test R²: 0.1

This model added a second hidden layer (32 neurons → 16 neurons). Adding depth led to a significant improvement: we achieved a positive R² score for the first time, meaning the model was better than simply guessing the average. The best performance came from training with 100 epochs and a batch size of 16, showing that giving the model more time to learn and smaller batch updates helped it generalize better.

#### Model 4: Different Activation (tanh)
Architecture: Input → Dense(32, tanh) → Output(1)

In [123]:
results_model4 = grid_search_nn(
    model_architecture_fn=build_model_4,
    model_name="Model 4: Tanh Activation",
    epochs_list=epochs_list,
    batch_sizes_list=batch_sizes_list,
    learning_rates_list=learning_rates_list
)

[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 4: Tanh Activation (epochs=50, batch=16, lr=0.001) | Epochs: 50 | Batch Size: 16
Test MSE: 80635559936.00
Test MAE: 138475.80
Test R²: -0.312
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 4: Tanh Activation (epochs=50, batch=32, lr=0.001) | Epochs: 50 | Batch Size: 32
Test MSE: 80699146240.00
Test MAE: 138705.20
Test R²: -0.313
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 4: Tanh Activation (epochs=100, batch=16, lr=0.001) | Epochs: 100 | Batch Size: 16
Test MSE: 80501809152.00
Test MAE: 137992.02
Test R²: -0.310
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 4: Tanh Activation (epochs=100, batch=32, lr=0.001) | Epochs: 100 | Batch Size: 32
Test MSE: 80629530624.00
Test MAE: 138454.02
T

We replaced the ReLU activation with tanh in the hidden layer. Across all training settings, this model consistently performed poorly, with high error rates and a negative R². This suggests that ReLU activation was better suited for this dataset, likely because ReLU handles wide-ranging numeric inputs without saturation issues that tanh sometimes suffers from.

#### Model 5: Add Dropout (Regularization)
Architecture: Input → Dense(32, relu) → Dropout(0.3) → Dense(16, relu) → Output(1)

In [125]:
results_model5 = grid_search_nn(
    model_architecture_fn=build_model_5,
    model_name="Model 5: Dropout Regularization",
    epochs_list=epochs_list,
    batch_sizes_list=batch_sizes_list,
    learning_rates_list=learning_rates_list
)

[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 5: Dropout Regularization (epochs=50, batch=16, lr=0.001) | Epochs: 50 | Batch Size: 16
Test MSE: 49716576256.00
Test MAE: 91169.36
Test R²: 0.191
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 5: Dropout Regularization (epochs=50, batch=32, lr=0.001) | Epochs: 50 | Batch Size: 32
Test MSE: 52386275328.00
Test MAE: 96110.25
Test R²: 0.148
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 5: Dropout Regularization (epochs=100, batch=16, lr=0.001) | Epochs: 100 | Batch Size: 16
Test MSE: 45954666496.00
Test MAE: 86285.38
Test R²: 0.252
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 5: Dropout Regularization (epochs=100, batch=32, lr=0.001) | Epochs: 100 | Batch Size: 32
Test MSE: 47438626816.00

This network introduced a Dropout layer (30%) to help prevent overfitting. The model achieved stable but slightly lower performance compared to the deeper network without dropout. Dropout helped regularize the network and made it less prone to memorizing the training data, but it slightly limited the model’s ability to fully fit the data. Still, with 100 epochs and a batch size of 16, it achieved a reasonably strong positive R².

### Improving Upon Our Best NN

With help from GeeksForGeeks: https://www.geeksforgeeks.org/what-is-batch-normalization-in-deep-learning/

To improve upon our best Neural Network so far, we built a new, slightly larger model with two hidden layers. The first hidden layer has 64 neurons, followed by Batch Normalization to help stabilize and speed up training. The second hidden layer has 32 neurons. We also increased the number of training epochs to 100 to give the model more time to learn the complex relationships in the data, and we used a smaller learning rate (0.001) for finer adjustments during training.

#### Model 6: Improved Network with Batch Norm
![Model 6](Diagrams/Model6.jpg)

In [126]:
results_model6 = grid_search_nn(
    model_architecture_fn=build_model_6,
    model_name="Model 6: Improved Network with Batch Norm",
    epochs_list=epochs_list,
    batch_sizes_list=batch_sizes_list,
    learning_rates_list=learning_rates_list
)

[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 6: Improved Network with Batch Norm (epochs=50, batch=16, lr=0.001) | Epochs: 50 | Batch Size: 16
Test MSE: 44054331392.00
Test MAE: 78002.59
Test R²: 0.283
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 6: Improved Network with Batch Norm (epochs=50, batch=32, lr=0.001) | Epochs: 50 | Batch Size: 32
Test MSE: 38779949056.00
Test MAE: 83923.19
Test R²: 0.369
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 6: Improved Network with Batch Norm (epochs=100, batch=16, lr=0.001) | Epochs: 100 | Batch Size: 16
Test MSE: 47651987456.00
Test MAE: 84262.73
Test R²: 0.225
----------------------------------------
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Model 6: Improved Network with Batch Norm (epochs=100, batch=32, lr=0.001) | Epochs: 100 |

For our best model, we expanded the network with two hidden layers (64 neurons → 32 neurons) and added Batch Normalization after the first layer. This greatly helped the model stabilize during training and reduced internal covariate shifts. Model 6 achieved the best results overall, with the highest R² score of 0.369, particularly when trained for 50 epochs with a batch size of 32. This shows that thoughtful architectural changes, combined with proper training settings, can lead Neural Networks to perform competitively even compared to ensemble methods like Bagging.

## Conclusion

In this activity, we explored different ways to predict wine prices using machine learning models, starting with a Bagging Regressor as a strong baseline and then developing six different Neural Network architectures. Each neural network was built with the goal of understanding how changing specific design choices—like the number of neurons, depth of layers, activation functions, dropout regularization, and batch normalization—impacts model performance. Our final and best-performing Neural Network (Model 6), which included Batch Normalization and two hidden layers, achieved an R² score of 0.369, coming close to our Bagging model's R² of 0.392. This demonstrates that, with thoughtful design and tuning, Neural Networks can be strong contenders even when compared to traditional ensemble methods. 
