# Predicting Wine Price

Please review the following site for information on our dataset of interest here: https://www.kaggle.com/datasets/dev7halo/wine-informationLinks to an external site.

Your goal is to use the other variables in the dataset to predict wine price. Feel free to use only a subset of the variables.

Assignment Specs

You should compare Neural Networks as we discussed this week to at least one of our previous models from this quarter.
A secondary goal of this assignment is to test the effects of the neural network function(s) arguments on the algorithm's performance. 
You should explore at least 5 different sets of settings for the function inputs, and you should do your best to find values for these inputs that actually change the results of your modelling. That is, try not to run three different sets of inputs that result in the same performance. The goal here is for you to better understand how to set these input values yourself in the future. Comment on what you discover about these inputs and how the behave.
Additionally, I'd like you to include pictures of the network architecture for each of the neural network models you run. You may hand-draw them and insert pictures into your submitted files if you wish. You may also use software (e.g. draw.io) to create nice looking diagrams. I want you to become intimately familiar with these types of models and what they look like.
Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.
Again, submit an HTML, ipynb, or Colab link. Be sure to rerun your entire notebook fresh before submitting!

In [2]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, KBinsDiscretizer
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPRegressor, MLPClassifier
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

In [3]:
wine = pd.read_csv("cleansingWine.csv")

# select columns
wine = wine[["id", "name", "producer", "nation", "local1", "varieties1", 
"type", "use", "abv", "degree", "sweet", "acidity", "body", "year", "ml", "price"]]

# drop where price is zero
wine = wine[wine["price"] != 0]

# convert to USD
wine["price"] = wine["price"] * 0.000703



## Neural Network

In [6]:
X = wine.drop(columns=['price'])

# Drop high-cardinality categorical columns
X = X.drop(columns=['name', 'producer', 'type','acidity', 'tannin', 'year', 'ml', 'varieties1', 'varieties2'], errors='ignore')
y = np.log1p(wine['price'])

# Manually identify column types
categorical_cols = X.select_dtypes(include='object').columns.tolist()
numerical_cols = X.select_dtypes(include='number').columns.tolist()

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing transformer
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_cols),
    ('num', StandardScaler(), numerical_cols)
])

# Model configurations
model_configs = [
    {"name": "Model_1", "hidden_layer_sizes": (1,), "activation": "relu", "solver": "adam", "learning_rate_init": 0.001},
    {"name": "Model_2", "hidden_layer_sizes": (5,), "activation": "tanh", "solver": "adam", "learning_rate_init": 0.005},
    {"name": "Model_3", "hidden_layer_sizes": (10,), "activation": "logistic", "solver": "sgd", "learning_rate_init": 0.01},
    {"name": "Model_4", "hidden_layer_sizes": (50,), "activation": "relu", "solver": "lbfgs", "learning_rate_init": 0.01},
    {"name": "Model_5", "hidden_layer_sizes": (100,), "activation": "tanh", "solver": "adam", "learning_rate_init": 0.0001}
]

# Run and evaluate models
results = []
for config in model_configs:
    pipe = Pipeline([
        ('preprocess', preprocessor),
        ('regressor', MLPRegressor(
            hidden_layer_sizes=config['hidden_layer_sizes'],
            activation=config['activation'],
            solver=config['solver'],
            learning_rate_init=config['learning_rate_init'],
            max_iter=1000,
            random_state=1
        ))
    ])
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_test)

    results.append({
        "Model": config["name"],
        "MSE": mean_squared_error(y_test, preds),
        "R2 Score": r2_score(y_test, preds)
    })

# Convert results to DataFrame
results_df = pd.DataFrame(results)
print(results_df)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


     Model       MSE  R2 Score
0  Model_1  0.466751  0.523148
1  Model_2  0.463477  0.526493
2  Model_3  0.458019  0.532069
3  Model_4  0.630257  0.356104
4  Model_5  0.467810  0.522067


## Network Architecture

![Alt text](network-architecture.jpeg)

None of the models achieved strong predictive performance, with all R² scores indicating limited ability to explain variance in wine prices. Among them, Model_3 performed the best, but even its R² of 0.53 suggests only modest predictive power.

# Boosting Model

In [7]:
from sklearn.ensemble import GradientBoostingRegressor

# Add a boosting model to the list
boosting_model = Pipeline([
    ('preprocess', preprocessor),
    ('regressor', GradientBoostingRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=1
    ))
])

# Fit and evaluate
boosting_model.fit(X_train, y_train)
boosting_preds = boosting_model.predict(X_test)

# Append results
results.append({
    "Model": "GradientBoosting",
    "MSE": mean_squared_error(y_test, boosting_preds),
    "R2 Score": r2_score(y_test, boosting_preds)
})

# Updated results DataFrame
results_df = pd.DataFrame(results)
print(results_df)


              Model       MSE  R2 Score
0           Model_1  0.466751  0.523148
1           Model_2  0.463477  0.526493
2           Model_3  0.458019  0.532069
3           Model_4  0.630257  0.356104
4           Model_5  0.467810  0.522067
5  GradientBoosting  0.473744  0.516004


The GradientBoosting model has a similar performance to the neural network models, with an MSE of 0.473744 and an R² score of 0.516004. It performs competitively, though it doesn’t outperform the best neural network model.