# PA - Neural Networks

Your goal is to use the other variables in the dataset to predict wine price. Feel free to use only a subset of the variables.

Assignment Specs

You should compare Neural Networks as we discussed this week to at least one of our previous models from this quarter.
A secondary goal of this assignment is to test the effects of the neural network function(s) arguments on the algorithm's performance. 
You should explore at least 5 different sets of settings for the function inputs, and you should do your best to find values for these inputs that actually change the results of your modelling. That is, try not to run three different sets of inputs that result in the same performance. The goal here is for you to better understand how to set these input values yourself in the future. Comment on what you discover about these inputs and how the behave.
Additionally, I'd like you to include pictures of the network architecture for each of the neural network models you run. You may hand-draw them and insert pictures into your submitted files if you wish. You may also use software (e.g. draw.io) to create nice looking diagrams. I want you to become intimately familiar with these types of models and what they look like.
Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.
Again, submit an HTML, ipynb, or Colab link. Be sure to rerun your entire notebook fresh before submitting!

In [None]:
import pandas as pd
wine = pd.read_csv("/In-Class/Week-4/cleansingWine.csv")
#wine_info = pd.read_csv(r"C:\Users\achur\OneDrive\Desktop\School\CP Spring 2024\545\GSB545\Labs\wine_info.csv")



In [15]:
# clean data
# drop the unnamed column
#wine = wine.drop(columns=['varieties3'])
#wine = wine.drop(columns=['varieties4'])
#wine = wine.drop(columns=['varieties5'])
#wine = wine.drop(columns=['varieties6'])
#wine = wine.drop(columns=['varieties7'])
#wine = wine.drop(columns=['varieties8'])
#wine = wine.drop(columns=['varieties9'])
#wine = wine.drop(columns=['varieties10'])
#wine = wine.drop(columns=['varieties11'])
#wine = wine.drop(columns=['varieties12'])
#wine = wine.drop(columns=['local2'])
#wine = wine.drop(columns=['local3'])
#wine = wine.drop(columns=['local4'])

# convert price to numeric
wine['price'] = pd.to_numeric(wine['price'], errors='coerce')
wine.loc[wine['price'] == 0, 'price'] = pd.NA

# convert to USD
wine['price_usd'] = wine['price'] / 1300

# clean white space
str_cols = wine.select_dtypes(include='object').columns
wine[str_cols] = wine[str_cols].apply(lambda x: x.str.strip())

# drop na in price
wine = wine.dropna(subset=['price_usd'])

### Bagging Regression

### Neural Network Regression

In [16]:
#pip install tensorflow

In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Drop rows with missing target
#cleansing_wine = cleansing_wine.dropna(subset=['price_usd'])

# Use numeric columns only
wine_data = wine.select_dtypes(include=[np.number]).dropna()

# Define features and target
#X = wine_data.drop(columns=['price', 'price_usd'])
#y = wine_data['price_usd']

# Scale features
#scaler = StandardScaler()
#X_scaled = scaler.fit_transform(X)

# Split data
#X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=1)


In [20]:
X = wine.drop(columns=['price', 'price_usd'])

# Drop high-cardinality categorical columns
X = X.drop(columns=['name', 'producer', 'type','acidity', 'tannin', 'year', 'ml', 'varieties1', 'varieties2'], errors='ignore')
y = np.log1p(wine['price_usd'])

# Manually identify column types
categorical_cols = X.select_dtypes(include='object').columns.tolist()
numerical_cols = X.select_dtypes(include='number').columns.tolist()

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing transformer
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_cols),
    ('num', StandardScaler(), numerical_cols)
])

# Model configurations
model_configs = [
    {"name": "Model_1", "hidden_layer_sizes": (1,), "activation": "relu", "solver": "adam", "learning_rate_init": 0.001},
    {"name": "Model_2", "hidden_layer_sizes": (1,), "activation": "tanh", "solver": "adam", "learning_rate_init": 0.005},
    {"name": "Model_3", "hidden_layer_sizes": (1,), "activation": "logistic", "solver": "sgd", "learning_rate_init": 0.01},
    {"name": "Model_4", "hidden_layer_sizes": (1,), "activation": "relu", "solver": "lbfgs", "learning_rate_init": 0.01},
    {"name": "Model_5", "hidden_layer_sizes": (1,), "activation": "tanh", "solver": "adam", "learning_rate_init": 0.0001}
]

# Run and evaluate models
results = []
for config in model_configs:
    pipe = Pipeline([
        ('preprocess', preprocessor),
        ('regressor', MLPRegressor(
            hidden_layer_sizes=config['hidden_layer_sizes'],
            activation=config['activation'],
            solver=config['solver'],
            learning_rate_init=config['learning_rate_init'],
            max_iter=1000,
            random_state=1
        ))
    ])
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_test)

    results.append({
        "Model": config["name"],
        "MSE": mean_squared_error(y_test, preds),
        "R2 Score": r2_score(y_test, preds)
    })

# Convert results to DataFrame
results_df = pd.DataFrame(results)
print(results_df)

     Model       MSE  R2 Score
0  Model_1  0.400001  0.592712
1  Model_2  0.415830  0.576594
2  Model_3  0.982141 -0.000035
3  Model_4  0.402055  0.590620
4  Model_5  0.401217  0.591474
