<a href="https://colab.research.google.com/github/claudiamoses/DataScience-Class-Projects/blob/main/Copy_of_DATASCI_112_Lab_4B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle Competition

In this part of the lab, you will train your own machine learning models to predict the bitterness of a beer.

Remember that the data is available from two sources:

- on [Kaggle](https://www.kaggle.com/competitions/beer2024/data) (You will need to be signed in and to have joined the Kaggle competition to view this link.)
- on [Github](https://github.com/dlsun/pods/tree/master/data/beer) (https://dlsun.github.io/pods/data/beer/beer_train.csv and https://dlsun.github.io/pods/data/beer/beer_test.csv)

A description of the variables is available [on Kaggle](https://www.kaggle.com/competitions/beer2024/data).

You will submit the models that you develop in this lab to a Kaggle competition. [Use this link to join the Kaggle competition](https://www.kaggle.com/t/fc19bcab992642cb882edc7bdfbba3eb).





In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler, MaxAbsScaler, Normalizer
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.linear_model import LinearRegression, Ridge, Lars, ElasticNet, Lasso
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.svm import LinearSVR
from sklearn.ensemble import GradientBoostingRegressor, BaggingRegressor
from sklearn.exceptions import DataConversionWarning
from sklearn.tree import DecisionTreeRegressor
import warnings

In [None]:
train_df = pd.read_csv("https://dlsun.github.io/pods/data/beer/beer_train.csv")
train_df

Unnamed: 0,id,abv,available,description,glass,ibu,isOrganic,name,originalGravity,srm
0,0,8.2,"Available at the same time of year, every year.",A Belgian-Abbey-Style Tripel that is big in al...,,31.0,N,LoonyToonTripel,1.070,8
1,1,5.7,"Available at the same time of year, every year.",Covert Hops is a crafty ale. Its stealthy dark...,Pint,45.0,N,Covert Hops,1.056,35
2,2,5.8,"Available at the same time of year, every year.",This is a traditional German-style Marzen char...,Mug,25.0,N,Oktoberfest,1.048,10
3,3,5.5,Available year round as a staple beer.,A West Coast-Style Pale Ale balancing plenty o...,Pint,55.0,N,Pale Ale,1.044,5
4,4,4.8,Available year round as a staple beer.,This Bombshell has a tantalizing crisp and cle...,Pint,11.4,N,Head Turner Blonde Ale,1.045,3
...,...,...,...,...,...,...,...,...,...,...
5995,5995,5.5,Available year round as a staple beer.,"Taking its cues from “Three Threads”, a barten...",Pint,33.0,N,Mayflower Porter,1.040,Over 40
5996,5996,11.0,"Available at the same time of year, every year.",Our barley wine is what would be considered an...,,30.0,N,Barbieswine,1.085,20
5997,5997,5.2,Available year round as a staple beer.,Our version of a west coast pale ale pushes th...,Pint,45.0,N,Canuck Pale Ale,1.044,8
5998,5998,6.2,Limited availability.,Aquarius White IPA combines the phenolic and e...,,50.0,N,Aquarius,1.044,2


## Question 1

Fit at least **three** different machine learning models that are different from the models that you fit in Part A of this lab.

Models are considered "different" if they use:

- different sets of input features,
- different preprocessing, or
- different machine learning methods/algorithms.

You are encouraged to try varying all three!

For each of your three models, you should do a grid search to find the best hyperparameters.

# Model 1: Linear Support Vector Regression: RMSE-21.15

## Step 1: evaluate best combination of features

In [None]:
# Select prediction features
X_train_1 = train_df[["abv", "name"]]
X_train_2 = train_df[["abv", "available", "name", "glass", "description"]].fillna("")
X_train_3 = train_df[["abv", "available", "name", "glass", "description", "srm"]].fillna("")
y_train = train_df[["ibu"]]
y_train = y_train.values.ravel()

In [None]:
# make column transformer for each features group
col_transformer_1 = make_column_transformer(
    (StandardScaler(), ["abv"]),
    (TfidfVectorizer(max_features=100), "name"),
    remainder="drop"
    )

col_transformer_2 = make_column_transformer(
    (StandardScaler(), ["abv"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["available", "glass"]),
    (TfidfVectorizer(max_features=100), "name"),
    remainder="drop"
  )

col_transformer_3 = make_column_transformer(
    (StandardScaler(), ["abv"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["available", "glass", "srm"]),
    (TfidfVectorizer(max_features=100), "name"),
    (TfidfVectorizer(max_features=100), "description"),
    remainder="drop"
  )

In [None]:
#make pipeline for each features group
pipe1 = make_pipeline(
    col_transformer_1,
    LinearSVR()
)

pipe2 = make_pipeline(
    col_transformer_2,
    LinearSVR()
)

pipe3 = make_pipeline(
    col_transformer_3,
    LinearSVR()
)

In [None]:
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

#calculate RMSE for each features group
train_list = [X_train_1, X_train_2, X_train_3]
pipe_list = [pipe1, pipe2, pipe3]
features_list = ["Group 1", "Group 2", "Group 3"]

def calculate_RMSE(name, pipeline, training_data):
  scores = cross_val_score(
    pipeline,
    X=training_data,
    y=y_train,
    cv=15,
    scoring="neg_mean_squared_error"
  )
  print(name + " RMSE: ", np.sqrt(-scores.mean()))
  return

for i in range(3):
  calculate_RMSE(features_list[i], pipe_list[i], train_list[i])

Group 1 RMSE:  23.27987743419027
Group 2 RMSE:  23.161141354967096
Group 3 RMSE:  21.39284098916195


Conclusion: group 3 of features is ideal

## Step 2: determine best combination of parameters

In [None]:
def estimate_test_mse(scaler):
  # TODO: estimate the test error for given distance metric and scaler
  col_transformer = make_column_transformer(
    (scaler, ["abv"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["available", "glass", "srm"]),
    (TfidfVectorizer(max_features=100), "name"),
    (TfidfVectorizer(max_features=100), "description"),
    remainder="drop"
  )

  pipeline = make_pipeline(
    col_transformer,
    LinearSVR()
  )

  grid_cv = GridSearchCV(
    pipeline,
    scoring="neg_mean_squared_error",
    n_jobs=-1,
    param_grid={
        "linearsvr__loss": ["epsilon_insensitive", "squared_epsilon_insensitive"],
        "linearsvr__epsilon": [0, 0.5, 1],
        "linearsvr__fit_intercept": [True, False]
    },
    cv=10,
  )
  grid_cv.fit(X_train_3, y_train)
  return grid_cv

# TODO: iterate over scalers, calling `estimate_test_mse`
scalers = [StandardScaler(), MinMaxScaler(), MaxAbsScaler(), Normalizer()]
grid_cv_list_LSVR = []
for scaler in scalers:
  grid_cv_list_LSVR.append(estimate_test_mse(scaler))


In [None]:
#extract best pipeline from grid_cv list
best_grid_LSVR = max(grid_cv_list_LSVR, key=lambda x: x.best_score_)
best_pipeline_LSVR = best_grid_LSVR.best_estimator_
best_pipeline_LSVR

In [None]:
#calculate error
def calculate_RMSE(name, pipeline, training_data):
  scores = cross_val_score(
    pipeline,
    X=training_data,
    y=y_train,
    cv=15,
    scoring="neg_mean_squared_error"
  )
  print(name + " RMSE: ", np.sqrt(-scores.mean()))
  return

calculate_RMSE("Best LinearSVR combination", best_pipeline_LSVR, X_train_3)

Best LinearSVR combination RMSE:  21.151924524819197


# Model 2: Ridge Regression: RMSE-21.15

## Step 1: evaluate best combination of features

In [None]:
# Select prediction features
X_train_1 = train_df[["abv", "name"]]
X_train_2 = train_df[["abv", "available", "name", "glass", "description", "srm"]].fillna("")
X_train_3 = train_df[["abv", "available", "name", "glass", "description"]].fillna("")
y_train = train_df[["ibu"]]
y_train = y_train.values.ravel()

In [None]:
# make column transformer for each features group

col_transformer_1 = make_column_transformer(
    (StandardScaler(), ["abv"]),
    (TfidfVectorizer(max_features=100), "name"),
    remainder="drop"
    )

col_transformer_2 = make_column_transformer(
    (StandardScaler(), ["abv"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["available", "glass", "srm"]),
    (TfidfVectorizer(max_features=100), "name"),
    remainder="drop"
  )

col_transformer_3 = make_column_transformer(
    (StandardScaler(), ["abv"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["available", "glass"]),
    (TfidfVectorizer(max_features=100), "name"),
    (TfidfVectorizer(max_features=100), "description"),
    remainder="drop"
  )

In [None]:
#make pipeline for each features group
pipe1 = make_pipeline(
    col_transformer_1,
    Ridge()
)

pipe2 = make_pipeline(
    col_transformer_2,
    Ridge()
)

pipe3 = make_pipeline(
    col_transformer_3,
    Ridge()
)

In [None]:
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

#calculate RMSE for each features group
train_list = [X_train_1, X_train_2, X_train_3]
pipe_list = [pipe1, pipe2, pipe3]
features_list = ["Group 1", "Group 2", "Group 3"]

def calculate_RMSE(name, pipeline, training_data):
  scores = cross_val_score(
    pipeline,
    X=training_data,
    y=y_train,
    cv=15,
    scoring="neg_mean_squared_error"
  )
  print(name + " RMSE: ", np.sqrt(-scores.mean()))
  return

for i in range(3):
  calculate_RMSE(features_list[i], pipe_list[i], train_list[i])

Group 1 RMSE:  22.907959251719415
Group 2 RMSE:  22.51290500833821
Group 3 RMSE:  21.201501474588166


Conclusion: features group 3 is ideal

## Step 2: determine best combination of parameters

In [None]:
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

def estimate_test_mse(scaler):
  # TODO: estimate the test error for given distance metric and scaler
  col_transformer = make_column_transformer(
    (scaler, ["abv"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["available", "glass"]),
    (TfidfVectorizer(max_features=100), "name"),
    (TfidfVectorizer(max_features=100), "description"),
    remainder="drop"
  )

  pipeline = make_pipeline(
    col_transformer,
    Ridge()
  )

  grid_cv = GridSearchCV(
    pipeline,
    scoring="neg_mean_squared_error",
    n_jobs=-1,
    param_grid={
        "ridge__alpha": [1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40,45,50,55,100],
        "ridge__solver": ["auto"]
    },
    cv=10,
  )
  grid_cv.fit(X_train_3, y_train)
  return grid_cv

# TODO: iterate over scalers, calling `estimate_test_mse`
scalers = [StandardScaler(), MinMaxScaler(), MaxAbsScaler(), Normalizer()]
grid_cv_list_RR = []
for scaler in scalers:
  grid_cv_list_RR.append(estimate_test_mse(scaler))

In [None]:
#extract best pipeline from grid_cv list
best_grid_cv = max(grid_cv_list_RR, key=lambda x: x.best_score_)
best_pipeline_RR = best_grid_cv.best_estimator_
best_pipeline_RR

Conclusive features:


*   StandardScaler()
*   alpha=10



In [None]:
#calculate error
def calculate_RMSE(name, pipeline, training_data):
  scores = cross_val_score(
    pipeline,
    X=training_data,
    y=y_train,
    cv=15,
    scoring="neg_mean_squared_error"
  )
  print(name + " RMSE: ", np.sqrt(-scores.mean()))
  return

calculate_RMSE("Best Ridge Regression combination", best_pipeline_RR, X_train_3)

Best Ridge Regression combination RMSE:  21.151270773480896


# Model 3: Linear Regression: RMSE-21.23

## Step 1: evaluate best combination of features

In [None]:
# Select prediction features
X_train_1 = train_df[["abv", "name"]]
X_train_2 = train_df[["abv", "available", "name", "glass", "description", "originalGravity"]].fillna("")
X_train_3 = train_df[["abv", "available", "name", "glass", "srm", "description"]].fillna("")
y_train = train_df[["ibu"]]
y_train = y_train.values.ravel()



In [None]:
# make column transformer for each features group

col_transformer_1 = make_column_transformer(
    (StandardScaler(), ["abv"]),
    (TfidfVectorizer(max_features=100), "name"),
    remainder="drop"
    )

col_transformer_2 = make_column_transformer(
    (StandardScaler(), ["abv", "originalGravity"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["available", "glass"]),
    (TfidfVectorizer(max_features=100), "name"),
    remainder="drop"
  )

col_transformer_3 = make_column_transformer(
    (StandardScaler(), ["abv"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["available", "glass"]),
    (TfidfVectorizer(max_features=100), "name"),
    (TfidfVectorizer(max_features=100), "srm"),
    (TfidfVectorizer(max_features=100), "description"),
    remainder="drop"
  )

In [None]:
#make pipeline for each features group
pipe1 = make_pipeline(
    col_transformer_1,
    LinearRegression()
)

pipe2 = make_pipeline(
    col_transformer_2,
    LinearRegression()
)

pipe3 = make_pipeline(
    col_transformer_3,
    LinearRegression()
)

In [None]:
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

#calculate RMSE for each features group
train_list = [X_train_1, X_train_2, X_train_3]
pipe_list = [pipe1, pipe2, pipe3]
features_list = ["Group 1", "Group 2", "Group 3"]

def calculate_RMSE(name, pipeline, training_data):
  scores = cross_val_score(
    pipeline,
    X=training_data,
    y=y_train,
    cv=15,
    scoring="neg_mean_squared_error"
  )
  print(name + " RMSE: ", np.sqrt(-scores.mean()))
  return

for i in range(3):
  calculate_RMSE(features_list[i], pipe_list[i], train_list[i])

Group 1 RMSE:  22.917630522219117
Group 2 RMSE:  22.82881339173775
Group 3 RMSE:  21.230687870321777


## Step 2: find best parameters

In [None]:
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

def estimate_test_mse(scaler):
  # TODO: estimate the test error for given distance metric and scaler
  col_transformer = make_column_transformer(
    (scaler, ["abv"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["available", "glass"]),
    (TfidfVectorizer(max_features=100), "name"),
    (TfidfVectorizer(max_features=100), "srm"),
    (TfidfVectorizer(max_features=100), "description"),
    remainder="drop"
  )

  pipeline = make_pipeline(
    col_transformer,
    LinearRegression()
  )

  grid_cv = GridSearchCV(
    pipeline,
    scoring="neg_mean_squared_error",
    n_jobs=-1,
    param_grid={
        "linearregression__fit_intercept": [True, False]
    },
    cv=10,
  )
  grid_cv.fit(X_train_3, y_train)
  return grid_cv

# TODO: iterate over scalers, calling `estimate_test_mse`
scalers = [StandardScaler(), MinMaxScaler(), MaxAbsScaler(), Normalizer()]
grid_cv_list_LR = []
for scaler in scalers:
  grid_cv_list_LR.append(estimate_test_mse(scaler))

In [None]:
#extract best pipeline from grid_cv list
best_grid_cv = max(grid_cv_list_LR, key=lambda x: x.best_score_)
best_pipeline_LR = best_grid_cv.best_estimator_
best_pipeline_LR

In [None]:
#calculate error
def calculate_RMSE(name, pipeline, training_data):
  scores = cross_val_score(
    pipeline,
    X=training_data,
    y=y_train,
    cv=15,
    scoring="neg_mean_squared_error"
  )
  print(name + " RMSE: ", np.sqrt(-scores.mean()))
  return

calculate_RMSE("Best Linear Regression combination", best_pipeline_LR, X_train_3)

Best Linear Regression combination RMSE:  21.230687870321777


# Model 4: Elastic Net: RMSE-24.37

## Step 1: evaluate best combination of features

In [None]:
# Select prediction features
X_train_1 = train_df[["abv", "name"]]
X_train_2 = train_df[["abv", "available", "name", "glass", "description", "originalGravity"]].fillna("")
X_train_3 = train_df[["abv", "available", "name", "glass", "description"]].fillna("")
y_train = train_df[["ibu"]]
y_train = y_train.values.ravel()

In [None]:
# make column transformer for each features group

col_transformer_1 = make_column_transformer(
    (StandardScaler(), ["abv"]),
    (TfidfVectorizer(max_features=100), "name"),
    remainder="drop"
    )

col_transformer_2 = make_column_transformer(
    (StandardScaler(), ["abv", "originalGravity"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["available", "glass"]),
    (TfidfVectorizer(max_features=100), "name"),
    remainder="drop"
  )

col_transformer_3 = make_column_transformer(
    (StandardScaler(), ["abv"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["available", "glass"]),
    (TfidfVectorizer(max_features=100), "name"),
    (TfidfVectorizer(max_features=100), "description"),
    remainder="drop"
  )

In [None]:
#make pipeline for each features group
pipe1 = make_pipeline(
    col_transformer_1,
    ElasticNet()
)

pipe2 = make_pipeline(
    col_transformer_2,
    ElasticNet()
)

pipe3 = make_pipeline(
    col_transformer_3,
    ElasticNet()
)

In [None]:
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

#calculate RMSE for each features group
train_list = [X_train_1, X_train_2, X_train_3]
pipe_list = [pipe1, pipe2, pipe3]
features_list = ["Group 1", "Group 2", "Group 3"]

def calculate_RMSE(name, pipeline, training_data):
  scores = cross_val_score(
    pipeline,
    X=training_data,
    y=y_train,
    cv=4,
    scoring="neg_mean_squared_error"
  )
  print(name + " RMSE: ", np.sqrt(-scores.mean()))
  return

for i in range(3):
  calculate_RMSE(features_list[i], pipe_list[i], train_list[i])

Group 1 RMSE:  25.671682209826745
Group 2 RMSE:  25.48537859979887
Group 3 RMSE:  25.601512736294595


## Step 2: determine best combination of parameters

In [None]:
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

def estimate_test_mse(scaler):
  # TODO: estimate the test error for given distance metric and scaler
  col_transformer = make_column_transformer(
    (scaler, ["abv", "originalGravity"]),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), ["available", "glass"]),
    (TfidfVectorizer(max_features=100), "name"),
    remainder="drop"
  )


  pipeline = make_pipeline(
    col_transformer,
    ElasticNet()
  )

  grid_cv = GridSearchCV(
    pipeline,
    scoring="neg_mean_squared_error",
    n_jobs=-1,
    param_grid={
        "elasticnet__alpha": [1, 1.5, 2],
        "elasticnet__l1_ratio": [0, 0.5, 1],
    },
    cv=10,
  )
  grid_cv.fit(X_train_2, y_train)
  return grid_cv

# TODO: iterate over scalers, calling `estimate_test_mse`
scalers = [StandardScaler(), MinMaxScaler(), MaxAbsScaler(), Normalizer()]
grid_cv_list_EN = []
for scaler in scalers:
  grid_cv_list_EN.append(estimate_test_mse(scaler))

In [None]:
#extract best pipeline from grid_cv list
best_grid_cv = max(grid_cv_list_EN, key=lambda x: x.best_score_)
best_pipeline_EN = best_grid_cv.best_estimator_
best_pipeline_EN

In [None]:
#calculate error
def calculate_RMSE(name, pipeline, training_data):
  scores = cross_val_score(
    pipeline,
    X=training_data,
    y=y_train,
    cv=15,
    scoring="neg_mean_squared_error"
  )
  print(name + " RMSE: ", np.sqrt(-scores.mean()))
  return

calculate_RMSE("Best Logistic Regression combination", best_pipeline_EN, X_train_2)

Best Logistic Regression combination RMSE:  24.370939172262407


# Interpretation

I ended up using 4 different regression models: LSVR, Ridge Regression, Linear Regression, and an Elastic Net that combined a few models. I actually tried some other combination models, and none of them performed as well as singular regressors. The first three models are the models with the lowest RMSE I tried, which I included in the section headings. For each model, I determined a set of best features and used GridSearchCV to find the best choice of parameters. In the end, Ridge Regression performed the best when I submitted my models to Kaggle.

# Question 2

## Question 2

Take the best performing model from Question 1, and use it to make predictions on the test data. Submit these predictions to the [Kaggle competition](https://www.kaggle.com/competitions/beer2024).

Note that your predictions will need to be formatted in a very specific way for Kaggle to accept it. See [`sample.csv` on Kaggle](https://www.kaggle.com/competitions/beer2024/data?select=sample.csv) for an example.

To earn full credit, at least one of your models must beat the final model from Part A of this lab on the public leaderboard.

There will be extra credit for the top finishers on the _private_ leaderboard (which will not be revealed until after the end of the contest):

- 1st place gets +5 extra credit points and a gift card.
- 2nd place gets +4 extra credit points and a gift card.
- 3rd place gets +3 extra credit points.
- 4th place gets +2 extra credit points.
- 5th place gets +1 extra credit point.

_COMPARE THE TEST RMSE ON PUBLIC LEADERBOARD WITH THE ESTIMATE YOU GOT USING CROSS-VALIDATION._

The test RMSE on the leaderboard was actually lower than the RSME I got using cross-validation for every model I submitted. This is strange because we would have expected to get higher RSME on data we didn't actually train on, since it wouldn't skew the information toward itself.

## Submission Instructions

- Restart this notebook and run the cells from beginning to end.
  - Go to Runtime > Restart and Run All.

In [None]:
# @markdown Run this cell to download this notebook as a webpage, `_NOTEBOOK.html`.

import google, json, nbformat

# Get the current notebook and write it to _NOTEBOOK.ipynb
raw_notebook = google.colab._message.blocking_request("get_ipynb",
                                                      timeout_sec=30)["ipynb"]
with open("_NOTEBOOK.ipynb", "w", encoding="utf-8") as ipynb_file:
  ipynb_file.write(json.dumps(raw_notebook))

# Use nbconvert to convert .ipynb to .html.
!jupyter nbconvert --to html --log-level WARN _NOTEBOOK.ipynb

# Download the .html file.
google.colab.files.download("_NOTEBOOK.html")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

- Open `_NOTEBOOK.html` in your browser, and save it as a PDF.
    - Go to File > Print > Save as PDF.
- Double check that all of your code and output is visible in the saved PDF.
- Upload the PDF to [Gradescope](https://www.gradescope.com/courses/694907).
    - Please be sure to select the correct pages corresponding to each question.