<a href="https://colab.research.google.com/github/ericwarren9/ST-590/blob/main/Warren_HW7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ST 590 Homework 7 By: Eric Warren

Here I am going to suppress warnings so it is easier to follow along.

In [1]:
import warnings

# Suppress warnings
warnings.filterwarnings("ignore")

## Read in and Combine Data

- Read in the `winequality-red.csv` and `winequality-white.csv` files available on the UCI machine learning repository site.
- Combine these two datasets and create a new variable that represents the type of wine (red or white)

In [2]:
# import pandas
import pandas as pd

# Read in the red wine data
red_wine = pd.read_csv("winequality-red.csv", sep = ";")
red_wine['color_white'] = 0

# Read in the white wine data
white_wine = pd.read_csv("winequality-white.csv", sep = ";")
white_wine['color_white'] = 1

# Combine the data into one data frame
wine = red_wine.append(white_wine)
wine # Show our combined data frame

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color_white
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,0
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,0
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,1
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,1
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,1
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,1


## Split the Data

Split up the data set into a training and test set. For this, I want you to use stratified sampling to
make sure that you have a similar proportion of white and red wines in the training and test sets. This
can be done with the `train_test_split()` function.

In [3]:
# Import needed functions for modeling
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression, LassoCV, Lasso, Ridge, RidgeCV, ElasticNet, ElasticNetCV, LogisticRegression, LogisticRegressionCV

# Split the data in a way that the proportion of red and white wine is same
X_train, X_test, y_train, y_test = train_test_split(
    wine.drop("alcohol", axis = 1),
    wine["alcohol"],
    test_size = 0.20,
    random_state = 99,
    stratify = wine['color_white'])

## Regression Task (alcohol as Response)

### Train Models

Fit four different multiple linear regression models.

- At least one should include interaction terms
- At least one should include some polynomial terms
- Use CV to select your best MLR model

In [4]:
# Allow us to use interaction terms and numpy
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

# Make a full MLR model
mlr1 = cross_validate(
    LinearRegression(),
    X_train,
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")

# Make a MLR model with only the pH, sulphates, quality, and color_white
mlr2 = cross_validate(
    LinearRegression(),
    X_train[["pH", "sulphates", "quality", "color_white"]],
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")

# Make a MLR model with the interaction terms for all of the columns we selected above
poly = PolynomialFeatures(interaction_only = True, include_bias = False)
X_design = poly.fit_transform(X_train[["pH", "sulphates", "quality", "color_white"]])
mlr3 = cross_validate(
    LinearRegression(),
    X_design,
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")

# Make a MLR model that includes a quadratic term for our columns
poly2 = PolynomialFeatures(degree = 2, interaction_only = False, include_bias = False)
X_design2 = poly2.fit_transform(X_train[["pH", "sulphates", "quality", "color_white"]])
mlr4 = cross_validate(
    LinearRegression(),
    X_design2,
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")

# Print which model has the best CV error and select that model to test in future
print(np.sqrt(-sum(mlr1['test_score'])),
      np.sqrt(-sum(mlr2['test_score'])),
      np.sqrt(-sum(mlr3['test_score'])),
      np.sqrt(-sum(mlr4['test_score'])))

1.0333519118768444 2.3530136757851543 2.348408083800981 2.3302342552083974


As we can see here our first model, which is just putting in a regular full model is the best. Let us save this best model to test for later.

In [5]:
mlr_best = LinearRegression().fit(X_train, y_train)

Now we are going to fit a LASSO model with a set of predictors of our choosing

- Use at least five predictors
- Use CV to select the tuning parameter

Then we are going to fit the model with the best tuning paramter to use for testing for later.

In [6]:
# Fit our lasso model to get the best tuning parameter
lasso_mod = LassoCV(cv = 5, random_state = 99).fit(X_train, y_train)

# Print the best alpha tuning parameter
print(lasso_mod.alpha_)

# Fit our best model
lasso_best = Lasso(lasso_mod.alpha_).fit(X_train, y_train)

0.017540010541499004


Here we can see that our best tuning parameter is quite small (and not far off from 0 saying the same as a MLR model).

Now we are going to fit a Ridge Regression model with a set of predictors of our choosing

- Use at least five predictors
- Use CV to select the tuning parameter

Then we are going to fit our model with the best tuning paramter to use for testing for later.

In [7]:
# Fit our ridge model to get the best tuning parameter
ridge_mod = RidgeCV(cv = 5,
                    alphas=[0.0001, 0.001, 0.01, 0.02, 0.05, 0.1, 0.2, 0.25, 0.5, 0.75, 0.9, 0.93, 0.95, 0.97, 0.98, 0.99, 1]).fit(X_train, y_train)

# Print the best alpha tuning parameter
print(ridge_mod.alpha_)

# Fit our best model
ridge_best = Ridge(ridge_mod.alpha_).fit(X_train, y_train)

0.0001


Again we can see a small penalty term applied to our ridge model, which is something to note that not much penalty is applied and makes it similar to a MLR model.

Lastly, we are going to fit an Elastic Net model with a set of predictors of your choosing

- Use at least five predictors
- Use CV to select the tuning parameters

Then we are going to fit our model with the best tuning paramter to use for testing for later.

In [8]:
# Fit our elastic net model to get the best tuning parameters
regr = ElasticNetCV(cv = 5,
                    random_state = 99,
                    l1_ratio = [0.0001, 0.001, 0.01, 0.02, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.96, 0.98, 0.99, 1],
                    n_alphas = 50)
regr.fit(X_train, y_train)

# Print the best alpha parameter
print(regr.alpha_)

# Print the best ratio parameter
print(regr.l1_ratio_)

# Fit our best elastic net model
en_best = ElasticNet(alpha = regr.alpha_, l1_ratio = regr.l1_ratio_).fit(X_train, y_train)

0.017540010541499004
1.0


We now got our best alpha which is small like our LASSO model and then a high ratio used as our second tuning paramter which is showing that our best elastic net model is just using our LASSO model.

### Test Models

Using your four selected models, compare their performance on the test set.

- Do so using RMSE as your model metric
- Do so using MAE as your model metric

First look at using RMSE as our model metric.

In [9]:
# Get our predictions from our model using our test set
mlr_pred = mlr_best.predict(X_test)
lasso_pred = lasso_best.predict(X_test)
ridge_pred = ridge_best.predict(X_test)
en_pred = en_best.predict(X_test)

# Get our RMSE values to see which has the lowest test value
print(np.sqrt(mean_squared_error(y_test, mlr_pred)),
      np.sqrt(mean_squared_error(y_test, lasso_pred)),
      np.sqrt(mean_squared_error(y_test, ridge_pred)),
      np.sqrt(mean_squared_error(y_test, en_pred)))

0.6323347806538819 0.9921748582591342 0.6299583102758477 0.9921748582591342


Here we can see our Ridge Regression model does the best in terms of looking at RMSE, which makes us believe this could be the best model to use.

Next look at using MAE as our model metric.

In [10]:
# Get our MAE values to see which has the lowest test value
print(mean_absolute_error(y_test, mlr_pred),
      mean_absolute_error(y_test, lasso_pred),
      mean_absolute_error(y_test, ridge_pred),
      mean_absolute_error(y_test, en_pred))

0.34808428128412455 0.7894563357522837 0.3496689651462296 0.7894563357522837


In this case the MLR full model does just better than our Ridge Regression Model.

Due to the Ridge and Multiple Linear Regression models both performing about the same in terms of test errors, I would say either model is perfectly sufficient to use. Note that our penality term for Ridge Regression was quite small, showing the results to Ridge and MLR were going to be roughly the same. Due to the ability of interpretability, I would use the MLR model since we usually want to only sacrifice this if something else is better, which isn't really the case here. Also note that LASSO (and elastic net) really do not perform as well as Ridge and MLR.

## Classification Task (Wine Type as Response)

- Repeat the training and testing done previously but use logistic regression models.
- Use log-loss or negative log-loss as your metric for choosing models during the training process
- During the testing portion, compare your models on both log-loss and accuracy

### Split the Data Again

First we are going to split our data with now our response being the wine type (really the color of the wine).

In [11]:
# Split the data in a way that the proportion of red and white wine is same
X_train, X_test, y_train, y_test = train_test_split(
    wine.drop("color_white", axis = 1),
    wine["color_white"],
    test_size = 0.20,
    random_state = 99,
    stratify = wine['color_white'])

### Train Different Models

Fit four different logistic regression models.

- At least one should include interaction terms
- At least one should include some polynomial terms
- Use CV to select your best logistic regression model

In [12]:
# Fit a full logistic regression model
log_reg1 = cross_validate(
    LogisticRegression(penalty = 'none', solver = "newton-cg"),
    X_train,
    y_train,
    cv = 5,
    scoring = "neg_log_loss")

# Fit a logistic regression model with pH, sulfates, alcohol, and quality
log_reg2 = cross_validate(
    LogisticRegression(penalty = 'none', solver = "newton-cg"),
    X_train[["pH", "sulphates", "quality", "alcohol"]],
    y_train,
    cv = 5,
    scoring = "neg_log_loss")

# Make a MLR model with the interaction terms for all of the columns we selected above
poly = PolynomialFeatures(interaction_only = True, include_bias = False)
X_design = poly.fit_transform(X_train[["pH", "sulphates", "quality", "alcohol"]])
log_reg3 = cross_validate(
    LogisticRegression(penalty = 'none', solver = "newton-cg"),
    X_design,
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")

# Make a MLR model that includes a quadratic term for our columns
poly2 = PolynomialFeatures(degree = 2, interaction_only = False, include_bias = False)
X_design2 = poly2.fit_transform(X_train[["pH", "sulphates", "quality", "alcohol"]])
log_reg4 = cross_validate(
    LogisticRegression(penalty = 'none', solver = "newton-cg"),
    X_design2,
    y_train,
    cv = 5,
    scoring = "neg_mean_squared_error")

# Print which model has the best CV error and select that model to test in future
print(round(log_reg1['test_score'].mean(), 4),
      round(log_reg2['test_score'].mean(), 4),
      round(log_reg3['test_score'].mean(), 4),
      round(log_reg4['test_score'].mean(), 4))

-0.0334 -0.3823 -0.178 -0.1738


Here again our full model is by far the best so we are going to use that for our future logistic regression candidate model for the "regular" approach we are used to.

In [13]:
reg_log_best = LogisticRegression(penalty = 'none', solver = "newton-cg").fit(X_train, y_train)

Now we are going to fit a logistic regression LASSO model with a set of predictors of our choosing (this is only using our l1_ratio here)

- Use at least five predictors
- Use CV to select the tuning parameter

Then we are going to fit the model with the best tuning paramter to use for testing for later.

In [14]:
# Make the cross validation to find the best tuning parameter
lasso_log_cv = LogisticRegressionCV(cv = 5,
                                    solver = "saga",
                                    penalty = "l1",
                                    Cs = 25,
                                    scoring = "neg_log_loss",
                                    random_state = 99)
lasso_log_cv.fit(X_train, y_train)

# Show the regularization value
print(lasso_log_cv.C_)

# Now fit our best model
lasso_best_log = LogisticRegression(solver = "saga",
                                    penalty = "l1",
                                    C = lasso_log_cv.C_[0],
                                    random_state = 99)
lasso_best_log.fit(X_train, y_train)

[10000.]


Note the high value which means it is not really regularized. Just something to note meaning this might not be the best procedure.

Now we are going to fit a logistic regression ridge model with a set of predictors of our choosing (this is only using our l2_ratio here)

- Use at least five predictors
- Use CV to select the tuning parameter

Then we are going to fit the model with the best tuning paramter to use for testing for later.

In [15]:
# Make the cross validation to find the best tuning parameter
ridge_log_cv = LogisticRegressionCV(cv = 5,
                                    solver = "newton-cg",
                                    penalty = "l2",
                                    Cs = 25,
                                    scoring = "neg_log_loss",
                                    random_state = 99)
ridge_log_cv.fit(X_train, y_train)

# Show the regularization value
print(ridge_log_cv.C_)

# Now fit our best model
ridge_best_log = LogisticRegression(solver = "newton-cg",
                                    penalty = "l2",
                                    C = ridge_log_cv.C_[0],
                                    random_state = 99)
ridge_best_log.fit(X_train, y_train)

[10000.]


Note the high value which means it is not really regularized. Just something to note meaning this might not be the best procedure.

Lastly, we are going to fit a logistic regression elastic net model with a set of predictors of our choosing (this is only using our l1_ratio and l2_ratio here and having to specify l1_weight)

- Use at least five predictors
- Use CV to select the tuning parameter

Then we are going to fit the model with the best tuning paramter to use for testing for later.

In [16]:
# Make the cross validation to find the best tuning parameter
e_net_log_cv = LogisticRegressionCV(cv = 5,
                                    solver = "saga",
                                    penalty = "elasticnet",
                                    Cs = 10,
                                    l1_ratios = [0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99],
                                    scoring = "neg_log_loss",
                                    random_state = 99)
e_net_log_cv.fit(X_train, y_train)

# Show the regularization value
print(e_net_log_cv.C_)
print(e_net_log_cv.l1_ratio_)

# Now fit our best model
e_net_best_log = LogisticRegression(solver = "saga",
                                    penalty = "elasticnet",
                                    C = e_net_log_cv.C_[0],
                                    l1_ratio = e_net_log_cv.l1_ratio_[0],
                                    random_state = 99)
e_net_best_log.fit(X_train, y_train)

[10000.]
[0.001]


Note the high value which means it is not really regularized. Just something to note meaning this might not be the best procedure. Also note the low l1_ratio. This is showing how we are not really using our l1 (or lasso component) that much to make an optimal model and similar to our l2 (or ridge) model.

### Test Logistic Models

Using your four selected models, compare their performance on the test set.

- Do so using log-loss as your model metric
- Do so using accuracy as your model metric



First we will check using log-loss as our metric.

In [17]:
# Import our metrics
from sklearn.metrics import log_loss, accuracy_score

# Get our predictions from our model using our test set
reg_pred = reg_log_best.predict(X_test)
lasso_pred = lasso_best_log.predict(X_test)
ridge_pred = ridge_best_log.predict(X_test)
en_pred = e_net_best_log.predict(X_test)

# Get our log-loss values to see which has the lowest test value
print(log_loss(y_test, reg_pred),
      log_loss(y_test, lasso_pred),
      log_loss(y_test, ridge_pred),
      log_loss(y_test, en_pred))

0.1940812105567849 1.8576344439006534 0.2495329850015805 1.8576344439006534


Here we can see how our regular (no penalty term) logistic regression model had the lowest log-loss, which makes us think it did the best. We will check with the accuracy as well.

In [18]:
# Get our accuracy values to see which has the lowest test value
print(accuracy_score(y_test, reg_pred),
      accuracy_score(y_test, lasso_pred),
      accuracy_score(y_test, ridge_pred),
      accuracy_score(y_test, en_pred))

0.9946153846153846 0.9484615384615385 0.9930769230769231 0.9484615384615385


Here we can see how our regular (no penalty term) logistic regression model had the highest accuracy, which makes us say that it is the best model to use coupled with the log-loss results. Therefore, for our logistic regression case we should just a logistic regression model with no penalty terms.