
# **Using Physicochemical Elements to Predict Wine Quality**


## Introduction

 Wine is a complex beverage composed of numerous compounds that contribute to its overall quality and taste. It is viewed as a luxury good and it is enjoyed by many consumers all over the world. The beverage's quality is most commonly assessed through both physicochemical properties and sensory tests (Mor et al.), we used this vital piece of information as the basis for our project. In this data science project, we will determine the most influential physicochemical compounds within wine to most accurately predict overall wine quality.  


 Through data analysis, we hope to find predictive relationships in certain compounds that make up wine in hopes of answering our project’s primary question: 
 **How accurately can we predict the quality of wine with the most relevant physicochemical elements using the K-nearest neighbors classification algorithm?**

 We will use a 2009 “Wine Quality” dataset from Portugal which models several different red wines based on physicochemical tests and their quality from sensory data as a score between 0 and 10. Portugal is one of the top 10 countries that export wine, for which the wine industry is investing in technologies for winemaking and selling. For this dataset, wine certification and quality assessment are key elements. 
 These are the columns of the dataset.
 
* 1 - fixed acidity
* 2 - volatile acidity
* 3 - citric acid
* 4 - residual sugar
* 5 - chlorides
* 6 - free sulfur dioxide
* 7 - total sulfur dioxide
* 8 - density
* 9 - pH
* 10 - sulphates
* 11 - alcohol
* Output variable (based on sensory data): 
* 12 - quality (score between 0 and 10)



## Methods and Results

In [None]:
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)


── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [None]:
wine_data <- read_csv2("https://raw.githubusercontent.com/choialice707/DSCI-100-Group56-Proj/main/winequality-red.csv") |>
    mutate(quality = as_factor(quality)) |>
    filter(alcohol < 150) |>
        mutate(`volatile acidity` = as.numeric(`volatile acidity`),
                `citric acid` = as.numeric(`citric acid`),
                chlorides = as.numeric(chlorides),
                density = as.numeric(density),
                sulphates = as.numeric(sulphates)) |>
na.omit()
head(wine_data)
tail(wine_data)

 The first step in our data analysis, after loading the necessary packgages, was to read the "wine quality" dataset imported onto Jupyter. We then filtered the alcohol variable to be less than 150 to remove any outliers that would otherwise skew the visualization to be clustered in one line. This allowed us to better compare the variables. Finally, we mutated each of the variables with as.numeric to convert the character vectors into a numeric vector, so that the axis numbers are more easily readable.

In [None]:
wine_data_scaled <- wine_data |> 
 mutate(scaled_fixed_acidity = scale(`fixed acidity`, center = TRUE), 
        scaled_volatile_acidity = scale(`volatile acidity`, center = TRUE),
        scaled_citric_acid = scale(`citric acid`, center = TRUE),
        scaled_chlorides = scale(chlorides, center = TRUE),
        scaled_free_sulfur_dioxide = scale(`free sulfur dioxide`, center = TRUE),
        scaled_total_sulfur_dioxide = scale(`total sulfur dioxide`, center = TRUE),
        scaled_density = scale(density, center = TRUE),
        scaled_pH = scale(pH, center = TRUE),
        scaled_sulphates = scale(sulphates, center = TRUE),
        scaled_alcohol = scale(alcohol, center = TRUE))

head(wine_data_scaled)

 Here the data used was scaled to ensure clear visualizations and facilitate effective comparison of variables since the distances between the points would have otherwise have different significances for the y and x axis . Scaling was performed using standardization, which standardized the range and distribution of the physicochemical elements. This process allowed for a fair and meaningful visual analysis, ensuring that variables with different scales did not dominate the visualizations. As a result, scaling the data improved the clarity and interpretability of the visual comparisons, aiding in the assessment of relationships between the physicochemical elements and wine quality.

In [None]:
set.seed(1234)

wine_split <- initial_split(wine_data_scaled, prop = 0.75 , strata = quality)  
wine_train <- training(wine_split)   
wine_test <- testing(wine_split)

head(wine_train)
head(wine_test)

In [None]:
wine_summary <- wine_train |>
    group_by(quality) |>
    summarize(count_per_quality = n()) 

wine_summary

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8) 

wine_summary_plot <- wine_summary |>
    ggplot(aes(x = quality, y = count_per_quality)) +
    geom_bar(stat = "identity") +
    labs(x = "Quality of Wine", y = "Total instances")
    
wine_summary_plot

In [None]:
library(gridExtra)
options(repr.plot.width = 25, repr.plot.height = 6) 

wine_plot1 <- wine_train |>
  ggplot(aes(x = scaled_sulphates, y = scaled_pH, color = quality)) +
  geom_point(alpha = 0.6) +
  labs(x = "Sulphates ", 
       y = "pH",
       color = "Quality") +
  theme(text = element_text(size = 17))+
scale_color_brewer(palette = "Set2")

wine_plot2 <- wine_train |>
  ggplot(aes(x = scaled_sulphates, y = scaled_total_sulfur_dioxide, color = quality)) +
  geom_point(alpha = 0.6) +
  labs(x = "Sulphates ", 
       y = "Total Sulfur Dioxide",
       color = "Quality") +
  theme(text = element_text(size = 17))+
scale_color_brewer(palette = "Set2")

wine_plot3 <- wine_train |>
  ggplot(aes(x = scaled_sulphates, y = scaled_alcohol, color = quality)) +
  geom_point(alpha = 0.6) +
  labs(x = "Sulphates", 
       y = "Alcohol",
       color = "Quality") +
  theme(text = element_text(size = 17))+
scale_color_brewer(palette = "Set2")

wine_plot4 <- wine_train |>
  ggplot(aes(x = scaled_sulphates, y = scaled_volatile_acidity, color = quality)) +
  geom_point(alpha = 0.6) +
  labs(x = "Sulphates", 
       y = "Volatile Acidity",
       color = "Quality") +
  theme(text = element_text(size = 17))+
scale_color_brewer(palette = "Set2")

#############


wine_plot5 <- wine_train |>
  ggplot(aes(x = scaled_pH, y = scaled_total_sulfur_dioxide, color = quality)) +
  geom_point(alpha = 0.6) +
  labs(x = "pH", 
       y = "Total Sulfur Dioxide",
       color = "Quality") +
  theme(text = element_text(size = 17))+
scale_color_brewer(palette = "Set2")

wine_plot6 <- wine_train |>
  ggplot(aes(x = scaled_pH, y = scaled_alcohol, color = quality)) +
  geom_point(alpha = 0.6) +
  labs(x = "pH", 
       y = "Alcohol",
       color = "Quality") +
  theme(text = element_text(size = 17))+
scale_color_brewer(palette = "Set2")

wine_plot7 <- wine_train |>
  ggplot(aes(x = scaled_pH, y = scaled_volatile_acidity, color = quality)) +
  geom_point(alpha = 0.6) +
  labs(x = "pH", 
       y = "Volatile Acidity",
       color = "Quality") +
  theme(text = element_text(size = 17))+
scale_color_brewer(palette = "Set2")

#########


wine_plot8 <- wine_train |>
  ggplot(aes(x = scaled_total_sulfur_dioxide, y = scaled_alcohol, color = quality)) +
  geom_point(alpha = 0.6) +
  labs(x = "Total_sulfur_dioxide", 
       y = "Scaled Alcohol",
       color = "Quality") +
  theme(text = element_text(size = 17))+
scale_color_brewer(palette = "Set2")

wine_plot9 <- wine_train |>
  ggplot(aes(x = scaled_total_sulfur_dioxide, y = scaled_volatile_acidity, color = quality)) +
  geom_point(alpha = 0.6) +
  labs(x = "Total Sulfur Dioxide", 
       y = "Volatile Acidity",
       color = "Quality") +
  theme(text = element_text(size = 17))+
scale_color_brewer(palette = "Set2")

##########


wine_plot10 <- wine_train |>
  ggplot(aes(x = scaled_alcohol, y = scaled_volatile_acidity, color = quality)) +
  geom_point(alpha = 0.6) +
  labs(x = "Alcohol", 
       y = "Volatile Acidity",
       color = "Quality") +
  theme(text = element_text(size = 17))+
scale_color_brewer(palette = "Set2")



grid.arrange(wine_plot1, wine_plot2, wine_plot3, wine_plot4, nrow = 1, top = '.')
grid.arrange(wine_plot5, wine_plot6, wine_plot7, nrow = 1, top = '.')
grid.arrange(wine_plot8, wine_plot9, wine_plot10, nrow = 1, top = '.')

     

 To determine the variables to include, we referred to a research paper by Cortez et al. as a valuable source of information. According to the findings presented in the paper, the top five relevant variables for predicting wine quality were sulphates, pH, total sulfur dioxide, alcohol, and volatile acidity. We conducted a comparative analysis among these five elements to validate and select the most suitable variables for our prediction model.

 By examining the relationships and patterns between these variables, we aimed to identify which ones exhibited the strongest correlations and clear distributions with wine quality. This was done by visually examining the patterns and trends through plotting each variable against each other. We then assessed the strength of the relationships, selecting the variables with the highest correlations as the most relevant for our model.

 Upon analyzing the results of our comparison, we determined that sulphates, sulfur dioxide, pH values, and alcohol displayed the most consistent and significant associations with wine quality. Therefore, these variables were selected as the most appropriate for inclusion in our prediction model using the KNN classification algorithm.

In [None]:
set.seed(1234)

wine_recipe <- recipe(quality ~ scaled_sulphates + scaled_pH + scaled_total_sulfur_dioxide + scaled_alcohol, data = wine_train)
wine_recipe

wine_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
      set_engine("kknn") |>
      set_mode("classification")
wine_tune


In [None]:
set.seed(1234)

wine_vfold <- vfold_cv(wine_train, v = 5, strata = quality)

In [None]:
set.seed(1234)

k_vals <- tibble(neighbors = seq(2, 10, 1))

wine_results <- workflow() |>
      add_recipe(wine_recipe) |>
      add_model(wine_tune) |>
      tune_grid(resamples = wine_vfold, grid = k_vals) |>
      collect_metrics()
wine_results

In [None]:
wine_accuracies <- wine_results |>
    filter(.metric == "accuracy")
wine_accuracies

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10) 

accuracy_vs_k_plot <- wine_accuracies |>
    ggplot(aes(x = neighbors, y = mean))+
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate") +
    scale_x_continuous(breaks = seq(0, 10, by = 1)) +  # adjusting the x-axis
    scale_y_continuous(limits = c(0.5, 0.6)) # adjusting the y-axis

accuracy_vs_k_plot

In [None]:
choice_of_k <- wine_accuracies |>
    arrange(desc(mean)) |>
    slice(1) |>
    pull(neighbors)
choice_of_k

In [None]:
set.seed(1234) 

wine_recipe_2 <- recipe(quality ~ scaled_sulphates + scaled_pH + scaled_total_sulfur_dioxide + scaled_alcohol, data = wine_train)
wine_recipe_2

wine_spec <-  nearest_neighbor(weight_func = "rectangular", neighbors = choice_of_k)|>
    set_engine("kknn") |>
     set_mode("classification")

wine_spec



In [None]:
wine_fit <- workflow() |>
add_recipe(wine_recipe_2) |>
add_model(wine_spec) |>
fit(data = wine_train)

wine_fit

In [None]:
set.seed(1234)

wine_predictions <- predict(wine_fit, wine_test) |>
bind_cols(wine_test)
head(wine_predictions)
wine_metrics <- wine_predictions |>
        metrics(truth = quality, estimate = .pred_class) |> 
filter(.metric == "accuracy")
wine_metrics

wine_conf_mat <- wine_predictions |> 
       conf_mat(truth = quality, estimate = .pred_class)

wine_conf_mat

In [1]:
autoplot(wine_conf_mat, type = "heatmap") +
    scale_fill_distiller(palette = "Oranges", name = "Frequency") + 
    labs(title = "Quality Confusion Matrix", caption = "[Figure 1.3]") +
    theme(legend.position = "left", text = element_text(size = 22),
          plot.caption = element_text(size = 15, hjust = 0))

ERROR: Error in autoplot(wine_conf_mat, type = "heatmap"): could not find function "autoplot"


In [None]:
# create the grid of area/smoothness vals, and arrange in a data frame
sul_grid <- seq(min(wine_test$scaled_sulphates), 
                max(wine_test$scaled_sulphates), 
                length.out = 100)

pH_grid <- seq(min(wine_test$scaled_pH), 
                max(wine_test$scaled_pH), 
                length.out = 100)

asgrid <- as_tibble(expand.grid(scaled_sulphates = sul_grid, 
                                scaled_pH = pH_grid))

# use the fit workflow to make predictions at the grid points
knnPredGrid <- predict(wine_fit, asgrid)

# bind the predictions as a new column with the grid points
prediction_table <- bind_cols(knnPredGrid, asgrid) |> 
  rename(Class = .pred_class)

# plot:
# 1. the colored scatter of the original data
# 2. the faded colored scatter for the grid points
wkflw_plot <-
  ggplot() +
  geom_point(data = wine_test, 
             mapping = aes(x = scaled_pH, 
                           y = scaled_sulphates, 
                           color = quality), 
             alpha = 0.75) +
  geom_point(data = prediction_table, 
             mapping = aes(x = scaled_pH, 
                           y = scaled_sulphates, 
                           color = quality), 
             alpha = 0.02, 
             size = 5) +
  labs(color = "Quality", 
       x = "pH (scaled)", 
       y = "Sulphates (scaled)") +

  theme(text = element_text(size = 12))

wkflw_plot

## **Discussion**

Include that we tried to balance the data because it was imbalanced (for example there were only 10 of quality 3 and ~600 for 5 and 6) but the themis package wouldnt load. In order  to improve our accuracy we would do the balancing step with the step_upsample function.

## **References**

#### Cortez, Paulo, et al. "Modeling wine preferences by data mining from physicochemical properties." Decision Support Systems, ScienceDirect, May 2009, https://www.sciencedirect.com/science/article/abs/pii/S0167923609001377.

#### Mor, Nuriel Shalom, et al. "Wine Quality and Type Prediction from Physicochemical Properties Using Neural Networks for Machine Learning: A Free Software for Winemakers and Customers." Journal of Wine Research, Taylor & Francis Online, 2019, https://www.tandfonline.com/doi/full/10.1080/09571264.2019.1590937.
