
# Predicting the Red Wine quality score 

**Introduction**

A 2022 study conducted by John Dunham and Associates found that the wine industry accounts for $276.07 billion economic output in America. While wine is a cornerstone of the US economy, few individuals are actually experts in the subject, with only 269 Master Sommeliers in the world. There is a clear disconnect between the economic impact of wine, and the number of individuals trained to assess a wine's quality, leaving room for restaraunts, liquor stores, and wineries to over (or under) price wines for consumers. During our project, we will be using predictors that could be found by contacting wine producers to see if we can predict the quality [0-10] of a red wine using regression.

The dataset chosen was collected by Cortez et al.(2009) for their paper, "Modeling wine preferences by data mining from physicochemical properties", in which they modelled wine preference on a scale of 0 [bad] to 10 [excellent].


**Preliminary Exploratory Data Analysis:**



In [1]:
library(repr)
library(tidyverse)
library(tidymodels)
install.packages("corrplot")
library(corrplot)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

In [None]:
wine_data <- read_csv2(file = "data/winequality-red (1).csv", skip = 1,
                       col_names = c("fixed_acidity",
                                     "volatile_acidity",
                                     "citric_acid",
                                     "residual_sugar",
                                     "chlorides",
                                     "free_sulfur_dioxide",
                                     "total_sulfur_dioxide",
                                     "density",
                                     "pH",
                                     "sulphates",
                                     "alcohol",
                                     "quality")) |>
              mutate(across("fixed_acidity"

**Methods**

Building a good model requires a careful selection of relevant predictors. Our dataset contains 11 variables of which many are redundant. We will use both our insights and a correlation matrix to determine which ones to use.  

Firstly, alcohol will be used, as ethanol as a compound is known to impact individuals perception of wine flavor, thus influencing percieved quality (Caballero & Segura, 2017). Secondly, total sulfur dioxide will be used, total sulfure dioxide acts as an antioxidant and is necessary in winemaking with probable impacts on quality (Araptisas et al.,  2018).

Furthermore the correlation matrix suggest that ## are also significant. The ID is distributed randomly so it does not have an impact and ### are based on the correlation matrix insignificant.  

Even though our prediction set consists of 11 values, we will be using regression for pridiction. The values for quality are ordered and the dataset is very unbalanceto use a classfier. Most numbers are in centered around the mean value so we believe using a multiple linear regression with 6 variables shoould give the best results. Rounding our pridicted number to the nearest integer will gives a quality score that could be easily checked by the confusion matrix. We will its the outputs to calculate the accuracy score. 
 

**Expected Outcomes and Significance**

We expect to be able to predict the quality of a particular sample, using the chosen predictors(important compounds present in a wine like alcohol, total sulfur dioxide), on a scale of 0(bad) to 10(excellent). 

Impact of our analysis:
High quality wines have intense flavours that last a long time, even after swallowing the wine. This analysis would be helpful to both the consumers and buisnesses like restaurants and hotels. If we find our model has a suitable level of accuracy, consumers could use our model to understand whether they believe they are paying a suitable price for the percieved wine quality, using factors from a wine's ingredients list. This would help consumers single out good quality wine without spending their hard earned money.

Buisnesses like hotels and restaurants have thin profit margins. They can use our data analysis to predict the quality of wine and determine how their consumers would percieve it. This would allow them to purchase only the wines that have a good quality and maximise their profits.

Future questions: 
If our data analysis is successful, we can look into other factors that determine the quality of a particular wine like its oriign, growing practices(grapes), winemaking practices, temperature at which it's fermented, age, etc. We can also conduct a data analysis linking price to the quality of wine. 


**Data Sources:**

https://archive.ics.uci.edu/ml/datasets/wine+quality

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

2022 economic impact study of the American wine industry methodology. WineAmerica. (2022, September 21). Retrieved October 28, 2022, from https://wineamerica.org/economic-impact-study/2022-american-wine-industry-methodology/ 

Members archive. Court of Master Sommeliers Europe. (n.d.). Retrieved October 28, 2022, from https://www.courtofmastersommeliers.org/members/ 

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5328826/#:~:text=First%2C%20for%20the%20wine%20industry,sluggish%20or%20even%20stuck%20fermentations.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5770432/