-
Notifications
You must be signed in to change notification settings - Fork 1
/
ZachDischnerWineInvestigation.Rmd
479 lines (378 loc) · 18.1 KB
/
ZachDischnerWineInvestigation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
Vines and Wines by Zach Dischner
========================================================
**Date:** April 4 2017
Part of the Udacity Data Analyst Nanodegree program
![Red Wines](http://static1.buchi.com/sites/default/files/styles/content_large/public/application-note-images/application-volatile-acids.jpg?itok=ULK99tjw)
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
# Load up all packages we want to use
library(ggplot2)
library(gridExtra)
library(GGally)
library(scales)
library(memisc)
```
```{r echo=FALSE, Load_the_Data}
setwd("/Users/dischnerz/code/udatascience/projects/p4-R-investigation")
# Load the Data
data <- read.csv("wineQualityReds.csv")
## Some preliminary manipulation
data$total.acidity <- data$fixed.acidity + data$volatile.acidity
## Set up Plot Themes
theme_set(theme_minimal(10))
```
### Dataset Description
This investigaion is about red wine. It will analyze red wine properties in
order to correlate chemical components to overall wine quality.
**Data Structure**
```{r echo=FALSE}
str(data)
```
**Data Summary**
```{r echo=FALSE}
summary(data)
```
Observations of Summary Data:
* There are 1599 samples of Red Wine properties and quality values
* No wine achieves either a terrible (0) or perfect (10) quality score
* Residual Sugar measurement has a maximum that is nearly 19 times farther away\
from the 3rd quartile than the 3rd quartile is from the 1st.
* Citric Acid had a minimum of 0.0. No other property values was exactly 0.
# Univariate Plots Section
Overall distribution of *quality* rating of red wines. Quality here is a single
measurement between 1 and 10 of oerall wine quality, as decided by three
separate agencies
```{r echo=FALSE, Univariate_Plots}
# Easiest, look at quality
qplot(x=quality, data=data, geom='bar')
```
Overall wine quality, rated on a scale from 1 to 10, has a normal shape and
very few exceptionally high or low quality ratings.
```{r echo=FALSE, message=FALSE, warning=FALSE, stats}
# Fixed and Volatile acidity
q1<-ggplot(aes(x=pH), data=data)+
geom_histogram(color =I('black'),fill = I('#990000'))+
ggtitle('pH distribution')
q2<-ggplot(aes(x=sulphates), data=data)+
geom_histogram(color =I('black'),fill = I('#990000'))+
ggtitle('Sulphates distribution')
q3<-ggplot(aes(x=chlorides), data=data)+
geom_histogram(color =I('black'),fill = I('#990000'))+
ggtitle('Chlorides distribution')
q4<-ggplot(aes(x=citric.acid), data=data)+
geom_histogram(color =I('black'),fill = I('#990000'))+
ggtitle('Citric Acid')
q5<-ggplot(aes(x=total.sulfur.dioxide), data=data)+
geom_histogram(color =I('black'),fill = I('#990000'))+
ggtitle('Total SO2 distribution')
q6<-ggplot(aes(x=total.acidity), data=data)+
geom_histogram(color =I('black'),fill = I('#990000'))+
ggtitle('Total Acidity (fixed + volatile)')
grid.arrange(q1,q2,q3,q4,q5,q6,ncol=2)
```
**Observations of Univariate Properties:**
* pH distribution is roughly normal centered around 3.3 with tails at 3.0 and \
3.6
* Sulphates look fairly normal centered at 0.6 +- 0.3, but with a positive skew\
that has measurements all the way out to 2.0
* Chlorides show a similar behavior with a center around 0.075
* Citric acid looks nearly like a double ramp function, with a peak at 0.0 and \
nearly constantly decreasing values until another peak citric acid value at\
0.25. A third jump at 0.5 is notable as well.
* Total acidity is roughly normal with a center at 7.5 and bounds at approx 5.0\
and 15.0.
Of all single component distributions, `pH` appears to be the only measurement
that is normally distributed. `Acidity` is next, being normal with just a slight
positive skew. `Citric Acid` and `S02` particularly exhibit a strong positive
skew. `pH` and `Acidity` reflect measurements that are centered around non-zero
numbers, whereas the other measurements are clustered at near-zero numbers.
Since these measurements cannot have negative values, the skewed distribution is
sensible.
**Note on Acidity**
Wine acidity is measured in multiple components: volatile and static. Total
acidity is not to be confused with the `pH` of a wine. The former details the
amount of acid in the wine, the latter relates to the *strength* of the acids.
More on *static* and *volatile* acidity measures here:
http://winemakersacademy.com/understanding-wine-acidity/
# Univariate Analysis
#### Overview
The red wine dataset features 1599 separate observations, each for a different
red wine sample. For each sample, 11 *laboratory* based measurements were made,
as well as a single subjective overall quality rating. The quality of a wine was
decided upon by three separate professional wine assessment institutions.
As presented, each wine sample is provided as a single row in the dataset. Due to
the nature of how some measurements are gathered, some values given represent
*components* of a measurement total. For example, `data.fixed.acidity` and
`data.volatile.acidity` are both obtained via separate measurement techniques,
and must be summed to indicate the total acidity present in a wine sample. For
these cases, I supplimented the data given by computing the total and storing in
the dataframe with a `data.total.*` variable.
#### Features of Interest
The main interesting measurement here is the wine `quality`. It is the
subjective measurement of how attractive the wine might be to a consumer. The
goal here will be to try and correlate non-subjective wine properties with its
quality.
I am curious about a few trends in particular:
* sulphates vs quality - Low sulphate wine has a reputation for not causing \
hangovers
* alcohol vs quality - Just an interesting measurement. Strong beers are \
typically higher quality and harder to make than lower proof beers.
At first, the lack of an age metric might be surprising since it is commonly
a factor in quick assumptions of wine quality. However, since the actual effect
of wine age is on the wine's measurable chemical properties, its inclusion here
is not necessary.
### Distributions
I left distributions alone here. Many measurements that were clustered close to
zero had a positive skew (you can't have negative percentages or ammounts).
Others such as `pH` and `total.acidity` and `quality` had normal looking
distributions.
# Bivariate Plots Section
## Scatterplot Matrix
```{r echo=FALSE, message=FALSE, warning=FALSE, Bivariate_Plots}
## Autocorrelation plots
# ggpairs(data)
# lower = list(continuous = wrap("points", shape = I('.'))),
# upper = list(combo = wrap("box", outlier.shape = I('.'))))
set.seed(666)
ggpairs(data[sample.int(nrow(data),1000),])
```
**Observations of Correlation Plot Matrix**
* Total Acidity is highly correlatable with fixed acidity
* pH appears correlatable with acidity, citric acid, chlorides, and residual\
sugars
* No single property appears to have a correlation with quality
## Sulphates
```{r, message=FALSE, warning=FALSE}
q1 <- ggplot(aes(x=sulphates, y=quality), data=data) +
geom_jitter(alpha=2/3) +
geom_smooth() +
ggtitle("Sulphates vs Quality")
q2 <- ggplot(aes(x=sulphates, y=quality), data=subset(data, data$sulphates < 1)) +
geom_jitter(alpha=2/3) +
geom_smooth() +
ggtitle("Sulphates vs Quality without outliers")
grid.arrange(q1,q2, ncol=1)
```
**Observations**:
* There is a slight trend implying a relationship between sulphates and wine\
qualtity, particularly if you disregard extreme sulphate values
* Disregarding measurements where sulphates > 1.0 is the same as disregarding \
the positive tail of the distribution, keeping just the normal-looking portion
* The relationship is still a week wone
## Alcohol
```{r,message=FALSE, warning=FALSE, ALCOHOL}
q0 <- ggplot(aes(x=alcohol, y=quality), data=data) +
geom_jitter(alpha=2/3) +
geom_smooth() +
ggtitle("Alcohol Content vs Quality")
q1 <- ggplot(aes(x=alcohol), data=data) +
geom_density(fill=I("#BB0000")) +
facet_wrap("quality") +
ggtitle("Alcohol Content for Wine Quality Ratings")
q2 <- ggplot(aes(x=residual.sugar, y=alcohol), data=data) +
geom_point(alpha=2/3) +
geom_smooth() +
ggtitle("Alcohol vs Residual Sugar Content")
grid.arrange(q1, arrangeGrob(q0, q2), ncol = 2)
```
**Observations**
* Alcohol and quality appear to be somewha correlatable
* Lower quality wines tended to have lower alcohol contents
* Higher quality wines tended to have progressively higher alcohol content
* There is no/an erratic relation between sugar and alcohol content, which I \
found surprising as alcohol is a byproduct of the yeast feeding off of sugar \
during the fermentation process
## SO2, Chlorides, Sugar, and Density
```{r,message=FALSE, warning=FALSE, SO2}
q1 <- ggplot(aes(x=total.sulfur.dioxide, y=quality),
data=subset(data, data$total.sulfur.dioxide <
quantile(total.sulfur.dioxide, 0.99))) +
geom_jitter(alpha=1/3) +
geom_smooth() + ggtitle("SO2")
q2 <- ggplot(aes(x=chlorides, y=quality),data=data) +
geom_jitter(alpha=1/3) +
geom_smooth() + ggtitle("Chlorides")
q3 <- ggplot(aes(x=residual.sugar, y=quality),data=data) +
geom_jitter(alpha=1/3) +
geom_smooth() + ggtitle("Residual Sugar")
q4 <- ggplot(aes(x=total.acidity, y=quality),data=data) +
geom_jitter(alpha=1/3) +
geom_smooth() + ggtitle("Total Acidity")
grid.arrange(q1,q2, q3, q4)
```
**Observations**
* There is little to no noticeable correlation between S02 and wine quality
* Residual Sugar has a particularly weak correlation with wine quality
* Acidity has no noticeable correlation with wine quality. This \
surprised me more than any other trend
# Bivariate Analysis
### Strong Correlations
I did not find a single, obvious and strong correlation between any wine
property and the given quality. Alcohol content is a strong contender but even
so, the correlation wasn't particularly strong.
### Notables
Most properties have roughly normal distributions, with some skew in one tail.
Scatterplot relationships between these properties often showed a slight trend
within the bulk of property values. However, as soon as the we leave the
*expected range*, the trends actually reverse. See Alcohol Content or Sulphate
vs Quality for examples. The trend isn't a definitive one, but it is seen
in different variables. Possibly, obtaining an *outlier* property (say sulphate
content) is particularly difficult to do in the wine making process. Or, the
wines that exhibit outlier properties are deliberatly of a non-standard variety.
In that case, it could be that wine judges have a harder time agreeing on
a quality rating
# Multivariate Plots Section
```{r,message=FALSE, warning=FALSE, echo=FALSE, Multivariate_Plots}
q1 <- ggplot(aes(x=alcohol, y=chlorides, color=quality), data=subset(data,
data$chlorides < quantile(data$chlorides, 0.99))) +
geom_point(position='jitter') +
ggtitle("Alcohol Content vs Chlorides and Wine Quality Ratings")
q2 <- ggplot(aes(x=citric.acid, y=pH, color=quality),data=subset(data,
data$citric.acid < quantile(data$citric.acid, 0.99))) +
geom_point(position='jitter') +
geom_smooth() +
ggtitle("Citric Acid vs pH and Wine Quality Ratings")
grid.arrange(q1,q2)
```
**Observtions**
* Adding chlorides to the Alcohol vs Quality added little insight to the plot
* Unusually bright chloride points occured at different qualities and alcohol\
content points with no discernable pattern
* Higher alcohol content and lower chloride content appears to correlate to \
higher quality wines
* Higher alcohol content and higher citric acid content appears to correlate to\
higher quality wines
* pH has no notable effect on wine quality
```{r echo=FALSE,message=FALSE, warning=FALSE, Multivariate_Plots2}
q1 <- ggplot(aes(x=density, y=total.acidity, color=quality), data=data) +
geom_point(position='jitter') +
geom_smooth() +
ggtitle("Density vs Acidity colored by Wine Quality Ratings")
q2 <- ggplot(aes(x=residual.sugar, y=chlorides, color=quality), data=subset(data,
data$chlorides < quantile(data$chlorides, 0.95))) +
geom_point(position='jitter') +
geom_smooth() +
ggtitle("Sugar vs Chlorides colored by Wine Quality Ratings")
grid.arrange(q1, q2)
```
**Observations**
* Higher quality wines appear to have a slight correlation with higher acidity \
across all densities
* There are abnormally high and low quality wines coincident with higher-than\
-usual sugar content.
### Model Creation
```{r, message=FALSE, warning=FALSE, LINEAR_MODEL}
m1 <- lm((quality ~ alcohol), data = data)
m2 <- update(m1, ~ . + citric.acid)
m3 <- update(m2, ~ . + chlorides)
m4 <- update(m3, ~ . + residual.sugar)
m5 <- update(m4, ~ . + total.acidity)
mtable(m1, m2, m3, m4, m5)
```
```{r, message=FALSE, warning=FALSE, LINEAR_MODEL2}
data$guess=median(data$quality)
m21 <- lm((guess ~ quality), data = data)
m22 <- update(m21, ~ . + alcohol)
m23 <- update(m22, ~ . + chlorides)
mtable(m21, m22, m23)
```
# Multivariate Analysis
The strongest relationship between wine properties and wine quality was that of
alcohol, chlorides and citric acid. Typically, pH is considered important when
assessing wine quality, however the data does not show an appreciable
correlation. In fact, any correlations were week ones. The model built with
alcohol, chlorides, citric acid, residual sugar and total acidity featured a
low R-squared value of just 0.3. Compared with a model where we just start with
a guess of the median wine quality (R-squared of 0.5), the model performs rather
poorly.
Often, the tails of property distributions showed a varied relationship with
quality. Sometimes, the tails would even reverse trends exhibited by the bulk
of the property distribution
I was most surprised that pH and residual sugar had no appreciable effect on
wine quality as these are two factors that I personally have heard when
sommeliers discuss a wine.
------
# Final Plots and Summary
### Plot One
```{r echo=FALSE, message=FALSE, warning=FALSE, Plot_One}
ggplot(aes(x=alcohol), data=data) +
geom_density(fill=I("#BB0000")) +
facet_wrap("quality") +
ggtitle("Alcohol Content for Wine Quality Ratings")
```
### Description One
This plot I think paints the most complete picture of the analysis of wine
quality with two takeaways.
1. Alcohol content has *some* effect on wine quality rating
2. The effect is not a strong one, and even for a given rating, there is \
significant variability
This visualization was especially appealing to me because of the way that
you can almost see the distribution shuffle from left to right as wine
ratings increase. Again, just showing a general tendancy instead of a
strong correlation.
### Plot Two
```{r echo=FALSE,message=FALSE, warning=FALSE, Plot_Two}
ggplot(aes(x=citric.acid, y=pH, color=quality),data=subset(data,
data$citric.acid < quantile(data$citric.acid, 0.99))) +
geom_point(position='jitter') +
geom_smooth() +
ggtitle("Citric Acid vs Wine Quality Ratings")
```
### Description Two
This plot illustrates that (to my suprise) there was little to no correlation
between wine quality and pH. At the same time, there a trend where more highly
rated wines (brighter blue) tend to be grouped at higher Citric Acid values.
However, the trend is far from definite. There are very highly rated wines
at low citric acid values as well as very lowely rated wines at high citric
acid values.
In summary, this plot illustrates one of the main themes in wine analys thus
far:
There *are* observable correlations between wine chemical properties and
quality. *However*, in each case, there are outlier ratings that undermine
the general relationship.
### Plot Three
```{r echo=FALSE,message=FALSE, warning=FALSE, Plot_Three}
ggplot(aes(x=citric.acid, y=quality),
data=subset(data, data$citric.acid <
quantile(data$citric.acid, 0.99))) +
geom_jitter(alpha=2/3) +
geom_smooth() + ggtitle("SO2")
```
### Description Three
After alcohol content, citric acid seemed to have the most promeneant effect
on wine quality. There is a specific bump at around 0.25 where the wine quality
jumps up a rating. That bump was also present in the univariate distribution
plot for citric acid. Is that a feature of certain kinds of wine? Or of wines
of a certain age? Possibly, just a "sweet spot" of citric acid content that
makes for a higher quality wine? The root cause is unknown but I found the
behavior interesting nontheless.
When plotted against wine quality in a scatter plot like this, only citric acid
and alcohol displayed a noticeable trend. In each case, the distribution is
slightly skewed towards higher citric acid and higher quality.
------
# Reflection
Overall, I was initially surprised by the seemingly dispersed nature of the
wine data. Nothing was immediately correlatable to being an inherant quality
of good wines. However, upon reflection, this is a sensible finding. Wine
making is still something of a science and an art, and if there was one
single property or process that continually yielded high quality wines, the
field wouldn't be what it is.
I was surprised to find that alcohol content and citric acid were the most
correlatable properties to wine quality. In my mind, sulphates and acidity
were what I assumed would be the main correlations.
In the future, I would like to do some research into the wine making process.
Some winemakers might actively try for some property values or combinations,
and finding those combinations (of 3 or more properties) might be the key
to truly predicting wine quality. This investigation was not able to find
a strong set of two properties that would consistently be able to predict
wine quality with any degree of certainty.
Additionally, having the wine type would be helpful for further analysis.
Somalliers might prefer certain types of wines to have different
properties and behaviors. For example, a Port (as sweet desert wine)
surely is rated differently from a dark and robust abernet Sauvignon,
which is rated differently from a bright and fruity Syrah. Without knowing
the type of wine, it is entirely possible that we are almost literally
comparing apples to oranges and can't find a correlation.
> **Tip**: Here's the final step! Reflect on the exploration you performed and
the insights you found. What were some of the struggles that you went through?
What went well? What was surprising? Make sure you include an insight into
future work that could be done with the dataset.