-
Notifications
You must be signed in to change notification settings - Fork 1
/
EDA.Rmd
61 lines (46 loc) · 2.19 KB
/
EDA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
---
title: "Report of Wine Reviews: Demystifying the sophisticated alcohol"
output: html_document
---
#### Data from: https://www.kaggle.com/zynicide/wine-reviews
#### Inspiration from: https://www.kaggle.com/graeme16161/wine-data-explore
Wines are just such delightful thing that life have given us, it is even an art to some. Those who practise this art are known as sommeliers. However, many question the subjectiveness in the field and are all this just a hoax.
Are costs of the wine justified by the ratings of the sommeliers? Are other factors at play to the pricing of wines?
Let us dive into the realm of this sophisticated beverage.
```{r setup, include= FALSE}
library(tidyverse)
library(knitr)
library(tm)
library(readr)
wine_complete <- read.csv('./wine_data/winemag-data_first150k.csv')
wine_complete[1] <- NULL
```
* Clear out missing data
```{r}
wine_complete <- wine_complete[complete.cases(wine_complete),]
```
```{r}
head(wine_complete)
```
```{r}
ggplot(wine_complete, aes(x = points)) + geom_histogram(binwidth = 1) + ggtitle("Frequencies of each point")
```
* Points are a scale so they are rather self explanatory, a 99 points bottle must be better than a 98 point bottle. However, how do we differentiate good wine from bad wine? Thus, we still need to categorise the points to give meaning to the numbers.
```{r}
wine_complete$Grade<- cut(wine_complete$points, c(0,50,60,70,80,90,95,100), right = FALSE, labels = c("undrinkable", "not recommended", "Average", "Above Average", "Very good", "Superior", "Exceptional"))
```
```{r}
ggplot(wine_complete, aes(x = points, fill = Grade)) + geom_histogram(binwidth = 1) + ggtitle("Points showing how wines have been categorised by grade")
```
* Now we will look at the distribution for price vs score
```{r}
ggplot(wine_complete, aes(x = price, y = points)) + geom_point()
```
* How "value for money" are wines then? We calculate the economic score of the wines to determine the number of scores per dollar paid for the wine.
```{r}
wine_complete$econ_score <- wine_complete$points / wine_complete$price
```
* Now we can take a look at the distribution of the economic score.
```{r}
ggplot(wine_complete, aes(x = 1, y = econ_score)) + geom_boxplot()
```