# Wine quality analysis

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

## Summary

## Introduction

Red wine has a long history and is favored by people around the world today. Depending on the raw materials and process, its quality can vary significantly, and only professional sommeliers can tell the differences. As the development of technology, it is now possible to quantify certain metrics of red wine. Here we want to find out how the metrics influence the quality. Using a red wine dataset containing quality rates and 11 kinds of metrics, we try to implement a machine learning algorithm to predict the quality of a bottle of red wine. 

## Methods & Results

Read the red wine quality dataset

In [2]:
data = pd.read_csv("data/winequality-red.csv", sep=';')
data.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Show the statistical information of dataset

In [3]:
print(data.describe())

       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000         

We can see there are 11 features deciding the quality of red wine, which is rated from 0 to 10. In this dataset, the score of quility ranges from 3 to 8, and the mean is about 5.6. Thus, we can suppose red wines with 6 or higher quality scores as "good" (marked as 1) and the others as "not good" (marked as 0). 

In [4]:
data["is_good"] = (data["quality"]>5)*1
data.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,is_good
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0


Data spliting

In [5]:

train_data, test_data = train_test_split(data, train_size=0.8, random_state=123)
x_train = np.array(train_data.iloc[:, :-2])
y_train = np.array(train_data["is_good"])
x_test = np.array(test_data.iloc[:, :-2])
y_test = np.array(test_data["is_good"])


## Discussion

## References