# Phân tích khám phá về chất lượng rượu vang đỏ (Red Wine Quality)


## 1. Định nghĩa vấn đề (Define Problem)
- **Mô tả**:
    - Tập “Red Vinho Verde” (Bồ Đào Nha) thuộc bộ Wine Quality gốc UCI, được mirror trên Kaggle. Dữ liệu chỉ gồm thử nghiệm hoá lý và điểm cảm quan “quality”.
+ **Dữ liệu vào**:
    - fixed acidity: 
    - volatile acidity
    - citric acid
    - residual sugar 
    - chlorides
    - free sulfur dioxide 
    - total sulfur dioxide 
    - density, pH
    - sulphates
    - alcohol
+ **Kết quả**: quality (3-8)

## 2. Chuẩn bị vấn đề (Prepare Problem)

### 2.1. Khai báo thư viện (Load Libraries)

In [2]:
# Importing all import python libraries
import pandas as pd
from IPython import display
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 2.2. Nạp dữ liệu (Load Dataset)

In [17]:
winequality_df = pd.read_csv("winequality-red.csv")
winequality_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [None]:
print(f"Content Unique columns:")
print(f"Fixed acidity: {np.sort(winequality_df["quality"].unique())}")
print(f"volatile acidity: {np.sort(winequality_df["volatile acidity"].unique())}")
print(f"citric acid: {np.sort(winequality_df["citric acid"].unique())}")
print(f"residual sugar: {np.sort(winequality_df["residual sugar"].unique())}")
print(f"chlorides: {np.sort(winequality_df["chlorides"].unique())}")
print(f"free sulfur dioxide: {np.sort(winequality_df["free sulfur dioxide"].unique())}")
print(f"total sulfur dioxide: {np.sort(winequality_df["total sulfur dioxide"].unique())}")
print(f"density: {np.sort(winequality_df["density"].unique())}")
print(f"pH: {np.sort(winequality_df["pH"].unique())}")
print(f"sulphates: {np.sort(winequality_df["sulphates"].unique())}")
print(f"alcohol: {np.sort(winequality_df["alcohol"].unique())}")
print(f"quality: {np.sort(winequality_df["quality"].unique())}")


Content Unique columns:
Fixed acidity: [3 4 5 6 7 8] /n
volatile acidity: [0.12  0.16  0.18  0.19  0.2   0.21  0.22  0.23  0.24  0.25  0.26  0.27
 0.28  0.29  0.295 0.3   0.305 0.31  0.315 0.32  0.33  0.34  0.35  0.36
 0.365 0.37  0.38  0.39  0.395 0.4   0.41  0.415 0.42  0.43  0.44  0.45
 0.46  0.47  0.475 0.48  0.49  0.5   0.51  0.52  0.53  0.54  0.545 0.55
 0.56  0.565 0.57  0.575 0.58  0.585 0.59  0.595 0.6   0.605 0.61  0.615
 0.62  0.625 0.63  0.635 0.64  0.645 0.65  0.655 0.66  0.665 0.67  0.675
 0.68  0.685 0.69  0.695 0.7   0.705 0.71  0.715 0.72  0.725 0.73  0.735
 0.74  0.745 0.75  0.755 0.76  0.765 0.77  0.775 0.78  0.785 0.79  0.795
 0.8   0.805 0.81  0.815 0.82  0.825 0.83  0.835 0.84  0.845 0.85  0.855
 0.86  0.865 0.87  0.875 0.88  0.885 0.89  0.895 0.9   0.91  0.915 0.92
 0.935 0.95  0.955 0.96  0.965 0.975 0.98  1.    1.005 1.01  1.02  1.025
 1.035 1.04  1.07  1.09  1.115 1.13  1.18  1.185 1.24  1.33  1.58 ]
citric acid: [0.   0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0

- Edit those columns (ex: fixed acidity -> fixed_acidity )

## 3. Phân tích dữ liệu (Analyze Data)

### 3.1. Thống kê mô tả (Descriptive Statistics)

#### (1) **Hiển thị một số thông tin về dữ liệu**
+ Số dòng, số cột của dữ liệu
+ Kiểu dữ liệu của từng cột
+ 5 dòng đầu và 5 dòng cuối của bảng dữ liệu
+ Thông tin chung về dữ liệu

In [19]:
# shape
print(f'+ Shape: {winequality_df.shape}')
# head, tail
print(f'+ Contents: ')
display.display(winequality_df.head(5))
display.display(winequality_df.tail(5))
# info
winequality_df.info()

+ Shape: (1599, 12)
+ Contents: 


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1594,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5
1595,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5
1598,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


**Nhận xét**:
+ Dữ liệu có 11 tính chất để phân lớp: Fixed acidity (g/L), Volatile acidity , Citric acid, Residual sugar, Chlorides, Free sulfur dioxide, Total sulfur dioxide, Density, pH, Sulphates (SO₄²⁻), Alcohol    
+ Tổng số dòng dữ liệu là 1599 dòng
+ Dữ liệu để phân lớp ở cột quality 

#### (2) **Kiểm tra tính toàn vẹn của dữ liệu**
+ Dữ liệu có bị trùng lặp không? Hiển thị dòng bị vi phạm.
+ Dữ liệu có tồn tại giá trị Null không? Hiển thị dòng bị vi phạm.
+ Dữ liệu có tồn tại giá trị NaN không? Hiển thị dòng bị vi phạm.

In [6]:
has_null = winequality_df.isnull().sum().any()
has_nan  = winequality_df.isna().sum().any()
n_duplicated = winequality_df.duplicated().sum()
print(f'Tính toàn vẹn dữ liệu:')
print(f'+ Có giá trị Null: {has_null}')
if has_null:
    display.display(winequality_df[winequality_df.isnull().any(axis=1)])
print(f'+ Có giá trị Nan: {has_nan}')
if has_nan:
    display.display(winequality_df[winequality_df.isna().any(axis=1)])
print(f'+ Số dòng trùng: {n_duplicated}')

Tính toàn vẹn dữ liệu:
+ Có giá trị Null: False
+ Có giá trị Nan: False
+ Số dòng trùng: 244


**Nhận xét**:
+ Dữ liệu co 244 bị trùng và không có giá trị rỗng (NaN, Null)