# 使用 Groupby 得出结论

在下面的notebook 中，你将使用 Pandas 的 groupby 函数调查有关此数据的两个问题。以下是回答每个问题的提示：

- 问题 1：某种类型的葡萄酒（红葡萄酒或白葡萄酒）是否代表更高的品质？  

对于此问题，将红葡萄酒的平均质量与白葡萄酒的平均质量等级进行比较。要这样做，先按颜色分组，然后找到每组的平均质量等级。

- 问题 2：哪个水平的酸度（pH 值）获得的平均评级最高？  

这个问题比较棘手，因为不同于颜色有明确的分类可以分组（红葡萄酒和白葡萄酒），pH 值是一个没有明确类别的定量变量。但是，有一个简单的解决方案。你可以通过创建自己的类别，从定量变量创建一个分类变量。Pandas 的 cut 函数可以让你将数据"切分"为组。你可以使用它，创建具有以下类别的名为acidity_levels的新列：
酸度水平：

    高: 最低 25% 时的 pH 值
    中等偏高: 25% - 50% 时的 pH 值
    中: 50% - 75% 时的 pH 值
    低: 最高 75% 时的 pH 值

在这里，数据在 25%、50% 和 75% 三个百分比处做了拆分。记住，你可以使用 Pandas 的 describe() 函数获得这些数字！创建这四个类别后，你可以使用 groupby 获得每个酸度水平的平均质量评级。


# Drawing Conclusions Using Groupby

In [1]:
# Load `winequality_edited.csv`
import pandas as pd

df = pd.read_csv('winequality_edited.csv')

### Is a certain type of wine associated with higher quality?

In [2]:
# Find the mean quality of each wine type (red and white) with groupby
df.groupby('color').mean().quality

color
red      5.636023
white    5.877909
Name: quality, dtype: float64

### What level of acidity receives the highest average rating?

In [3]:
# View the min, 25%, 50%, 75%, max pH values with Pandas describe
df.describe().pH

count    6497.000000
mean        3.218501
std         0.160787
min         2.720000
25%         3.110000
50%         3.210000
75%         3.320000
max         4.010000
Name: pH, dtype: float64

In [4]:
# Bin edges that will be used to "cut" the data into groups
bin_edges = [2.72, 3.11, 3.21, 3.32, 4.01] # Fill in this list with five values you just found

In [5]:
# Labels for the four acidity level groups
bin_names = ['high', 'mod_high', 'medium', 'low'] # Name each acidity level category

In [6]:
# Creates acidity_levels column
df['acidity_levels'] = pd.cut(df['pH'], bin_edges, labels=bin_names)

# Checks for successful creation of this column
df.head()

Unnamed: 0,alcohol,chlorides,citric_acid,color,density,fixed_acidity,free_sulfur_dioxide,pH,quality,residual_sugar,sulphates,total_sulfur-dioxide,total_sulfur_dioxide,volatile_acidity,acidity_levels
0,9.4,0.076,0.0,red,0.9978,7.4,11.0,3.51,5,1.9,0.56,34.0,,0.7,low
1,9.8,0.098,0.0,red,0.9968,7.8,25.0,3.2,5,2.6,0.68,67.0,,0.88,mod_high
2,9.8,0.092,0.04,red,0.997,7.8,15.0,3.26,5,2.3,0.65,54.0,,0.76,medium
3,9.8,0.075,0.56,red,0.998,11.2,17.0,3.16,6,1.9,0.58,60.0,,0.28,mod_high
4,9.4,0.076,0.0,red,0.9978,7.4,11.0,3.51,5,1.9,0.56,34.0,,0.7,low


In [7]:
# Find the mean quality of each acidity level with groupby
df.groupby('acidity_levels').mean().quality

acidity_levels
high        5.783343
mod_high    5.784540
medium      5.850832
low         5.859593
Name: quality, dtype: float64

In [8]:
# Save changes for the next section
df.to_csv('winequality_edited.csv', index=False)