# Introduction

Maps allow us to transform data in a DataFrame or Series one value at a time for an entire column. However, often we want to group our data, and then do something specific to the group the data is in. 

映射允许我们一次转换 DataFrame 或系列中整列的一个值。但是，我们经常希望对数据进行分组，然后对数据所在的组进行特定操作。

As you'll learn, we do this with the `groupby()` operation.  We'll also cover some additional topics, such as more complex ways to index your DataFrames, along with how to sort your data.

我们将学习使用 `groupby()` 操作来实现这一目的。我们还将介绍一些其他主题，例如索引 DataFrames 的更复杂方法，以及如何对数据进行排序。

In [1]:
import pandas as pd
reviews = pd.read_csv("./input/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)

In [3]:
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


# Groupwise analysis

One function we've been using heavily thus far is the `value_counts()` function. We can replicate what `value_counts()` does by doing the following:

到目前为止，我们一直在大量使用的一个函数是 `value_counts()` 函数。我们可以通过以下操作复制 `value_counts()` 的功能：

In [None]:
reviews.groupby('points').points.count()

points
80     397
81     692
      ... 
99      33
100     19
Name: points, Length: 21, dtype: int64

groups the reviews DataFrame by the values in the points column and then counts the number of occurrences of each unique value in the points column. The result is a Series where the index is the unique values from the points column and the values are the counts of those unique values.

将 reviews DataFrame 按 points 列的值进行分组，然后计算每个唯一值在 points 列中出现的次数。结果是一个 Series，其中索引是 points 列中的唯一值，值是这些唯一值的计数。

`groupby()` created a group of reviews which allotted the same point values to the given wines. Then, for each of these groups, we grabbed the `points()` column and counted how many times it appeared.  `value_counts()` is just a shortcut to this `groupby()` operation. 

`groupby()`创建了一组reviews，为给定的葡萄酒分配了相同的分值。然后，对于每一组，我们抓取`points()`列并计算它出现的次数。  `value_counts()` 只是这种 `groupby()` 操作的快捷方式。

We can use any of the summary functions we've used before with this data. For example, to get the cheapest wine in each point value category, we can do the following:

我们可以对这些数据使用我们以前使用过的任何汇总函数。例如，要得到每个点值类别中最便宜的葡萄酒，我们可以这样做：

In [7]:
reviews.groupby('points').price.min()

points
80      5.0
81      5.0
       ... 
99     44.0
100    80.0
Name: price, Length: 21, dtype: float64

(⭐️⭐️⭐️)

You can think of each group we generate as being a slice of our DataFrame containing only data with values that match. This DataFrame is accessible to us directly using the `apply()` method, and we can then manipulate the data in any way we see fit. For example, here's one way of selecting the name of the first wine reviewed from each winery in the dataset:

可以把我们生成的每个组看作是 DataFrame 的一个切片，其中只包含值匹配的数据。我们可以使用`apply()`方法直接访问该 DataFrame，然后以任何我们认为合适的方式处理数据。例如，下面是一种从数据集中的每个酒庄中选择第一款葡萄酒名称的方法：

In [8]:
reviews.groupby('winery').apply(lambda df: df.title.iloc[0])

  reviews.groupby('winery').apply(lambda df: df.title.iloc[0])


winery
1+1=3                          1+1=3 NV Rosé Sparkling (Cava)
10 Knots                 10 Knots 2010 Viognier (Paso Robles)
                                  ...                        
àMaurice    àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka                         Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object

For even more fine-grained control, you can also group by more than one column. For an example, here's how we would pick out the best wine by country _and_ province:

为了实现更精细的控制，您还可以按多个列进行分组。举例来说，下面是我们如何按国家 _和_ 省份挑选出最好的葡萄酒：

In [9]:
reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])

  reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])


Unnamed: 0_level_0,Unnamed: 1_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
country,province,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Argentina,Mendoza Province,Argentina,"If the color doesn't tell the full story, the ...",Nicasia Vineyard,97,120.0,Mendoza Province,Mendoza,,Michael Schachner,@wineschach,Bodega Catena Zapata 2006 Nicasia Vineyard Mal...,Malbec,Bodega Catena Zapata
Argentina,Other,Argentina,"Take note, this could be the best wine Colomé ...",Reserva,95,90.0,Other,Salta,,Michael Schachner,@wineschach,Colomé 2010 Reserva Malbec (Salta),Malbec,Colomé
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Uruguay,San Jose,Uruguay,"Baked, sweet, heavy aromas turn earthy with ti...",El Preciado Gran Reserva,87,50.0,San Jose,,,Michael Schachner,@wineschach,Castillo Viejo 2005 El Preciado Gran Reserva R...,Red Blend,Castillo Viejo
Uruguay,Uruguay,Uruguay,"Cherry and berry aromas are ripe, healthy and ...",Blend 002 Limited Edition,91,22.0,Uruguay,,,Michael Schachner,@wineschach,Narbona NV Blend 002 Limited Edition Tannat-Ca...,Tannat-Cabernet Franc,Narbona


Another `groupby()` method worth mentioning is `agg()`, which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

另一个值得一提的`groupby()`方法是`agg()`，它可以让你在 DataFrame 上同时运行多个不同的函数。例如，我们可以生成一个简单的数据集统计摘要，如下所示：

In [10]:
reviews.groupby(['country']).price.agg([len, min, max])

  reviews.groupby(['country']).price.agg([len, min, max])
  reviews.groupby(['country']).price.agg([len, min, max])


Unnamed: 0_level_0,len,min,max
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,3800,4.0,230.0
Armenia,2,14.0,15.0
...,...,...,...
Ukraine,14,6.0,13.0
Uruguay,109,10.0,130.0


Effective use of `groupby()` will allow you to do lots of really powerful things with your dataset.

有效使用 `groupby()`，可以让您在数据集上做很多非常强大的事情。

# Multi-indexes

In all of the examples we've seen thus far we've been working with DataFrame or Series objects with a single-label index. `groupby()` is slightly different in the fact that, depending on the operation we run, it will sometimes result in what is called a multi-index.

在我们迄今为止看到的所有示例中，我们一直在使用具有单标签索引的 DataFrame 或 Series 对象。`groupby()`略有不同，根据我们运行的操作，有时会产生所谓的多索引。

A multi-index differs from a regular index in that it has multiple levels. For example:

多重索引与普通索引的不同之处在于它有多个级别。例如

In [11]:
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed

Unnamed: 0_level_0,Unnamed: 1_level_0,len
country,province,Unnamed: 2_level_1
Argentina,Mendoza Province,3264
Argentina,Other,536
...,...,...
Uruguay,San Jose,3
Uruguay,Uruguay,24


In [12]:
mi = countries_reviewed.index
type(mi)

pandas.core.indexes.multi.MultiIndex

Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value. Dealing with multi-index output is a common "gotcha" for users new to pandas.

多级索引有几种处理分层结构的方法，而单级索引则没有这些方法。它们还需要两层标签来检索一个值。对于刚接触 pandas 的用户来说，处理多索引输出是一个常见的 "难题"。

The use cases for a multi-index are detailed alongside instructions on using them in the [MultiIndex / Advanced Selection](https://pandas.pydata.org/pandas-docs/stable/advanced.html) section of the pandas documentation.

在 pandas 文档的[MultiIndex / Advanced Selection](https://pandas.pydata.org/pandas-docs/stable/advanced.html)部分，详细介绍了多索引的用例和使用说明。

However, in general the multi-index method you will use most often is the one for converting back to a regular index, the `reset_index()` method:

不过，一般来说，你最常使用的多索引方法是转换回普通索引的方法，即 `reset_index()` 方法：

In [13]:
countries_reviewed.reset_index()

Unnamed: 0,country,province,len
0,Argentina,Mendoza Province,3264
1,Argentina,Other,536
...,...,...,...
423,Uruguay,San Jose,3
424,Uruguay,Uruguay,24


# Sorting

Looking again at `countries_reviewed` we can see that grouping returns data in index order, not in value order. That is to say, when outputting the result of a `groupby`, the order of the rows is dependent on the values in the index, not in the data.

再次查看 `countries_reviewed` 可以发现，分组是按索引顺序而不是按值顺序返回数据的。也就是说，在输出`groupby`的结果时，行的顺序取决于索引中的值，而不是数据中的值。

To get data in the order want it in we can sort it ourselves.  The `sort_values()` method is handy for this.

要按顺序获取数据，我们可以自己排序。为此，`sort_values()`方法非常方便。

In [14]:
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')

Unnamed: 0,country,province,len
179,Greece,Muscat of Kefallonian,1
192,Greece,Sterea Ellada,1
...,...,...,...
415,US,Washington,8639
392,US,California,36247


`sort_values()` defaults to an ascending sort, where the lowest values go first. However, most of the time we want a descending sort, where the higher numbers go first. That goes thusly:

`sort_values()`默认为升序排序，即最小值排在前面。但大多数情况下，我们需要降序排序，即数值大的先排。具体如下

In [15]:
countries_reviewed.sort_values(by='len', ascending=False)

Unnamed: 0,country,province,len
392,US,California,36247
415,US,Washington,8639
...,...,...,...
63,Chile,Coelemu,1
149,Greece,Beotia,1


To sort by index values, use the companion method `sort_index()`. This method has the same arguments and default order:

要按索引值排序，请使用配套方法 `sort_index()`。该方法具有相同的参数和默认顺序：

In [16]:
countries_reviewed.sort_index()

Unnamed: 0,country,province,len
0,Argentina,Mendoza Province,3264
1,Argentina,Other,536
...,...,...,...
423,Uruguay,San Jose,3
424,Uruguay,Uruguay,24


Finally, know that you can sort by more than one column at a time:

最后，要知道您可以同时按不止一列进行排序：

In [17]:
countries_reviewed.sort_values(by=['country', 'len'])

Unnamed: 0,country,province,len
1,Argentina,Other,536
0,Argentina,Mendoza Province,3264
...,...,...,...
424,Uruguay,Uruguay,24
419,Uruguay,Canelones,43
