# Introduction

Selecting specific values of a pandas DataFrame or Series to work on is an implicit step in almost any data operation you'll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.

从 pandas DataFrame 或 Series 中选择要处理的特定值是几乎所有数据操作中的一个隐含步骤，因此在使用 Python 处理数据时，首先要学习的就是如何快速有效地选择与您相关的数据点。

In [1]:
import pandas as pd
reviews = pd.read_csv("./input/winemag-data-130k-v2.csv", index_col=0)
pd.set_option('display.max_rows', 5)

# Native accessors

Native Python objects provide  good ways of indexing data. Pandas carries all of these over, which helps make it easy to start with.

Consider this DataFrame:

Python原生对象提供了索引数据的好方法。Pandas继承了所有这些方法，使其易于上手。

请看这个 DataFrame：

In [2]:
reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


In Python, we can access the property of an object by accessing it as an attribute. A `book` object, for example, might have a `title` property, which we can access by calling `book.title`. Columns in a pandas DataFrame work in much the same way. 

在 Python 中，我们可以通过作为属性访问对象的属性。例如，一个 `book` 对象可能有一个 `title` 属性，我们可以通过调用 `book.title` 来访问它。pandas DataFrame 中的列的工作方式大致相同。

Hence to access the `country` property of `reviews` we can use:

因此，要访问 `reviews` 中的 `country` 属性，我们可以使用：

In [3]:
reviews.country

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

If we have a Python dictionary, we can access its values using the indexing (`[]`) operator. We can do the same with columns in a DataFrame:

如果我们有一个 Python 字典，我们可以使用索引 (`[]`) 操作符访问它的值。对于 DataFrame 中的列，我们也可以这样做：

In [4]:
reviews['country']

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

These are the two ways of selecting a specific Series out of a DataFrame. Neither of them is more or less syntactically valid than the other, but the indexing operator `[]` does have the advantage that it can handle column names with reserved characters in them (e.g. if we had a `country providence` column, `reviews.country providence` wouldn't work).

这是从 DataFrame 中选择特定系列的两种方法。这两种方法在语法上都不太有效，但索引操作符 `[]` 的优点是可以处理列名中的保留字符（例如，如果我们有一个 `country providence` 列，`reviews.country providence` 就不起作用）。

Doesn't a pandas Series look kind of like a fancy dictionary? It pretty much is, so it's no surprise that, to drill down to a single specific value, we need only use the indexing operator `[]` once more:

pandas Series 看起来是不是有点像一本花哨的字典？差不多就是这样，所以我们只需再使用一次索引操作符 `[]`，就能深入到单个特定值，这并不奇怪：

In [5]:
reviews['country'][0]

'Italy'

# Indexing in pandas

The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, `loc` and `iloc`. For more advanced operations, these are the ones you're supposed to be using.

索引运算符和属性选择很不错，因为它们的工作原理与 Python 生态系统中的其他部分一样。作为新手，这使得它们很容易上手和使用。然而，pandas 有自己的访问操作符 `loc` 和 `iloc`。对于更高级的操作，您应该使用它们。

### Index-based selection

Pandas indexing works in one of two paradigms. The first is **index-based selection**: selecting data based on its numerical position in the data. `iloc` follows this paradigm.

Pandas 索引有两种工作模式。第一种是**基于索引的选择**：根据数据在数据中的数字位置选择数据。`iloc` 遵循这一范式。

To select the first row of data in a DataFrame, we may use the following:

要选择 DataFrame 中的第一行数据，我们可以使用下面的方法：

In [6]:
reviews.iloc[0]

country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object

On its own, the `:` operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the `country` column from just the first, second, and third row, we would do:

就其本身而言，`:`操作符也来自原生Python，表示 "一切"。然而，当与其他选择器结合使用时，它可以用来表示值的范围。例如，要从第一行、第二行和第三行中选择 `country` 列，我们可以这样做：

In [7]:
reviews.iloc[:3, 0]

0       Italy
1    Portugal
2          US
Name: country, dtype: object

Or, to select just the second and third entries, we would do:

或者，如果只选择第二和第三个条目，我们可以这样做：

In [8]:
reviews.iloc[1:3, 0]

1    Portugal
2          US
Name: country, dtype: object

It's also possible to pass a list:

In [9]:
reviews.iloc[[0, 1, 2], 0]

0       Italy
1    Portugal
2          US
Name: country, dtype: object

Finally, it's worth knowing that negative numbers can be used in selection. This will start counting forwards from the _end_ of the values. So for example here are the last five elements of the dataset.

最后，值得注意的是，负数也可以用于选择。这将从数值的 _end_ 开始向前计数。例如，这里是数据集的最后五个元素。

In [10]:
reviews.iloc[-5:]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
129966,Germany,Notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,US,Citation is given as much as a decade of bottl...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,France,Well-drained gravel soil gives this wine its c...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


### Label-based selection

The second paradigm for attribute selection is the one followed by the `loc` operator: **label-based selection**. In this paradigm, it's the data index value, not its position, which matters.

属性选择的第二种范式是`loc`操作符所遵循的范式：**基于标签的选择**。在这种模式中，重要的是数据索引值，而不是它的位置。

For example, to get the first entry in `reviews`, we would now do the following:

例如，要获取`reviews`中的第一个条目，我们现在需要执行以下操作：

In [11]:
reviews.loc[0, 'country']

'Italy'

`iloc` is conceptually simpler than `loc` because it ignores the dataset's indices. When we use `iloc` we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. `loc`, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using `loc` instead. For example, here's one operation that's much easier using `loc`:

`iloc` 在概念上比 `loc` 简单，因为它忽略了数据集的索引。当我们使用 `iloc` 时，我们把数据集当作一个大矩阵（一个列表的列表），一个我们必须按位置索引的矩阵。相比之下，`loc` 使用索引中的信息来完成工作。由于你的数据集通常都有有意义的索引，所以通常使用 `loc` 会更容易。例如，这里有一个使用 `loc` 更容易的操作：

In [12]:
reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

Unnamed: 0,taster_name,taster_twitter_handle,points
0,Kerin O’Keefe,@kerinokeefe,87
1,Roger Voss,@vossroger,87
...,...,...,...
129969,Roger Voss,@vossroger,90
129970,Roger Voss,@vossroger,90


### Choosing between `loc` and `iloc`

When choosing or transitioning between `loc` and `iloc`, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.

在选择或过渡使用 `loc` 和 `iloc` 时，有一个"小问题"值得注意，那就是这两种方法使用的索引方案略有不同。

`iloc` uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So `0:10` will select entries `0,...,9`. `loc`, meanwhile, indexes inclusively. So `0:10` will select entries `0,...,10`.

`iloc` 使用 Python stdlib 索引方案，其中包含范围内的第一个元素，而排除最后一个元素。因此，`0:10` 将选择条目`0,...,9`。而 `loc` 则是包含索引。因此，`0:10` 将选择条目`0,...,10`。

Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values `Apples, ..., Potatoes, ...`, and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index `df.loc['Apples':'Potatoes']` than it is to index something like `df.loc['Apples', 'Potatoet']` (`t` coming after `s` in the alphabet).

为何要变？记住，loc 可以索引任何 stdlib 类型：例如字符串。如果我们有一个索引值为 `Apples, ...，Potatoes, ...`的 DataFrame，而我们想选择 "Apple 和 Potatoes 之间所有按字母顺序排列的水果选择"，那么索引 `df.loc['Apples':'Potatoes']` 比索引 `df.loc['Apples', 'Potatoet']` 要方便得多。

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. `0,...,1000`. In this case `df.iloc[0:1000]` will return 1000 entries, while `df.loc[0:1000]` return 1001 of them! To get 1000 elements using `loc`, you will need to go one lower and ask for `df.loc[0:999]`. 

当 DataFrame 索引是一个简单的数字列表（如 `0,...,1000`）时，这种情况尤其令人困惑。在这种情况下，`df.iloc[0:1000]` 将返回 1000 个条目，而`df.loc[0:1000]` 则返回其中的 1001 个！要使用 `loc` 获得 1000 个元素，您需要再低一级，请求 `df.loc[0:999]`。

Otherwise, the semantics of using `loc` are the same as those for `iloc`.

否则，使用 `loc` 的语义与 `iloc` 相同。

# Manipulating the index

Label-based selection derives its power from the labels in the index. Critically, the index we use is not immutable. We can manipulate the index in any way we see fit.

基于标签的选择功能来自索引中的标签。重要的是，我们使用的索引并不是一成不变的。我们可以以任何我们认为合适的方式操作索引。

The `set_index()` method can be used to do the job. Here is what happens when we `set_index` to the `title` field:

可以使用 `set_index()` 方法来完成这项工作。下面是我们对 `title` 字段进行`set_index`操作时发生的情况：

In [13]:
reviews.set_index("title")

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,variety,winery
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Nicosia 2013 Vulkà Bianco (Etna),Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,White Blend,Nicosia
Quinta dos Avidagos 2011 Avidagos Red (Douro),Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...
Domaine Marcel Deiss 2012 Pinot Gris (Alsace),France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Pinot Gris,Domaine Marcel Deiss
Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caroline Gewurztraminer (Alsace),France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Gewürztraminer,Domaine Schoffit


This is useful if you can come up with an index for the dataset which is better than the current one.

如果你能为数据集创建一个比当前索引更好的索引，那么这将非常有用。

# Conditional selection

So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do *interesting* things with the data, however, we often need to ask questions based on conditions. 

到目前为止，我们一直在使用 DataFrame 本身的结构属性来索引各种数据。不过，要使用数据做*有趣*的事情，我们通常需要根据条件提出问题。

For example, suppose that we're interested specifically in better-than-average wines produced in Italy.

例如，假设我们对意大利生产的优于平均水平的葡萄酒特别感兴趣。

We can start by checking if each wine is Italian or not:

我们可以先看看每种葡萄酒是不是意大利酒：

In [14]:
reviews.country == 'Italy'

0          True
1         False
          ...  
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

This operation produced a Series of `True`/`False` booleans based on the `country` of each record.  This result can then be used inside of `loc` to select the relevant data:

该操作根据每条记录的`country`生成一系列`True`/`False`布尔值。  然后，可以在 `loc` 中使用这一结果来选择相关数据：

In [15]:
reviews.loc[reviews.country == 'Italy']

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129961,Italy,"Intense aromas of wild cherry, baking spice, t...",,90,30.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,COS 2013 Frappato (Sicilia),Frappato,COS
129962,Italy,"Blackberry, cassis, grilled herb and toasted a...",Sàgana Tenuta San Giacomo,90,40.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,Cusumano 2012 Sàgana Tenuta San Giacomo Nero d...,Nero d'Avola,Cusumano


This DataFrame has ~20,000 rows. The original had ~130,000. That means that around 15% of wines originate from Italy.

这个 DataFrame 有 ~20,000 行。原来的数据有 ~130,000 行。这意味着约 15%的葡萄酒产自意大利。

We also wanted to know which ones are better than average. Wines are reviewed on a 80-to-100 point scale, so this could mean wines that accrued at least 90 points.

我们还想知道哪些葡萄酒比平均水平更好。葡萄酒的评分标准为 80 分至 100 分，因此这可能意味着至少获得 90 分的葡萄酒。

We can use the ampersand (`&`) to bring the two questions together:

我们可以用`&`把这两个问题联系起来：

In [16]:
reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
120,Italy,"Slightly backward, particularly given the vint...",Bricco Rocche Prapó,92,70.0,Piedmont,Barolo,,,,Ceretto 2003 Bricco Rocche Prapó (Barolo),Nebbiolo,Ceretto
130,Italy,"At the first it was quite muted and subdued, b...",Bricco Rocche Brunate,91,70.0,Piedmont,Barolo,,,,Ceretto 2003 Bricco Rocche Brunate (Barolo),Nebbiolo,Ceretto
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129961,Italy,"Intense aromas of wild cherry, baking spice, t...",,90,30.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,COS 2013 Frappato (Sicilia),Frappato,COS
129962,Italy,"Blackberry, cassis, grilled herb and toasted a...",Sàgana Tenuta San Giacomo,90,40.0,Sicily & Sardinia,Sicilia,,Kerin O’Keefe,@kerinokeefe,Cusumano 2012 Sàgana Tenuta San Giacomo Nero d...,Nero d'Avola,Cusumano


Suppose we'll buy any wine that's made in Italy _or_ which is rated above average. For this we use a pipe (`|`):

假设我们会购买任何意大利制造 _或_ 评级高于平均水平的葡萄酒。为此，我们使用管道 (`|`)：

In [17]:
reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


Pandas comes with a few built-in conditional selectors, two of which we will highlight here. 

Pandas 内置了一些条件选择器，我们将在此重点介绍其中两个。

The first is `isin`. `isin` is lets you select data whose value "is in" a list of values. For example, here's how we can use it to select wines only from Italy or France:

第一个是 `isin`。`isin` 可以让你选择值"is in"一个值列表中的数据。例如，我们可以用它来选择只来自意大利或法国的葡萄酒：

In [18]:
reviews.loc[reviews.country.isin(['Italy', 'France'])]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


The second is `isnull` (and its companion `notnull`). These methods let you highlight values which are (or are not) empty (`NaN`). For example, to filter out wines lacking a price tag in the dataset, here's what we would do:

第二种是 `isnull`（及其同伴 `notnull`）。通过这些方法，可以突出显示为空（或非空）的值（`NaN`）。例如，要过滤掉数据集中没有价格标签的葡萄酒，我们可以这样做：

In [19]:
reviews.loc[reviews.price.notnull()]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


# Assigning data

Going the other way, assigning data to a DataFrame is easy. You can assign either a constant value:

反过来，为 DataFrame 分配数据也很简单。您可以指定一个常量值：

In [20]:
reviews['critic'] = 'everyone'
reviews['critic']

0         everyone
1         everyone
            ...   
129969    everyone
129970    everyone
Name: critic, Length: 129971, dtype: object

Or with an iterable of values:

In [21]:
reviews['index_backwards'] = range(len(reviews), 0, -1)
reviews['index_backwards']

0         129971
1         129970
           ...  
129969         2
129970         1
Name: index_backwards, Length: 129971, dtype: int64