# Introduction

In the last tutorial, we learned how to select relevant data out of a DataFrame or Series. Plucking the right data out of our data representation is critical to getting work done, as we demonstrated in the exercises.

在上一教程中，我们学习了如何从 DataFrame 或系列中选择相关数据。正如我们在练习中演示的那样，从数据表示中提取正确的数据对于完成工作至关重要。

However, the data does not always come out of memory in the format we want it in right out of the bat. Sometimes we have to do some more work ourselves to reformat it for the task at hand.  This tutorial will cover different operations we can apply to our data to get the input "just right". 

然而，数据并不总能以我们想要的格式从内存中直接取出。有时，我们必须自己再做一些工作来重新格式化数据，以完成手头的任务。  本教程将介绍我们可以对数据进行的不同操作，以获得 "恰到好处 "的输入。

We'll use the Wine Magazine data for demonstration.

我们将使用《葡萄酒》杂志的数据进行演示。

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 5)
import numpy as np
reviews = pd.read_csv("./input/winemag-data-130k-v2.csv", index_col=0)

In [2]:
reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


# Summary functions

Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. For example, consider the `describe()` method:

Pandas 提供了许多简单的 "摘要函数"（非官方名称），它们以某种有用的方式重组数据。例如，考虑一下`describe()`方法：

In [3]:
reviews.points.describe()

count    129971.000000
mean         88.447138
             ...      
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

该方法可生成给定列属性的高级摘要。它是类型感知的，这意味着它的输出会根据输入的数据类型而改变。上面的输出只对数字数据有意义；对于字符串数据，我们会得到下面的结果：

In [4]:
reviews.taster_name.describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen. 

如果您想得到 DataFrame 或序列中某一列的某些特定简单汇总统计数据，通常会有一个有用的 pandas 函数来实现这一目的。

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the `mean()` function:

例如，要查看所分配分数的平均值（例如，平均评分葡萄酒的表现），我们可以使用`mean()`函数：

In [5]:
reviews.points.mean()

np.float64(88.44713820775404)

To see a list of unique values we can use the `unique()` function:

In [6]:
reviews.taster_name.unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

To see a list of unique values _and_ how often they occur in the dataset, we can use the `value_counts()` method:

In [7]:
reviews.taster_name.value_counts()

taster_name
Roger Voss           25514
Michael Schachner    15134
                     ...  
Fiona Adams             27
Christina Pickard        6
Name: count, Length: 19, dtype: int64

# Maps

A **map** is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

**map**是从数学中借用过来的术语，指的是将一组数值"映射"到另一组数值的函数。在数据科学中，我们经常需要从现有数据中创建新的表示方法，或将数据从现在的格式转换为我们希望的格式。映射就是处理这些工作的工具，因此对于完成工作而言，映射极为重要！

There are two mapping methods that you will use often. 

有两种映射方法您会经常用到。

[`map()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can do this as follows:

[`map()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html)是第一种稍微简单的方法。例如，假设我们想把葡萄酒得到的分数重新平均为 0，我们可以这样做：

In [8]:
review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

The function you pass to `map()` should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. `map()` returns a new Series where all the values have been transformed by your function.

您传递给 `map()` 的函数应从 Series 中获取一个值（在上例中是一个点值），并返回该值的转换版本。`map()` 返回一个新的 Series，其中的所有值都已被函数转换。

[`apply()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) is the equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.

如果我们想通过在每一行上调用自定义方法来转换整个 DataFrame，[`apply()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) 是一个等效方法。

In [9]:
def remean_points(row):
    row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
129970,France,"Big, rich and off-dry, this is powered by inte...",Lieu-dit Harth Cuvée Caroline,90,21.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit


If we had called `reviews.apply()` with `axis='index'`, then instead of passing a function to transform each row, we would need to give a function to transform each *column*.

如果我们在调用 `reviews.apply()` 时使用 `axis='index'` ，那么我们就不需要传递一个函数来转换每一行，而是需要给出一个函数来转换每一列。

Note that `map()` and `apply()` return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of `reviews`, we can see that it still has its original `points` value.

请注意，`map()` 和`apply()` 会分别返回新的、经过转换的 Series 和 DataFrames。它们不会修改被调用的原始数据。如果我们查看 `reviews` 的第一行，就会发现它仍然具有原始的 `points` 值。

In [10]:
reviews.head(1)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia


Pandas provides many common mapping operations as built-ins. For example, here's a faster way of remeaning our points column:

Pandas 提供了许多常见的内置映射操作。例如，这里有一种更快的方法来重新定义我们的points列：

In [11]:
review_points_mean = reviews.points.mean()
reviews.points - review_points_mean

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

In this code we are performing an operation between a lot of values on the left-hand side (everything in the Series) and a single value on the right-hand side (the mean value). Pandas looks at this expression and figures out that we must mean to subtract that mean value from every value in the dataset.

在这段代码中，我们在左侧的大量值（系列中的所有值）和右侧的单个值（平均值）之间执行运算。Pandas 查看了这个表达式，并计算出我们必须从数据集中的每个值中减去这个平均值。

Pandas will also understand what to do if we perform these operations between Series of equal length. For example, an easy way of combining country and region information in the dataset would be to do the following:

如果我们在等长序列之间执行这些操作，Pandas 也能理解该怎么做。例如，在数据集中组合country和region信息的简单方法如下：

In [12]:
reviews.country + " - " + reviews.region_1

0            Italy - Etna
1                     NaN
               ...       
129969    France - Alsace
129970    France - Alsace
Length: 129971, dtype: object

These operators are faster than `map()` or `apply()` because they use speed ups built into pandas. All of the standard Python operators (`>`, `<`, `==`, and so on) work in this manner.

这些运算符比 `map()` 或 `apply()` 更快，因为它们使用了 pandas 内置的加速功能。所有标准的 Python 运算符（`>`、`<`、`==` 等）都以这种方式工作。

However, they are not as flexible as `map()` or `apply()`, which can do more advanced things, like applying conditional logic, which cannot be done with addition and subtraction alone.

不过，它们不如 `map()` 或 `apply()` 灵活，后者可以做更高级的事情，比如应用条件逻辑，而这是加、减操作无法做到的。