# CHAPTER 5 Getting Started With pandas

pandas will be a major tool of interest throughout much of the rest of the book. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops. 
<font color=Indigo>
在这本书余下的大部分内容里，pandas 将成为一个贯穿始终的主要工具。它包含数据结构和数据操作，以及在Python中快速、方便地进行数据清理和分析的工具。pandas 经常和数值计算工具(类似 NumPy 和 SciPy)，分析库(类似 statsmokit 和 scikit-learn)，以及像 matplotlib 这样的数据可视化库一起使用 。pandas 采用了 NumPy 的惯用风格——基于数组(array-based)的计算方式，特别是基于数组的函数(array-based functions)，以及对不通过 for 循环对数据进行处理的偏爱。

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.
<font color=Indigo>
虽然 pandas 采用了许多 NumPy 的编码习惯，但最大的不同是，pandas 的设计目的是为了处理表格形或多样化、异性的数据。相比之下，NumPy 最适合于处理同类的规整的数值形数组数据。

Since becoming an open source project in 2010, pandas has matured into a quite large library that’s applicable in a broad set of real-world use cases. The developer community has grown to over 800 distinct contributors, who’ve been helping build the project as they’ve used it to solve their day-to-day data problems. Throughout the rest of the book, I use the following import convention for pandas:
<font color=Indigo>
自从2010年成为一个开源项目以来，pandas 已经成长为一个相当大的库，它适用于广泛的现实世界的用例。开发人员社区已经发展到超过800个不同的贡献者，他们一直在帮助构建这个项目，因为他们已经用它来解决他们的日常数据问题。在本书的剩余部分，我将使用以下的 pandas 导入惯例：

In [2]:
import pandas as pd
import numpy as np

Thus, whenever you see pd. in code, it’s referring to pandas. You may also find it easier to import Series and DataFrame into the local namespace since they are so frequently used:
<font color=Indigo>
因此，无论你在代码中何时看到 pd ，它指的是 pandas 。你可能还会发现，将 Series 和 DataFrame 导入本地命名空间会更好一些，因为它们经常被使用：

In [3]:
from pandas import Series, DataFrame

## 5.1 Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.
<font color=Indigo>
开始使用 pandas 之前，你需要适应它的两个重要的数据结构: Series 和 DataFrame 。虽然它们不是解决所有问题的通用解决方案，但它们为大多数应用程序提供了可靠的、易于使用的基础。

### Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.The simplest Series is formed from only an array of data:
<font color=Indigo>
Series 是一个一维数组类的对象，包含一个值序列(与NumPy类型相似的类型)和与之相关的数据标签数组(称为它的索引)。最简单的 Series 仅仅是由一个数据序列组成的:

In [4]:
obj = pd.Series([4, 7, -5, 3]) ; obj

0    4
1    7
2   -5
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created. You can get the array representation and index object of the Series via its values and index attributes, respectively:
<font color=Indigo>
上述 Series 显示的行，相对应地表示了左边的索引和右边的值。由于我们没有为数据指定一个索引，因此创建了一个默认值，其中包含了 0 到 N-1(其中N是数据的长度)的整数。您可以通过它的 `values` 和 `index` 属性分别获得该 Series 的数组表示的值和索引对象:

In [5]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [6]:
obj.index  # like range(4)

RangeIndex(start=0, stop=4, step=1)

Often it will be desirable to create a Series with an index identifying each data point with a label:
<font color=Indigo>
通常，我们希望为每一行数据创建一个具有标识的索引:

In [7]:
obj2 = pd.Series(data=[4, 7, -5, 3], index=['d', 'b', 'a', 'c']) ; obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [8]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:
<font color=Indigo>
与 NumPy 数组相比，在选择单个值或一组值时，可以使用索引中的标签。

In [9]:
obj2['a']

-5

In [10]:
obj2['d'] = 6
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

Here ['c', 'a', 'd'] is interpreted as a list of indices, even though it contains strings instead of integers.
<font color=Indigo>
`['c', 'a', 'd']` 被解释为一个索引列表，尽管它包含的是字符串而不是整数。

Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:
<font color=Indigo>
使用 NumPy 函数或 NumPy 风格的操作，例如使用 boolean array 进行过滤、标量乘法或应用数学函数，将保留索引和值之间链接:

In [11]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [12]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [13]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict:
<font color=Indigo>
另一种理解 Series 的方法是固定长度的、有序的 dict，因为 dict 是索引值到数据值的映射。它可以应用在很多场景，你可以这样使用它:

In [14]:
'b' in obj2

True

In [15]:
'e' in obj2

False

Should you have data contained in a Python dict, you can create a Series from it by passing the dict:
<font color=Indigo>
如果数据被存放在一个Python字典中，也可以直接通过这个字典来创建Series：

In [16]:
sdata = {'Ohio': 335000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

Ohio      335000
Oregon     16000
Texas      71000
Utah        5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:
<font color=Indigo>
如果只传入一个字典，Series的index将按照顺序沿用dict的keys。你可以通过index可选参数，根据需要改变其在结果Series中出现的顺序：

In [17]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California         NaN
Ohio          335000.0
Oregon         16000.0
Texas          71000.0
dtype: float64

Here, three values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number), which is considered in pandas to mark missing or NA values. Since 'Utah' was not included in states, it is excluded from the resulting object.
<font color=Indigo>
在这个例子中，sdata中和states索引匹配的那3个值会被找出来并放到相应的位置上，但是由于“California”所对应的sdata值找不到，所以其结果就为NaN，即“非数字”（not a number），在pandas中，它用于表示缺失或NA值。由于“Utah”不包括在states内，因此它被排除在结果对象之外。

I will use the terms “missing” or “NA” interchangeably to refer to missing data. The isnull and notnull functions in pandas should be used to detect missing data:
<font color=Indigo>
我们将使用缺失（missing）或NA表示缺失的数据。pandas的isnull和notnull函数可以用于检测缺失数据：

In [18]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [19]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Series also has these as instance methods:
<font color=Indigo>
Series也有这些实例方法:

In [20]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

I discuss working with missing data in more detail in Chapter 7.
<font color=Indigo>
在第7章中，我将更详细地讨论如何处理丢失的数据。

A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:
<font color=Indigo>
对于许多应用程序来说，一个有用的Series特性是，它会自动在算术操作中对索引标签进行对齐:

In [21]:
obj3

Ohio      335000
Oregon     16000
Texas      71000
Utah        5000
dtype: int64

In [22]:
obj4

California         NaN
Ohio          335000.0
Oregon         16000.0
Texas          71000.0
dtype: float64

In [23]:
obj3 + obj4

California         NaN
Ohio          670000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Data alignment features will be addressed in more detail later. If you have experience with databases, you can think about this as being similar to a join operation.
<font color=Indigo>
数据对齐功能稍后将更详细地讨论。如果你有数据库方面的经验，可以将其视为与连接操作类似。

Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality:
<font color=Indigo>
所有的Series对象本身及其索引都有一个name属性，它与pandas功能的其他关键功能关系非常密切：

In [24]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California         NaN
Ohio          335000.0
Oregon         16000.0
Texas          71000.0
Name: population, dtype: float64

A Series’s index can be altered in-place by assignment:
<font color=Indigo>
一个Series的index可以通过赋值来改变：

In [25]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [26]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

### DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of
one-dimensional arrays. The exact details of DataFrame’s internals are outside the scope of this book.
<font color=Indigo>
DataFrame是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。DataFrame既有行索引也有列索引，它可以被看作有Series组成的字典（共用同一个索引）。跟其他类似的数据结构相比（如R的data.frame），DataFrame中面向行和面向列的操作基本上是平衡的。其实，DataFrame中的数据是以一个或多个二维块存放的（而不是列表、字典或别的一维数据结构）。有关DataFrame内部技术细节远远超出了倍数所讨论的范围。

__While a DataFrame is physically two-dimensional, you can use it to represent higher dimensional data in a tabular format using hierarchical indexing, a subject we will discuss in Chapter 8 and an
ingredient in some of the more advanced data-handling features in pandas.__
<font color=Indigo>
虽然DataFrame物理上是二维的，但您可以使用分层索引来表示更高维度的数据，这是pandas中许多高级数据处理功能的关键要素，我们将在第8章讨论这个问题，。

There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:
<font color=Indigo>
有很多方法可以构造一个DataFrame，不过最常见的方法之一是使用等长的列表或NumPy数组组成的字典:

In [27]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:
<font color=Indigo>
生成的DataFrame会自动加上索引（和Series一样），且全部列会被有序排列：

In [28]:
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002
5,3.2,Nevada,2003


If you are using the Jupyter notebook, pandas DataFrame objects will be displayed as a more browser-friendly HTML table.
<font color=Indigo>
如果您使用的是Jupyter notebook，那么pandas的DataFrame对象将会显示为一个HTML表。

For large DataFrames, the head method selects only the first five rows:
<font color=Indigo>
对于大型的DataFrames，head方法只选择前5行:

In [29]:
frame.head()

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:
<font color=Indigo>
如果您指定了一个columns序列，那么DataFrame的columns将按指定的顺序排列:

In [30]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


If you pass a column that isn’t contained in the dict, it will appear with missing values in the result:
<font color=Indigo>
如果您传递了一个没有包含在dict中的column，它将会在结果中引入缺失值:

In [31]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                            index=['one', 'two', 'three', 'four','five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:
<font color=Indigo>
通过类似字典标记的方式或属性的方式，可以将DataFrame的列获取为一个Series：

In [32]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [33]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

__Attribute-like access (e.g., frame2.year) and tab completion of column names in IPython is provided as a convenience.__
<font color=Indigo>
在IPython中，可以使用属性的方式访问(例如，frame2.year)并通过tab键自动补齐名称，这是Ipython提供的一种便利。

__frame2[column] works for any column name, but frame2.column only works when the column name is a valid Python variable name.__
<font color=Indigo>
frame2[column]可用于任何列名称，但是frame2.column只有在列名是有效的Python变量名时才起作用。

Note that the returned Series have the same index as the DataFrame, and their name attribute has been appropriately set.
<font color=Indigo>
注意，返回的Series具有与DataFrame相同的index，并且它们的name属性已经自动设置完成。

Rows can also be retrieved by position or name with the special loc attribute (much more on this later):
<font color=Indigo>
行也可以通过位置或名称的方式进行获取，比如用特殊的loc属性(后面会详细介绍):

In [34]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values:
<font color=Indigo>
可以通过赋值来修改columns。例如，空的“debt”列可以被分配一个标量值或一组值:

In [35]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [36]:
frame2['debt'] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any holes:
<font color=Indigo>
当您将列表或数组分配到一个列时，值的长度必须与DataFrame的长度匹配。如果赋值的时一个Series，它的的标签会精确匹配到DataFrame的索引，所有的空位都将被天上缺失值:

In [37]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


Assigning a column that doesn’t exist will create a new column. The del keyword will delete columns as with a dict.
<font color=Indigo>
为一个不存在的cloumns赋值将会创建一个新的columns。del关键字将删除columns。和字典一样。

As an example of del, I first add a new column of boolean values where the state column equals 'Ohio':
<font color=Indigo>
作为del的一个例子，我首先添加了一个布尔值的新列，在这个列中布尔值为：state列是否为“Ohio”:

In [38]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


__New columns cannot be created with the frame2.eastern syntax.__
<font color=Indigo>
不能使用frame2.eastern的语法创建新的列。

The del method can then be used to remove this column:
<font color=Indigo>
接下来可以使用del方法删除这一列:

In [39]:
del frame2['eastern']

In [40]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

__The column returned from indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Series’s copy method.__
<font color=Indigo>
从索引DataFrame返回的列只是相应的数据的视图而已，并不是副本。因此，对该Series所做的任何修改都将在源DataFrame中反映出来。可以用Series的copy方法显式地复制列。

Another common form of data is a nested dict of dicts:
<font color=Indigo>
另一种常见的数据形式是嵌套的字典:

In [41]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices:
<font color=Indigo>
如果将嵌套的字典传递给DataFrame，那么pandas将把外键作为columns和内键作为row index:

In [42]:
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array:
<font color=Indigo>
您可以将DataFrame进行转置（交换行和列），与NumPy数组相似的语法:

In [43]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


The keys in the inner dicts are combined and sorted to form the index in the result. This isn’t true if an explicit index is specified:
<font color=Indigo>
内层字典的键会被合并、排列以形成最终的索引。如果显示指定了索引，则不会这样：

In [44]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


Dicts of Series are treated in much the same way:
<font color=Indigo>
由Series组成的Dicts处理方式也大致相同:

In [45]:
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7


For a complete list of things you can pass the DataFrame constructor, see Table 5-1.
<font color=Indigo>
表5-1：可以输入给DataFrame构造器的数据的一个完整的列表

Table 5-1. Possible data inputs to DataFrame constructor

Type|Notes
:-|:-
2D ndarray | A matrix of data, passing optional row and column labels
dict of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame; all sequences must be the same length
NumPy structured/record array | Treated as the “dict of arrays” case
dict of Series | Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed
dict of dicts | Each inner dict becomes a column; keys are unioned to form the row index as in the “dict of Series” case
List of dicts or Series | Each item becomes a row in the DataFrame; union of dict keys or Series indexes become the DataFrame’s column labels
List of lists or tuples | Treated as the “2D ndarray” case
Another DataFrame | The DataFrame’s indexes are used unless different ones are passed
NumPy MaskedArray | Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result
----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------

表 5-1. 可以输入给DataFrame构造器的数据

类型|说明
:-|:-
2维 ndarray | 数据矩阵，还可以传入行标和列标
由数组、列表或元祖组成的字典 | 每个序列会变成DataFrame的一列。所有序列的长度必须相同。
NumPy的结构化/记录数组 | 类似于“由数组组成的字典”
由Series组成的字典 | 每个Series会成为一列。如果没有显示指定索引，则各Series的索引会被合并成结果的行索引。
由字典组成的字典 | 各内层字典会成为一列。键会被合并成结果的行索引，跟“由Series组成的字典”的情况一样。
字典或Series的列表 | 各项将会成为DataFrame的一行。字典键或Sereis索引的并集将会成为DataFrame的列标。
由列表或元祖组成的列表 | 类似于“二维Ndarray”
另一个DataFrame | 该DataFrame的索引将会被沿用，除非显示指定了其他索引。
NumPy MaskedArray | 类似于“二维Ndarray”的情况，只是掩码值在结果DataFrame会变成NA/缺失值。
----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------

If a DataFrame’s index and columns have their name attributes set, these will also be displayed:
<font color=Indigo>
如果DataFrame的index和columns有它们的name属性集，那么它们也会显示出来:

In [46]:
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


As with Series, the values attribute returns the data contained in the DataFrame as a two-dimensional ndarray:
<font color=Indigo>
与Series一样，values属性将DataFrame中包含的数据返回为一个二维的ndarray

In [47]:
frame3.values

array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

If the DataFrame’s columns are different dtypes, the dtype of the values array will be chosen to accommodate all of the columns:
<font color=Indigo>
如果DataFrame的columns是不同的dtype，那么数组的dtype将会使用能够兼容所有列类型的数据类型:

In [48]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

### Index Objects

pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:
<font color=Indigo>
pandas的索引对象负责管理轴标签和其他元数据（比如轴名称等）。构建Series或DataFrame时，所用到的任何数组或其他序列的标签都会被转换成一个Index：

In [49]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [50]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user:
<font color=Indigo>
索引对象是不可变的(immutable)，因此不能被用户修改:

In [51]:
# index[1] = 'd' # TypeErroe

Immutability makes it safer to share Index objects among data structures:
<font color=Indigo>
不可修改性非常重要，因为这样才能使Index对象在多个数据结构之间安全共享：

In [52]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [53]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [54]:
obj2.index is labels

True

__Some users will not often take advantage of the capabilities provided by indexes, but because some operations will yield results containing indexed data, it’s important to understand how they work.__
<font color=Indigo>
有些用户不会经常利用Index提供的功能，但是因为一些操作会产生包含Index数据的结果，所以理解它们是如何工作的很重要。

In addition to being array-like, an Index also behaves like a fixed-size set:
<font color=Indigo>
除了长的像array ，Index的功能也类似于要给固定大小的set:

In [55]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [56]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [57]:
'Ohio' in frame3.columns

True

In [58]:
2003 in frame3.index

False

Unlike Python sets, a pandas Index can contain duplicate labels:
<font color=Indigo>
与Python的set不同的是，pandas的Index可以包含重复标签:

In [59]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

Selections with duplicate labels will select all occurrences of that label.
<font color=Indigo>
选择带有重复标签的labels将选择该labels的所有出现的内容。

Each Index has a number of methods and properties for set logic, which answer other common questions about the data it contains. Some useful ones are summarized in Table 5-2.
<font color=Indigo>
每个Index都有一些方法和属性，它们可以用于设置逻辑并回答有关该索引所包含的数据的常见问题。表5-2总结了一些有用的方法。

Table 5-2. Some Index methods and properties

Method|Description
:-|:-
append | Concatenate with additional Index objects, producing a new Index
difference | Compute set difference as an Index
intersection | Compute set intersection
union | Compute set union
isin | Compute boolean array indicating whether each value is contained in the passed collection
delete | Compute new Index with element at index i deleted
drop | Compute new Index by deleting passed values
insert | Compute new Index by inserting element at index i
is_monotonic | Returns True if each element is greater than or equal to the previous element
is_unique | Returns True if the Index has no duplicate values
unique | Compute the array of unique values in the Index
----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------

表 5-2. Index的方法和属相

方法|说明
:-|:-
append | 连接另一个Index对象，产生一个新的Index
difference | 计算差集，并得到一个Index
intersection | 计算交集
union | 计算并集
isin | 计算一个指标各值是否都包含在参数集合中的布尔型数组
delete | 删除索引i处的元素，并得到新的Index
drop | 删除传入的值，并的到新的Index
insert | 将元素插入到索引i处，并得到新的Index
is_monotonic | 当各元素均大于等于前一个元素时，返回True
is_unique |当Index没有重复值时，返回True
unique | 计算Index中唯一值的数组
----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------

## Essential Functionality
<font color=Indigo>
基本功能

This section will walk you through the fundamental mechanics of interacting with the data contained in a Series or DataFrame. In the chapters to come, we will delve more deeply into data analysis and manipulation topics using pandas. This book is not intended to serve as exhaustive documentation for the pandas library; instead, we’ll focus on the most important features, leaving the less common (i.e., more esoteric) things for you to explore on your own.
<font color=Indigo>
本节中，我将介绍操作Series和DataFrame中的数据基本手段。后续章节将更加深入地挖掘pandas在数据分析和处理方面的功能。本书不是pandas库的详尽文档，主要关注的是最重要的功能，那些不太常用的内容（也就是那些更深奥的内容）就交给你自己去摸索吧。

### Reindexing
<font color=Indigo>
重新索引

An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Consider an example:
<font color=Indigo>
pandas对象的一个重要方法是reindex,其作用是创建一个适应新索引的新对象。以之前的一个简单示例来说：

In [60]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling reindex on this Series rearranges the data according to the new index,introducing
missing values if any index values were not already present:
<font color=Indigo>
调用该Series的reindex将会根据新索引进行重排。如果某个索引值当前不存在，就引入缺失值：

In [61]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as ffill, which forward-fills the values:
<font color=Indigo>
对于时间序列这样的有序数据，重新索引时可能需要做一些插值处理。method选项即可达到此目的，例如，使用ffill可以实现向前值填充：

In [62]:
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With DataFrame, reindex can alter either the (row) index, columns, or both. When passed only a sequence, it reindexes the rows in the result:
<font color=Indigo>
使用DataFrame,reindex可以修改(row)index、columns或两者都修改。当仅传递一个序列时，它将对结果中的row进行重新索引:

In [63]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texax', 'Californial'])
frame

Unnamed: 0,Ohio,Texax,Californial
a,0,1,2
c,3,4,5
d,6,7,8


In [64]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texax,Californial
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


The columns can be reindexed with the columns keyword:
<font color=Indigo>
可以用columns关键字来对列进行索引:

In [65]:
states = ['Texax', 'Utah', 'Californial']
frame.reindex(columns=states)

Unnamed: 0,Texax,Utah,Californial
a,1,,2
c,4,,5
d,7,,8


See Table 5-3 for more about the arguments to reindex.
<font color=Indigo>
有关重新索引的参数，请参阅表5 - 3。

As we’ll explore in more detail, you can reindex more succinctly by label-indexing with loc, and many users prefer to use it exclusively:
<font color=Indigo>
您可以通过loc的label-indexing更简洁地进行索引，许多用户只喜欢使用它:

In [66]:
frame.loc[['a', 'b', 'c', 'd'], states]

Unnamed: 0,Texax,Utah,Californial
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


Table 5-3. reindex function arguments

Argument|Description
:-|:-
index | New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying.
method | Interpolation (fill) method; 'ffill' fills forward, while 'bfill' fills backward.
fill_value | Substitute value to use when introducing missing data by reindexing.
limit | When forward- or backfilling, maximum size gap (in number of elements) to fill.
tolerance | When forward- or backfilling, maximum size gap (in absolute numeric distance) to fill for inexact matches.
level | Match simple Index on level of MultiIndex; otherwise select subset of.
copy | If True, always copy underlying data even if new index is equivalent to old index; if False, do not copy the data when the indexes are equivalent.
----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------

表 5-3. reindex函数参数

Argument|Description
:-|:-
index | 用作索引的新序列。即可以时Index实例，也可以是其他序列型的Python数据结构。Index会被完全使用，就像没有任何复制一样。
method | 插值（填充）方式，具体参数请参见表5-4
fill_value | 在重新索引的过程中，需要引入缺失值的替代值
limit | 向前或向后填充时的最大填充量
tolerance | 当向前或向后填充时，最大尺寸间隙填充(绝对数字距离)不精确匹配。
level | 在Multlindex的制定级别上匹配简单索引，否则选取其子集
copy | 默认为True，无论如何都复制；如果为False，则新旧相等就不复制。
----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------

### Dropping Entries from an Axis
<font color=Indigo>
丢弃制定轴上的项

Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:
<font color=Indigo>
丢弃某条轴上的一个或多个项很简单，只要有一个索引数组或列表即可。由于需要执行一些数据整理和集合逻辑，所以drop方法返回的是一个在指定轴上删除了指定值的新对象：

In [67]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [68]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:
<font color=Indigo>
使用DataFrame，可以从任意一个轴删除索引值。为了说明这一点，我们首先创建了一个DataFrame示例:

In [69]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Calling drop with a sequence of labels will drop values from the row labels (axis 0):
<font color=Indigo>
将一组行标签传入drop，将会从行标签中删除值(axis 0):

In [70]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


You can drop values from the columns by passing axis=1 or axis='columns':
<font color=Indigo>
您可以通过传递axis=1或axis='columns'来从列中删除值。

In [71]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [72]:
data.drop(['two', 'four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in-place without returning a new object:
<font color=Indigo>
许多函数，比如drop，它可以修改一个Series或DataFrame的大小或形状，可以在不返回新对象的情况下就地对原始对象进行操作。

In [73]:
obj.drop('c', inplace=True)
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

Be careful with the inplace, as it destroys any data that is dropped.
<font color=Indigo>
小心使用inplace，因为它会破坏掉原始数据。

### Indexing, Selection, and Filtering
<font color=Indigo>
索引、选择和过滤

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:
<font color=Indigo>
Series的索引(obj[...])与NumPy数组索引类似，但您可以使用该Series的索引值，而不是只使用整数。以下是一些例子:

In [74]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [75]:
obj['b']

1.0

In [76]:
obj[1]

1.0

In [77]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [78]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [79]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [80]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:
<font color=Indigo>
利用标签的切片运算与普通的Python切片运算不同，其末端是包含的（inclusive），即封闭区间。

In [81]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

Setting using these methods modifies the corresponding section of the Series:
<font color=Indigo>
使用这些方法设置和修改当前Series的相应部分:

In [82]:
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:
<font color=Indigo>
对一个DataFrame进行索引是为了检索一个或多个列，或者一个单独的值或序列:

In [83]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [84]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [85]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


Indexing like this has a few special cases. First, slicing or selecting data with a boolean array:
<font color=Indigo>
这种索引方式有几个特殊的情况。首先通过切片或布尔型数组选取行：

In [86]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [87]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


The row selection syntax data[:2] is provided as a convenience. Passing a single element or a list to the [] operator selects columns.
<font color=Indigo>
行选择语法data[:2]提供了方便。通过将单个元素或列表传递给操作符[]来选择列。

Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:
<font color=Indigo>
另一个用例是用布尔DataFrame进行索引，例如一个标量比较:

In [88]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [89]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


This makes DataFrame syntactically more like a two-dimensional NumPy array in this particular case.
<font color=Indigo>
这使得DataFrame在这个特殊的情况下更像是一个二维的NumPy数组。

#### Selection with loc and iloc
<font color=Indigo>
Selection with loc and iloc

For DataFrame label-indexing on the rows, I introduce the special indexing operators loc and iloc. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either axis labels (loc) or integers (iloc).
<font color=Indigo>
对于在行上使用DataFrame标签索引，我将介绍特殊的索引操作符loc和iloc。它们使您可以使用任一axis标签(loc)或整数(iloc)来从DataFrame中选择行和列的一个子集。

As a preliminary example, let’s select a single row and multiple columns by label:
<font color=Indigo>
作为一个初步示例，让我们根据标签选择单个行和多个列。

In [90]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

We’ll then perform some similar selections with integers using iloc:
<font color=Indigo>
然后，我们将使用iloc执行一些类似的选择:

In [91]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [92]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [93]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


Both indexing functions work with slices in addition to single labels or lists of labels:
<font color=Indigo>
这两个索引函数都可以在单独的标签或标签列表的情况下使用切片。

In [94]:
data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

In [95]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


So there are many ways to select and rearrange the data contained in a pandas object. For DataFrame, Table 5-4 provides a short summary of many of them. As you’ll see later, there are a number of additional options for working with hierarchical indexes.
<font color=Indigo>
因此，有许多方法可以选择和重新排列pandas对象中包含的数据。对于DataFrame，表5-4提供了许多这方面的简短摘要。稍后您将看到，有许多其他的选项用于处理分层索引。

__When originally designing pandas, I felt that having to type frame[:, col] to select a column was too verbose (and errorprone),since column selection is one of the most common operations.I made the design trade-off to push all of the fancy indexing behavior (both labels and integers) into the ix operator. In practice,this led to many edge cases in data with integer axis labels, so the pandas team decided to create the loc and iloc operators to deal with strictly label-based and integer-based indexing, respectively.The ix indexing operator still exists, but it is deprecated. I do not recommend using it.__
<font color=Indigo>
最初设计pandas时，我觉得必须输入frame[:, col]选择一个列太啰嗦(而且容易出错)，因为选择列是最常见的操作之一。于是我设计了一个折衷方案，将所有复杂的索引行为(包括标签和整数)都放到ix操作符中。在实践中，这导致了带有整数轴标签数据的许多边缘情况，因此pandas团队决定创建一个loc和iloc操作符来处理严格的基于标签的和基于整数的索引。ix索引操作符仍然存在，但它已被弃用。我不推荐使用它。

Table 5-4. Indexing options with DataFrame

Type|Notes
:-|:-
df[val] | Select single column or sequence of columns from the DataFrame;special case conveniences: boolean array (filter rows), slice (slice rows),or boolean DataFrame(set values based on some criterion)
df.loc[val] | Selects single row or subset of rows from the DataFrame by label
df.loc[:, val] | Selects single column or subset of columns by label
df.loc[val1, val2] | Select both rows and columns by label
df.iloc[where] | Selects single row or subset of rows from the DataFrame by integer position
df.iloc[:, where] | Selects single column or subset of columns by integer position
df.iloc[where_i, where_j] | Select both rows and columns by integer position
df.at[label_i, label_j] | Select a single scalar value by row and column label
df.iat[i, j] | Select a single scalar value by row and column position (integers)
reindex method | Select either rows or columns by labels
get_value, set_value methods | Select single value by row and column label
--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------

Table 5-4. DataFrame的索引选项

类型|说明
:-|:-
df[val] | 从DataFrame中选择单个列或序列;在一些特殊的情况下会比较便利:布尔数组(过滤行)、slice(片行)或布尔DataFrame(基于条件设置值)
df.loc[val] | 选取DataFrame的单个行或一组行
df.loc[:, val] | 选取单个列或列子集
df.loc[val1, val2] | 同时选取行和列
df.iloc[where] | 从DataFrame中选择单个行或子集，由整数位置选择
df.iloc[:, where] | 根据整数位置选择列或列的子集
df.iloc[where_i, where_j] | 根据整数位置同时选取行和列
df.at[label_i, label_j] | 通过行和列标签选择一个标量值
df.iat[i, j] | 通过行和列位置(整数)选择一个标量值
reindex method | 根据标签选择行或列
get_value, set_value methods | 通过行和列标签选择单个值
--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------

### Integer Indexes
<font color=Indigo>
整数索引

Working with pandas objects indexed by integers is something that often trips up new users due to some differences with indexing semantics on built-in Python data structures like lists and tuples. For example, you might not expect the following code to generate an error:
<font color=Indigo>
处理整数索引的pandas对象，通常会导致新用户犯错误，因为在诸如列表和元组这样的内置Python数据结构中，在索引语义有一些不同。例如，您可能不会认为以下代码将产生错误:

In [96]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [97]:
# ser[-1] # Error

In this case, pandas could “fall back” on integer indexing, but it’s difficult to do this in general without introducing subtle bugs. Here we have an index containing 0, 1, 2,but inferring what the user wants (label-based indexing or position-based) is difficult:
<font color=Indigo>
在这种情况下，pandas可能会“求助于”整数索引，但是没有那种方法，能够即不引入任何bug又安全有效地解决该问题。这里，我们有一个包含0、1、2的索引，但是很难推断用户想要的是什么(基于标签的索引或基于位置的)：

On the other hand, with a non-integer index, there is no potential for ambiguity:
<font color=Indigo>
另一方面，使用非整数索引，不存在歧义的问题:

In [98]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]

2.0

To keep things consistent, if you have an axis index containing integers, data selection will always be label-oriented. For more precise handling, use loc (for labels) or iloc(for integers):
<font color=Indigo>
为了保持良好的一致性，如果你有一个包含整数的轴索引，那么根据整数进行数据选取的操作总是针对标签的。为更精确的处理，可以使用loc(用于标签)或iloc(对于整数)

In [99]:
ser[:1]

0    0.0
dtype: float64

In [100]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [101]:
ser.iloc[:1]

0    0.0
dtype: float64

### Arithmetic and Data Alignment
<font color=Indigo>
算数运算和数据对其

An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels. Let’s look at an example:
<font color=Indigo>
对某些应用来说，pandas的一个重要特性是，它可以对不同索引对象进行算术运算。在将对象相加时，如果存在不同的索引对，则结果的索引就是该索引对的并集。对于有数据库经验的用户来说，这类似于索引标签上的自动outer连接。让我们来看一个例子:

In [102]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [103]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

Adding these together yields:
<font color=Indigo>
将它们相加就会产生：

In [104]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations.
<font color=Indigo>
自动的数据对其操作在不重叠的索引处引入了NA值。缺失值会在算术运算过程中传播。

In the case of DataFrame, alignment is performed on both the rows and the columns:
<font color=Indigo>
对于DataFrame，对齐操作会同时发生在行和列上：

In [105]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [106]:
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


Adding these together returns a DataFrame whose index and columns are the unions of the ones in each DataFrame:
<font color=Indigo>
把他们相加后会返回一个新的DataFrame，其索引和列为原来那两个DataFrame的并集：

In [107]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


Since the 'c' and 'e' columns are not found in both DataFrame objects, they appear as all missing in the result. The same holds for the rows whose labels are not common to both objects.
<font color=Indigo>
由于在两个DataFrame对象中都没有找到“c”和“e”列，因此它们在结果中都出现了缺失。对于这两个对象的行标签也同样适用。

If you add DataFrame objects with no column or row labels in common, the result will contain all nulls:
<font color=Indigo>
如果在没有列或行标签的情况下对DataFrame对象相加，那么结果全为null:

In [108]:
df1 = pd.DataFrame({'A': [1, 2]})
df1

Unnamed: 0,A
0,1
1,2


In [109]:
df2 = pd.DataFrame({'B': [3, 4]})
df2

Unnamed: 0,B
0,3
1,4


#### Arithmetic methods with fill values
<font color=Indigo>
在算数方法中填充值

In arithmetic operations between differently indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other:
<font color=Indigo>
在对不同索引的对象进行算数运算时，你可能希望在一个对象中某个轴标签在另一个对象中找不到时填充一个特殊值（比如0）：

In [110]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [111]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [112]:
df2.loc[1, 'b'] = np.nan

In [113]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


Adding these together results in NA values in the locations that don’t overlap:
<font color=Indigo>
将这些加在一起，没有重叠的位置就会产生NA值：

In [114]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


Using the add method on df1, I pass df2 and an argument to fill_value:
<font color=Indigo>
使用df1的add方法，传入df2以及一个fill_value参数：

In [115]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


See Table 5-5 for a listing of Series and DataFrame methods for arithmetic. Each of them has a counterpart, starting with the letter r, that has arguments flipped. So these two statements are equivalent:
<font color=Indigo>
请参阅表5-5，以获得Series和DataFrame的算术方法。它们中的每一个都有一个对应的，以字母r开头的版本，它的参数和调用对象是翻转的。所以这两个表述是等价的：

In [116]:
1 / df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [117]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill value:
<font color=Indigo>
相应地，当对一个Series或DataFrame重新索引时，也可以指定一个不同的填充值:

In [118]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


Table 5-5. Flexible arithmetic methods

Method|Description
:-|:-
add, radd | Methods for addition (+)
sub, rsub | Methods for subtraction (-)
div, rdiv | Methods for division (/)
floordiv, rfloordiv | Methods for floor division (//)
mul, rmul | Methods for multiplication (`*`)
pow, rpow | Methods for exponentiation (`**`)
--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------

表 5-5. 灵活的算数方法

Method|Description
:-|:-
add, radd | 用于加法（+）的方法
sub, rsub | 用于减法（-）的方法
div, rdiv | 用于除法（/）的方法
floordiv, rfloordiv | 用于地板除（//）的方法
mul, rmul | 用于乘法（`*`）的方法
pow, rpow | 用于乘方（`**`）的方法
--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------

#### Operations between DataFrame and Series
<font color=Indigo>
DataFrame和Series之间的运算

As with NumPy arrays of different dimensions, arithmetic between DataFrame and Series is also defined. First, as a motivating example, consider the difference between a two-dimensional array and one of its rows:
<font color=Indigo>
跟Numpy数组一样，DataFrame和Series之间算数运算也是有明确规定的。先来看一个具有启发性的例子，计算一个二维数组与其某行之间的差：

In [119]:
arr = np.arange(12.).reshape((3, 4))
arr

array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

In [120]:
arr[0]

array([ 0.,  1.,  2.,  3.])

In [121]:
arr - arr[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

When we subtract arr[0] from arr, the subtraction is performed once for each row. This is referred to as broadcasting and is explained in more detail as it relates to general NumPy arrays in Appendix A. Operations between a DataFrame and a Series are similar:
<font color=Indigo>
当我们从arr中减去arr[0]时，每一行都执行一次减法。这称为广播，附录A中与NumPy数组相关的部分有更详细地解释。DataFrame和Series的操作类似:

In [122]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [123]:
series = frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows:
<font color=Indigo>
默认情况下，DataFrame和Seires之间的算术运算会将Series的索引匹配到DataFrame的列上，然后沿着行一直向下广播:

In [124]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


If an index value is not found in either the DataFrame’s columns or the Series’s index, the objects will be reindexed to form the union:
<font color=Indigo>
如果某个索引值在DataFrame的列或Series的索引中找不到，则参与运算的两个对象就会被重新索引以形成并集:

In [125]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods. For example:
<font color=Indigo>
如果你希望匹配行并在列上广播，则必须使用算数运算方法。例如：

In [126]:
series3 = frame['d']
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [127]:
frame.sub(series3, axis='index')

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


The axis number that you pass is the axis to match on. In this case we mean to match on the DataFrame’s row index (axis='index' or axis=0) and broadcast across.
<font color=Indigo>
传入的轴号就是希望匹配的轴。在本例中，我们的目的时匹配DataFrame的行（axis='index或axis=0）索引并进行广播。

### Function Application and Mapping
<font color=Indigo>
函数应用和映射

NumPy ufuncs (element-wise array methods) also work with pandas objects:
<font color=Indigo>
NumPy的ufuncs(元素数组方法)也适用于pandas对象:

In [128]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-0.394437,0.970761,-0.235309
Ohio,0.072318,-0.098665,-1.227519
Texas,0.495589,-0.326478,1.239539
Oregon,0.762072,-1.189422,-0.130408


In [129]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.394437,0.970761,0.235309
Ohio,0.072318,0.098665,1.227519
Texas,0.495589,0.326478,1.239539
Oregon,0.762072,1.189422,0.130408


Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this:
<font color=Indigo>
另一个常见的操作是，将函数应用到由各列或行所形成的一维数组上。DataFrame的apply方法即可实现此功能：

In [130]:
f = lambda x: x.max() - x.min()

In [131]:
frame.apply(f)

b    1.156510
d    2.160183
e    2.467058
dtype: float64

Here the function f, which computes the difference between the maximum and minimum of a Series, is invoked once on each column in frame. The result is a Series having the columns of frame as its index.
<font color=Indigo>
这里函数f计算了一个Series的最大值和最小值之间的差值，它在frame的每一列上被调用一次。结果是一个Series，它将frame的列名作为它的索引。

If you pass axis='columns' to apply, the function will be invoked once per row instead:
<font color=Indigo>
如果传递axis='columns'，那么该函数将在每个行上被调用一次:

In [132]:
frame.apply(f, axis='columns')

Utah      1.365198
Ohio      1.299837
Texas     1.566016
Oregon    1.951494
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods,so using apply is not necessary.
<font color=Indigo>
许多最为常见的数组统计功能都被实现成DataFrame的方法（如sum和mean），因此无需使用apply方法。

The function passed to apply need not return a scalar value; it can also return a Series with multiple values:
<font color=Indigo>
除标量值外，传递给apply的函数还可以返回由多个值组成的Series:

In [133]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

In [134]:
frame.apply(f)

Unnamed: 0,b,d,e
min,-0.394437,-1.189422,-1.227519
max,0.762072,0.970761,1.239539


Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating-point value in frame. You can do this with apply map:
<font color=Indigo>
此外，元素级的Python函数也是可以使用的。假如你想得到frame中各个浮点值的格式化字符串，使用applymap即可：

In [135]:
format = lambda x: '%.2f' % x

In [136]:
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-0.39,0.97,-0.24
Ohio,0.07,-0.1,-1.23
Texas,0.5,-0.33,1.24
Oregon,0.76,-1.19,-0.13


The reason for the name applymap is that Series has a map method for applying an element-wise function:
<font color=Indigo>
之所以叫applymap，是因为Series有一个用于应用元素级函数的map方法：

In [137]:
frame['e'].map(format)

Utah      -0.24
Ohio      -1.23
Texas      1.24
Oregon    -0.13
Name: e, dtype: object

### Sorting and Ranking
<font color=Indigo>
排序和排名

Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:
<font color=Indigo>
根据条件对数据集排序（sorting）也是一种重要的内置运算。要对行或列索引进行排序（按字典顺序），可以是应用sort_index方法，它将返回一个已排序的新对象：

In [138]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int32

With a DataFrame, you can sort by index on either axis:
<font color=Indigo>
对于DataFrame，则可以根据任意一个轴上的索引进行排序：

In [140]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [141]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [142]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


The data is sorted in ascending order by default, but can be sorted in descending order, too:
<font color=Indigo>
数据默认是按升序排列的，但也可以降序排序：

In [144]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


To sort a Series by its values, use its sort_values method:
<font color=Indigo>
根据值对Series进行排序，可以使用sort_values方法:

In [146]:
obj = pd.Series([4, 7, -3, 2])
obj

0    4
1    7
2   -3
3    2
dtype: int64

In [147]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

Any missing values are sorted to the end of the Series by default:
<font color=Indigo>
在排序时，任何缺失的值默认都会被放到Series的末尾:

In [148]:
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj

0    4.0
1    NaN
2    7.0
3    NaN
4   -3.0
5    2.0
dtype: float64

In [149]:
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

When sorting a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more column names to the by option of sort_values:
<font color=Indigo>
在DataFrame上，你可能希望根据一个或多个列中的值进行排序。将一个或多个列的名字传递给by选项即可达到该目的：

In [150]:
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2


In [151]:
frame.sort_values(by='b')

Unnamed: 0,a,b
2,0,-3
3,1,2
0,0,4
1,1,7


To sort by multiple columns, pass a list of names:
<font color=Indigo>
要根据多个列进行排序，传入名称的列表即可：

In [153]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,a,b
2,0,-3
0,0,4
3,1,2
1,1,7


Ranking assigns ranks from one through the number of valid data points in an array. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank:
<font color=Indigo>
排序从1开始，到一个数组中的有效数据点的数量分配排名。接下来介绍Series和DataFrame的rank方法；在默认情况下，rank是通过“为各组分配一个平均排名”的方式破坏平级关系的：

In [155]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [156]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Ranks can also be assigned according to the order in which they’re observed in the data:
<font color=Indigo>
也可以根据值在原数据中出现的顺序给出排名（类似于稳定排序）

In [157]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

Here, instead of using the average rank 6.5 for the entries 0 and 2, they instead have been set to 6 and 7 because label 0 precedes label 2 in the data.
<font color=Indigo>
这里，并不是使用标签0和2的平均排名6.5，而是将它们设置为6和7，因为标签0在标签2之前。

You can rank in descending order, too:
<font color=Indigo>
你也可以按降序进行排名：

In [158]:
# Assign tie values the maximum rank in the group
# 将绑定值指定为组中最大的排名
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

See Table 5-6 for a list of tie-breaking methods available.
<font color=Indigo>
表5-6列出了所有用于破坏平级关系的method选项。

DataFrame can compute ranks over the rows or the columns:
<font color=Indigo>
DataFrame也可以在行或列上计算排名

In [160]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                      'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,a,b,c
0,0,4.3,-2.0
1,1,7.0,5.0
2,0,-3.0,8.0
3,1,2.0,-2.5


In [161]:
frame.rank(axis='columns')

Unnamed: 0,a,b,c
0,2.0,3.0,1.0
1,1.0,3.0,2.0
2,2.0,1.0,3.0
3,2.0,3.0,1.0


Table 5-6. Tie-breaking methods with rank

Method|Description
:-|:-
'average' | Default: assign the average rank to each entry in the equal group
'min' | Use the minimum rank for the whole group
'max' | Use the maximum rank for the whole group
'first' | Assign ranks in the order the values appear in the data
'dense' | Like method='min', but ranks always increase by 1 in between groups rather than the number of equal elements in a group
--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------

表 5-6. 排名时用于破坏平级关系的method选项

Method|Description
:-|:-
'average' | 默认：在相等分组中，为各个值分配平均排名
'min' | 使用整个分组的最小排名
'max' | 使用整个分组的最大排名
'first' | 按值在元数数据中出现的顺序分配排名
'dense' | 类似于 method='min', 但在组中，rank总是增加1，而不是组中相等的元素的数量
--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------

### Axis Indexes with Duplicate Labels
<font color=Indigo>
带有重复值的轴索引

Up until now all of the examples we’ve looked at have had unique axis labels (index values). While many pandas functions (like reindex) require that the labels be unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:
<font color=Indigo>
到目前为止，我所介绍的所有范例都有着唯一的轴标签（索引值）。虽然许多pandas函数（如reindex）都要求标签唯一，但这并不是强制性的。我们来看看下面这个简单的带有重复索引值的Series：

In [162]:
obj = pd.Series(range(5),index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int32

The index’s is_unique property can tell you whether its labels are unique or not:
<font color=Indigo>
索引的is_unique属性可以告诉你它的值是否时唯一的：

In [164]:
obj.index.is_unique

False

Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value:
<font color=Indigo>
对于带有重复值的索引，数据选取的行为将会有些不同。如果某个索引对应多个值，则返回一个Series；而对应单个值的，则返回一个标量值。