# `DataFrame`对象的数据选取

In [1]:
import pandas as pd

area = pd.Series({'California':423967,'Texas':695662,
                  'New York':141297,'Floriade':170312,
                  'Illinois':149995})

pop = pd.Series({'California':38332521,'Texas':26448193,
                  'New York':19651127,'Floriade':19552860,
                  'Illinois':12882135})

data = pd.DataFrame({'area':area, 'pop':pop})
print(data)

              area       pop
California  423967  38332521
Texas       695662  26448193
New York    141297  19651127
Floriade    170312  19552860
Illinois    149995  12882135


由于`DataFrame`本质上反映的是列标签与`Series`的映射关系，所以对列名进行字典形式的取值，可以获取相应的`Series`列数据

In [2]:
print(data['area'])

California    423967
Texas         695662
New York      141297
Floriade      170312
Illinois      149995
Name: area, dtype: int64


使用字典语法进行列扩充

In [3]:
data['a'] = [1,2,3,4,5]
data['b'] = 0
print(data)

              area       pop  a  b
California  423967  38332521  1  0
Texas       695662  26448193  2  0
New York    141297  19651127  3  0
Floriade    170312  19552860  4  0
Illinois    149995  12882135  5  0


计算每个州人口密度

In [5]:
data = pd.DataFrame({'area':area, 'pop':pop})
data['density'] = data['pop'] / data['area']
print(data)

              area       pop     density
California  423967  38332521   90.413926
Texas       695662  26448193   38.018740
New York    141297  19651127  139.076746
Floriade    170312  19552860  114.806121
Illinois    149995  12882135   85.883763


### 类比`ndarray`获取数据
用`values`属性，可以获得一个二维`ndarray`二维数组

In [6]:
print(data.values)

[[4.23967000e+05 3.83325210e+07 9.04139261e+01]
 [6.95662000e+05 2.64481930e+07 3.80187404e+01]
 [1.41297000e+05 1.96511270e+07 1.39076746e+02]
 [1.70312000e+05 1.95528600e+07 1.14806121e+02]
 [1.49995000e+05 1.28821350e+07 8.58837628e+01]]


通过`values`属性可以获取二维`ndarray`数组形式的数据之后，我们自然可以像操作二维数组一样，对其进行行索引和值的索引

In [7]:
print(data.values[2])
print(data.values[1][1])
print(data.values[1:][:2])

[1.41297000e+05 1.96511270e+07 1.39076746e+02]
26448193.0
[[6.95662000e+05 2.64481930e+07 3.80187404e+01]
 [1.41297000e+05 1.96511270e+07 1.39076746e+02]]


### 三种索引器的行列索引方法

* **`iloc`索引:** 无论是行还是列，都采用隐式的整数索引进行分片，规则都是左闭右开

In [8]:
print(data.iloc[:2,1:2])

                 pop
California  38332521
Texas       26448193


* **`loc`索引：** 采用显式的标签值索引进行分片，规则是左右都取

In [11]:
print(data)
print(data.loc[:'Floriade', 'area':'pop'])

              area       pop     density
California  423967  38332521   90.413926
Texas       695662  26448193   38.018740
New York    141297  19651127  139.076746
Floriade    170312  19552860  114.806121
Illinois    149995  12882135   85.883763
              area       pop
California  423967  38332521
Texas       695662  26448193
New York    141297  19651127
Floriade    170312  19552860


* **`ix`索引：** 隐式整数索引进行行的分片，利用显式标签名称进行列的分片

In [13]:
print(data)
# 对列的选取：连续的分片方式
print(data.ix[:3, 'area':'pop'])
# 对列的选取：指定列名方式
print(data.ix[:3, ['area','density']])

              area       pop     density
California  423967  38332521   90.413926
Texas       695662  26448193   38.018740
New York    141297  19651127  139.076746
Floriade    170312  19552860  114.806121
Illinois    149995  12882135   85.883763
              area       pop
California  423967  38332521
Texas       695662  26448193
New York    141297  19651127
              area     density
California  423967   90.413926
Texas       695662   38.018740
New York    141297  139.076746


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """


**只对行进行分片操作，不需要索引器**

In [16]:
print(data)
print(data['Texas':'Floriade'])
print(data[1:3])

              area       pop     density
California  423967  38332521   90.413926
Texas       695662  26448193   38.018740
New York    141297  19651127  139.076746
Floriade    170312  19552860  114.806121
Illinois    149995  12882135   85.883763
            area       pop     density
Texas     695662  26448193   38.018740
New York  141297  19651127  139.076746
Floriade  170312  19552860  114.806121
            area       pop     density
Texas     695662  26448193   38.018740
New York  141297  19651127  139.076746


### 注意：如果`[ ]`中只有一个值，那就是对列进行索引，如果是分片那就是对行进行操作

## `DataFrame`的条件过滤操作

In [17]:
print(data[data['density'] > 100])

            area       pop     density
New York  141297  19651127  139.076746
Floriade  170312  19552860  114.806121


In [19]:
print(data.loc[data['density'] > 100, ['area','density']])

            area     density
New York  141297  139.076746
Floriade  170312  114.806121


## `DataFrame`数据值的修改
如果想要修改`DataFrame`中的某个值，使用任何一种索引器方法定位到具体的一个数据项，即可办到

In [20]:
data.loc['Floriade', 'area'] = 9999999
data.iloc[4,1] = 8888888
print(data)

               area       pop     density
California   423967  38332521   90.413926
Texas        695662  26448193   38.018740
New York     141297  19651127  139.076746
Floriade    9999999  19552860  114.806121
Illinois     149995   8888888   85.883763
