# DataFrame  
一个Datarame表示一个表格，类似电子表格的数据结构，包含一个经过排序的列表集，它们每一个都可以有不同的类型值（数字，字符串，布尔等等）。Datarame有行和列的索引；它可以被看作是一个Series的字典（每个Series共享一个索引）。

## 构建DataFrame

### dict或numpy数组

由此产生的DataFrame和Series一样，它的索引会自动分配，并且对列进行了排序：

In [1]:
import numpy as np
import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


如果设定了一个列的顺序，DataFrame的列将会精确的按照所传递的顺序排列：

In [2]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


和Series一样，如果你传递了一个行，但不包括在 data 中，在结果中它会表示为NA值：

In [3]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four','five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


Creating a DataFrame by passing a numpy array, with a datetime indexand labeled columns:

In [4]:
dates = pd.date_range('20171118', periods=6)
dates

DatetimeIndex(['2017-11-18', '2017-11-19', '2017-11-20', '2017-11-21',
               '2017-11-22', '2017-11-23'],
              dtype='datetime64[ns]', freq='D')

In [5]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2017-11-18,0.792894,-0.56887,1.426474,1.527781
2017-11-19,-1.14978,-1.599802,-0.060131,-0.431938
2017-11-20,-0.394942,2.328338,0.370515,-1.266174
2017-11-21,-0.360854,-0.572085,-2.1033,-0.71144
2017-11-22,0.946979,-0.940775,1.765257,0.061518
2017-11-23,-0.314954,0.062449,-0.767526,-1.340866


### 嵌套的字典的字典格式

如果被传递到DataFrame，它的外部键会被解释为列索引，内部键会被解释为行索引：

In [6]:
pop = {'Nevada':{2001:2.4, 2002:2.9}, 'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


当然也可以进行转置:

In [7]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


### 通过Series对象创建

In [8]:
pd.DataFrame([df.median(), df.mean(), df.std()], index=['median', 'mean', 'std'])

Unnamed: 0,A,B,C,D
median,-0.337904,-0.570478,0.155192,-0.571689
mean,-0.080109,-0.215124,0.105215,-0.360186
std,0.799291,1.359611,1.430972,1.063328


## DataFrame转换为其他格式

### dataframe 转换为dict

从字典构建dataframe就知道dataframe是如何转换为字典的了，dataframe会转换成嵌套dict。
如果只是选择一列进行转换，就相当于是将series对象转换成dict。

In [9]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],  'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data, index = [2,3,4,5,6])
frame

Unnamed: 0,pop,state,year
2,1.5,Ohio,2000
3,1.7,Ohio,2001
4,3.6,Ohio,2002
5,2.4,Nevada,2001
6,2.9,Nevada,2002


In [10]:
dict(frame['year'])

{2: 2000, 3: 2001, 4: 2002, 5: 2001, 6: 2002}

In [11]:
k = dict(frame[['pop', 'year']])
print(type(k))
print(k)
print(type(k['pop']))
print(k['pop'])

<class 'dict'>
{'pop': 2    1.5
3    1.7
4    3.6
5    2.4
6    2.9
Name: pop, dtype: float64, 'year': 2    2000
3    2001
4    2002
5    2001
6    2002
Name: year, dtype: int64}
<class 'pandas.core.series.Series'>
2    1.5
3    1.7
4    3.6
5    2.4
6    2.9
Name: pop, dtype: float64


### dataframe转换成ndarray，嵌套的list

首先使用np.array()函数把DataFrame转化为np.ndarray()，再利用tolist()函数把np.ndarray()转为list:

In [12]:
# ltu_list = [col.tolist() for _,col in frame]
tmp = np.array(frame)
print(type(tmp))
tmp.tolist()

<class 'numpy.ndarray'>


[[1.5, 'Ohio', 2000],
 [1.7, 'Ohio', 2001],
 [3.6, 'Ohio', 2002],
 [2.4, 'Nevada', 2001],
 [2.9, 'Nevada', 2002]]

### dataframe数据类型转换

使用 DataFrame.dtypes 可以查看每列的数据类型，Pandas默认可以读出int和float64，其它的都处理为object，需要转换格式的一般为日期时间。  
DataFrame.astype()可对整个DataFrame或某一列进行数据格式转换，支持Python和NumPy的数据类型。

In [13]:
frame.dtypes

pop      float64
state     object
year       int64
dtype: object

In [14]:
frame['year'] = frame['year'].astype('int')
frame

Unnamed: 0,pop,state,year
2,1.5,Ohio,2000
3,1.7,Ohio,2001
4,3.6,Ohio,2002
5,2.4,Nevada,2001
6,2.9,Nevada,2002


## 查看数据

### See the top & bottom rows of the frame

使用head()查看前几行数据，默认5行；使用tail查看后几行数据，默认5行；可以指定行数

In [15]:
frame.head()

Unnamed: 0,pop,state,year
2,1.5,Ohio,2000
3,1.7,Ohio,2001
4,3.6,Ohio,2002
5,2.4,Nevada,2001
6,2.9,Nevada,2002


In [16]:
frame.tail(3)

Unnamed: 0,pop,state,year
4,3.6,Ohio,2002
5,2.4,Nevada,2001
6,2.9,Nevada,2002


In [17]:
frame.head(3)

Unnamed: 0,pop,state,year
2,1.5,Ohio,2000
3,1.7,Ohio,2001
4,3.6,Ohio,2002


### 切片、loc、iloc、at、iat、ix

行切片

In [18]:
 frame[0:4]

Unnamed: 0,pop,state,year
2,1.5,Ohio,2000
3,1.7,Ohio,2001
4,3.6,Ohio,2002
5,2.4,Nevada,2001


列切片，对于多列选择，不能像行选择时用0:3这种方法选择

In [19]:
frame[['pop', 'state']]

Unnamed: 0,pop,state
2,1.5,Ohio
3,1.7,Ohio
4,3.6,Ohio
5,2.4,Nevada
6,2.9,Nevada


区域选择

In [20]:
frame[:3][['state', 'year']]

Unnamed: 0,state,year
2,Ohio,2000
3,Ohio,2001
4,Ohio,2002


### loc按索引值取值

loc可以让你按照索引来进行行列选择，行索引，这里需要注意的是，loc与切片方法不同之处在于会把第5行也选择进去，而第一种方法只会选择到第4行为止。

In [21]:
frame.loc[2:3]

Unnamed: 0,pop,state,year
2,1.5,Ohio,2000
3,1.7,Ohio,2001


In [22]:
frame2.loc[['one', 'two']]

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,


loc能够选择在两个特定日期之间的数据，需要注意的是这两个日期必须都要在索引中。  
如果没有特殊需求，**强烈建议使用loc而尽量少用[]**，因为loc在对DataFrame进行重新赋值操作时会避免chained indexing问题，使用[]时编译器很可能会给出SettingWithCopy的警告

In [23]:
import datetime as dt
dates = pd.date_range('20171118', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
# 生成两个日期
fecha_1 = dt.datetime(2017, 11, 19)
fecha_2 = dt.datetime(2017, 11, 21)
# 生成切片数据
df.loc[fecha_1:fecha_2]

Unnamed: 0,A,B,C,D
2017-11-19,0.780579,-2.02889,-1.539165,-0.410289
2017-11-20,0.767227,-1.367373,0.384481,-0.502857
2017-11-21,-1.030629,-0.316625,0.183439,0.156224


### iloc按索引位置取值

iloc不关心索引的具体值是多少，只关心位置是多少，所以使用iloc时方括号中只能使用数值。

In [24]:
df.iloc[0:3]

Unnamed: 0,A,B,C,D
2017-11-18,-1.147734,2.234395,-0.490016,0.157359
2017-11-19,0.780579,-2.02889,-1.539165,-0.410289
2017-11-20,0.767227,-1.367373,0.384481,-0.502857


In [25]:
df.iloc[:, 0:2]

Unnamed: 0,A,B
2017-11-18,-1.147734,2.234395
2017-11-19,0.780579,-2.02889
2017-11-20,0.767227,-1.367373
2017-11-21,-1.030629,-0.316625
2017-11-22,-0.071375,0.289792
2017-11-23,-1.203889,-1.220025


In [26]:
df.iloc[:, [0, 2]]

Unnamed: 0,A,C
2017-11-18,-1.147734,-0.490016
2017-11-19,0.780579,-1.539165
2017-11-20,0.767227,0.384481
2017-11-21,-1.030629,0.183439
2017-11-22,-0.071375,-1.398324
2017-11-23,-1.203889,-0.093599


In [27]:
df.iloc[[1, 2], [2, 3]]

Unnamed: 0,C,D
2017-11-19,-1.539165,-0.410289
2017-11-20,0.384481,-0.502857


### at iat 访问单个单元

at的使用方法与loc类似，但是比loc有更快的访问数据的速度，而且只能访问单个元素，不能访问多个元素。  
iat对于iloc的关系就像at对于loc的关系，是一种更快的基于索引位置的选择方法，同at一样只能访问单个元素。

In [28]:
df.at[fecha_1, 'A']

0.7805786104806125

In [29]:
df.iat[1, 0]

0.7805786104806125

### ix 允许得到不在DataFrame索引中的数据

以上说过的几种方法都要求查询的秩在索引中，或者位置不超过长度范围，而ix允许访问总出现不在DataFrame索引中的数据。

In [32]:
date1 = dt.datetime(2017, 11, 10)
date2 = dt.datetime(2017, 12, 10)
df.ix[date1:date2]

Unnamed: 0,A,B,C,D
2017-11-18,-1.147734,2.234395,-0.490016,0.157359
2017-11-19,0.780579,-2.02889,-1.539165,-0.410289
2017-11-20,0.767227,-1.367373,0.384481,-0.502857
2017-11-21,-1.030629,-0.316625,0.183439,0.156224
2017-11-22,-0.071375,0.289792,-1.398324,-0.730211
2017-11-23,-1.203889,-1.220025,-0.093599,-2.300757


### dataframe数据遍历和迭代iteration

```for i in obj```: 遍历 obj 的 columns names  
```for i in obj.iteriems()```: 对 DataFrame 相当于对列迭代  返回 tuple(index, series)  
```for i in obj.iterrow()```: 对 DataFrame 的每一行进行迭代， 返回 tuple(index, series)  
```for i in obj.itertuples()```: 也是对行迭代，返回一个 namedtuple，通常比 iterrow 快，因为不需要转换  

In [33]:
for i in frame:
    print(i)

pop
state
year


In [34]:
for i in frame.iteritems():
    print(i)

('pop', 2    1.5
3    1.7
4    3.6
5    2.4
6    2.9
Name: pop, dtype: float64)
('state', 2      Ohio
3      Ohio
4      Ohio
5    Nevada
6    Nevada
Name: state, dtype: object)
('year', 2    2000
3    2001
4    2002
5    2001
6    2002
Name: year, dtype: int32)


In [35]:
for i in frame.iterrows():
    print(i)

(2, pop       1.5
state    Ohio
year     2000
Name: 2, dtype: object)
(3, pop       1.7
state    Ohio
year     2001
Name: 3, dtype: object)
(4, pop       3.6
state    Ohio
year     2002
Name: 4, dtype: object)
(5, pop         2.4
state    Nevada
year       2001
Name: 5, dtype: object)
(6, pop         2.9
state    Nevada
year       2002
Name: 6, dtype: object)


In [36]:
for i in frame.itertuples():
    print(i)
    print(type(i))

Pandas(Index=2, pop=1.5, state='Ohio', year=2000)
<class 'pandas.core.frame.Pandas'>
Pandas(Index=3, pop=1.7, state='Ohio', year=2001)
<class 'pandas.core.frame.Pandas'>
Pandas(Index=4, pop=3.6000000000000001, state='Ohio', year=2002)
<class 'pandas.core.frame.Pandas'>
Pandas(Index=5, pop=2.3999999999999999, state='Nevada', year=2001)
<class 'pandas.core.frame.Pandas'>
Pandas(Index=6, pop=2.8999999999999999, state='Nevada', year=2002)
<class 'pandas.core.frame.Pandas'>


### 查看行数、类型

In [37]:
frame.shape

(5, 3)

In [38]:
frame.shape[0]

5

In [39]:
frame.dtypes

pop      float64
state     object
year       int32
dtype: object

### 列columns和行index的名字、values

In [40]:
frame.index

Int64Index([2, 3, 4, 5, 6], dtype='int64')

In [41]:
frame.columns

Index(['pop', 'state', 'year'], dtype='object')

In [42]:
list(frame.index)

[2, 3, 4, 5, 6]

In [43]:
frame.values

array([[1.5, 'Ohio', 2000],
       [1.7, 'Ohio', 2001],
       [3.6, 'Ohio', 2002],
       [2.4, 'Nevada', 2001],
       [2.9, 'Nevada', 2002]], dtype=object)

## DataFrame值的修改

## 直接列赋值，当不存在时会创建新列

In [44]:
frame

Unnamed: 0,pop,state,year
2,1.5,Ohio,2000
3,1.7,Ohio,2001
4,3.6,Ohio,2002
5,2.4,Nevada,2001
6,2.9,Nevada,2002


In [45]:
frame['pop'] = ['1.4', '1.6', '2.6', '2.9', '3.3']
frame

Unnamed: 0,pop,state,year
2,1.4,Ohio,2000
3,1.6,Ohio,2001
4,2.6,Ohio,2002
5,2.9,Nevada,2001
6,3.3,Nevada,2002


In [46]:
frame['city'] = ['beijing', 'shanghai', 'hangzhou', 'shenzhen', 'zhuhai']
frame

Unnamed: 0,pop,state,year,city
2,1.4,Ohio,2000,beijing
3,1.6,Ohio,2001,shanghai
4,2.6,Ohio,2002,hangzhou
5,2.9,Nevada,2001,shenzhen
6,3.3,Nevada,2002,zhuhai


### 