<img src="http://blog.welcomege.com/wp-content/uploads/2017/07/jupyter.png"
style="width:120px;height:30px;float:right">

# Pandas Tutorial

Created on 29/12/2017, Fan Li

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 创建对象

**Series 是一个值的序列，它只有一个列，以及索引。下面的例子中，就用默认的整数索引**

In [2]:
s = pd.Series([1,3,5,np.nan,6,8])

In [3]:
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

**DataFrame 是有多个列的数据表，每个列拥有一个 label，当然，DataFrame 也有索引**

In [4]:
dates = pd.date_range('20180101', periods=6)

In [5]:
dates

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06'],
              dtype='datetime64[ns]', freq='D')

In [6]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

In [7]:
df

Unnamed: 0,A,B,C,D
2018-01-01,-0.840159,0.919306,-0.425277,-0.284685
2018-01-02,0.389555,-0.703242,0.880569,-0.948373
2018-01-03,0.053589,0.899564,0.665736,-2.184503
2018-01-04,2.327503,-0.565789,-0.274481,-1.17204
2018-01-05,-0.447019,2.16692,0.385466,-1.546954
2018-01-06,-1.927569,-1.023112,0.407856,-1.29275


**如果参数是一个 dict，每个 dict 的 value 会被转化成一个 Series**

In [8]:
df2 = pd.DataFrame({
    'A' : 1.,
    'B' : pd.Timestamp('20180102'),
    'C' : pd.Series(1, index=list(range(4)), dtype='float32'),
    'D' : np.array([3]*4, dtype="int32"),
    'E' : pd.Categorical(["test", "train", "test", "train"]),
    'F' : 'Fan'
})

In [9]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2018-01-02,1.0,3,test,Fan
1,1.0,2018-01-02,1.0,3,train,Fan
2,1.0,2018-01-02,1.0,3,test,Fan
3,1.0,2018-01-02,1.0,3,train,Fan


**每列的格式用 dtypes 查看**

In [10]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

**可以理解为‘DataFrame 是由 Series 组成的’**

In [11]:
df2.A

0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

In [12]:
df2.B

0   2018-01-02
1   2018-01-02
2   2018-01-02
3   2018-01-02
Name: B, dtype: datetime64[ns]

## 查看数据

**用head和tail查看顶端和底端的几列**

In [13]:
df.head()

Unnamed: 0,A,B,C,D
2018-01-01,-0.840159,0.919306,-0.425277,-0.284685
2018-01-02,0.389555,-0.703242,0.880569,-0.948373
2018-01-03,0.053589,0.899564,0.665736,-2.184503
2018-01-04,2.327503,-0.565789,-0.274481,-1.17204
2018-01-05,-0.447019,2.16692,0.385466,-1.546954


In [14]:
df.tail(3)

Unnamed: 0,A,B,C,D
2018-01-04,2.327503,-0.565789,-0.274481,-1.17204
2018-01-05,-0.447019,2.16692,0.385466,-1.546954
2018-01-06,-1.927569,-1.023112,0.407856,-1.29275


**实际上，DataFrame 内部用 numpy 格式存储数据。你也可以单独查看 index 和 columns**

In [15]:
df.index

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06'],
              dtype='datetime64[ns]', freq='D')

In [16]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [17]:
df.values

array([[-0.84015939,  0.91930594, -0.42527729, -0.28468528],
       [ 0.38955499, -0.70324159,  0.8805686 , -0.94837273],
       [ 0.05358902,  0.89956369,  0.66573647, -2.18450302],
       [ 2.32750259, -0.56578856, -0.27448112, -1.17204047],
       [-0.44701914,  2.1669201 ,  0.38546628, -1.5469544 ],
       [-1.92756927, -1.02311198,  0.40785571, -1.29275047]])

**describe() 显示数据的概要 [count, means, (min, q1, median, q3, max)]**

In [19]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.074017,0.282275,0.273311,-1.238218
std,1.425499,1.243642,0.517985,0.631088
min,-1.927569,-1.023112,-0.425277,-2.184503
25%,-0.741874,-0.668878,-0.109494,-1.483403
50%,-0.196715,0.166888,0.396661,-1.232395
75%,0.305563,0.91437,0.601266,-1.00429
max,2.327503,2.16692,0.880569,-0.284685


**和 numpy 一样，可以方便的得到转置**

In [20]:
df.T

Unnamed: 0,2018-01-01 00:00:00,2018-01-02 00:00:00,2018-01-03 00:00:00,2018-01-04 00:00:00,2018-01-05 00:00:00,2018-01-06 00:00:00
A,-0.840159,0.389555,0.053589,2.327503,-0.447019,-1.927569
B,0.919306,-0.703242,0.899564,-0.565789,2.16692,-1.023112
C,-0.425277,0.880569,0.665736,-0.274481,0.385466,0.407856
D,-0.284685,-0.948373,-2.184503,-1.17204,-1.546954,-1.29275


**对 axis 按照 index 排序（axis=1 是指第二个维度，即：列; ascending = False 即：降序）**

In [22]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2018-01-01,-0.284685,-0.425277,0.919306,-0.840159
2018-01-02,-0.948373,0.880569,-0.703242,0.389555
2018-01-03,-2.184503,0.665736,0.899564,0.053589
2018-01-04,-1.17204,-0.274481,-0.565789,2.327503
2018-01-05,-1.546954,0.385466,2.16692,-0.447019
2018-01-06,-1.29275,0.407856,-1.023112,-1.927569


**按指定值(某行某列)排序**

In [24]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2018-01-06,-1.927569,-1.023112,0.407856,-1.29275
2018-01-02,0.389555,-0.703242,0.880569,-0.948373
2018-01-04,2.327503,-0.565789,-0.274481,-1.17204
2018-01-03,0.053589,0.899564,0.665736,-2.184503
2018-01-01,-0.840159,0.919306,-0.425277,-0.284685
2018-01-05,-0.447019,2.16692,0.385466,-1.546954


## 获取行/列

**从 DataFrame 选择一个列，就得到了 Series**

In [25]:
df["B"]

2018-01-01    0.919306
2018-01-02   -0.703242
2018-01-03    0.899564
2018-01-04   -0.565789
2018-01-05    2.166920
2018-01-06   -1.023112
Freq: D, Name: B, dtype: float64

**和 numpy 类似，这里也能用 []**

In [26]:
df[0:3]

Unnamed: 0,A,B,C,D
2018-01-01,-0.840159,0.919306,-0.425277,-0.284685
2018-01-02,0.389555,-0.703242,0.880569,-0.948373
2018-01-03,0.053589,0.899564,0.665736,-2.184503


In [27]:
df["20180104": "20180106"]

Unnamed: 0,A,B,C,D
2018-01-04,2.327503,-0.565789,-0.274481,-1.17204
2018-01-05,-0.447019,2.16692,0.385466,-1.546954
2018-01-06,-1.927569,-1.023112,0.407856,-1.29275


### 通过 label 选择