看[十分钟入门 Pandas](https://www.pypandas.cn/docs/getting_started/10min.html)笔记。

In [1]:
import numpy as np
import pandas as pd

# 生成对象


In [2]:
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [3]:
dates = pd.date_range('20130101',periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Series类型也可以被利用于创建DataFrame:

In [44]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.15813,-1.273136,0.120422,0.370596
2013-01-02,-0.373393,0.827477,0.94134,0.612299
2013-01-03,-1.273525,-0.877686,-2.112548,0.140516
2013-01-04,1.34242,-0.075805,0.227908,-0.318044
2013-01-05,-0.097336,0.199299,0.692315,0.627982
2013-01-06,1.91629,0.302726,2.249316,-0.554638


从上面来看，Series和列表有相似之处。

In [5]:
 df2 = pd.DataFrame({'A': 1.,
                     'B': pd.Timestamp('20130102'),
                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                     'D': np.array([3] * 4, dtype='int32'),
                     'E': pd.Categorical(["test", "train", "test", "train"]),
                     'F': 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [7]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

上面这个DataFrame中每一列对应着不同的类型。

# 查看数据

In [12]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-1.698328,0.213797,-0.153667,-0.005892
2013-01-02,-1.084556,0.605959,-0.179345,-0.612155
2013-01-03,1.593621,0.485248,1.243203,0.956194
2013-01-04,-0.932383,0.642543,0.393981,-0.045905
2013-01-05,0.824812,0.013194,0.19379,-1.2428
2013-01-06,0.761398,0.112971,-0.450235,-0.685198


In [10]:
df.head(0)

Unnamed: 0,A,B,C,D


In [13]:
df.head(1)

Unnamed: 0,A,B,C,D
2013-01-01,-1.698328,0.213797,-0.153667,-0.005892


In [14]:
df.tail(0)

Unnamed: 0,A,B,C,D


In [11]:
df.tail(1)

Unnamed: 0,A,B,C,D
2013-01-06,0.761398,0.112971,-0.450235,-0.685198


从上面可以看出，`tail()`和`head()`的使用是相似的，不填入参数时仅抛弃原数组中一列，填入0时仅返回表头。

In [16]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [17]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame转换为Numpy：

In [18]:
df.to_numpy()

array([[-1.69832825,  0.21379673, -0.15366707, -0.00589176],
       [-1.08455629,  0.60595927, -0.17934515, -0.61215469],
       [ 1.59362065,  0.48524788,  1.24320274,  0.95619429],
       [-0.93238264,  0.64254341,  0.39398096, -0.04590538],
       [ 0.82481209,  0.01319449,  0.19378956, -1.24280048],
       [ 0.76139816,  0.11297147, -0.45023513, -0.68519842]])

In [19]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

Pandas 和 NumPy 的本质区别：NumPy 数组只有一种数据类型，DataFrame 每列的数据类型各不相同。因此在将DataFrame转为Numpy时，如果DataFrame内各列类型不同，则花费资源就会比较多。

另外也要注意转换为Numpy时index和columns会直接丢弃。

转置数据：(和Numpy中一样)

In [20]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-1.698328,-1.084556,1.593621,-0.932383,0.824812,0.761398
B,0.213797,0.605959,0.485248,0.642543,0.013194,0.112971
C,-0.153667,-0.179345,1.243203,0.393981,0.19379,-0.450235
D,-0.005892,-0.612155,0.956194,-0.045905,-1.2428,-0.685198


按轴排序：

In [21]:
df.sort_index(axis=1,ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-0.005892,-0.153667,0.213797,-1.698328
2013-01-02,-0.612155,-0.179345,0.605959,-1.084556
2013-01-03,0.956194,1.243203,0.485248,1.593621
2013-01-04,-0.045905,0.393981,0.642543,-0.932383
2013-01-05,-1.2428,0.19379,0.013194,0.824812
2013-01-06,-0.685198,-0.450235,0.112971,0.761398


In [22]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-05,0.824812,0.013194,0.19379,-1.2428
2013-01-06,0.761398,0.112971,-0.450235,-0.685198
2013-01-01,-1.698328,0.213797,-0.153667,-0.005892
2013-01-03,1.593621,0.485248,1.243203,0.956194
2013-01-02,-1.084556,0.605959,-0.179345,-0.612155
2013-01-04,-0.932383,0.642543,0.393981,-0.045905


# 选择

In [23]:
df.loc['2013-01-05']

A    0.824812
B    0.013194
C    0.193790
D   -1.242800
Name: 2013-01-05 00:00:00, dtype: float64

In [24]:
df.loc['2013-01-05',['A','B']]

A    0.824812
B    0.013194
Name: 2013-01-05 00:00:00, dtype: float64

In [26]:
df.iloc[0]

A   -1.698328
B    0.213797
C   -0.153667
D   -0.005892
Name: 2013-01-01 00:00:00, dtype: float64

In [30]:
df.iloc[[0,2,3],0:2]

Unnamed: 0,A,B
2013-01-01,-1.698328,0.213797
2013-01-03,1.593621,0.485248
2013-01-04,-0.932383,0.642543


In [31]:
df['A']

2013-01-01   -1.698328
2013-01-02   -1.084556
2013-01-03    1.593621
2013-01-04   -0.932383
2013-01-05    0.824812
2013-01-06    0.761398
Freq: D, Name: A, dtype: float64

In [45]:
df['2013-01-01':'2013-01-03']

Unnamed: 0,A,B,C,D
2013-01-01,-1.15813,-1.273136,0.120422,0.370596
2013-01-02,-0.373393,0.827477,0.94134,0.612299
2013-01-03,-1.273525,-0.877686,-2.112548,0.140516


**注意**：在选择行时直接输入行的columns名称，但是选择列的时候要使用切片来获得。

# 运算

## apply函数

In [32]:
df.apply(lambda x: x.max()-x.min())

A    3.291949
B    0.629349
C    1.693438
D    2.198995
dtype: float64

# 合并

## 结合

In [46]:
df = pd.DataFrame(np.random.randn(10, 4))
pieces = [df[:3], df[3:7], df[7:]]
pieces

[          0         1         2         3
 0 -0.263194  0.348138  0.671059 -1.131329
 1  1.099333 -0.253627  0.144834 -0.594846
 2 -0.322605  0.187960  0.042065  0.171241,
           0         1         2         3
 3 -0.347857 -0.605265 -0.932975  1.313521
 4 -1.171566  0.967818  0.828199 -0.142316
 5  1.860283  0.484442 -1.426788 -0.163725
 6  0.596286  0.120438  0.579895 -1.358645,
           0         1         2         3
 7 -0.655525  0.809190 -2.385250 -0.123670
 8 -0.393709 -1.573537  0.572671 -0.624143
 9  0.060042  0.584087  0.498921 -0.295325]

In [47]:
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,-0.263194,0.348138,0.671059,-1.131329
1,1.099333,-0.253627,0.144834,-0.594846
2,-0.322605,0.18796,0.042065,0.171241
3,-0.347857,-0.605265,-0.932975,1.313521
4,-1.171566,0.967818,0.828199,-0.142316
5,1.860283,0.484442,-1.426788,-0.163725
6,0.596286,0.120438,0.579895,-1.358645
7,-0.655525,0.80919,-2.38525,-0.12367
8,-0.393709,-1.573537,0.572671,-0.624143
9,0.060042,0.584087,0.498921,-0.295325


## 连接

In [49]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
left

Unnamed: 0,key,lval
0,foo,1
1,foo,2


In [50]:
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right

Unnamed: 0,key,rval
0,foo,4
1,foo,5


In [51]:
pd.merge(left,right)

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


In [52]:
pd.merge(left,right,on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


In [53]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,bar,2,5


## 追加

In [54]:
df = pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,0.013086,0.59619,-1.292056,-0.594901
1,1.837913,0.723589,1.367277,-0.427229
2,0.849559,-1.210506,-0.624218,0.066001
3,0.23716,-1.293985,0.581259,-0.385105
4,2.010033,-1.128762,0.034815,-0.307704
5,0.223058,1.00637,0.981667,-0.124479
6,0.782073,-0.944668,0.6394,-2.541169
7,-1.083972,-1.306321,-0.594673,0.818462


In [55]:
s = df.loc[6]
s

A    0.782073
B   -0.944668
C    0.639400
D   -2.541169
Name: 6, dtype: float64

In [56]:
df.append(s,ignore_index=True)

Unnamed: 0,A,B,C,D
0,0.013086,0.59619,-1.292056,-0.594901
1,1.837913,0.723589,1.367277,-0.427229
2,0.849559,-1.210506,-0.624218,0.066001
3,0.23716,-1.293985,0.581259,-0.385105
4,2.010033,-1.128762,0.034815,-0.307704
5,0.223058,1.00637,0.981667,-0.124479
6,0.782073,-0.944668,0.6394,-2.541169
7,-1.083972,-1.306321,-0.594673,0.818462
8,0.782073,-0.944668,0.6394,-2.541169


# 分组