# 十分钟入门 Pandas

In [1]:
import numpy as np

import pandas as pd

# 生成对象 object-creation
---

用值列表生成 Series 时，Pandas 默认自动生成整数索引：

In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

用含日期时间索引与标签的 NumPy 数组生成 DataFrame ：

In [19]:
dates = pd.date_range('20130101', periods=6)

dates

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

df

Unnamed: 0,A,B,C,D
2013-01-01,-0.952008,-1.884322,0.304232,-0.72831
2013-01-02,-1.15262,-0.023535,-0.339995,1.632461
2013-01-03,-0.064363,-1.760031,-1.134183,0.105973
2013-01-04,-1.917089,-1.139279,-0.750057,0.875505
2013-01-05,-0.663622,0.127321,-0.861308,-1.402733
2013-01-06,1.390489,-1.185069,-1.792517,0.856189


用 Series 字典对象生成 DataFrame:

In [4]:
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


DataFrame 的列有不同数据类型。

In [5]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## 查看数据 viewing-data
---
查看 DataFrame 头部和尾部数据：

In [6]:
df.head

df.tail(3)

Unnamed: 0,A,B,C,D
2021-12-15,-2.218056,-0.103874,1.313427,-0.783677
2021-12-16,0.435264,-1.060215,-0.713016,-0.244315
2021-12-17,-0.954491,-0.92923,-0.46156,-1.877273


显示索引与列名：

In [7]:
df.index

df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() 输出底层数据的 NumPy 对象。

DataFrame.to_numpy() 的输出不包含行索引和列标签

df 这个 DataFrame 里的值都是浮点数，DataFrame.to_numpy() 的操作会很快，而且不复制数据。

In [8]:
df.to_numpy()

array([[-2.14757203, -1.24942395, -0.42143462, -1.04387454],
       [-0.20992466, -1.2598119 , -0.11491146, -1.52270852],
       [ 1.80077402,  0.63760864,  0.08118179,  0.37367259],
       [-2.21805605, -0.10387359,  1.31342712, -0.78367681],
       [ 0.43526392, -1.06021453, -0.71301628, -0.24431467],
       [-0.95449051, -0.92923044, -0.4615597 , -1.87727309]])

df2 这个 DataFrame 包含了多种类型，DataFrame.to_numpy() 操作就会耗费较多资源。

In [9]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

describe() 可以快速查看数据的统计摘要：

In [10]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.549001,-0.660824,-0.052719,-0.849696
std,1.556328,0.765887,0.724879,0.826322
min,-2.218056,-1.259812,-0.713016,-1.877273
25%,-1.849302,-1.202122,-0.451528,-1.403
50%,-0.582208,-0.994722,-0.268173,-0.913776
75%,0.273967,-0.310213,0.032158,-0.379155
max,1.800774,0.637609,1.313427,0.373673


转置数据：

In [11]:
df.T

Unnamed: 0,2021-12-12,2021-12-13,2021-12-14,2021-12-15,2021-12-16,2021-12-17
A,-2.147572,-0.209925,1.800774,-2.218056,0.435264,-0.954491
B,-1.249424,-1.259812,0.637609,-0.103874,-1.060215,-0.92923
C,-0.421435,-0.114911,0.081182,1.313427,-0.713016,-0.46156
D,-1.043875,-1.522709,0.373673,-0.783677,-0.244315,-1.877273


按轴排序：

In [12]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2021-12-12,-1.043875,-0.421435,-1.249424,-2.147572
2021-12-13,-1.522709,-0.114911,-1.259812,-0.209925
2021-12-14,0.373673,0.081182,0.637609,1.800774
2021-12-15,-0.783677,1.313427,-0.103874,-2.218056
2021-12-16,-0.244315,-0.713016,-1.060215,0.435264
2021-12-17,-1.877273,-0.46156,-0.92923,-0.954491


按值排序：

In [13]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2021-12-13,-0.209925,-1.259812,-0.114911,-1.522709
2021-12-12,-2.147572,-1.249424,-0.421435,-1.043875
2021-12-16,0.435264,-1.060215,-0.713016,-0.244315
2021-12-17,-0.954491,-0.92923,-0.46156,-1.877273
2021-12-15,-2.218056,-0.103874,1.313427,-0.783677
2021-12-14,1.800774,0.637609,0.081182,0.373673


## 选择
### 获取数据
选择单列，产生 Series，df['A'] 与 df.A 等效：

In [14]:
df['A']

2021-12-12   -2.147572
2021-12-13   -0.209925
2021-12-14    1.800774
2021-12-15   -2.218056
2021-12-16    0.435264
2021-12-17   -0.954491
Freq: D, Name: A, dtype: float64

用 [ ] 切片行：

In [15]:
df[0:3]

df['20211214':'20211216']

Unnamed: 0,A,B,C,D
2021-12-14,1.800774,0.637609,0.081182,0.373673
2021-12-15,-2.218056,-0.103874,1.313427,-0.783677
2021-12-16,0.435264,-1.060215,-0.713016,-0.244315


### 按标签选择
用标签提取一行数据：

In [None]:
df.loc[dates[0]]

 用标签选择多列数据：

In [None]:
df.loc[:, ['A', 'B']]

用标签切片，包含行与列结束点：



In [None]:
df.loc['20130102':'20130104', ['A', 'B']]

返回对象降维：

In [None]:
df.loc['20130102', ['A', 'B']]

提取标量值：

In [None]:
df.loc[dates[0], 'A']

快速访问标量，与上述方法等效：

In [None]:
df.at[dates[0], 'A']

### 按位置选择

In [16]:
df.loc[dates[0]]

A   -2.147572
B   -1.249424
C   -0.421435
D   -1.043875
Name: 2021-12-12 00:00:00, dtype: float64

 用标签选择多列数据：

In [17]:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2021-12-12,-2.147572,-1.249424
2021-12-13,-0.209925,-1.259812
2021-12-14,1.800774,0.637609
2021-12-15,-2.218056,-0.103874
2021-12-16,0.435264,-1.060215
2021-12-17,-0.954491,-0.92923


用标签切片，包含行与列结束点：



In [20]:
df.loc['20130102':'20130104', ['A', 'B']]

Unnamed: 0,A,B
2013-01-02,-1.15262,-0.023535
2013-01-03,-0.064363,-1.760031
2013-01-04,-1.917089,-1.139279


返回对象降维：

In [21]:
df.loc['20130102', ['A', 'B']]

A   -1.152620
B   -0.023535
Name: 2013-01-02 00:00:00, dtype: float64

提取标量值：

In [22]:
df.loc[dates[0], 'A']

-0.9520078239729454

快速访问标量，与上述方法等效：

In [23]:
df.at[dates[0], 'A']

-0.9520078239729454

### 按位置选择