## pandas入门

---

In [1]:
import pandas as pd

### 数据结构

#### Series -- 一维数据表

In [None]:
obj = pd.Series([1,3,5,7,9])
obj

左边第一列为index，默认为range(n)，可以自己设置

In [None]:
obj = pd.Series([1,3,5,7,9], index=['a','b','c','d','e'])
obj

通过.values获得值，.index获得索引

In [None]:
obj.values
obj.index

可以通过标签对元素进行索引

In [None]:
obj = pd.Series([1,3,5,7,9], index=['a','b','c','d','e'])
obj['a']
obj[0]

**Series对象可以由字典生成**，数组操作与Numpy中相似

In [None]:
dic = {'a':1, 'b':2, 'c':3}
obj = pd.Series(dic)
obj

In [None]:
# 可以通过设置索引改变排列顺序
index = ['b','c','a']
obj = pd.Series(dic, index)
obj

使用isnull和notnull检查缺失数据

In [None]:
dic = {'a':1, 'b':2, 'c':3}
obj = pd.Series(dic, index=['d','a','b','c','e'])
obj 

In [None]:
pd.isnull(obj)
pd.notnull(obj)

为Series对象设置name属性

In [None]:
dic = {'a':1, 'b':2, 'c':3}
obj = pd.Series(dic, index=['d','a','b','c','e'])
obj.name = "Seties_obj 1"  #整个对象的name
obj.index.name = "Letter"  #索引的name
obj

#### DataFrame -- 二维数据表

常用包含**等长度列表或NumPy数组**的字典形成

In [None]:
dic = {
    'Name':['Angela','Bob','Cindy','David'],
    'Sex':['Female','Male','Female','Male'],
    'Age':[23,25,64,10]
      }
df = pd.DataFrame(dic)
df

通过.head()获取前5行数据，常用于大型数据

通过传递参数columns指定列顺序

In [None]:
df = pd.DataFrame(dic, columns=['Sex','Name','Age'], index=list(range(1,len(dic['Name'])+1)) )
df

像字典那样检索Series

In [None]:
df['Name'] #通用
df.Name    #列名应是有效的Python变量名

转置操作与NumPy类似

In [None]:
df.T

获取index和values操作与Series相同

In [None]:
df.index
df.values

pandas中的索引对象像数组也像**固定大小**的集合，所以有一些相关类似方法

- append &nbsp;&nbsp; 产生新索引
- difference &nbsp;&nbsp; 两索引的差集
- intersection &nbsp;&nbsp; 两索引交集
- union &nbsp;&nbsp; 两索引并集
- isin
- delete &nbsp;&nbsp; 删除位置i的元素
- drop &nbsp;&nbsp; 删除指定索引值
- is_monotonic &nbsp;&nbsp; 索引序列是否递增
- is_unique &nbsp;&nbsp; 索引是否有重复元素
- unique &nbsp;&nbsp; 索引去重

***注；以上方法均针对index对象***

### 基本功能

#### 重建索引

使用.reindex()方法

In [3]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),
                    index = ['a','b','c'],
                    columns = ['Chengdu', 'Changsha', 'Beijing'])
frame

Unnamed: 0,Chengdu,Changsha,Beijing
a,0,1,2
b,3,4,5
c,6,7,8


##### 修改索引

In [5]:
frame2 = frame.reindex(['c','a','b'])
frame2

Unnamed: 0,Chengdu,Changsha,Beijing
c,6,7,8
a,0,1,2
b,3,4,5


##### 修改列名

In [7]:
frame3 = frame.reindex(columns=['Beijing','Chengdu','Changsha'])
frame3

Unnamed: 0,Beijing,Chengdu,Changsha
a,2,0,1
b,5,3,4
c,8,6,7


##### 也可以使用.loc[]方法修改

#### 删除条目

In [8]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),
                    columns = ['a','b','c'],
                    index = ['Chengdu', 'Changsha', 'Beijing'])
frame

Unnamed: 0,a,b,c
Chengdu,0,1,2
Changsha,3,4,5
Beijing,6,7,8


##### 删除行

In [11]:
frame.drop(['Chengdu','Beijing'])

Unnamed: 0,a,b,c
Changsha,3,4,5


##### 删除列 -- 设置axis参数

In [13]:
frame.drop('b', axis=1)
frame.drop('b', axis="columns")

Unnamed: 0,a,c
Chengdu,0,2
Changsha,3,5
Beijing,6,8


Unnamed: 0,a,c
Chengdu,0,2
Changsha,3,5
Beijing,6,8


##### 直接修改原对象 -- 设置inplace参数

In [14]:
frame.drop('Beijing', inplace=True)
frame

Unnamed: 0,a,b,c
Chengdu,0,1,2
Changsha,3,4,5


#### 索引、选择与过滤

##### 基本操作

###### Series

In [19]:
s = pd.Series(np.arange(2,7), index=['a','b','c','d','e'])
s

a    2
b    3
c    4
d    5
e    6
dtype: int32

In [36]:
s[0]
s[[0,2,4]]
s[:3]
s['a':'d'] ## 注意：Series通过索引的切片包含尾部！
s[s>3]

2

a    2
c    4
e    6
dtype: int32

a    2
b    3
c    4
dtype: int32

a    2
b    3
c    4
d    5
dtype: int32

c    4
d    5
e    6
dtype: int32

###### DataFrame

In [25]:
df = pd.DataFrame(np.arange(1,17).reshape((4,4)),
                 index=['Beijing','Shanghai','Guangdong','Shenzhen'],
                 columns = ['a','b','c','d'])
df

Unnamed: 0,a,b,c,d
Beijing,1,2,3,4
Shanghai,5,6,7,8
Guangdong,9,10,11,12
Shenzhen,13,14,15,16


In [53]:
## 注：DataFrame中不能像 df[1] 这样进行单独索引行
df[2:]  #索引行
df[['b','d']]  #索引列
df[df['d']>10]

Unnamed: 0,a,b,c,d
Guangdong,9,10,11,12
Shenzhen,13,14,15,16


Unnamed: 0,b,d
Beijing,2,4
Shanghai,6,8
Guangdong,10,12
Shenzhen,14,16


Unnamed: 0,a,b,c,d
Guangdong,9,10,11,12
Shenzhen,13,14,15,16


##### 使用loc和iloc选择器

- **loc**为***轴标签***，即通过字符串索引
- **iloc**为***整数标签***，即通过整数索引

In [50]:
df.loc['Guangdong']
df.loc['Guangdong','b']
df.loc[['Guangdong','Beijing'],['a','d']]
df.loc[:, 'a':'c'][df.c>7]

a     9
b    10
c    11
d    12
Name: Guangdong, dtype: int32

10

Unnamed: 0,a,d
Guangdong,9,12
Beijing,1,4


Unnamed: 0,a,b,c
Guangdong,9,10,11
Shenzhen,13,14,15


In [51]:
df.iloc[2]
df.iloc[2, 1]
df.iloc[[2,0], [0,3]]
df.iloc[:, :3][df.c>7]

a     9
b    10
c    11
d    12
Name: Guangdong, dtype: int32

10

Unnamed: 0,a,d
Guangdong,9,12
Beijing,1,4


Unnamed: 0,a,b,c
Guangdong,9,10,11
Shenzhen,13,14,15


#### 排序

##### 对索引排序

In [56]:
df = pd.DataFrame(np.arange(1,17).reshape((4,4)),
                 index=['d','c','b','a'],
                 columns=['Beijing','Shanghai','Guangdong','Shenzhen'])
df

Unnamed: 0,Beijing,Shanghai,Guangdong,Shenzhen
d,1,2,3,4
c,5,6,7,8
b,9,10,11,12
a,13,14,15,16


对行排序

In [57]:
df.sort_index()

Unnamed: 0,Beijing,Shanghai,Guangdong,Shenzhen
a,13,14,15,16
b,9,10,11,12
c,5,6,7,8
d,1,2,3,4


对列排序

In [61]:
# 要降序排列时设置参数ascending
df.sort_index(axis=1, ascending=False)

Unnamed: 0,Shenzhen,Shanghai,Guangdong,Beijing
d,4,2,3,1
c,8,6,7,5
b,12,10,11,9
a,16,14,15,13


按值排序

In [63]:
# 当排序对象为DataFrame时要设置参数by以设定排序的标准
df.sort_values(by="Shanghai")
df.sort_values(by=['Beijing','Shanghai']) #多行排序，优先级依照列表顺序

Unnamed: 0,Beijing,Shanghai,Guangdong,Shenzhen
d,1,2,3,4
c,5,6,7,8
b,9,10,11,12
a,13,14,15,16


Unnamed: 0,Beijing,Shanghai,Guangdong,Shenzhen
d,1,2,3,4
c,5,6,7,8
b,9,10,11,12
a,13,14,15,16


### 统计与计算

#### 基础操作

In [64]:
df = pd.DataFrame(np.arange(1,17).reshape((4,4)),
                 index=['d','c','b','a'],
                 columns=['Beijing','Shanghai','Guangdong','Shenzhen'])
df

Unnamed: 0,Beijing,Shanghai,Guangdong,Shenzhen
d,1,2,3,4
c,5,6,7,8
b,9,10,11,12
a,13,14,15,16


###### 求和

1. 对列求和

In [65]:
df.sum()

Beijing      28
Shanghai     32
Guangdong    36
Shenzhen     40
dtype: int64

2. 对行求和

In [66]:
df.sum(axis=1)

d    10
c    26
b    42
a    58
dtype: int64

###### 求平均值

求平均值时将自动跳过为NaN的元素(如果有的话)，行、列选择与求和相同

In [67]:
df.mean()
df.mean(axis=1)

Beijing       7.0
Shanghai      8.0
Guangdong     9.0
Shenzhen     10.0
dtype: float64

d     2.5
c     6.5
b    10.5
a    14.5
dtype: float64

###### 求最后