# Pandas教程

### 2017七月在线 Python数据分析集训营 julyedu.com
by 褚则伟 zeweichu@gmail.com

pandas是一个专门用于数据分析的python library

## [Pandas](http://pandas.pydata.org/)简介
- python数据分析library
- 基于numpy (对ndarray的操作)
- 有一种用python做Excel/SQL/R/Matlab的感觉

## 目录
- Series
- DataFrame
- Index

## 数据结构Series

### 2017七月在线 Python数据分析集训营 julyedu.com

In [1]:
import pandas as pd
import numpy as np

### 构造和初始化Series

Series是一个一维的数据结构，下面是一些初始化Series的方法。

In [2]:
s = pd.Series([7, 'Beijing', 2.18, -3.1415, "Happy Birthday!"])
s

0                  7
1            Beijing
2               2.18
3            -3.1415
4    Happy Birthday!
dtype: object

pandas会默认用0到n来作为Series的index，但是我们也可以自己指定index。index我们可以把它理解为dict里面的key。

In [3]:
s = pd.Series([7, 'Beijing', 2.18, -3.1415, "Happy Birthday!"],
             index = ["A", "B", "C", "D", "E"])
s

A                  7
B            Beijing
C               2.18
D            -3.1415
E    Happy Birthday!
dtype: object

还可以用dictionary来构造一个Series，因为Series本来就是key value pairs。

In [5]:
cities = {"Beijing": 55000, "Shanghai": 60000, "Shenzhen": 50000, 
          "Hangzhou": 20000, "Guangzhou": 25000, "Suzhou": None}
apts = pd.Series(cities)
apts

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
dtype: float64

numpy ndarray构建一个Series

In [7]:
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
s

a   -2.577716
b   -1.569732
c   -1.511730
d    0.288231
e    1.162825
dtype: float64

### 选择数据

我们可以像对待一个list一样对待Series

In [14]:
apts[[4,3,5]]

Shenzhen    50000.0
Shanghai    60000.0
Suzhou          NaN
dtype: float64

In [17]:
apts[1:6:2]

Guangzhou    25000.0
Shanghai     60000.0
Suzhou           NaN
dtype: float64

In [18]:
apts[:-1]

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
dtype: float64

为什么下面这样会拿到两个NaN呢？

In [21]:
apts[1:] - apts[:-1]

Beijing      NaN
Guangzhou    0.0
Hangzhou     0.0
Shanghai     0.0
Shenzhen     0.0
Suzhou       NaN
dtype: float64

Series就像一个dict，前面定义的index就是用来选择数据的

In [23]:
apts["Guangzhou"]

25000.0

In [24]:
apts["Shenzhen"]

50000.0

In [25]:
"Hangzhou" in apts

True

In [26]:
"Chongqing" in apts

False

比较安全的用key读取value的方法如下

In [32]:
apts.get("Chongqing", 0)

0

In [33]:
cities.get("Chongqing", 0)

0

下面这种写法，如果key不存在，就可能会报错了

In [28]:
# apts["Chongqing"]

boolean indexing，与numpy类似。

In [34]:
apts < 50000

Beijing      False
Guangzhou     True
Hangzhou      True
Shanghai     False
Shenzhen     False
Suzhou       False
dtype: bool

In [35]:
apts[apts < 50000]

Guangzhou    25000.0
Hangzhou     20000.0
dtype: float64

下面我再详细展示一下这个boolean indexing是如何工作的

In [36]:
less_than_50k = apts < 50000
less_than_50k

Beijing      False
Guangzhou     True
Hangzhou      True
Shanghai     False
Shenzhen     False
Suzhou       False
dtype: bool

In [37]:
apts[less_than_50k]

Guangzhou    25000.0
Hangzhou     20000.0
dtype: float64

In [38]:
apts.median()

50000.0

In [39]:
apts.max()

60000.0

In [40]:
apts.mean()

42000.0

In [41]:
apts[apts > apts.mean()]

Beijing     55000.0
Shanghai    60000.0
Shenzhen    50000.0
dtype: float64

### Series元素赋值

Series的元素可以被赋值

In [43]:
print("Old value", apts["Shenzhen"])
apts["Shenzhen"] = 58000
print("New value", apts["Shenzhen"])

Old value 50000.0
New value 58000.0


前面讲过的boolean indexing在赋值的时候也可以用

In [44]:
print("old values: ", apts)
apts[apts <= 50000] = 40000
print("new values: ", apts)

old values:  Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     58000.0
Suzhou           NaN
dtype: float64
new values:  Beijing      55000.0
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     58000.0
Suzhou           NaN
dtype: float64


### 数学运算

下面我们来讲一些基本的数学运算。

In [45]:
apts / 2

Beijing      27500.0
Guangzhou    20000.0
Hangzhou     20000.0
Shanghai     30000.0
Shenzhen     29000.0
Suzhou           NaN
dtype: float64

In [47]:
apts ** 2 

Beijing      3.025000e+09
Guangzhou    1.600000e+09
Hangzhou     1.600000e+09
Shanghai     3.600000e+09
Shenzhen     3.364000e+09
Suzhou                NaN
dtype: float64

numpy的运算可以被运用到pandsa上去

In [48]:
np.sqrt(apts)

Beijing      234.520788
Guangzhou    200.000000
Hangzhou     200.000000
Shanghai     244.948974
Shenzhen     240.831892
Suzhou              NaN
dtype: float64

我们再定义一个新的Series做加法

In [52]:
cars = pd.Series({"Beijing": 300000, "Shanghai": 400000, "Shenzhen": 250000, 
                 "Tianjin": 200000, "Guangzhou": 200000, "Chongqing": 150000})
cars

Beijing      300000
Chongqing    150000
Guangzhou    200000
Shanghai     400000
Shenzhen     250000
Tianjin      200000
dtype: int64

In [50]:
apts

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     58000.0
Suzhou           NaN
dtype: float64

In [53]:
apts * 100 + cars

Beijing      5800000.0
Chongqing          NaN
Guangzhou    4200000.0
Hangzhou           NaN
Shanghai     6400000.0
Shenzhen     6050000.0
Suzhou             NaN
Tianjin            NaN
dtype: float64

### 数据缺失

[reference](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)

In [54]:
"Hangzhou" in apts

True

In [55]:
"Hangzhou" in cars

False

In [56]:
apts.notnull()

Beijing       True
Guangzhou     True
Hangzhou      True
Shanghai      True
Shenzhen      True
Suzhou       False
dtype: bool

In [59]:
apts[apts.isnull()] = 0.
apts

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     58000.0
Suzhou           0.0
dtype: float64

In [60]:
apts = pd.Series(cities)
apts

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
dtype: float64

In [61]:
apts[apts.isnull()]

Suzhou   NaN
dtype: float64

In [63]:
apts[apts.isnull() == False]

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
dtype: float64

Series又是一个list，又是一个dictionary，又是一个1darray

## 数据结构[Dataframe](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

### 七月在线 julyedu.com

一个Dataframe就是一张表格，Series表示的是一维数组，Dataframe则是一个二维数组，可以类比成一张excel的spreadsheet。也可以把Dataframe当做一组Series的集合。

### 创建一个DataFrame

dataframe可以由一个dictionary构造得到。

In [64]:
data = {"city": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen", "Hangzhou", "Chongqing"],
       "year": [2016, 2017, 2016, 2017, 2016, 2016],
       "population": [2100, 2300, 1000, 700, 500, 500]}
data

{'city': ['Beijing',
  'Shanghai',
  'Guangzhou',
  'Shenzhen',
  'Hangzhou',
  'Chongqing'],
 'population': [2100, 2300, 1000, 700, 500, 500],
 'year': [2016, 2017, 2016, 2017, 2016, 2016]}

In [65]:
pd.DataFrame(data)

Unnamed: 0,city,population,year
0,Beijing,2100,2016
1,Shanghai,2300,2017
2,Guangzhou,1000,2016
3,Shenzhen,700,2017
4,Hangzhou,500,2016
5,Chongqing,500,2016


columns的名字和顺序可以指定

In [67]:
pd.DataFrame(data, columns = ["year", "city", "population", "debt"])

Unnamed: 0,year,city,population,debt
0,2016,Beijing,2100,
1,2017,Shanghai,2300,
2,2016,Guangzhou,1000,
3,2017,Shenzhen,700,
4,2016,Hangzhou,500,
5,2016,Chongqing,500,


In [69]:
frame2 = pd.DataFrame(data, columns = ["year", "city", "population", "debt"],
            index = ["one", "two", "three", "four", "five", "six"])
frame2

Unnamed: 0,year,city,population,debt
one,2016,Beijing,2100,
two,2017,Shanghai,2300,
three,2016,Guangzhou,1000,
four,2017,Shenzhen,700,
five,2016,Hangzhou,500,
six,2016,Chongqing,500,


也可以从几个Series构建一个DataFrame

In [70]:
apts

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
dtype: float64

In [71]:
cars

Beijing      300000
Chongqing    150000
Guangzhou    200000
Shanghai     400000
Shenzhen     250000
Tianjin      200000
dtype: int64

In [74]:
df = pd.DataFrame({"apts": apts, "cars": cars})
df

Unnamed: 0,apts,cars
Beijing,55000.0,300000.0
Chongqing,,150000.0
Guangzhou,25000.0,200000.0
Hangzhou,20000.0,
Shanghai,60000.0,400000.0
Shenzhen,50000.0,250000.0
Suzhou,,
Tianjin,,200000.0


In [73]:
df = pd.DataFrame([apts, cars])
df

Unnamed: 0,Beijing,Chongqing,Guangzhou,Hangzhou,Shanghai,Shenzhen,Suzhou,Tianjin
0,55000.0,,25000.0,20000.0,60000.0,50000.0,,
1,300000.0,150000.0,200000.0,,400000.0,250000.0,,200000.0


也可以用一个list of dicts来构建DataFrame

In [75]:
data = [{"July": 999999, "Han": 50000, "Zewei": 1000}, 
        {"July": 99999, "Han": 8000, "Zewei": 200}]
pd.DataFrame(data)

Unnamed: 0,Han,July,Zewei
0,50000,999999,1000
1,8000,99999,200


In [76]:
pd.DataFrame(data, index=["salary", "bonux"])

Unnamed: 0,Han,July,Zewei
salary,50000,999999,1000
bonux,8000,99999,200


## 练习

- 构建三个Series，分别是一系列商品的单价，计量单位，和数量。至于是什么商品什么计量单位由大家自己决定。
- 然后把这三个Series合并成一个DataFrame。

### 2017七月在线 Python数据分析集训营 julyedu.com
by 褚则伟 zeweichu@gmail.com

In [80]:
import time
start = time.time()
test_df = pd.DataFrame([np.random.rand(100000), np.random.rand(100000), np.random.rand(100000)])
print("costs {} seconds".format(time.time() - start))
test_df.head()


costs 6.052450895309448 seconds


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,99990,99991,99992,99993,99994,99995,99996,99997,99998,99999
0,0.664721,0.135933,0.196986,0.728308,0.76126,0.742622,0.887648,0.276397,0.51233,0.902568,...,0.438236,0.197958,0.21941,0.102096,0.273686,0.376666,0.925884,0.235395,0.92167,0.251485
1,0.417645,0.379543,0.875853,0.473266,0.69471,0.857835,0.747609,0.209197,0.114588,0.165442,...,0.622295,0.433176,0.527761,0.030541,0.924604,0.306224,0.257116,0.523443,0.351672,0.053223
2,0.036179,0.10077,0.445545,0.184311,0.670971,0.455545,0.241892,0.606992,0.364626,0.244174,...,0.401856,0.313008,0.538772,0.372599,0.617379,0.717936,0.028525,0.758961,0.677974,0.327543


In [84]:
start = time.time()
print(test_df.max(axis=1))
print("costs {} seconds".format(time.time() - start))

0    1.000000
1    0.999993
2    1.000000
dtype: float64
costs 0.005897045135498047 seconds


### 2017七月在线 Python数据分析集训营 julyedu.com

In [86]:
type(df["apts"])

pandas.core.series.Series

In [89]:
df["total_cost"] = df["apts"]*100 + df["cars"]
df

Unnamed: 0,apts,cars,total_cost
Beijing,55000.0,300000.0,5800000.0
Chongqing,,150000.0,
Guangzhou,25000.0,200000.0,2700000.0
Hangzhou,20000.0,,
Shanghai,60000.0,400000.0,6400000.0
Shenzhen,50000.0,250000.0,5250000.0
Suzhou,,,
Tianjin,,200000.0,


In [91]:
frame2["city"]

one        Beijing
two       Shanghai
three    Guangzhou
four      Shenzhen
five      Hangzhou
six      Chongqing
Name: city, dtype: object

In [92]:
frame2.year

one      2016
two      2017
three    2016
four     2017
five     2016
six      2016
Name: year, dtype: int64

ix方法可以帮我们拿到行，但是现在已经不推荐使用了

In [102]:
df.ix["Shenzhen"]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


apts            50000.0
cars           250000.0
total_cost    5250000.0
Name: Shenzhen, dtype: float64

loc方法可以拿到行

In [103]:
df.loc["Shenzhen"]

apts            50000.0
cars           250000.0
total_cost    5250000.0
Name: Shenzhen, dtype: float64

下面这种方法默认用来选列而不是选行

ix也可以用index拿到行，当然也不推荐了

In [106]:
df.ix[2]

apts            25000.0
cars           200000.0
total_cost    2700000.0
Name: Guangzhou, dtype: float64

现在推荐的用法是iloc

In [105]:
df.iloc[3]

apts          20000.0
cars              NaN
total_cost        NaN
Name: Hangzhou, dtype: float64

In [112]:
df.loc["Beijing":"Shenzhen", "apts"]

Beijing      55000.0
Chongqing        NaN
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Name: apts, dtype: float64

In [114]:
df.iloc[2:5, 1:2]

Unnamed: 0,cars
Guangzhou,200000.0
Hangzhou,
Shanghai,400000.0


In [118]:
df.values

array([[   55000.,   300000.,  5800000.],
       [      nan,   150000.,       nan],
       [   25000.,   200000.,  2700000.],
       [   20000.,       nan,       nan],
       [   60000.,   400000.,  6400000.],
       [   50000.,   250000.,  5250000.],
       [      nan,       nan,       nan],
       [      nan,   200000.,       nan]])

### DataFrame元素赋值

In [121]:
frame2

Unnamed: 0,year,city,population,debt
one,2016,Beijing,2100,
two,2017,Shanghai,2300,
three,2016,Guangzhou,1000,
four,2017,Shenzhen,700,
five,2016,Hangzhou,500,
six,2016,Chongqing,500,


小作业：研究一下pandas中copy和view的关系

In [122]:
frame2["population"]["two"] = 2200
frame2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,year,city,population,debt
one,2016,Beijing,2100,
two,2017,Shanghai,2200,
three,2016,Guangzhou,1000,
four,2017,Shenzhen,700,
five,2016,Hangzhou,500,
six,2016,Chongqing,500,


In [123]:
frame2.loc["two", "population"] = 2500
frame2

Unnamed: 0,year,city,population,debt
one,2016,Beijing,2100,
two,2017,Shanghai,2500,
three,2016,Guangzhou,1000,
four,2017,Shenzhen,700,
five,2016,Hangzhou,500,
six,2016,Chongqing,500,


可以给一整列赋值

In [127]:
frame2["debt"] = 100
frame2

Unnamed: 0,year,city,population,debt
one,2016,Beijing,2100,100
two,2017,Shanghai,2500,100
three,2016,Guangzhou,1000,100
four,2017,Shenzhen,700,100
five,2016,Hangzhou,500,100
six,2016,Chongqing,500,100


In [132]:
frame2["debt"] = np.linspace(100, 600, 6)
frame2

Unnamed: 0,year,city,population,debt
one,2016,Beijing,2100,100.0
two,2017,Shanghai,2500,200.0
three,2016,Guangzhou,1000,300.0
four,2017,Shenzhen,700,400.0
five,2016,Hangzhou,500,500.0
six,0,0,0,600.0


In [128]:
frame2.loc["six"] = 0
frame2

Unnamed: 0,year,city,population,debt
one,2016,Beijing,2100,100
two,2017,Shanghai,2500,100
three,2016,Guangzhou,1000,100
four,2017,Shenzhen,700,100
five,2016,Hangzhou,500,100
six,0,0,0,0


In [129]:
frame2.index

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

还可以用Series来指定需要修改的index以及相对应的value，没有指定的默认用NaN.

如果我们想要知道有哪些行，直接用columns

In [131]:
frame2.columns

Index(['year', 'city', 'population', 'debt'], dtype='object')

一个DataFrame就和一个numpy 2d array一样，可以被转置

In [135]:
frame2_transpose = frame2.T

In [136]:
frame2_transpose.index

Index(['year', 'city', 'population', 'debt'], dtype='object')

In [137]:
frame2_transpose.columns

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

指定index的顺序，以及使用切片初始化数据

In [139]:
pop = {
    "Beijing": {2016: 2100, 2017: 2200},
    "Shanghai": {2015: 2400, 2016: 2500, 2017:2600}
}

In [140]:
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Beijing,Shanghai
2015,,2400
2016,2100.0,2500
2017,2200.0,2600


我们还可以指定index的名字和列的名字

In [144]:
frame3.index.name = "year"
frame3.columns.name = "city"
frame3

city,Beijing,Shanghai
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2015,,2400
2016,2100.0,2500
2017,2200.0,2600


In [145]:
frame3.index

Int64Index([2015, 2016, 2017], dtype='int64', name='year')

In [149]:
np.sum(frame3.values, 0)

array([   nan,  7500.])

In [150]:
pd.DataFrame(np.random.rand(5,5))

Unnamed: 0,0,1,2,3,4
0,0.535125,0.019538,0.592836,0.209418,0.688002
1,0.902057,0.357361,0.709754,0.942224,0.978626
2,0.212925,0.592423,0.79533,0.439496,0.660353
3,0.391505,0.854762,0.671805,0.812453,0.525896
4,0.168395,0.609642,0.29439,0.817607,0.483327


## Index
### 七月在线python数据分析集训营 2017升级版 julyedu.com

### index object

In [177]:
obj = pd.Series(range(6), index=["a", "b", "c", "d", "e", "f"])
obj

a    0
b    1
c    2
d    3
e    4
f    5
dtype: int64

In [154]:
obj["a":"c"]

a    0
b    1
c    2
dtype: int64

In [156]:
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [157]:
type(index)

pandas.core.indexes.base.Index

index的值是不能被更改的

In [161]:
index[0] = "d"

TypeError: Index does not support mutable operations

In [164]:
index = pd.Index(np.arange(3)+5)
index

Int64Index([5, 6, 7], dtype='int64')

In [165]:
obj2 = pd.Series([2,5,7], index=index)
obj2

5    2
6    5
7    7
dtype: int64

In [166]:
obj2.index is index

True

In [169]:
obj2.index is np.arange(3)+5

False

In [173]:

2015 in frame3.index

True

In [174]:
"Beijing" in frame3.columns

True

### 针对index进行索引和切片

默认的数字index依旧可以使用

In [183]:
obj[-4:7]

c    2
d    3
e    4
f    5
dtype: int64

In [184]:
obj[obj < 2]

a    0
b    1
dtype: int64

下面介绍如何对Series进行切片

In [186]:
obj["b":"d"] = 5
obj

a    0
b    5
c    5
d    5
e    4
f    5
dtype: int64

对DataFrame进行Indexing与Series基本相同

In [188]:
frame = pd.DataFrame(np.random.rand(3,3)*1000,
                    index = ["a", "c", "d"],
                    columns = ["Hangzhou", "Shenzhen", "Nanjing"])
frame

Unnamed: 0,Hangzhou,Shenzhen,Nanjing
a,618.765001,593.180867,258.286209
c,49.364032,763.332012,431.624973
d,860.6407,983.031531,107.875363


In [190]:
frame[["Nanjing", "Shenzhen"]]

Unnamed: 0,Nanjing,Shenzhen
a,258.286209,593.180867
c,431.624973,763.332012
d,107.875363,983.031531


In [192]:
frame[1:3]

Unnamed: 0,Hangzhou,Shenzhen,Nanjing
c,49.364032,763.332012,431.624973
d,860.6407,983.031531,107.875363


In [200]:
frame.loc[:"z", "Hangzhou":"Shenzhen"]

Unnamed: 0,Hangzhou,Shenzhen
a,618.765001,593.180867
c,49.364032,763.332012
d,860.6407,983.031531


DataFrame也可以用condition selection

In [206]:
frame[frame.Hangzhou > 500]

Unnamed: 0,Hangzhou,Shenzhen,Nanjing
a,618.765001,593.180867,258.286209
d,860.6407,983.031531,107.875363


In [208]:
frame[frame > 500] *= 100
frame

Unnamed: 0,Hangzhou,Shenzhen,Nanjing
a,61876.500143,59318.086722,-500.0
c,-500.0,76333.20124,-500.0
d,86064.069981,98303.15306,-500.0


### [reindex](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html)

把一个Series或者DataFrame按照新的index顺序进行重排

In [209]:
obj = pd.Series([4.5, 6.7, -5.3, 2.1], index=["d", "b", "c", "a"])
obj

d    4.5
b    6.7
c   -5.3
a    2.1
dtype: float64

In [210]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
obj2

a    2.1
b    6.7
c   -5.3
d    4.5
e    NaN
dtype: float64

In [212]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"], fill_value=0)
obj2

a    2.1
b    6.7
c   -5.3
d    4.5
e    0.0
dtype: float64

In [213]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [215]:
obj3.reindex(range(6), method="ffill")

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [216]:
obj3.reindex(range(6), method="bfill")

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

既然我们可以对Series进行reindex，相应地，我们也可以用同样的方法对DataFrame进行reindex。

In [217]:
frame = pd.DataFrame(np.random.rand(3,3)*1000, 
                    index = ["a", "c", "d"],
                    columns = ["Hangzhou", "Tianjin", "Shanghai"])
frame

Unnamed: 0,Hangzhou,Tianjin,Shanghai
a,505.708089,765.346595,271.461783
c,502.068271,940.419039,566.955583
d,997.265391,26.699206,428.983069


In [218]:
frame2 = frame.reindex(["a", "b", "c", "d"])
frame2

Unnamed: 0,Hangzhou,Tianjin,Shanghai
a,505.708089,765.346595,271.461783
b,,,
c,502.068271,940.419039,566.955583
d,997.265391,26.699206,428.983069


在reindex的同时，我们还可以重新指定columns

In [220]:
frame2 = frame.reindex(index = ["a", "b", "c", "d"], 
                       columns = ["Chongqing", "Hangzhou", "Shenzhen"])
frame2

Unnamed: 0,Chongqing,Hangzhou,Shenzhen
a,,505.708089,
b,,,
c,,502.068271,
d,,997.265391,


In [222]:
frame2.loc["b", "Chongqing"]

nan

In [223]:
frame2.T

Unnamed: 0,a,b,c,d
Chongqing,,,,
Hangzhou,505.708089,,502.068271,997.265391
Shenzhen,,,,


下面介绍如何用drop来删除Series和DataFrame中的index

In [225]:
obj4 = obj3.drop(2)
obj4

0      blue
4    yellow
dtype: object

In [226]:
obj3.drop([2,4])

0    blue
dtype: object

In [228]:
frame.drop(["a", "c"])

Unnamed: 0,Hangzhou,Tianjin,Shanghai
d,997.265391,26.699206,428.983069


In [230]:
frame.drop("Shanghai", axis=1)

Unnamed: 0,Hangzhou,Tianjin
a,505.708089,765.346595
c,502.068271,940.419039
d,997.265391,26.699206


In [231]:
frame.drop(["Shanghai", "Hangzhou"], axis=1)

Unnamed: 0,Tianjin
a,765.346595
c,940.419039
d,26.699206


In [232]:
frame

Unnamed: 0,Hangzhou,Tianjin,Shanghai
a,505.708089,765.346595,271.461783
c,502.068271,940.419039,566.955583
d,997.265391,26.699206,428.983069


drop不仅仅可以删除行，还可以删除列

### hierarchical index

Series的hierarchical indexing

In [233]:
data = pd.Series(np.random.randn(10), index=[
    ['a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'd'],
    [1, 2, 3, 1, 2, 1, 2, 3, 1, 2]])
data

a  1    0.855212
   2    1.029673
   3   -0.366195
b  1   -0.980646
   2   -1.176016
c  1   -0.489860
   2   -0.591196
   3   -0.407260
d  1    0.063668
   2    0.876177
dtype: float64

In [234]:
data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 2, 3, 3], [0, 1, 2, 0, 1, 0, 1, 2, 0, 1]])

In [235]:
data.b

1   -0.980646
2   -1.176016
dtype: float64

In [237]:
data["b":"c"]

b  1   -0.980646
   2   -1.176016
c  1   -0.489860
   2   -0.591196
   3   -0.407260
dtype: float64

In [239]:
data[1:3]

a  2    1.029673
   3   -0.366195
dtype: float64

unstack和stack可以帮助我们在hierarchical indexing和DataFrame之间进行切换。

In [241]:
data

a  1    0.855212
   2    1.029673
   3   -0.366195
b  1   -0.980646
   2   -1.176016
c  1   -0.489860
   2   -0.591196
   3   -0.407260
d  1    0.063668
   2    0.876177
dtype: float64

In [240]:
data.unstack()

Unnamed: 0,1,2,3
a,0.855212,1.029673,-0.366195
b,-0.980646,-1.176016,
c,-0.48986,-0.591196,-0.40726
d,0.063668,0.876177,


In [242]:
type(data.unstack())

pandas.core.frame.DataFrame

In [243]:
type(data)

pandas.core.series.Series

In [245]:
data.unstack().stack()

a  1    0.855212
   2    1.029673
   3   -0.366195
b  1   -0.980646
   2   -1.176016
c  1   -0.489860
   2   -0.591196
   3   -0.407260
d  1    0.063668
   2    0.876177
dtype: float64

DataFrame的hierarchical indexing

In [255]:
frame = pd.DataFrame(np.arange(12).reshape(4,3),
                    index = [['a', 'a', 'b','b'], [1,2,1,2]],
                    columns = [["Beijing", "Beijing", "Shanghai"], ["apts", "cars", "apts"]])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Beijing,Beijing,Shanghai
Unnamed: 0_level_1,Unnamed: 1_level_1,apts,cars,apts
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [256]:
frame.columns.names = ["city", "type"]

In [257]:
frame.index.names = ["key1", "key2"]

In [251]:
frame

Unnamed: 0_level_0,city,Bejing,Beijing,Shanghai
Unnamed: 0_level_1,type,apts,cars,apts
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [252]:
frame.loc["a", 1]

city      type
Bejing    apts    0
Beijing   cars    1
Shanghai  apts    2
Name: (a, 1), dtype: int64

In [259]:
frame.loc["a", 2]["Beijing"]["apts"]

3

In [263]:
frame.stack().stack()

key1  key2  type  city    
a     1     apts  Beijing      0.0
                  Shanghai     2.0
            cars  Beijing      1.0
      2     apts  Beijing      3.0
                  Shanghai     5.0
            cars  Beijing      4.0
b     1     apts  Beijing      6.0
                  Shanghai     8.0
            cars  Beijing      7.0
      2     apts  Beijing      9.0
                  Shanghai    11.0
            cars  Beijing     10.0
dtype: float64