# Pandas教程

### 2017七月在线 Python数据分析集训营 julyedu.com
by 褚则伟 zeweichu@gmail.com

pandas是一个专门用于数据分析的python library

## [Pandas](http://pandas.pydata.org/)简介
- python数据分析library
- 基于numpy (对ndarray的操作)
- 有一种用python做Excel/SQL/R的感觉

## 目录
- Series
- DataFrame
- Index

## 数据结构Series

### 2017七月在线 Python数据分析集训营 julyedu.com

### 构造和初始化Series

In [1]:
import pandas as pd
import numpy as np

Series是一个一维的数据结构，下面是一些初始化Series的方法。

In [2]:
s = pd.Series([7, 'Beijing', 2.17, -12344, 'Happy Birthday!'])
s

0                  7
1            Beijing
2               2.17
3             -12344
4    Happy Birthday!
dtype: object

pandas会默认用0到n来作为Series的index，但是我们也可以自己指定index。index我们可以把它理解为dict里面的key。

In [3]:
s = pd.Series([7, 'Beijing', 2.17, -12344, 'Happy Birthday!'],
             index=['A', 'B', 'C', 'D', 'E'])
s

A                  7
B            Beijing
C               2.17
D             -12344
E    Happy Birthday!
dtype: object

还可以用dictionary来构造一个Series，因为Series本来就是key value pairs。

In [4]:
cities = {'Beijing': 55000, 'Shanghai': 60000, 'Shenzhen': 50000, 'Hangzhou': 20000, 'Guangzhou': 25000, 'Suzhou': None}
# apts = pd.Series(cities)
apts = pd.Series(cities, name="price")
apts

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: price, dtype: float64

numpy ndarray构建一个Series

In [5]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -0.804751
b    1.229966
c   -0.628068
d   -0.408734
e    0.209191
dtype: float64

### 选择数据

我们可以像对待一个list一样对待Series

In [6]:
apts[[4,3,1]]

Shenzhen     50000.0
Shanghai     60000.0
Guangzhou    25000.0
Name: price, dtype: float64

In [7]:
apts[1:]

Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: price, dtype: float64

In [8]:
apts[:-1]

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Name: price, dtype: float64

为什么下面这样会拿到两个NaN呢？

In [9]:
apts[1:] + apts[:-1]

Beijing           NaN
Guangzhou     50000.0
Hangzhou      40000.0
Shanghai     120000.0
Shenzhen     100000.0
Suzhou            NaN
Name: price, dtype: float64

Series就像一个dict，前面定义的index就是用来选择数据的

In [10]:
apts["Hangzhou"]

20000.0

In [11]:
apts[["Hangzhou", "Beijing", "Shenzhen"]]

Hangzhou    20000.0
Beijing     55000.0
Shenzhen    50000.0
Name: price, dtype: float64

In [12]:
"Hangzhou" in apts

True

In [13]:
"Chongqing" in apts

False

比较安全的用key读取value的方法如下

In [14]:
apts.get("Chongqing")

下面这种写法，如果key不存在，就可能会报错了

In [15]:
# apts["Chongqing"]

In [16]:
apts.get("Hangzhou")

20000.0

boolean indexing，与numpy类似。

In [17]:
apts[apts < 50000]

Guangzhou    25000.0
Hangzhou     20000.0
Name: price, dtype: float64

In [18]:
apts.median()

50000.0

In [19]:
apts[apts > apts.median()]

Beijing     55000.0
Shanghai    60000.0
Name: price, dtype: float64

下面我再详细展示一下这个boolean indexing是如何工作的

In [20]:
less_than_50000 = apts < 50000
print(less_than_50000)

Beijing      False
Guangzhou     True
Hangzhou      True
Shanghai     False
Shenzhen     False
Suzhou       False
Name: price, dtype: bool


In [21]:
print(apts[less_than_50000])

Guangzhou    25000.0
Hangzhou     20000.0
Name: price, dtype: float64


### Series元素赋值

Series的元素可以被赋值

In [22]:
print("Old value: ", apts['Shenzhen'])
apts['Shenzhen'] = 55000
print("New value: ", apts['Shenzhen'])

Old value:  50000.0
New value:  55000.0


前面讲过的boolean indexing在赋值的时候也可以用

In [23]:
print(apts[apts < 50000])
print()
apts[apts <= 50000] = 40000
print(apts[apts < 50000])

Guangzhou    25000.0
Hangzhou     20000.0
Name: price, dtype: float64

Guangzhou    40000.0
Hangzhou     40000.0
Name: price, dtype: float64


### 数学运算

下面我们来讲一些基本的数学运算。

In [24]:
apts / 2

Beijing      27500.0
Guangzhou    20000.0
Hangzhou     20000.0
Shanghai     30000.0
Shenzhen     27500.0
Suzhou           NaN
Name: price, dtype: float64

In [25]:
apts ** 2

Beijing      3.025000e+09
Guangzhou    1.600000e+09
Hangzhou     1.600000e+09
Shanghai     3.600000e+09
Shenzhen     3.025000e+09
Suzhou                NaN
Name: price, dtype: float64

numpy的运算可以被运用到pandsa上去

In [26]:
np.square(apts)

Beijing      3.025000e+09
Guangzhou    1.600000e+09
Hangzhou     1.600000e+09
Shanghai     3.600000e+09
Shenzhen     3.025000e+09
Suzhou                NaN
Name: price, dtype: float64

我们再定义一个新的Series做加法

In [27]:
cars = pd.Series({'Beijing': 300000, 'Shanghai': 400000, 'Shenzhen': 300000, \
                      'Tianjin': 200000, 'Guangzhou': 200000, 'Chongqing': 150000})
cars

Beijing      300000
Chongqing    150000
Guangzhou    200000
Shanghai     400000
Shenzhen     300000
Tianjin      200000
dtype: int64

In [28]:
print(cars + apts * 100)

Beijing      5800000.0
Chongqing          NaN
Guangzhou    4200000.0
Hangzhou           NaN
Shanghai     6400000.0
Shenzhen     5800000.0
Suzhou             NaN
Tianjin            NaN
dtype: float64


### 数据缺失

[reference](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)

In [29]:
print('Hangzhou' in apts)
print('Hangzhou' in cars)

True
False


In [30]:
apts.notnull()

Beijing       True
Guangzhou     True
Hangzhou      True
Shanghai      True
Shenzhen      True
Suzhou       False
Name: price, dtype: bool

In [31]:
print(apts.isnull())
print()

Beijing      False
Guangzhou    False
Hangzhou     False
Shanghai     False
Shenzhen     False
Suzhou        True
Name: price, dtype: bool



In [32]:
print(apts[apts.isnull()])

Suzhou   NaN
Name: price, dtype: float64


In [33]:
print(apts[apts.isnull() == False])

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     55000.0
Name: price, dtype: float64


## 数据结构[Dataframe](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

### 七月在线 julyedu.com

一个Dataframe就是一张表格，Series表示的是一维数组，Dataframe则是一个二维数组，可以类比成一张excel的spreadsheet。也可以把Dataframe当做一组Series的集合。

### 创建一个DataFrame

dataframe可以由一个dictionary构造得到。

In [34]:
data = {'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],
       'year': [2016,2017,2016,2017,2016, 2016],
       'population': [2100, 2300, 1000, 700, 500, 500]}
print(pd.DataFrame(data))

        city  population  year
0    Beijing        2100  2016
1   Shanghai        2300  2017
2  Guangzhou        1000  2016
3   Shenzhen         700  2017
4   Hangzhou         500  2016
5  Chongqing         500  2016


columns的名字和顺序可以指定

In [36]:
print(pd.DataFrame(data, columns=['year', 'city', 'population']))

   year       city  population
0  2016    Beijing        2100
1  2017   Shanghai        2300
2  2016  Guangzhou        1000
3  2017   Shenzhen         700
4  2016   Hangzhou         500
5  2016  Chongqing         500


In [37]:
frame2 = pd.DataFrame(data, \
                     columns = ['year', 'city', 'population', 'debt'],
                     index = ['one', 'two', 'three', 'four', 'five', 'six'])
print(frame2)

       year       city  population debt
one    2016    Beijing        2100  NaN
two    2017   Shanghai        2300  NaN
three  2016  Guangzhou        1000  NaN
four   2017   Shenzhen         700  NaN
five   2016   Hangzhou         500  NaN
six    2016  Chongqing         500  NaN


也可以从几个Series构建一个DataFrame

In [38]:
apts

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     55000.0
Suzhou           NaN
Name: price, dtype: float64

In [39]:
cars

Beijing      300000
Chongqing    150000
Guangzhou    200000
Shanghai     400000
Shenzhen     300000
Tianjin      200000
dtype: int64

In [40]:
df = pd.DataFrame({"apts": apts, "cars": cars})
df

Unnamed: 0,apts,cars
Beijing,55000.0,300000.0
Chongqing,,150000.0
Guangzhou,40000.0,200000.0
Hangzhou,40000.0,
Shanghai,60000.0,400000.0
Shenzhen,55000.0,300000.0
Suzhou,,
Tianjin,,200000.0


也可以用一个list of dicts来构建DataFrame

In [41]:
data = [{"July": 999999, "Han": 50000, "Zewei": 1000}, {"July": 99999, "Han": 8000, "Zewei": 200}]
pd.DataFrame(data)

Unnamed: 0,Han,July,Zewei
0,50000,999999,1000
1,8000,99999,200


In [42]:
pd.DataFrame(data, index=["salary", "bonus"])

Unnamed: 0,Han,July,Zewei
salary,50000,999999,1000
bonus,8000,99999,200


## 练习

- 构建三个Series，分别是一系列商品的单价，计量单位，和数量。至于是什么商品什么计量单位由大家自己决定。
- 然后把这三个Series合并成一个DataFrame。

In [43]:
price = pd.Series([20, 2, 3, 50, 40],
             index=["Apple", "Banana", "Orange", "Watermelon", "Strawberry"])
unit = pd.Series(["kg", "each", "each", "each", "kg"],
             index=["Apple", "Banana", "Orange", "Watermelon", "Strawberry"])
amount = pd.Series([5, 10, 6, 1, 2],
             index=["Apple", "Banana", "Orange", "Watermelon", "Strawberry"])
fruit_df = pd.DataFrame({"price": price, "unit": unit, "amount": amount}, columns=["price", "unit", "amount"])
fruit_df

Unnamed: 0,price,unit,amount
Apple,20,kg,5
Banana,2,each,10
Orange,3,each,6
Watermelon,50,each,1
Strawberry,40,kg,2


### 2017七月在线 Python数据分析集训营 julyedu.com
by 褚则伟 zeweichu@gmail.com

## 数据结构Series

### 2017七月在线 Python数据分析集训营 julyedu.com

In [44]:
df["apts"]

Beijing      55000.0
Chongqing        NaN
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     55000.0
Suzhou           NaN
Tianjin          NaN
Name: apts, dtype: float64

In [45]:
df["total_cost"] = df["apts"]*100 + df["cars"]
df

Unnamed: 0,apts,cars,total_cost
Beijing,55000.0,300000.0,5800000.0
Chongqing,,150000.0,
Guangzhou,40000.0,200000.0,4200000.0
Hangzhou,40000.0,,
Shanghai,60000.0,400000.0,6400000.0
Shenzhen,55000.0,300000.0,5800000.0
Suzhou,,,
Tianjin,,200000.0,


In [46]:
print(frame2['city'])
type(frame2['city'])

one        Beijing
two       Shanghai
three    Guangzhou
four      Shenzhen
five      Hangzhou
six      Chongqing
Name: city, dtype: object


pandas.core.series.Series

In [47]:
print(frame2.year)
type(frame2.year)

one      2016
two      2017
three    2016
four     2017
five     2016
six      2016
Name: year, dtype: int64


pandas.core.series.Series

ix方法可以帮我们拿到行，但是现在已经不推荐使用了

In [48]:
print(frame2.ix['three'])
type(frame2.ix['three'])

year               2016
city          Guangzhou
population         1000
debt                NaN
Name: three, dtype: object


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


pandas.core.series.Series

loc方法可以拿到行

In [49]:
print(frame2.loc['three'])
type(frame2.loc['three'])

year               2016
city          Guangzhou
population         1000
debt                NaN
Name: three, dtype: object


pandas.core.series.Series

下面这种方法默认用来选列而不是选行

In [52]:
# print(frame2["three"])

ix也可以用index拿到行，当然也不推荐了

In [53]:
print(frame2.ix[1])
type(frame2.ix[1])

year              2017
city          Shanghai
population        2300
debt               NaN
Name: two, dtype: object


pandas.core.series.Series

现在推荐的用法是iloc

In [55]:
frame2.iloc[1]

year              2017
city          Shanghai
population        2300
debt               NaN
Name: two, dtype: object

### DataFrame元素赋值

In [56]:
frame2["population"]["one"] = 2200

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [76]:
frame2.loc["one", "population"] = 2200

可以给一整列赋值

In [59]:
frame2['debt'] = 100
print(frame2)

       year       city  population  debt
one    2016    Beijing        2200   100
two    2017   Shanghai        2300   100
three  2016  Guangzhou        1000   100
four   2017   Shenzhen         700   100
five   2016   Hangzhou         500   100
six    2016  Chongqing         500   100


In [61]:
frame2.loc['six'] = 0
print(frame2)

       year       city  population  debt
one    2016    Beijing        2200   100
two    2017   Shanghai        2300   100
three  2016  Guangzhou        1000   100
four   2017   Shenzhen         700   100
five   2016   Hangzhou         500   100
six       0          0           0     0


In [65]:
frame2
frame2.index
frame2['city']

one        Beijing
two       Shanghai
three    Guangzhou
four      Shenzhen
five      Hangzhou
six              0
Name: city, dtype: object

In [66]:
frame2.debt = np.arange(6)
print(frame2)

       year       city  population  debt
one    2016    Beijing        2200     0
two    2017   Shanghai        2300     1
three  2016  Guangzhou        1000     2
four   2017   Shenzhen         700     3
five   2016   Hangzhou         500     4
six       0          0           0     5


还可以用Series来指定需要修改的index以及相对应的value，没有指定的默认用NaN.

In [67]:
val = pd.Series([100, 200, 300], index=['two', 'three', 'five'])
frame2['debt'] = val
print(frame2)

       year       city  population   debt
one    2016    Beijing        2200    NaN
two    2017   Shanghai        2300  100.0
three  2016  Guangzhou        1000  200.0
four   2017   Shenzhen         700    NaN
five   2016   Hangzhou         500  300.0
six       0          0           0    NaN


In [68]:
frame2['western'] = (frame2.city == 'Chongqing')
print(frame2)

       year       city  population   debt  western
one    2016    Beijing        2200    NaN    False
two    2017   Shanghai        2300  100.0    False
three  2016  Guangzhou        1000  200.0    False
four   2017   Shenzhen         700    NaN    False
five   2016   Hangzhou         500  300.0    False
six       0          0           0    NaN    False


如果我们想要知道有哪些行，直接用columns

In [70]:
print(frame2.columns)

Index(['year', 'city', 'population', 'debt', 'western'], dtype='object')


一个DataFrame就和一个numpy 2d array一样，可以被转置

In [71]:
pop = {'Beijing': {2016: 2100, 2017:2200},
      'Shanghai': {2015:2400, 2016:2500, 2017:2600}}

In [72]:
frame3 = pd.DataFrame(pop)
print(frame3)
print(frame3.T)

      Beijing  Shanghai
2015      NaN      2400
2016   2100.0      2500
2017   2200.0      2600
            2015    2016    2017
Beijing      NaN  2100.0  2200.0
Shanghai  2400.0  2500.0  2600.0


指定index的顺序，以及使用切片初始化数据

In [73]:
print(pd.DataFrame(pop, index=[2016,2015,2017]))

      Beijing  Shanghai
2016   2100.0      2500
2015      NaN      2400
2017   2200.0      2600


In [74]:
pdata = {'Beijing': frame3['Beijing'][:-1], 'Shanghai':frame3['Shanghai'][:-1]}
print(pd.DataFrame(pdata))

      Beijing  Shanghai
2015      NaN      2400
2016   2100.0      2500


我们还可以指定index的名字和列的名字

In [75]:
frame3.index.name = 'year'
frame3.columns.name = 'city'
print(frame3)

city  Beijing  Shanghai
year                   
2015      NaN      2400
2016   2100.0      2500
2017   2200.0      2600


In [76]:
print(frame3.values)
print(frame2)
print(type(frame2.values))

[[   nan  2400.]
 [ 2100.  2500.]
 [ 2200.  2600.]]
       year       city  population   debt  western
one    2016    Beijing        2200    NaN    False
two    2017   Shanghai        2300  100.0    False
three  2016  Guangzhou        1000  200.0    False
four   2017   Shenzhen         700    NaN    False
five   2016   Hangzhou         500  300.0    False
six       0          0           0    NaN    False
<class 'numpy.ndarray'>


## Index
### 七月在线python数据分析集训营 2017升级版 julyedu.com

### index object

In [77]:
obj = pd.Series(range(3), index = ['a', 'b', 'c'])
index = obj.index
print(index)
print(index[1:])

Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')


index的值是不能被更改的

In [78]:
index[1] = 'd'

TypeError: Index does not support mutable operations

In [79]:
index = pd.Index(np.arange(3))
obj2 = pd.Series([2,5,7], index=index)
print(obj2)
print(obj2.index is index)
print(obj2.index is np.arange(3))

0    2
1    5
2    7
dtype: int64
True
False


In [80]:
pop = {'Beijing': {2016: 2100, 2017:2200},
      'Shanghai': {2015:2400, 2016:2500, 2017:2600}}
frame3 = pd.DataFrame(pop)
print('Shanghai' in frame3.columns)
print(2015 in frame3.index)

True
True


### 针对index进行索引和切片

In [81]:
obj = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
print(obj['b'])

1


默认的数字index依旧可以使用

In [82]:
print(obj[3])
print()
print(obj[[1,3]])

3

b    1
d    3
dtype: int64


In [83]:
print(obj[obj<2])

a    0
b    1
dtype: int64


下面介绍如何对Series进行切片

In [84]:
print(obj['b':'c'])
obj['b':'c'] = 5
print(obj)

b    1
c    2
dtype: int64
a    0
b    5
c    5
d    3
dtype: int64


对DataFrame进行Indexing与Series基本相同

In [85]:
frame = pd.DataFrame(np.arange(9).reshape(3,3), 
                    index = ['a', 'c', 'd'],
                    columns = ['Hangzhou', 'Shenzhen', 'Nanjing'])

In [86]:
print(frame)

   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8


In [87]:
print(frame['Hangzhou'])

a    0
c    3
d    6
Name: Hangzhou, dtype: int64


In [89]:
print(frame[['Shenzhen', 'Nanjing']])

   Shenzhen  Nanjing
a         1        2
c         4        5
d         7        8


In [96]:
print(frame[:2])

   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5


In [98]:
print(frame.loc['a'])

Hangzhou    0
Shenzhen    1
Nanjing     2
Name: a, dtype: int64


In [99]:
print(frame.loc[['a','d'], ['Shenzhen', 'Nanjing']])

   Shenzhen  Nanjing
a         1        2
d         7        8


In [102]:
print(frame.loc[:'c', 'Hangzhou'])

a    0
c    3
Name: Hangzhou, dtype: int64


DataFrame也可以用condition selection

In [103]:
print(frame[frame.Hangzhou > 1])

   Hangzhou  Shenzhen  Nanjing
c         3         4        5
d         6         7        8


In [104]:
print(frame < 5)

   Hangzhou  Shenzhen  Nanjing
a      True      True     True
c      True      True    False
d     False     False    False


In [105]:
frame[frame < 5] = 0
print(frame)

   Hangzhou  Shenzhen  Nanjing
a         0         0        0
c         0         0        5
d         6         7        8


### [reindex](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html)

把一个Series或者DataFrame按照新的index顺序进行重排

In [106]:
import numpy as np
import pandas as pd

In [107]:
obj = pd.Series([4.5, 7.2, -5.3, 3.2], index=['d', 'b', 'a', 'c'])
print(obj)

d    4.5
b    7.2
a   -5.3
c    3.2
dtype: float64


In [108]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(obj2)

a   -5.3
b    7.2
c    3.2
d    4.5
e    NaN
dtype: float64


In [109]:
print(obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0))

a   -5.3
b    7.2
c    3.2
d    4.5
e    0.0
dtype: float64


In [110]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index = [0,2,4])
print(obj3)

0      blue
2    purple
4    yellow
dtype: object


In [111]:
print(obj3.reindex(range(6), method='ffill'))

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object


In [112]:
print(obj3.reindex(range(6), method='bfill'))

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object


既然我们可以对Series进行reindex，相应地，我们也可以用同样的方法对DataFrame进行reindex。

In [113]:
frame = pd.DataFrame(np.arange(9).reshape(3,3), 
                    index = ['a', 'c', 'd'],
                    columns = ['Hangzhou', 'Shenzhen', 'Nanjing'])
print(frame)

   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8


In [114]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)

   Hangzhou  Shenzhen  Nanjing
a       0.0       1.0      2.0
b       NaN       NaN      NaN
c       3.0       4.0      5.0
d       6.0       7.0      8.0


在reindex的同时，我们还可以重新指定columns

In [115]:
print(frame.reindex(columns = ['Shenzhen', 'Hangzhou', 'Chongqing']))

   Shenzhen  Hangzhou  Chongqing
a         1         0        NaN
c         4         3        NaN
d         7         6        NaN


In [116]:
print(frame.reindex(index = ['a', 'b', 'c', 'd'],
                    columns = ['Chongqing', 'Hangzhou', 'Shenzhen']))
print(frame.loc[['a', 'b', 'c', 'd'],['Shenzhen', 'Hangzhou', 'Chongqing']])

   Chongqing  Hangzhou  Shenzhen
a        NaN       0.0       1.0
b        NaN       NaN       NaN
c        NaN       3.0       4.0
d        NaN       6.0       7.0
   Shenzhen  Hangzhou  Chongqing
a       1.0       0.0        NaN
b       NaN       NaN        NaN
c       4.0       3.0        NaN
d       7.0       6.0        NaN


下面介绍如何用drop来删除Series和DataFrame中的index

In [117]:
print(obj3)
obj4 = obj3.drop(2)
print(obj4)

0      blue
2    purple
4    yellow
dtype: object
0      blue
4    yellow
dtype: object


In [118]:
print(obj3.drop([2,4]))

0    blue
dtype: object


In [119]:
print(frame)

   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8


In [120]:
print(frame.drop(['a', 'c']))

   Hangzhou  Shenzhen  Nanjing
d         6         7        8


drop不仅仅可以删除行，还可以删除列

In [121]:
print(frame.drop('Shenzhen', axis=1))

   Hangzhou  Nanjing
a         0        2
c         3        5
d         6        8


In [122]:
print(frame.drop(['Shenzhen', 'Hangzhou'], axis=1))

   Nanjing
a        2
c        5
d        8


### hierarchical index

In [123]:
import numpy as np
import pandas as pd

Series的hierarchical indexing

In [124]:
data = pd.Series(np.random.randn(10), index=[['a','a','a','b','b','c','c','c','d','d'], [1,2,3,1,2,1,2,3,1,2]])
print(data)

a  1   -0.112995
   2   -1.598109
   3   -2.460582
b  1    0.635737
   2    0.960789
c  1    1.079376
   2   -0.013745
   3    0.121006
d  1   -0.552221
   2    0.775818
dtype: float64


In [125]:
print(data.index)

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 2, 3, 3], [0, 1, 2, 0, 1, 0, 1, 2, 0, 1]])


In [126]:
print(data.b)

1    0.635737
2    0.960789
dtype: float64


In [127]:
print(data['b':'c'])

b  1    0.635737
   2    0.960789
c  1    1.079376
   2   -0.013745
   3    0.121006
dtype: float64


In [128]:
print(data[:2])

a  1   -0.112995
   2   -1.598109
dtype: float64


unstack和stack可以帮助我们在hierarchical indexing和DataFrame之间进行切换。

In [129]:
print(data.unstack())
print(type(data.unstack()))

          1         2         3
a -0.112995 -1.598109 -2.460582
b  0.635737  0.960789       NaN
c  1.079376 -0.013745  0.121006
d -0.552221  0.775818       NaN
<class 'pandas.core.frame.DataFrame'>


In [130]:
print(data.unstack().stack())
print(type(data.unstack().stack()))

a  1   -0.112995
   2   -1.598109
   3   -2.460582
b  1    0.635737
   2    0.960789
c  1    1.079376
   2   -0.013745
   3    0.121006
d  1   -0.552221
   2    0.775818
dtype: float64
<class 'pandas.core.series.Series'>


DataFrame的hierarchical indexing

In [131]:
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                    index = [['a','a','b','b'], [1,2,1,2]],
                    columns = [['Beijing', 'Beijing', 'Shanghai'], ['apts', 'cars', 'apts']])
print(frame)

    Beijing      Shanghai
       apts cars     apts
a 1       0    1        2
  2       3    4        5
b 1       6    7        8
  2       9   10       11


In [132]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['city', 'type']
print(frame)

city      Beijing      Shanghai
type         apts cars     apts
key1 key2                      
a    1          0    1        2
     2          3    4        5
b    1          6    7        8
     2          9   10       11


In [134]:
print(frame.loc['a', 1])
print(type(frame.loc['a', 1]))

city      type
Beijing   apts    0
          cars    1
Shanghai  apts    2
Name: (a, 1), dtype: int64
<class 'pandas.core.series.Series'>


In [135]:
print(frame.loc['a', 2]['Beijing'])

type
apts    3
cars    4
Name: (a, 2), dtype: int64


In [136]:
print(frame.loc['a', 2]['Beijing']['apts'])

3


In [140]:
print(frame.loc['a'])

city Beijing      Shanghai
type    apts cars     apts
key2                      
1          0    1        2
2          3    4        5
