Pandas 是一个Python数据分析库，安装完 Anaconda 会按照好 Pandas

# 1 基础数据结构
Pandas 包含了两种主要的数据结构
+ Series
+ DataFrame
Series 用来存储一维数据，而DataFrame则存储复杂的数据

## Series
表示一维数据，并且内部结构简单，由两个关联的数组组成，主数组用来存放数据，每个主数组有一个与之相关的标签。

index | value
--- | ---
 0 | 12
 1 | -4
 2 | 22
 3 | 0
 

In [1]:
import numpy as np
import pandas as pd
a = pd.Series([12,-4,22,0])
a

0    12
1    -4
2    22
3     0
dtype: int64

In [2]:
# 讲标签替换成有意义的值
a = pd.Series([12,-4,22,0],index=['a','b','c','d'])
print a

a    12
b    -4
c    22
d     0
dtype: int64


In [3]:
print a.values
print a.index

[12 -4 22  0]
Index([u'a', u'b', u'c', u'd'], dtype='object')


### 选取数据

In [4]:
print a[2]
print a['b']
print a[0:2]
print a[['b','d']]

22
-4
a    12
b    -4
dtype: int64
b   -4
d    0
dtype: int64


### 数据赋值

In [5]:
a[2]=10
a['d']=-5

### 从numpy对象中创建Series

In [6]:
arr =np.array([1,2,3,4])
s = pd.Series(arr)
s

0    1
1    2
2    3
3    4
dtype: int64

In [7]:
s1 = pd.Series(s)
s1[0]=-10
print s

0   -10
1     2
2     3
3     4
dtype: int64


### 筛选数据

In [8]:
s[s>2]

2    3
3    4
dtype: int64

### 数学运算

In [9]:
s/2

0   -5.0
1    1.0
2    1.5
3    2.0
dtype: float64

In [10]:
np.log(s)

0         NaN
1    0.693147
2    1.098612
3    1.386294
dtype: float64

### NaN
字段中若为空或者不符合要求的数字定义的是，放回NaN（Not a Number)

In [1]:
import numpy as np
import pandas as pd
s2 = pd.Series([5,-3,np.NaN,14])
s2

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

In [2]:
s2.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [4]:
s2.notnull()

0     True
1     True
2    False
3     True
dtype: bool

In [5]:
# 当做筛选条件
s2[s2.notnull()]

0     5.0
1    -3.0
3    14.0
dtype: float64

In [6]:
s2[s2.isnull()]

2   NaN
dtype: float64

### 字典使用

In [8]:
mydic = {'red':200,'blue':1000,'yellow':500,'orange':1000}
myseries = pd.Series(mydic)
myseries

blue      1000
orange    1000
red        200
yellow     500
dtype: int64

In [9]:
colors = ['red','yellow','orange','blue','green']
mySeries = pd.Series(mydic, index=colors)
mySeries

red        200.0
yellow     500.0
orange    1000.0
blue      1000.0
green        NaN
dtype: float64

### Series 对象之前运算

In [10]:
mydict2 = {'red':400,'yellow':1000,'black':700}
mySeries2 = pd.Series(mydict2)
mySeries + mySeries2

black        NaN
blue         NaN
green        NaN
orange       NaN
red        600.0
yellow    1500.0
dtype: float64

## DataFrame

DataFrame 列表式跟Excel比较类似，其设计初衷将Series的使用场景扩展至多维

index | color | object | price
--- | --- | --- | ---
0 | blue | ball | 1.2
1 | green | pen | 1.0
2 | yellow | pencil | 0.6

DataFrame 对象则有所不同，它有两个索引数组
1. 与行相关，与Series的索引组类似
2. 一系列标签，每个标签与列数据关联

In [1]:
import numpy as np
import pandas as pd
data = {'color':['blue','green','yellow'],'object':['ball','pen','pencil'],'price':[1.2,1.0,0.6]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6


当然也可以选择你感兴趣的内容

In [3]:
frame2 = pd.DataFrame(data,columns=['object','price'])
frame2

Unnamed: 0,object,price
0,ball,1.2
1,pen,1.0
2,pencil,0.6


修改index值

In [4]:
frame2 = pd.DataFrame(data,index=['one','two','three'])
frame2

Unnamed: 0,color,object,price
one,blue,ball,1.2
two,green,pen,1.0
three,yellow,pencil,0.6


其他创建DataFrame的方式

In [7]:
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
frame3

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


#### 选择元素

In [9]:
frame.columns

Index([u'color', u'object', u'price'], dtype='object')

In [10]:
frame.index

RangeIndex(start=0, stop=3, step=1)

In [11]:
frame.values

array([['blue', 'ball', 1.2],
       ['green', 'pen', 1.0],
       ['yellow', 'pencil', 0.6]], dtype=object)

In [12]:
frame['price']

0    1.2
1    1.0
2    0.6
Name: price, dtype: float64

In [13]:
frame.ix[2]

color     yellow
object    pencil
price        0.6
Name: 2, dtype: object

In [16]:
# 选择选择元素
frame.ix[[0,2]]

Unnamed: 0,color,object,price
0,blue,ball,1.2
2,yellow,pencil,0.6


In [17]:
# 索引值选择类似切片
frame[0:1]

Unnamed: 0,color,object,price
0,blue,ball,1.2


In [18]:
frame[0:3]

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6


In [19]:
# 选择其中的元素
frame['object'][2]

'pencil'

#### 赋值

In [23]:
frame.index.name='id'
frame.columns.name='item'
frame

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6


In [24]:
# 增加一个新列
frame['new']=12 # 默认
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,12
1,green,pen,1.0,12
2,yellow,pencil,0.6,12


In [25]:
frame['new'] = [3.0,1.3,2.2]
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,3.0
1,green,pen,1.0,1.3
2,yellow,pencil,0.6,2.2


In [26]:
# 通过Series其他方式更新
ser = pd.Series(np.arange(3))
frame['new']=ser
frame

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,blue,ball,1.2,0
1,green,pen,1.0,1
2,yellow,pencil,0.6,2


#### 元素的所属关系

In [30]:
frame.isin([1.0,'pen'])

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,False,False,False,False
1,False,True,True,True
2,False,False,False,False


In [31]:
frame[frame.isin([1.0,'pen'])]

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,,,,
1,,pen,1.0,1.0
2,,,,


#### 删除某一列

In [32]:
del frame['new']
frame

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6


#### 筛选元素

In [33]:
frame[frame<3]

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,,,1.2
1,,,1.0
2,,,0.6


#### 有嵌套的字典生成DataFrame对象
pandas 将会将外部的key解释为列名称，将内部的key解释为索引的标签

In [34]:
nestdict = {'red':{2012:22,2013:33},'white':{2011:13,2012:22,2013:16},'blue':{2011:17,2012:27,2013:18}}
frame2 = pd.DataFrame(nestdict)
frame2

Unnamed: 0,blue,red,white
2011,17,,13
2012,27,22.0,22
2013,18,33.0,16


#### DataFrame转置

In [36]:
frame2.T

Unnamed: 0,2011,2012,2013
blue,17.0,27.0,18.0
red,,22.0,33.0
white,13.0,22.0,16.0


### index对象

In [3]:
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
ser.index

Index([u'red', u'blue', u'yellow', u'white', u'green'], dtype='object')

#### index 对象方法

In [4]:
ser.idxmin()

'blue'

In [5]:
ser.idxmax()

'white'

### 含有重复标签的index

In [6]:
serd = pd.Series(range(6),index=['white','white','blue','green','green','yellow'])
serd

white     0
white     1
blue      2
green     3
green     4
yellow    5
dtype: int64

In [7]:
serd['white']

white    0
white    1
dtype: int64

In [10]:
serd.index.is_unique

False

# 索引对象其他的作用
+ 更换索引
+ 删除
+ 对齐

## 更换索引

In [11]:
ser = pd.Series([2,5,7,4],index=['one','two','three','four'])
ser

one      2
two      5
three    7
four     4
dtype: int64

In [13]:
ser.reindex(['three','four','five','one'])

three    7.0
four     4.0
five     NaN
one      2.0
dtype: float64