# pandas 数据处理

pandas是基于numpy包扩展而来的，因而numpy的绝大多数方法在pandas中都能适用。

pandas中我们要熟悉两个数据结构Series 和DataFrame

## Series
类似于数组的对象，它有一组数据和与之相关的标签组成。

In [1]:
import pandas as pd
 
object=pd.Series([2,5,8,9])
 
print(object)

0    2
1    5
2    8
3    9
dtype: int64


我们可以用values和index分别进行引用

In [2]:
print(object.values)
print(object.index)

[2 5 8 9]
RangeIndex(start=0, stop=4, step=1)


还可以按照自己的意愿构建标签

In [3]:
object1=pd.Series([2,5,8,9],index=['a','b','c','d'])
 
print(object1)

a    2
b    5
c    8
d    9
dtype: int64


对序列进行运算

In [4]:
print(object[object>5])

2    8
3    9
dtype: int64


可以把Series看成一个字典，使用in进行判断.

值是不能直接被索引到的

In [5]:
print('a' in object1)
print(0 in object)

True
True


Series中的一些方法，

isnull或者notnull可以用于判断数据中缺失值情况

name或者index.name可以对数据进行重命名

## DataFrame
数据框，也是一种数据结构，和R中的数据框类似

In [6]:
data={'year':[2000,2001,2002,2003],
          'income':[3000,3500,4500,6000]}
 
data=pd.DataFrame(data)
 
print(data)

   income  year
0    3000  2000
1    3500  2001
2    4500  2002
3    6000  2003


新增加列outcome在data中没有，则用na值代替

In [7]:
data1 ={'year':[2000,2001,2002,2003],
          'income':[3000,3500,4500,6000]}

data2=pd.DataFrame(data1,
                   columns=['year','income','outcome'],
                   index=['a','b','c','d']
                  )
print(data2)

   year  income outcome
a  2000    3000     NaN
b  2001    3500     NaN
c  2002    4500     NaN
d  2003    6000     NaN


### 索引的几种方式

In [8]:
print(data2['year'])

a    2000
b    2001
c    2002
d    2003
Name: year, dtype: int64


In [9]:
print(data2.year)

a    2000
b    2001
c    2002
d    2003
Name: year, dtype: int64


对行进行索引，则是另外一种形式

In [10]:
print(data2.ix['a'])

year       2000
income     3000
outcome     NaN
Name: a, dtype: object


切片

In [11]:
print(data2[1:3])

   year  income outcome
b  2001    3500     NaN
c  2002    4500     NaN


增加和删除列

In [12]:
import numpy as np
data2['money']=np.arange(4)
print(data2)

   year  income outcome  money
a  2000    3000     NaN      0
b  2001    3500     NaN      1
c  2002    4500     NaN      2
d  2003    6000     NaN      3


In [13]:
del data2['outcome']
print(data2)

   year  income  money
a  2000    3000      0
b  2001    3500      1
c  2002    4500      2
d  2003    6000      3


reindex函数可以重新构建索引

In [14]:
data_1 = {'year':[2000,2001,2002,2003],
          'income':[3000,3500,4500,6000]}

data_2 = pd.DataFrame(data_1,columns=['year','income','outcome'],
index=['a','b','c','d'])
 
data_3 =data_2.reindex(['a','b','c','d','e'])
print(data_3)

     year  income outcome
a  2000.0  3000.0     NaN
b  2001.0  3500.0     NaN
c  2002.0  4500.0     NaN
d  2003.0  6000.0     NaN
e     NaN     NaN     NaN


In [15]:
data_3=data_2.reindex(['a','b','c','d','e'],method='ffill')
print(data_3)

   year  income outcome
a  2000    3000     NaN
b  2001    3500     NaN
c  2002    4500     NaN
d  2003    6000     NaN
e  2003    6000     NaN


索引删除以及过滤等相关方法

In [16]:
print(data_2.drop(['a']))

   year  income outcome
b  2001    3500     NaN
c  2002    4500     NaN
d  2003    6000     NaN


In [17]:
print(data_2[data_2['year']>2001])

   year  income outcome
c  2002    4500     NaN
d  2003    6000     NaN


In [18]:
print(data_2.ix[['a','b'],['year','income']])

   year  income
a  2000    3000
b  2001    3500


In [19]:
print(data_2.ix[data_2.year>2000,:2])

   year  income
b  2001    3500
c  2002    4500
d  2003    6000


### dataframe的算法运算

In [20]:
data={'year':[2000,2001,2002,2003],
'income':[3000,3500,4500,6000]}
 
data1=pd.DataFrame(data,columns=['year','income','outcome'],
index=['a','b','c','d'])
 
data2=pd.DataFrame(data,columns=['year','income','outcome'],
index=['a','b','c','d'])
 
data1['outcome']=range(1,5)
 
print('\ndata1\n',data1)

print('\ndata2\n',data2)

data2=data2.reindex(['a','b','c','d','e'])
 
print('\ndata_reindex\n',data1.add(data2,fill_value=0))


data1
    year  income  outcome
a  2000    3000        1
b  2001    3500        2
c  2002    4500        3
d  2003    6000        4

data2
    year  income outcome
a  2000    3000     NaN
b  2001    3500     NaN
c  2002    4500     NaN
d  2003    6000     NaN

data_reindex
      year   income outcome
a  4000.0   6000.0       1
b  4002.0   7000.0       2
c  4004.0   9000.0       3
d  4006.0  12000.0       4
e     NaN      NaN     NaN


对dataframe进行排序

In [21]:
data=pd.DataFrame(np.arange(15).reshape((3,5)),
                  index=['c','a','b'], 
                  columns=['one','four','two','three','five'])
 
print(data)

   one  four  two  three  five
c    0     1    2      3     4
a    5     6    7      8     9
b   10    11   12     13    14


In [22]:
print(data.sort_index())

   one  four  two  three  five
a    5     6    7      8     9
b   10    11   12     13    14
c    0     1    2      3     4


In [23]:
print(data.sort_index(axis=1))

   five  four  one  three  two
c     4     1    0      3    2
a     9     6    5      8    7
b    14    11   10     13   12


In [24]:
print(data.sort_values(by='one'))

   one  four  two  three  five
c    0     1    2      3     4
a    5     6    7      8     9
b   10    11   12     13    14


In [25]:
print(data.sort_values(by='one',ascending=False))

   one  four  two  three  five
b   10    11   12     13    14
a    5     6    7      8     9
c    0     1    2      3     4


汇总以及统计描述

In [26]:
data=pd.DataFrame(np.arange(15).reshape((3,5)),
                  index=['c','a','b'], 
                  columns=['one','four','two','three','five'])
 
print(data.describe())

        one  four   two  three  five
count   3.0   3.0   3.0    3.0   3.0
mean    5.0   6.0   7.0    8.0   9.0
std     5.0   5.0   5.0    5.0   5.0
min     0.0   1.0   2.0    3.0   4.0
25%     2.5   3.5   4.5    5.5   6.5
50%     5.0   6.0   7.0    8.0   9.0
75%     7.5   8.5   9.5   10.5  11.5
max    10.0  11.0  12.0   13.0  14.0


In [27]:
print(data.sum())

one      15
four     18
two      21
three    24
five     27
dtype: int64


In [28]:
print(data.sum(axis=1))

c    10
a    35
b    60
dtype: int64


In [29]:
data=pd.DataFrame(np.random.random(20).reshape((4,5)),index=['c','a','b','c'],
columns=['one','four','two','three','five'])
 
print(data)

        one      four       two     three      five
c  0.997724  0.787519  0.544104  0.735576  0.638625
a  0.600537  0.968203  0.445935  0.881803  0.501845
b  0.958477  0.201321  0.831380  0.628254  0.598649
c  0.506096  0.759721  0.132219  0.304490  0.630988


In [30]:
#相关系数
print(data.one.corr(data.three))

0.366611001385


In [31]:
#协方差
print(data.one.cov(data.three))

0.022358112986


In [32]:
#one和所有列的相关系数
print(data.corrwith(data.one))

one      1.000000
four    -0.547833
two      0.829598
three    0.366611
five     0.359399
dtype: float64


In [33]:
data=pd.Series(['a','a','b','b','b','c','d','d'])
 
print(data.unique())

['a' 'b' 'c' 'd']


In [34]:
print(data.isin(['b']))

0    False
1    False
2     True
3     True
4     True
5    False
6    False
7    False
dtype: bool


In [35]:
print(pd.value_counts(data.values,sort=True))

b    3
d    2
a    2
c    1
dtype: int64


#### 缺失值处理

In [36]:
data=pd.Series(['a','a','b',np.nan,'b','c',np.nan,'d'])
 
print(data.isnull())

0    False
1    False
2    False
3     True
4    False
5    False
6     True
7    False
dtype: bool


In [37]:
print(data.dropna())

0    a
1    a
2    b
4    b
5    c
7    d
dtype: object


In [38]:
print(data.ffill())

0    a
1    a
2    b
3    b
4    b
5    c
6    c
7    d
dtype: object


In [39]:
 print(data.fillna(0))

0    a
1    a
2    b
3    0
4    b
5    c
6    0
7    d
dtype: object


### 层次化索引

可以对数据进行多维度的索引

In [40]:
data = pd.Series(np.random.randn(10), 
                 index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], 
                        [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
 
print(data)

a  1   -1.370134
   2   -0.392633
   3    0.448226
b  1   -2.638579
   2   -1.074331
   3    0.174993
c  1    1.691425
   2   -0.046041
d  2   -1.329235
   3   -0.796645
dtype: float64


In [41]:
print(data.index)

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])


In [42]:
print(data['c'])

1    1.691425
2   -0.046041
dtype: float64


In [43]:
print(data[:,3])

a    0.448226
b    0.174993
d   -0.796645
dtype: float64


In [44]:
#把数据转换成为一个dataframe
print(data.unstack())

          1         2         3
a -1.370134 -0.392633  0.448226
b -2.638579 -1.074331  0.174993
c  1.691425 -0.046041       NaN
d       NaN -1.329235 -0.796645


In [45]:
print(data.unstack().stack())

a  1   -1.370134
   2   -0.392633
   3    0.448226
b  1   -2.638579
   2   -1.074331
   3    0.174993
c  1    1.691425
   2   -0.046041
d  2   -1.329235
   3   -0.796645
dtype: float64
