## Pandas 基本介绍
* ### Numpy 和Pandas 有什么不同
* ### Series
* ### DataFrame
* ### DataFrame 的一些简单应用

* ### Numpy 和Pandas 有什么不同

Pandas 是基于 numpy构建的
numpy 像list pandas 像dict

* ### Series

In [3]:
#一维字典
import pandas as pd
import numpy as np
s = pd.Series([1,3,6,np.nan,44,1])
print('s=\n',s)

s=
 0     1.0
1     3.0
2     6.0
3     NaN
4    44.0
5     1.0
dtype: float64


In [15]:
#直接创建一个series
s = pd.Series([1,3,6,np.nan,44,1],index=list(range(6)),dtype='float32')
print(s)

0     1.0
1     3.0
2     6.0
3     NaN
4    44.0
5     1.0
dtype: float32


* ### DataFrame

In [4]:
#表格型数据结构,可以称之为二维字典,不过不再用axis=0/1表示不同纬度,而是index/columns
dates = pd.date_range('20160101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])
print('df=\n',df)

df=
                    a         b         c         d
2016-01-01 -0.356408 -1.445475  2.017946  0.377428
2016-01-02 -0.280402 -1.247819  0.144409  0.761430
2016-01-03  0.644781  1.679094  0.858624  0.921503
2016-01-04 -0.863005 -0.463600  0.602037  1.673293
2016-01-05  0.407146  1.373404 -0.431531 -1.250529
2016-01-06  1.176787 -0.750215  0.925333 -1.375727


* ### DataFrame 的一些简单应用

In [7]:
#索引
print('df=\n',df)
print("df['c']=\n",df['c'])

df=
                    a         b         c         d
2016-01-01 -0.356408 -1.445475  2.017946  0.377428
2016-01-02 -0.280402 -1.247819  0.144409  0.761430
2016-01-03  0.644781  1.679094  0.858624  0.921503
2016-01-04 -0.863005 -0.463600  0.602037  1.673293
2016-01-05  0.407146  1.373404 -0.431531 -1.250529
2016-01-06  1.176787 -0.750215  0.925333 -1.375727
df['c']=
 2016-01-01    2.017946
2016-01-02    0.144409
2016-01-03    0.858624
2016-01-04    0.602037
2016-01-05   -0.431531
2016-01-06    0.925333
Freq: D, Name: c, dtype: float64


In [11]:
# 创建一组没有给定标签的数据
df1 = pd.DataFrame(np.arange(12).reshape((3,4)))
print(df1)

   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11


In [12]:
# 创建安一组给定标签的数据表格
df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo'})
                    
print(df2)

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo


In [13]:
# 查看属性名称、索引（记录名称）、属性数据类型、仅仅value
print('df2.columns=\n',df2.columns)
print('df2.index=\n',df2.index)
print('df2.dtypes=\n',df2.dtypes)
print('df2.values=\n',df2.values)

df2.columns=
 Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
df2.index=
 Int64Index([0, 1, 2, 3], dtype='int64')
df2.dtypes=
 A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
df2.values=
 [[1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'foo']
 [1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'train' 'foo']
 [1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'test' 'foo']
 [1.0 Timestamp('2013-01-02 00:00:00') 1.0 3 'train' 'foo']]


In [16]:
# 表格数据的特征总结 仅仅针对数值型数据
df2.describe()

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,1.0,1.0,3.0
std,0.0,0.0,0.0
min,1.0,1.0,3.0
25%,1.0,1.0,3.0
50%,1.0,1.0,3.0
75%,1.0,1.0,3.0
max,1.0,1.0,3.0


In [17]:
# 翻转数据
print(df2.T)

                     0                    1                    2  \
A                    1                    1                    1   
B  2013-01-02 00:00:00  2013-01-02 00:00:00  2013-01-02 00:00:00   
C                    1                    1                    1   
D                    3                    3                    3   
E                 test                train                 test   
F                  foo                  foo                  foo   

                     3  
A                    1  
B  2013-01-02 00:00:00  
C                    1  
D                    3  
E                train  
F                  foo  


In [21]:
# 对index排序
print('df2=\n',df2,'\n对index排序后\n')
print(df2.sort_index(axis=1, ascending=True)) # ascending 上升

df2=
      A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo 
对index排序后

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo


In [22]:
# 对index排序，，，依然是用axis=0/1表示横向/纵向
print('df2=\n',df2,'\n对index排序后\n')
print(df2.sort_index(axis=1, ascending=False))

df2=
      A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo 
对index排序后

     F      E  D    C          B    A
0  foo   test  3  1.0 2013-01-02  1.0
1  foo  train  3  1.0 2013-01-02  1.0
2  foo   test  3  1.0 2013-01-02  1.0
3  foo  train  3  1.0 2013-01-02  1.0


In [23]:
# 对数据值进行排序
print(df2.sort_values(by='B'))

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo


 ---

## Pandas 选择数据----访问
* ### 简单的筛选
* ### 根据标签loc
* ### 根据序列iloc
* ### 根据混合的这两种 ix
* ### 通过判断的筛选

* #### 选一个数据
* #### 选一行/列
* #### 选连续/不连续的行/列

---

In [26]:
# 建立了一个6*4的矩阵数据
datas = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=datas,columns=['A','B','C','D'])
print(df)

             A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23


* ### 简单的筛选

In [38]:
# 简单筛选——两种皆可
print(df['A']) #索引方式
print(df.A) #属性方式
#print(df['20130101'])
print(df[0:3]) #选择前三行
print('20130102:20130104\n',df['20130102':'20130104']) #根据标签选择连续的若干行，包含两端

2013-01-01     0
2013-01-02     4
2013-01-03     8
2013-01-04    12
2013-01-05    16
2013-01-06    20
Freq: D, Name: A, dtype: int32
2013-01-01     0
2013-01-02     4
2013-01-03     8
2013-01-04    12
2013-01-05    16
2013-01-06    20
Freq: D, Name: A, dtype: int32
            A  B   C   D
2013-01-01  0  1   2   3
2013-01-02  4  5   6   7
2013-01-03  8  9  10  11
20130102:20130104
              A   B   C   D
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15


* ### 根据标签loc ---标签label的意思是索引名字或者属性名字---

In [44]:
#根据标签loc_因为不能直接选择某一行，只能直接索引某一列，即得到表格中的某一属性。类似的，也不能直接切片某几列，直接切片得到的是某几行
# loc后面的列表表示标签名所在的行或者列
print('df=\n',df)
print("df.loc['20130102']=\n",df.loc['20130102'])
print("df.loc[:,['A','B']]=\n",df.loc[:,['A','B']])
print("df.loc['20130102',['A','B']]=\n",df.loc['20130102',['A','B']])

df=
              A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23
df.loc['20130102']=
 A    4
B    5
C    6
D    7
Name: 2013-01-02 00:00:00, dtype: int32
df.loc[:,['A','B']]=
              A   B
2013-01-01   0   1
2013-01-02   4   5
2013-01-03   8   9
2013-01-04  12  13
2013-01-05  16  17
2013-01-06  20  21
df.loc['20130102',['A','B']]=
 A    4
B    5
Name: 2013-01-02 00:00:00, dtype: int32


* ### 根据列表iloc e.g.,df.iloc[1,2]表示position

In [46]:
# iloc后面的列表代表位置坐标
print('df.iloc[3,1]=\n',df.iloc[3,1]) #某一个元素
print('df.iloc[3:5,1:3]=\n',df.iloc[3:5,1:3]) #任意连续的一块区域
print('df.iloc[[1,3,5],1:3]=\n',df.iloc[[1,3,5],1:3]) #不连续的区域

df.iloc[3,1]=
 13
df.iloc[3:5,1:3]=
              B   C
2013-01-04  13  14
2013-01-05  17  18
df.iloc[[1,3,5],1:3]=
              B   C
2013-01-02   5   6
2013-01-04  13  14
2013-01-06  21  22


* ### 根据混合的这两种 ix。
#### 所谓混合选择：个人提供一个角度。上面的loc 或者 iloc 接受的列表，里面的元素要么全是字符串表示标签名，要么是数字表示坐标.
而ix接受两者混合的list

In [48]:
print('df=\n',df)
print("df.ix[:3,['A','C']=\n",df.ix[:3,['A','C']])

df=
              A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23
df.ix[:3,['A','C']=
             A   C
2013-01-01  0   2
2013-01-02  4   6
2013-01-03  8  10


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


* ### 通过判断的筛选——找出符合条件的部分

In [49]:
# 
print('df=\n',df)
print('df[df.A>8]=\n',df[df.A>8])

df=
              A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23
df[df.A>8]=
              A   B   C   D
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23


----

## Pandas 设置值——————改写

* ### 创建数据

In [53]:
dates = pd.date_range('20130101',periods = 6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index = dates, columns=['A','B','C','D'])
print(df)

             A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23


* ### 根据位置设置loc和iloc


In [56]:
df.iloc[2,2] = 1111
df.loc['20130101','B'] = 2222
print("after df.iloc[2,2] && df.loc['20130101','B'],df=\n",df)

after df.iloc[2,2] && df.loc['20130101','B'],df=
              A     B     C   D
2013-01-01   0  2222     2   3
2013-01-02   4     5     6   7
2013-01-03   8     9  1111  11
2013-01-04  12    13    14  15
2013-01-05  16    17    18  19
2013-01-06  20    21    22  23


* ### 根据条件设置

In [58]:
df.B[df.A>4] = 0
print(df)

             A     B     C   D
2013-01-01   0  2222     2   3
2013-01-02   4     5     6   7
2013-01-03   8     0  1111  11
2013-01-04  12     0    14  15
2013-01-05  16     0    18  19
2013-01-06  20     0    22  23


In [60]:
df['F'] = np.nan #加上一列‘F’
print(df)

             A     B     C   D   F
2013-01-01   0  2222     2   3 NaN
2013-01-02   4     5     6   7 NaN
2013-01-03   8     0  1111  11 NaN
2013-01-04  12     0    14  15 NaN
2013-01-05  16     0    18  19 NaN
2013-01-06  20     0    22  23 NaN


* ### 添加数据

In [61]:
# 添加数据 e.g.,加上series序列
df['E'] = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130101',periods=6)) 
print(df)

             A     B     C   D   F  E
2013-01-01   0  2222     2   3 NaN  1
2013-01-02   4     5     6   7 NaN  2
2013-01-03   8     0  1111  11 NaN  3
2013-01-04  12     0    14  15 NaN  4
2013-01-05  16     0    18  19 NaN  5
2013-01-06  20     0    22  23 NaN  6
