## 安插
- 在安插新的列时通过索引值自动排列
- 通过标签安插值
- 通过位置安插值
- 通过分配numpy数组来安插新的列
- 前面安插值的操作的结果
- 用一个where操作来安插数据

In [1]:
import numpy as np 
import pandas as pd

dates = pd.date_range('20130101', periods=6)
print(dates)
print('\n')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) # 6行4列，索引为dates，列名称分别为ABCD
print(df)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')


                   A         B         C         D
2013-01-01 -0.873462  2.847167  0.693864 -2.035999
2013-01-02 -1.185963 -0.290307  0.566101  0.973414
2013-01-03 -1.065812 -0.696262  0.945434 -3.397630
2013-01-04 -1.246650  0.757272 -0.995949 -0.085033
2013-01-05 -0.356536  0.462087  1.006266  0.153631
2013-01-06  0.049527 -1.658512  2.297461 -0.730905


In [3]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102',periods=6))
print(s1)
print('\n')
df['F']=s1
print(df) #多扔，少补，补NaN

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64


                   A         B         C         D    F
2013-01-01 -0.873462  2.847167  0.693864 -2.035999  NaN
2013-01-02 -1.185963 -0.290307  0.566101  0.973414  1.0
2013-01-03 -1.065812 -0.696262  0.945434 -3.397630  2.0
2013-01-04 -1.246650  0.757272 -0.995949 -0.085033  3.0
2013-01-05 -0.356536  0.462087  1.006266  0.153631  4.0
2013-01-06  0.049527 -1.658512  2.297461 -0.730905  5.0


In [7]:
# 通过标签安插值
print(df.at[dates[0],'B'])
df.at[dates[0],'B']=6.66
print(df.at[dates[0],'B'])
print(10*'==')
# 通过位置安插值
print(df.iloc[0,2])
df.iloc[0,2] = 8.8
print(df.iloc[0,2])

6.66
6.66
0.6938638905139981
8.8


In [11]:
# 通过分配numpy数组来安插新的列
print(df['C'])
print('\n')
df['C'] = np.array([5] * len(df))
print(df['C'])

2013-01-01    8.800000
2013-01-02    0.566101
2013-01-03    0.945434
2013-01-04   -0.995949
2013-01-05    1.006266
2013-01-06    2.297461
Freq: D, Name: C, dtype: float64


2013-01-01    5
2013-01-02    5
2013-01-03    5
2013-01-04    5
2013-01-05    5
2013-01-06    5
Freq: D, Name: C, dtype: int64


In [14]:
# 用一个where操作来安插数据
print(df)
print('\n')
df2 = df.copy()
df2[df2>0] = -df2 # df2>0把df2里面大于0的都筛选出来，用-df2对应的赋值，
print(df2)

                   A         B  C  D    F
2013-01-01  6.660000  6.660000  5  5  NaN
2013-01-02 -1.185963 -0.290307  5  5  1.0
2013-01-03 -1.065812 -0.696262  5  5  2.0
2013-01-04 -1.246650  0.757272  5  5  3.0
2013-01-05 -0.356536  0.462087  5  5  4.0
2013-01-06  0.049527 -1.658512  5  5  5.0


                   A         B  C  D    F
2013-01-01 -6.660000 -6.660000 -5 -5  NaN
2013-01-02 -1.185963 -0.290307 -5 -5 -1.0
2013-01-03 -1.065812 -0.696262 -5 -5 -2.0
2013-01-04 -1.246650 -0.757272 -5 -5 -3.0
2013-01-05 -0.356536 -0.462087 -5 -5 -4.0
2013-01-06 -0.049527 -1.658512 -5 -5 -5.0


## 缺失值
- 早先的pandas使用 np.nan的值来代表缺失值。缺失值默认不会进行计算。
- 重新排列索引操作允许你在指定的轴上改变/增加/删除索引。下面返回一个前面数据的复制结果
- 删除所有含有缺失值的行
- 替换缺失值
- 通过判断缺失值来获取布尔值

In [16]:
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])
print(df1)
print('\n')
df1.loc[dates[0]:dates[1],'E'] = 1
print(df1)

                   A         B  C  D    F   E
2013-01-01  6.660000  6.660000  5  5  NaN NaN
2013-01-02 -1.185963 -0.290307  5  5  1.0 NaN
2013-01-03 -1.065812 -0.696262  5  5  2.0 NaN
2013-01-04 -1.246650  0.757272  5  5  3.0 NaN


                   A         B  C  D    F    E
2013-01-01  6.660000  6.660000  5  5  NaN  1.0
2013-01-02 -1.185963 -0.290307  5  5  1.0  1.0
2013-01-03 -1.065812 -0.696262  5  5  2.0  NaN
2013-01-04 -1.246650  0.757272  5  5  3.0  NaN


In [21]:
# 删除所有含有缺失值的行
print(df1.dropna(how='any')) # 删除含有空值的行，方式是任何含有空值的都删除,删除操作不影响原始数据
print('\n')
print(df1.fillna(value=8)) # 替换/填充空值，填值=8，替换操作不影响原始数据

                   A         B  C  D    F    E
2013-01-02 -1.185963 -0.290307  5  5  1.0  1.0


                   A         B  C  D    F    E
2013-01-01  6.660000  6.660000  5  5  8.0  1.0
2013-01-02 -1.185963 -0.290307  5  5  1.0  1.0
2013-01-03 -1.065812 -0.696262  5  5  2.0  8.0
2013-01-04 -1.246650  0.757272  5  5  3.0  8.0


In [24]:
# 通过判断缺失值来获取布尔值
print(pd.isnull(df1))

                A      B      C      D      F      E
2013-01-01  False  False  False  False   True  False
2013-01-02  False  False  False  False  False  False
2013-01-03  False  False  False  False  False   True
2013-01-04  False  False  False  False  False   True


## 统计表
- 该操作一般不包含缺失值 
- 呈现一个描述性的统计表
- 在其他轴上进行相同的操作

In [33]:
print(df)
print('\n')
df3 = df.fillna(value=6)
print('\n')
print(df3)
print('\n')
print(df3.mean()) # 得到每列的平均值
print('\n')
print(df3.mean(1)) # 得到每行的平均值

                   A         B  C  D    F
2013-01-01  6.660000  6.660000  5  5  NaN
2013-01-02 -1.185963 -0.290307  5  5  1.0
2013-01-03 -1.065812 -0.696262  5  5  2.0
2013-01-04 -1.246650  0.757272  5  5  3.0
2013-01-05 -0.356536  0.462087  5  5  4.0
2013-01-06  0.049527 -1.658512  5  5  5.0




                   A         B  C  D    F
2013-01-01  6.660000  6.660000  5  5  6.0
2013-01-02 -1.185963 -0.290307  5  5  1.0
2013-01-03 -1.065812 -0.696262  5  5  2.0
2013-01-04 -1.246650  0.757272  5  5  3.0
2013-01-05 -0.356536  0.462087  5  5  4.0
2013-01-06  0.049527 -1.658512  5  5  5.0


A    0.475761
B    0.872380
C    5.000000
D    5.000000
F    3.500000
dtype: float64


2013-01-01    5.864000
2013-01-02    1.904746
2013-01-03    2.047585
2013-01-04    2.502125
2013-01-05    2.821110
2013-01-06    2.678203
Freq: D, dtype: float64


## 应用
- 对数据进行函数的应用

In [35]:
df3.apply(np.cumsum) # 第一行不变，第二行是前2行的和，第三行是前3行的和，一次类推

Unnamed: 0,A,B,C,D,F
2013-01-01,6.66,6.66,5,5,6.0
2013-01-02,5.474037,6.369693,10,10,7.0
2013-01-03,4.408225,5.673431,15,15,9.0
2013-01-04,3.161575,6.430703,20,20,12.0
2013-01-05,2.805039,6.89279,25,25,16.0
2013-01-06,2.854566,5.234279,30,30,21.0


In [36]:
df3.apply(lambda x: x.max()-x.min()) # 得到每列数据最大值和最小值的差值。

A    7.906650
B    8.318512
C    0.000000
D    0.000000
F    5.000000
dtype: float64

In [40]:
# 统计值的频数
s = pd.Series(np.random.randint(0,7,size=10))
print(s)
print('\n')
print(s.value_counts()) # 对数据那一列进行频数统计。获得每个数字，及这个数字出现的次数

0    6
1    4
2    2
3    2
4    4
5    5
6    4
7    3
8    3
9    4
dtype: int64


4    4
3    2
2    2
6    1
5    1
dtype: int64


In [44]:
# 字符串操作
# Series拥有像对字符串集合处理方法的能力，在str属性中可以对数组的每一个元素进行便捷的操作，就像下面的一小片字段中显示的那样。
s1 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
print(s1)
print('\n')
print(s1.str.lower())

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object


0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object
